Floating point
Matthew Phillips (473) 721 posts |
Could we have a bounty to do with exploiting floating point support in the new hardware? We probably need several things: Firstly a replacement FPE module for Beagleboard etc. which will reinterpret old-style FPA instructions and execute them on the VFP or NEON units (whichever gives more benefit). This would immediately give a speed benefit to any software using FPA instructions. This would include hand-crafted stuff, BASIC VI, and presumably C programs using floating point. I’m not sure what compiler and OS support is like for VFP or NEON, but once software starts being compiled to use these instruction sets there may be a need for a VFPE module for older hardware to make it easier for developers to support the wide range of machines still in use. |
Matthew Phillips (473) 721 posts |
Is no-one interested in this bounty proposal? It would be good to get wider exploitation of the hardware floating point in the BeagleBoard, rather than just a few demos. Is my analysis of the requirements way off the mark, or something? |
Jeffrey Lee (213) 6048 posts |
Here’s my 2p:
Although I’ve suggested that we create a replacement FPEmulator before, I’m not actually sure how much of a performance benefit it would give. It could also take a lot of effort to make sure the new code is as accurate as the original code.
I think the better choice for BASIC64 would be to modify it to use the VFP instructions directly. Unfortunately I think it’s possible that some programs use knowledge of the inner workings of BASIC to modify the state directly (e.g. in assembler routines) – so if BASIC64 suddenly switched from using FPA instructions and registers to VFP instructions and registers (and to storing the floating point values in little-endian word order instead of big-endian word order) then those programs could break.
I don’t think we should worry too much about existing C programs. Any programs which make heavy use of floating point (or where floating point performance was a major performance issue) will have long ago switched to using GCC’s softfloat support, or to hand-written fixed point routines. Therefore any remaining programs will only be making light use of floating point, so wouldn’t see any significant gains from using a new FPEmulator. And any new programs which need to make heavy use of floating point should surely take into account the fact that all future machines will have VFP/NEON available – i.e. they should be using VFP/NEON by default (and relying on a VFPE module for running on old machines), or there should two (or more) different versions of the program available depending on the users machine type (not ideal, but it’s the only way you’d get the best performance for everyone).
Basic OS support for VFP/NEON has been available for about a year now. Assembler support for VFP/NEON is pretty good (extASM, objasm & GCC 4.6 support the full ARMv7 instruction set, and BASIC is in testing). C compiler support is a bit lacking though – no announcement from ROOL as to when they’re aiming to add support for it to their tools (although it might be somewhat dependent on this bounty), and the patch I sent to the GCC team near the start of the year – that would have enabled full VFP/NEON support in C/C++ – doesn’t seem to have made it into their source repo yet (I should probably chase them up on that).
This is something I’ve suggested myself in the past as well, but it is a lot of work to create a full-blown floating point emulator, and the performance wouldn’t be as good as if a seperate non-VFP/NEON version of the program was produced. So it’s tempting to say that people should just deal with the fact that some machines will have VFP/NEON while others won’t. But until “proper” programs start appearing that make use of VFP/NEON we won’t really know if that approach is something programmers/users will be happy with. |
Matthew Phillips (473) 721 posts |
So perhaps we need a bounty for a “proper” program to be produced as a demonstrator of the possibilities. What would benefit most? Would FFmpeg or KinoAMP benefit from hardware floating point? Does anyone have any other suggestions. I just feel now we’ve finally got hardware floating point for the fisrt time since the ARM3 more should be done to exploit it. |
Dave Higton (281) 668 posts |
I’d have thought that FFmpeg, KinoAMP and the like would benefit most from using the DSP. I’m probably not a mainstream user, but I can’t think of much that would make significant use of FP at all, other than media decoding, the latter perhaps being best performed on DSPs with the main CPU getting the stream and putting the decoded result to the screen. I may be wrong. |
Jeffrey Lee (213) 6048 posts |
Yes, very true. Unfortuntaly DSP coprocessors are very machine-specific – although it might be possible to make the RISC OS side fairly generic, any code that runs on the DSP needs to be tailored to that specific machine. The DSP in the OMAP3 isn’t even compatible with the one in the OMAP4 (they’ve changed it from a generic fully programmable processor to a set of fixed-function components designed for decoding current video codecs like H264). Something else that would result in a large performance boost to movie players would be an API to allow access the YUV video overlay(s). Different machines might require slightly different pixel formats, but overall there’ll be much less variation compared to the DSPs, so it’ll be much easier to write code that will work with everything. This is something I might have done by now, if I wasn’t at a loss as to how to best extend GraphicsV (see here ). Maybe things will be a bit clearer once I/we find out the capabilities of the Raspberry Pi. But getting back to the topic of VFP/NEON…
|
Trevor Johnson (329) 1645 posts |
What about DigitalCD? Or does that use MAD or some other integer algorithm? |
André Timmermans (100) 655 posts |
I had a quick look at the FFmpeg sources, for NEON optimisations that could be relevant to KinoAmp or DigitalCD, and I only found a NEON version of the IDCT routine, so I guess the corresponding KinoAmp routine could be somewhat optimised. I suspect that KinoAmp would gain far from speed from having access YUV overlays or non-blocking I/O APIs. |