FP support
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
GavinWraith (26) 1563 posts |
Jim:
Not with a new Shared C Library. Dave:
Yes. Charm does this already. See the files lib.src.fp and lib.src.maths in the Charm 2.6.6 distribution. Recompilation would be necessary if the C library were extended by functions to check which FP system were available. It may be that the C runtime would need changing to save and restore vfp state – I am not sure about that. Otherwise I think it might be possible to have different CLib modules to suit each FP system. |
Rick Murray (539) 13840 posts |
Okay. For the lulz. Here’s a C program (you’re on your own for the MakeFile but note that objasm will whinge like hell as we’re mixing FPA and VFP and pre-UAL and UAL):
And here’s some assembler to go with it:
Everything is double as the C compiler appears to convert single (float) to double prior to calling printf(). The FPE code seems to be broken. It consistently outputs an insanely large value. It used to work, and the code is more or less equivalent to that taken by the FPE code generated by the compiler, so I don’t know what’s going wrong. I don’t see what can go wrong with “load a value, load another value, multiply them, store the result”… Doesn’t matter, really. The point is the timings. Um.
290cs vs 7cs. This is fairly consistent, running on a standard Pi, single tasking. For the time it takes to do this once with FPE, I could do it forty one times with VFP. This is why the DDE ought to start supporting native hardware FP instead of ancient emulated FP. |
Steve Pampling (1551) 8170 posts |
I maybe off track here, but isn’t one of the reasons for RO multi-media handling being a bit sucky something to do with an absence of decent FP? Or rather the use of FPE instead of hardware FP? |
David Feugey (2125) 2709 posts |
This, and slow disc accesses. |
Steve Pampling (1551) 8170 posts |
I did say “one of the reasons” Unless you have a magic wand deal with problems one at a time. Faster disc access is at least partially inflicted by limitations of the current hardware. |
Rick Murray (539) 13840 posts |
Err, not really. Same hardware, different OS, makes the standard Pi quite a nice media playback system. One of the main issues is the closed nature of many of the GPUs. The things can be controlled with a binary blob supplied by the manufacturer which will slot into Linux and together with the media framework will provide what is necessary for HD H.264 video. You can see, by looking at our MPlayer port and its just-about 320×240 capabilities exactly how much the GPU does assist. Without this, we’re kind of stuck. Anyway, for now, for today, it might be a nice idea if our compiler could perhaps make better use of the easily available facilities. 1 A possible alternative could be to supply code paths for VFP and non-VFP and select which one is used at runtime depending on the facilities of the host system? This might impact the efficiency of the compiler so… |
David Feugey (2125) 2709 posts |
Holidays? :) The problem is more on Neon side. We could have a NeonEmulator, but the use of Neon code will not be very optimal on VFP (without Neon) systems. Conclusion: VFP is definitively possible without any drawback, except on FPA systems (rare today). Neon is possible, but with speed problems on VFP only systems. One solution would be to have the two ASM code in the binary. A simple solution would be to have some *ifNeon CLI command to launch a specific RunImage if Neon is present (easy to make [compile twice], easy to remove if some people want to save space [remove the unneeded RunImage). If you really want, some *ifFPA and *ifVFP commands could be provided too. IMHO, several binaries will be much more flexible than a fat-binary. And it’s more RISCOSish too :) |
jim lesurf (2082) 1438 posts |
IIRC That was always the case and I assumed it was because the IEEE compliance was based on double precision error/rounding specs. BTW for me, IEEE compliance was important. And at least one version of the FPE emulation failed at one time. Most were fine, though. As was a faster 3rd party version I used for a while (I’ve forgotten who produced that). And back in the day when I still had a machine with real FP hardware I found that hardware was indeed 20 – 40 times faster than emulation. Made a big difference to programs that used a lot of floating point number bashing. I confess to being wary of having to generate multiple binary versions, etc. Opens up scope for odd failures as people start saying “doesn’t work here” or even more frustrating “I get a slightly different answer”. Seems to me that dealing with this via Clib/FPEmulator as the go-between makes most sense. But of course I have no idea what VFP/NEON entail, so what I’m saying may not be possible. In the past I mainly wanted FP hardware for engineering/academic/scientific calculations. These days I’d be in a boat with more passengers as I feel it would help a lot with processing ‘AV’ data files and streams. These often involve bucketloads of data as well as require a lot of number crunching. Sometimes it can all be integer, but other times not. Jim |
Steve Drain (222) 1620 posts |
I have tried to keep an eye on Charm. When I looked several months ago I think I remember the floating point only extended to the arithmetic operations; now I see that all the operations offered by FPA are there. If I understand correctly, to exploit VFP you have to recompile the compiler (written in Charm), so object code will only work in either of the areas we might be concerned with, not both. I have not delved into the source, but I would be interested to find out how Charm provides the trancendental functions using VFP, and how it handles VFP contexts. Nevertheless, it is clearly an exemplar for changing the C compiler. |
Steve Drain (222) 1620 posts |
I have never been in doubt about the huge speed advantages of using VFP. My own tests confirmed it, but I thought it too obvious to make a point of. Thanks for your explicit numbers. ;-) |
Steve Drain (222) 1620 posts |
Have you looked at my Float module? I considered all the issues raised here when writing it. My solution may not be suitable as it stands, but it is a solution. First, as Jeffrey pointed out, you have to deal with context. At what level you do this is an important decision, and one that did not have to be made when using FPA. My solution is at task level – not per instruction nor at system. If BASIC were to do this, it would be during task initialisation with *Basic and the context pointers stored in the spare words still available in the workspace. That is how I have experimented with it for Basalt. I expect that this can be done similarly in C. Next is the compatiblity between FPA and VFP. The general feeling here seems to be for separate compilation depending on the processor, as with Charm, but I dislike the idea of having different code, despite arguments about what is the route forward. ;-) My solution is to set up a context on systems that can use VFP, but to have null pointers otherwise. The choice of which code to run is then made on whether there is a context or not. This single instruction and branch is of no significance to FPA and a miniscule delay to VFP. Then is the problem of endianness:
I agree. Any existing data is stored for FPA, cf BASIC. So my solution is the same and requires the overhead of swapping registers before and after VFP code. I think this is acceptable for compatibility. A problem that has only been discussed tangentially, I think, is precision. Certainly double precision IEEE is needed and that is where I stopped, because that is all BASIC requires. However, single precision is used. VFP can provide this and I think we can ignore NEON. Extended precision is out of the question with VFP, but how important is that? Lastly is the problem of existential operations not provided by VFP. This is not trivial, but it has been a problem for computers for a long time. As an amateur, my solutions took some time and effort to tease out, but I expect those with computer science qualifications could rustle them up. How many of those do we have here? The algorithms might need explanation, but that is for another time, if anyone is interested. ;-) A final comment. I do not see an FPE replacement using VFP as feasible, for all the reasons Jeffrey listed, and I think a VFPE alternative would impose unnecessary conditions on programs running on non-VFP machines, so I would rule both out. |
David Feugey (2125) 2709 posts |
Yep, an elegant solution. But I was just thinking that perhaps it’s time to get this by default.
We could then fall back to VFPEmulator specific software functions made to be closed to FPEmulator. That’s why I suggested Basic VII. A new solution with support for VFP/Neon (and non VFP computers with VFPEmulator), the fastest way possible, but with small differences with BBC Basic VI that can lead to incompatible code. The same as between BBC Basic V and VI. And perhaps it will be the good time to add some features present in BBC Basic for Windows : “data structures, PRIVATE variables, long strings, event interrupts, an address of operator, byte variables, a line continuation character, indirect procedure and function calls and improved numeric accuracy.” Directives for (de)tokeniser (AllowLowcaseKeywords, RemoveFN, RemovePROC, AllowKeywordRewrites, Aliases, Renames) could be added to, to get something more modern (I do this today with a very limited and buggy preprocessor).
So the solution is not hardware FP?
A VFPEmulator, the same way with current software that needs FPEmulator. Not really a big change for users. |
David Feugey (2125) 2709 posts |
From a strategic point of view, RISC OS attracts some people because of BBC Basic. And simply because it’s probably the fastest interpreter on ARM in the world (with no JIT). I think it’s really important to keep this advantage. On Windows, BBC Basic is one the smallest, and the fastest interpreter too. That’s really a good reason to use it, and to make cross platform software (OK: games) with it. I’m OK with add ons (as the really excellent Basalt), and with support of legacy platforms, but we can also move on. Just to claim that we still have the fastest ARM interpreter in the world :) |
Steffen Huber (91) 1953 posts |
Can someone summarize the situation with GCC and float stuff? Before we got all that shiny new hardware, I remember that there were “hard float” (FPA/FPE) and “soft float” (an internal GCC math lib) being the choice. “Soft float” was a lot faster on non-FPU hardware. ISTR that libraries needed to be compiled for the correct calling standard. |
David Feugey (2125) 2709 posts |
On non-FPU hardware, Soft float is the fastest solution. Hard Float the slowest. VFP support should be complete in both GCC and UnixLib Is it available in stock GCC or in a specific beta version? I don’t know. |
GavinWraith (26) 1563 posts |
It probably depends on what program you run, but I would claim that Lua is faster. The Lua binary is about 88K; that includes extra libraries like lpeg (parser expression grammars) and bc (big numbers). Basic 64 is smaller at 51K. But Lua does not have the nostalgia factor. |
David Feugey (2125) 2709 posts |
LUA has some libs that can lead to much faster results, but I doubt that each opcode is decoded and run with only a few ARM instructions. BBC Basic engine is very optimised here (I could say the same with BBC Basic 4 Windows). That does not remove qualities of LUA anyway. |
Rick Murray (539) 13840 posts |
Well, there’s one way to settle this. CODEFIGHT!!!
|
GavinWraith (26) 1563 posts |
The standard Lua distribution has always had “#define LUA_NUMBER double” as
You would be right. Some of the Lua VM instructions are pretty complex, especially the ones dealing with tables. Almost every operation depends on whether its operands have metatables. So an addition (+), which in the simplest case would come down to an “ADD result, arg1, arg2” ARM-instruction might be implemented by arbitrarily long user code in Lua (or C or assembler), if either of the operands has been set up to demand it. This is the penalty that has to be paid for user-controlled syntax. In this sense, Lua is not so much a language as a language-kit. It mandates certain aspects ( garbage collected memory management, lexical scoping, multi-return values from functions) but leaves a great deal else free to be defined by the user. The intention behind the register-based Lua VM was that each instruction should do as much work as possible, to cut down on interpretive overhead. |
Rick Murray (539) 13840 posts |
Doesn’t this run the risk of marginalising the language away from serious use? After all, there are always two sides to the story. For example, all of my Windows programs are written with (true)VB because I didn’t grok how C programs started themselves up and I wasn’t confident to alter one of the demo apps to figure it out. VB on the other hand is overly friendly and extremely simple to use even if you pay for it in efficiency (the 6502 emulator I started in VB was taking the piddly more than anything else). Perhaps a person might feel more confident with Lua?
Just like the FPE instructions…
Wouldn’t that be the same in BASIC? |
Jeffrey Lee (213) 6048 posts |
VFP support should be complete in both GCC and UnixLib Currently the only way to get a VFP/NEON capable version of GCC is to build it yourself. It’s also worth pointing out that any programs compiled to use VFP/NEON (using GCC) will need SharedUnixLibrary 1.13 – which hasn’t seen a public release yet. If you build GCC yourself you’ll get a copy of it, but to avoid potential differences once the official 1.13 comes along I don’t think the GCC team will be happy with you distributing your own version. Rick: The reason your do_fpe function returns the wrong value is because of the differing word order for doubles between FPA & VFP. It looks like objasm decided you wanted to use VFP word ordering, which is why your VFP code needs to swap the order on save (for interaction with the FPA CLib) but not on load. Lack of VNMUL & friends in BASIC looks like an oversight – they should be there now in BASIC 1.61 |
Fred Graute (114) 645 posts |
Thanks for the quick fix, Jeffrey. Here’s a few more anomalies I found while extending StrongED’s ASM colouring:
|
Jeffrey Lee (213) 6048 posts |
What’s the hex for those instructions? For VLDM/VSTM the register count is stored in an 8 bit field, so you could theoretically load/store up to 255 registers if the hardware had that many. So I suspect that the debugger is disassembling it correctly, and it’s actually the instruction which is at fault. Which would then lead on to a second question of how you assembled those instructions! |
Steve Drain (222) 1620 posts |
I have looked back at the long list of VFP instructions you provided, but I cannot see the VNMUL & friends in it. Could you, or Fred, please post them or point me directly to the relevant ARM document. I will then add them to the VFP manual. However, I notice that I missed out the VFP4 instructions VF[N]MA and VF[N]MS fused multiply instructions. I cannot recall why, but I do not suppose that they are the same. |
GavinWraith (26) 1563 posts |
Lua has been designed with very specific aims. In particular, it is designed as a C library to be embedded in applications written in C. Lua as a separate programming language is something of a side-issue. The idea is that if you are after speed, you cater for that on the C side of things. So something like RiscLua, which is a statically compiled C application to interpret an appropriate dialect of Lua for RISC OS, is only showing half the story. So, yes, embedding Lua in a pre-existing number-crunching package, to make it easier to use, makes sense. Adding number-crunching facilities to RiscLua makes less sense IMHO. That is not to say that a future version of RiscLua won’t be using vfp or NEON. Lua 5.3, the latest version, after years of discussion on the forums, addresses the problem that whereas doubles may be a useful number type to expose for the user, the internal code is only interested in pointers, essentially an integer type. My personal preference is for no coercion and keeping types separate, but that is not seen as making things simple for the casual user. |
Pages: 1 2 3 4 5 6 7 8 9 10 11 12