FP support
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
David J. Ruck (33) 1635 posts |
Great stuff, any chance of the same thing for Fireworkz? |
Stuart Swales (8827) 1357 posts |
Quite likely, I would think ;-) |
Paolo Fabio Zaino (28) 1882 posts |
@ Stuart I’ve got my hands full ATM, but if you want I can give it a try this weekend while testing other stuff. If so link to pull down the source for build (or binary for test only) pleaseeeee :) P.S. Awesome work! |
Stuart Swales (8827) 1357 posts |
Have a whirl with this source tarball. Interested in comments at present as to whether to polish further – should we go to Code Review? Note that I haven’t yet implemented fetestexcept() and friends for the VFP world (expect NaNs and Infs rather than SIGFPE barfs). http://croftnuisk.co.uk/coltsoft-downloads/other/apcs_softpcs_20210924.zip In the end I had to change PipeDream very little to use this library; I didn’t get C99 double complex with Norcroft /softfp working so had to revert to my own old implementation, and change one trivial inline function to non-inline to stop a compiler barf. [Edit: I forgot about adding -DAPCS_SOFTPCS as well as using -apcs /softfp as the compiler doesn’t seem to defined anything useful] |
Chris Gransden (337) 1207 posts |
I did a quick test with the flops.c benchmark.
|
Stuart Swales (8827) 1357 posts |
Thanks Chris! 10x, but could be 10x better, eh. As I mentioned somewhere else, I see it as a useful stepping-stone towards using some of the potential performance offered by new hardware without abandoning the old. I’m sure I’m not the only person who is pretty much tied into continuing to use Norcroft for RISC OS targets given various pragmas and globs of assembler. Chris: I tried with flops.c from the interweb to see what ops that used and get vastly different results to yours – could I have a copy please? Ta. |
Chris Gransden (337) 1207 posts |
I’ve just sent it. I’ll see I can find and build something that is more of a real world test. |
Stuart Swales (8827) 1357 posts |
Thanks – results now believably closer. Mine are somewhat lower due to older HW (ARMX6@1GHz) but the gap between Norcroft -Otime with apcs_softpcs and gcc -O2 -mfpu=vfp (4.7.4) is also lower, about a factor of three to four, not ten. I do see a factor of ten still between Norcroft FPA and Norcroft with apcs_softpcs. |
David Pitt (3386) 1248 posts |
Some results using this flops.c built with GCC4.7.4 :- *gcc flops.c -o flops -mfpu=vfp On the 1.5MHz Titanium :- FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Iterations = 512000000 NullTime (usec) = 0.0014 MFLOPS(1) = 291.6420 MFLOPS(2) = 295.3921 MFLOPS(3) = 343.1982 MFLOPS(4) = 347.2868 On The RPi400 at 2.4MHz :- Iterations = 512000000 NullTime (usec) = 0.0009 MFLOPS(1) = 697.6939 MFLOPS(2) = 811.5966 MFLOPS(3) = 886.8430 MFLOPS(4) = 898.4188 |
Stuart Swales (8827) 1357 posts |
David: Try with -O2, might get closer to Chris’ results. |
Chris Gransden (337) 1207 posts |
I used -O3. While trying to link something else I get an undefined symbol for __apcs_softpcs__lrintf. |
Stuart Swales (8827) 1357 posts |
Ah, overzealous bit of macro-ing in apcs_softpcs.h! Thanks. lrint and llrint (and friends) didn’t need wrapping, and might not benefit much from VFP-ing as their implementation in the C library is pure ARM w/o FPA. [Edit: the above is true for lrint/lrintl/llrint/llrintl but NOT for lrintf/llrintf when compiled with |
Rick Murray (539) 13840 posts |
Norcroft really needs to move away from emitting FPA instructions, and these examples are (yet another) demonstration why. I noticed a few versions ago it has some options for the FPU type. I don’t think they do anything, but maybe it’s planned? Fingers crossed! |
David Pitt (3386) 1248 posts |
David: Try with -O2, might get closer to Chris’ results. Thanks both, O2 good O3 better. (RPi400 2400kHz) *gcc flops.c -o flops -mfpu=vfp -O2 *flops FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Iterations = 512000000 NullTime (usec) = 0.0000 MFLOPS(1) = 1669.2163 MFLOPS(2) = 1127.8841 MFLOPS(3) = 1596.2417 MFLOPS(4) = 1860.7029 * *gcc flops.c -o flops -mfpu=vfp -O3 *flops FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Iterations = 512000000 NullTime (usec) = 0.0000 MFLOPS(1) = 1889.5671 MFLOPS(2) = 1158.4400 MFLOPS(3) = 1661.5248 MFLOPS(4) = 2009.1419 * |
Stuart Swales (8827) 1357 posts |
Not just the compiler, Rick. The C library is chokka with FPA assembler. For instance the lrintf code could be sped-up usefully by having a VFP code branch as well as the existing FPA branch just to do the callee-narrowing prior to the common ARM bit. But how much are we prepared to annoy people who for good reasons are still running old hardware (“it works for me in my setup”) or emulators (“don’t have hardware at work, just RPCEmu on the laptop”)? I wouldn’t bother to release a VFP-only version of PipeDream or Fireworkz as the performance gains in those applications wouldn’t be worth it for 99% of users, whereas something that usefully boosts performance for anyone with modern-ish hardware without unduly penalising the other users looks like a win to me. Anyone who needs to run at highest performance needs to grab code and compile it to suit their needs. |
Chris Gransden (337) 1207 posts |
Commenting out lrintf in the header got it to link. Here’s the results for twolame converting a wav file to mp2 on a RPi CM4 @2.4GHz. fpa 374.7 secs softpcs (APCS_SOFTPCS_RUNTIME_SWITCH: TRUE) 16.18 secs softpcs (APCS_SOFTPCS_RUNTIME_SWITCH: FALSE) 15.02 secs gcc 4.7.4 vfp 3.89 secs |
Stuart Swales (8827) 1357 posts |
Wow! That’s a win. Edit: Note that you can assemble the library with APCS_SOFTPCS_RUNTIME_SWITCH set to {FALSE} for more performance when you know the target will have VFP. That setting eliminates a LDR/LDR/TEQ/BEQ for each basic f.p. operator. e.g. on my 1.0GHz ARMX6: *flops-vfpf [no run-time switch, so VFP required for basic operations, VFP with FPA fallback for library functions] Iterations = 128000000 NullTime (usec) = 0.0020 MFLOPS(1) = 73.4334 MFLOPS(2) = 73.0166 MFLOPS(3) = 78.5507 MFLOPS(4) = 82.1033 *flops-vfps [run-time switch for VFP/FPA for everything] Iterations = 128000000 NullTime (usec) = 0.0020 MFLOPS(1) = 60.0289 MFLOPS(2) = 61.3677 MFLOPS(3) = 65.3769 MFLOPS(4) = 67.8630 This benchmark is just really exercising the basic f.p. operators. I forgot to mention, these figures are WITHOUT -Otime for flops.c as that degraded it slightly… Several of the individual methods seem to run faster when flops.c is compiled with -arch 2 -cpu 3 as it uses LDM (and STM) rather than two LDRs to move the double precision values around. |
Rick Murray (539) 13840 posts |
I know. It will need some stuff duplicated, but then maybe a little bit of smarts will be able to set up the Stubs jump table accordingly if called using a new SWI (LibInitAPCS32VFP or something?) in order that older FPA software also works as expected.
My thoughts on this are that it isn’t really an annoyance as such. Everything they own and everything they use won’t mysteriously cease to function. The only difference is that upgrades and new releases of some things won’t work. Firstly, there is precedence from Acorn (think of all the RiscPC extended stuff that was never officially made available for older machines (why do you think Dummy Dynamic Areas was created?)). Secondly, there is precedence, take a look at https://www.riscosports.co.uk/vfp/ and note that it isn’t aimed at anything pre-5.2x with VFP. Thirdly, this should be a question for each individual author. Some bend over backwards to use StubsG to support “damn near everything”, while others figure after over a quarter of a century, the RiscPC has had a good run, but it shouldn’t be a millstone preventing future progress. “Because ancient machines” is a pretty lousy excuse for not having the DDE compiler support hardware maths, and that sort of logic might push people less lazy than me to GCC. I wouldn’t move, as I suck at maths so my code doesn’t tend to be maths heavy, but Chris has provided yet another example of the limitations of emulated FP. I mean, literally, six odd minutes (FPA) versus a mite under four seconds (VFP). I’m not entirely certain what softpcs actually is, but even that hands FPA it’s arse on a plate, running in at sixteen seconds. Which is way closer to four then it is to six freaking minutes! As such, softpcs would seem an acceptable alternative (how might I use this in my programs? (Norcroft compiler)), as, really, it’s FPA that’s not fit for purpose… 1 It used to be, but these days I try to avoid turning on the power hungry Windows box. |
Stuart Swales (8827) 1357 posts |
RE: apcs_softpcs – see my post from 19 hours ago (I have no idea how to paste links to individual posts here) |
David Pitt (3386) 1248 posts |
At the required post click on the 19hours link, that is the link required, copy it from the URL bar. I do this in a second browser window to avoid loosing my place. |
Chris Gransden (337) 1207 posts |
Down from 16.18 secs to 15.02 secs. |
Stuart Swales (8827) 1357 posts |
I did wonder about having the first instruction of each function being @David: Thanks – I can only see links to topics and posts but have now found the post id to use by HTML inspection. Let’s see if I can do it: https://www.riscosopen.org/forum/forums/2/topics/3457?page=1#posts-45080 was what inspired me to do this. |
David Pitt (3386) 1248 posts |
It only works from within the topics but not from “Recent post”. Contemplating this message look up at the one above, the time, above the name, is a link to that post including the |
Rick Murray (539) 13840 posts |
Mmm, just read bits of it on my phone. It looks like the sort of FP support that was provided with TurboC way back when – use hardware if available, else emulate. It’s a good compromise. Thanks. ;-) |
Rick Murray (539) 13840 posts |
It’s hiding. Don’t use Recent Posts, go into the actual thread. Then look at the posting time above the user’s icon. There’s your link. [alternate: Firefox, install the Display #Anchors add-on] |
Pages: 1 2 3 4 5 6 7 8 9 10 11 12