FracNEON on 64Bit Linux
Kuemmel (439) 384 posts |
Meanwhile I ported my good old FracNEON single precision benchmark from aarch32 to aarch64 to be run on 64 Bit Linux. It uses some C++/SDL2 to display the results but the benchmark itself is written in assembler. Now it’s 3 different versions as they show 3 different optimisation possibilites on single core coding. Hope to implement multihreading some day also. I did not only port the initial FracNEON which is now calles ‘opt1’, I also created new versions from scratch with 2 and 3 indepedent instruction blocks (opt2 and opt3), still calculating the same exact same thing (iterations/result). The double amount of available NEON registers make this possible. This ended up in more than 1000 lines of assembler to catch up on all events when you iterate 12 Mandelbrot pixels in 3 NEON paths in the main loop at the same time…got me some nightmares when bug hunting ;-) …but especially the results (thanks to Chris) on Virtual Ubuntu on an Apple M1 show what a huge potential is there using this coding technique on the lowest level on the latest cores. ‘opt3’ is like 137% faster than ‘opt2’. On the RPI4 only little gains. You’ll find everthing on my homepage including some graphs and source here If you got some other 64Bit Linux ARM device I’m looking forward for other measurement data. Still hoping to do that someday on Risc OS :-) though I can’t complain about my first experience coding on Linux. |
Colin Ferris (399) 1814 posts |
I wonder if anyone has managed to jump between 32bit and 64bit – You know the British are mad – but we can but try :-) |
Kuemmel (439) 384 posts |
Updated the results with numbers from Odroid N2+ Cortex A73 and thanks to Chris with that 80 core monster Ampere Altra Neoverse N1. The Neoverse is initially best for ‘opt1’ variant and then loses big time against the Apple M1, but generally shows the same trend for the optimisations and a big step compared to A72/73. Kind of funny to use only 1 of 80 cores :-) I guess it runs almost idle ;-) I’ll probably transform the benchmark to double precision as ARMv8 supports that on NEON in contrast to ARMv7 and then I’ll see if I can get my mind into multicore coding on Linux some day…while I wait for RISC OS. |
Kuemmel (439) 384 posts |
Meanwhile I created a NEON double precision version for Linux (a feature only possible with 64 Bit aarch64). It results more ore less in half the speed of the single precision NEON version. But of course it’s more ‘realistic’ to use double floats with Mandelbrot’s as if you go deeper into the set you’ll need the precision. You’ll find it on the same page here I also added text-only versions for single and double, so people could test even if they only got a command line Linux running. Thanks again to Crhis for testing on Apple M1 and Neoverse. When I look at the double precision results and compare it to similar code/results I did for x86 it’s even more clear that Apple reached or even exceeded the performance of x86 in terms of floating point efficiency. Next step is to implement threading/multicore… |