Mandelbrot AArch32 vs AArch64 shoot-out
Kuemmel (439) 384 posts |
Being intrigued by the whole AArc32 vs AArch64 discussion I thought I should explore that ARMv8 instruction set by myself by porting my RISC OS Mandelbrot Fractal benchmarks to AArch64 on Linux. And as I don’t have a blog or something I’ll just leave this here… It was a bit of a hurdle since my limited Unix experience was like 20 years ago on some SunOS/DEC Alpha stations at university. Nevertheless I managed to fire up the GNU Assembler on Kali Linux 64 on my RPi4. Overall a quite pleasant experience to work with. As I didn’t advance into graphics coding I just did it for console only. So after the calculations are done the code prints the iterations, like a checksum for me to see if the calculations are correct. For timing the Linux ‘time’ command seems to give results good enough for my purpose. So I ported all 4 base applications for Fixed Point, Single Floats, Double Floats and NEON (Single Floats) and compared them between RISC OS and Linux 64. As the NEON code executes so fast since I published that benchmark first I decided to loop all of the benchmarks for 10 times now. The whole code for both systems (text output only) can be found at that link The original graphical versions for RISC OS are still here To build the Linux code you use the batch “./build” and to run use “./measure” in each directory. The results were quite interesting: The results are displayed in million iterations per second. While the Single Float and Double Float versions don’t differ much, there are speed gains for NEON and especially Fixed Point. What happed for the Fixed Point (it uses a 10.22 fixed point format) on RPi4 ? Due to the 64 Bit instruction SMULL Xx,Wy,Wz (what internally is an SMADDL Xx,Wy,Wz,XZR) instead of the 32 Bit instruction SMULL Ru,Rv,Rx,Ry I can save on some orr/shift-ing to correct the fixed point format after the multiplication. But that wouldn’t explain it in total. After looking at the Cortex A72 optimization manual the speed gain became more clear. ARM enhanced the cycle timings of that AArch64 SMADDL compared to SMULL for AArch32. Regarding NEON some of the new ASIMD instructions that were not there in AArch32 are beneficial. Here especially the so called across-vector instructions are helping out (UMAX Vx,Vy). This one puts the maximum number of all 4 words into the destination as a scalar result. This helps to detect the exit of the Mandelbrot iteration loop. Previously you had to do that in a more complicate way. Even more also, for single and double float the scalar “FCMP” instruction directly sets the needed flags. No need any more to transfer the flags like it was the case with “VCMP”. So even if I’m just scratching the surface here (only the ASIMD section in the ARM ARM manual is more than 900 pages :-)) it seems quite clear that ARM is continuously focusing on enhancing the AArch64 instruction set, while not paying too much attention any more to AArch32. Overall it’s very good to have that double amount of registers from AArch64 regarding the Xx and Wx. That helps optimising speed critical loops. It’s quite nice you can mostly choose if you need a 32 or 64 Bit sized register for your operations. Addressing though is only by Xx. On the other hand that must be a nightmare for a compiler to choose and to cope with the sheer amount of available instructions… What gives you a headache at first when you port NEON code from AArch32 to AArch64 is the different syntax (no general “V…” instructions, now “F…” for floating point and no letter for integer…) and the totally changed register mapping: In AArch32 it was adjacent (D0=S1.S0, Q0=S3.S2.S1.S0,…), now it’s (D0=?.S0, D1=?.S1, V1=?.?.?.S1, V1=?.D1,…). So some code needs to be completely rewritten. Luckily now there is an insert instruction called “INS Vx[index1],Vy[index2]”. Once you cover those things it’s actually a pleasure to write ARMv8 code as you got so much more new toys to play with (registers and instructions). From that perspective I don’t see any downside of that ISA compared to 32 Bit at all. Of course you’ll miss the conditional execution, but at least a little of that is still there, covered by some instructions. In the ARM ARM you’ll also spot that the ARM v8 doesn’t stand still since Cortex A72, which is quite old now internally for ARM. I think now it goes up to ARM V8.6. You’ll get even stuff in ASIMD like a complete vector dot-product instruction since ARM v8.2. |
David J. Ruck (33) 1636 posts |
Nice, but it would be good to give the figures for Linux Aarch 64, Linux Aarch32 and RISC OS Aarch32, just in case there is any difference from toolchains or execution environment. When we’ve got RISC OS on Arch64, it might be easier to do direct comparisons :) |
Kuemmel (439) 384 posts |
@David: Point taken. I’ll see if I find time to create aarch32 Linux version and run it on a Linux 32 version. Shouldn’t be too difficult, as despite syntax issues it’s the same assembler code like RISC OS. Toolchains don’t play any role, as the code is 100% assembler. Regarding the execution environment the RISC OS code is already in single task, just on Linux it’s running in a command line window within the graphical user interface…so there could be a small overhead on the Linux side. I’ll try to figure out how to boot Linux to command line without the graphical interface and see if the timings are different. But I don’t expect a too big difference as the Single and Double Float results are quite in line. …and yes I’ll keep my fingers crossed for a RISC OS Arch64 within the next 10 years ;-) @Edit: I’ll also check out results from RPi3…Cortex A-53 can do also ARMv8… |
David J. Ruck (33) 1636 posts |
That’s the overhead, but Linux can also use the other cores for housekeeping tasks, which might give a small underhead! |
Kuemmel (439) 384 posts |
I repeated the tests with Linux booted to command line. Difference was within the normal deviations, so no impact. I also added results for the RPi3…quite interesting that they didn’t have that cycle optimisation on the SMADDL back then compared to the RPi4. And also the RPi3 was a bit less ‘happy’ with the NEON enhancements. Shows clearly the evolution, while clock by clock the RPi3 was faster than a RPi4 for fixed point math at AArch32. |
Steffen Huber (91) 1953 posts |
My RISC OS Blog is of course always open for guest postings :-) |