Safeguarding the past, present and future of RISC OS for everyone

News | Downloads | Bugs | Bounties | Forums | Library

Forums → Code review →

Max out VFP performance on the latest ARM cores

3 posts, 2 voices

Apr 17, 2021 12:14pm Kuemmel (439) 384 posts	When there’s a new ARM core running RISC OS you might have seen me always checking my old Mandelbrot benchmarks (Link here ) to get some insight in the evolution of the latest ARM cores, especially regarding the floating point performance. Just in the last years when I looked especially at the Cortex A72 (RPi4) results…somehow the performance clock by clock didn’t really enhance compared to former generations even if those cores should theoretically perform somehow better. I was wondering if the way I wrote that benchmark like 15 years ago was generally an optimal way to implement this in Assembler or if it should have been done differently. For your understanding some short introduction to the Mandelbrot algorithm. Some Pseudo-code is here: `DO x_new = xx - yy + a y_new = 2xy + b i% = i% +1 UNTIL (xx+yy)>4 OR i%>max%` Basically you iterate a special area of the Mandelbrot set with that formula and map it to the screen, either in uniform colour when iteration max% is reached or when it diverged to the colour of the used iterations i% by then. To implement this in single precision VFP assembler I tried at least to avoid pipeline stalls by making each instructions independent to the next one. That looks like this here (missing the loop exit checks): `Pixel 0 at x,y - one iteration VMUL.F32 S2, S0, S0 ;" xx VMUL.F32 S3, S1, S1 ;" yy VMUL.F32 S4, S0, S1 ;" xy VADD.F32 S5, S2, S3 ;" xx + yy (diverge state) VSUB.F32 S0, S2, S3 ;" x_new = xx - yy VADD.F32 S1, S4, S4 ;" y_new = 2xy VADD.F32 S0, S0, S6 ;" x_new = x_new + a VADD.F32 S1, S1,S29 ;" y_new = y_new + b` In my may be naive way I thought that would help the CPU enough to keep the instruction decoder or the usually 2 floating point execution units in one CPU busy enough. It seems I was clearly wrong on this. I guess due to the latency/throughput timings the execution units were still kind of clogged or lets say not used to their full capabilities. But then I remembered some optimisation I did on x86 code that another coder suggested to fully exploit those huge out-of-order windows that modern CPU’s have (also recent ARM cores since may be Cortex A15, but not e.g. Cortex A53, A7, A8, A9. That window is basically an internal table (as far as I get it) where the CPU tries to reorder independent instructions from the flow of code and tries to feed the available execution units as good as possible. Those out-of-orders windows can be as huge as several 100 instructions. For the Cortex A72 it seems to be 128 instructions big. The approach here for such an iterative algorithm is to use as much registers as possible to create as much as possible independent instructions in the code flow. I implemented this by putting a second pixel (so to say the next pixel) calculation within the main iteration loop using totally different registers to the one before: Pixel 0 at init x0,y0 - one iteration VMUL.F32 S2, S0, S0 ;" x0x0 VMUL.F32 S3, S1, S1 ;" y0y0 VMUL.F32 S4, S0, S1 ;" x0y0 VADD.F32 S5, S2, S3 ;" x0x0 + y0y0 (diverge state) VSUB.F32 S0, S2, S3 ;" x0_new = x0x0 - y0y0 VADD.F32 S1, S4, S4 ;" y0_new = 2x0y0 VADD.F32 S0, S0, S6 ;" x0_new = x_new + a0 VADD.F32 S1, S1,S29 ;" y0_new = y_new + b0 Pixel 1 - at init x1,(y1=y0) - one iteration VMUL.F32 S10, S8, S8 ;" x1x1 VMUL.F32 S11, S9, S9 ;" y1y1 VMUL.F32 S12, S8, S9 ;" x1y1 VADD.F32 S13,S10,S11 ;" x1x1 + y1y1 (diverge state) VSUB.F32 S8,S10,S11 ;" x1_new = x1x1 - y1y1 VADD.F32 S9,S12,S12 ;" y1_new = 2x1*y1 VADD.F32 S8, S8,S14 ;" x1_new = x1_new + a1 VADD.F32 S9, S9,S29 ;" y1_new = y1_new + b1 What makes this actually difficult is that you have to implement all the logic that takes care when e.g. pixel 0 diverged or pixel 1 or one of them reached maximum iterations. I leave this to you digging in my source…it gets quite long to handle all of this but works. As I even had more registers left to utilize I even implemented a version for 3 pixels to see if that code is even faster or if I reach another bottleneck or maxed out the CPU already. So in the end I had 3 versions for single precision floating point: FracVFP_single_V1 : 1 pixel loop, 3 times loop unrolling (same as my old benchmark) FracVFP_single_V2 : 2 independent pixel loops, 1 time loop unrolling FracVFP_single_V2 : 3 independent pixel loops, 1 time loop unrolling You can download the new benchmarks here The results showing up where more than I could hope for. You can see the graphs normalised to iterations per clock per Mhz in percentage: http://www.mikusite.de/images/FracV1V2V3results.png (…can’t embed the image somehow…any help on this !? I tried some of the differen syntaxes…not success…) While the old approach V1 as I said didn’t gain over all the cores, the V2 begin to fly with the A72 and went up by 103%. The V3 even further to about 142% compared to V1. To pressure the world champion regarding ARM architecture I asked Chris Gransden to run a Linux 64 version of my code on a virtual Ubuntu system on an Apple M1. You can see that due to the 3 instead of 2 floating point execution units the M1 can even benefit further, showing that the approach works and is scalable once the CPU will get more instruction units. The older CPU’s don’t benefit at all from the new code logic. Special thanks go to Raik, Chris and Jan for testing on various devices. I just wanted to share this as it kind of positively disturbed my knowledge how to optimize code for ARM. Depending on what kind of speed critical loop you have, this could be applicable to other algorithms for sure. Non-iterative should be easier anyway. It seems it’s quite time to revisit some of my old codes :-) I also wonder if the latest compilers do help optimizing stuff like that. I also included versions for double precision which performs the same for most ARM cores anyway. Regarding Fixed-Point and NEON the story is different actually. In AArch 32 we only got half of the amount of registers (16) compared to VFP single and double precision (32). I’m still evaluating but it seems much less effective for the Mandelbrot algorithm as you got to compensate with memory access and different flow. Here AArch 64 would help immensely.

Apr 17, 2021 12:26pm Rick Murray (539) 13840 posts	can’t embed the image somehow… as it kind of positively disturbed my knowledge how to optimize code for ARM Yes, things are a lot more complicated than “the old days”. Code, these days, is written by compilers with all sorts of rules for scheduling instructions. I wouldn’t want to attempt to try it manually. You have demonstrated why. ;-)

Apr 18, 2021 9:20am Kuemmel (439) 384 posts	@Rick: Yeah, it’s kind of crazy. When I tested also my NEON variant I ended up with only like +20% for Cortex A72 but actually lost (!) 35% on a Cortex A53…but of course may be for more general code it’s kind of easier to optimized for both. It’s just sad that so much computation power isn’t utilized I guess. As far as I read here, the Apple M1 has an out-of-order window (or re- order buffer) of may be more than 630 instructions…insane ! I guess it’s the only way they can utilize their huge amount of execution units…that window seems to desperately look out for things to be computed so far ahead in the code, most execution units running may be idle a lot of the time :-)

Reply

To post replies, please first log in.

Forums → Code review →

Search forums

Social

Follow us on

and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.

Description

Developer peer review of proposed code alterations.

Voices

Options

Forums
Login

Contact Us | About Us

The RISC OS Open Beast theme is based on Beast's default layout
Site design © RISC OS Open Limited 2024 except where indicated

Hosted by Arachsys

Powered by Beast © 2006 Josh Goebel and Rick Olson
This site runs on Rails