More ARMv8 Linux adventures
Kuemmel (439) 384 posts |
Some may be remember my post here from last year when I ported my Mandelbrot NEON benchmark to Linux for single core. Meanwhile I managed to make it use multithreading with the PosixThreads library. Once you know how that works it’s kind a piece of cake. To max out speed on all cores the multithreading code assigns one Mandelbrot set line at a time to each available core. If one line of any core is finished it increments the global line counter and the next available one is chosen until the set is complete. This ensures that no core ever runs idle as each line might take a different time to calculate due to the iterative nature and especially in big/little cores. So the parallelisation reaches something like 99%. You can get the code and see some table of results here Some findings on the side: I got hold of a Firefly-RK3588S CortexA76/55 board. Results included. I guess something like that will be the next RPi5. That thing is really fast, also got a direct PCI slot for an M2 SSD and due to 8nm process still consumes quite low power. As I did the needed global/atomic variables update in assembler I found ARMv8.2 offers a memory add instruction…since when is ARM not a load-calc-store architecture any more ;-) !? So with ARMv8.2 you can do …while with ARMv8 you have to Though in my code case the speedup from ARMv8.2 variant isn’t there as I don’t use that very often during runtime.
If anybody got an interesting ARM device running Linux where I don’t have results listed and can spare some time running the benchmark, would be nice to get in contact. Check the readme.txt for my email address. |
David J. Ruck (33) 1635 posts |
Sacrilege! |
Rick Murray (539) 13840 posts |
Hmmm, the x86 became more RISC (internally), and the ARM is becoming more CISC. In other words, the 6502 was right all along. ;) |
Kuemmel (439) 384 posts |
@Rick: I’d think so, too. The simple benefit of having a memory ADD or similar is that you need less register usage, especially when you are dealing with constants from memory. Back in time I thought: But a memory ADD would be slower in an inner loop routine, so try to put all those constants in registers beforehand as much as you can, but within x86 it’s not the case as far as I can tell for the newer generation of cpu cores since may be core duo or something. I wouldn’t know what in means from the standpoint of a cpu designer, but me as a programmer wants to have that in ARM, too. |
Rick Murray (539) 13840 posts |
I think the issue is that if you do that sort of thing, the processor grinds to a halt as it needs to fetch/write values to memory. Unless this one is aimed at spinlocks and such, it ought to be preloaded by the many tricks (out of order/speculative execution, etc) which should reduce the impact of direct memory access. As for stacking up the registers, that’s why RISC has lots that are (excepting architecture things like R14 and calling protocol things like SP, FP, etc) completely unrestricted in use (unlike “this is a loop counter” and “this is where the results of calculations end up”). |