Cloud Computing Mandelbrot Benchmarking Experience
Kuemmel (439) 384 posts |
Hi there, just sharing some cloud computing experience while optimizing and testing my good old Mandelbrot benchmark ending up with 1500 lines of assembler code for optimisation variant number 4. I updated the recent code and results especially with the Amazon Graviton series with up to 64 cores on my website. You can find it at the link here It’s kinda fun to test stuff on something that’s like 40 times faster than your RPi4 :-) I’m doing those test using Amazon Cloud Computing Service (AWS) uploading my code to a Linux Ubuntu shell via FTP and running it from there in a terminal in text mode. The cost of the service is quite low as I only use it for some minutes (latest Graviton 3 (Cortex X-level) with 64 cores is 2.48 dollar per hour). I’ve got to say that the overall experience with AWS is quite nice. At first the interface is a bit overwhelming, but once you get used to it it’s straight forward. If you got a question or need a service they answer right away and get things done within a couple of minutes. So encouraged by that I tried also Google Cloud Computing. What a totally lame experience. When I wanted to use more than 4 cores they told me “to contact my sales representative” LMAO…I told them I’m just an enthusiast and they didn’t get that. Asking again, a lady from Google Germany called me by phone to understand the problem and I told her. She said she’ll forward me to some subcontractor. Then just nothing happened…such a loser company… :-) For the results some key findings… ARMv8-A vs v8.2-A atomic instructions actually begin to make a difference with high core count. I gained like 5 percent at 64 cores when using the atomic add instead of the loop like shown here:
Other than that the benchmark also shows that it’s hard to feed enough work to those 4 NEON execution units of Gravitron 3 or Apple M1. You can only do that by hand coded assembler choosing the register usage by yourself. No high level language/compiler will let you do that, as far as I know. But then…who cares for Mandelbrots and assembler ;-) The parallelism is very close to 100 percent up to 16 cores, then it goes down a bit. Still okay at 70 percent for double precision at 64 cores on Gravitron 3 for an iterative algorithm. But this I guess is also due to the threading administration overhead when you think each thread/core only gets less than 10 lines of the 600 by 600 dots to be calculated as I assign a complete line for each core at first. By the way…if you got an Apple M1 and run Asahi Linux…I’d still like some test results as I got some gaps there. |
David J. Ruck (33) 1635 posts |
Only 600×600? I thought with that amount of power available you’d be generating 4K images at a minimum! |
Kuemmel (439) 384 posts |
Of course, even with realtime zoom at 4K if I do the math :-) …but I kept the 600×600 to make the results comparable to my older versions on RiscOS and in the end it doesn’t really matter what resolution as it’s a CPU core evaluation. |