RISC OS Open: Forum: Mandelbrot Fractal VFP version

Jan 5, 2011 2:28pm

Kuemmel (439) 384 posts

With lots of help from Terje, I managed to code a VFP-version for single and double precision of my Mandelbrot fractal benchmark FixFrac. You can find it here: FracVFP

As expected the speed compared to my fixed point math Mandelbrot code is not that high. If one looks up the instruction timings for VFP commands the results fit. I know it’s not really comparable to fixed point but here are the results in seconds (800 MHz) to calculate that Mandelbrot picture:


- Frac VFP Single Precision: 20,28 s 
- Frac VFP Double Precision: 23,55 s
- FixFrac(64BitMul)........:  3,98 s

So it’s about 5 times slower. But anyway, I’m really happy that it works and it’s fun to code for the VFP, as you don’t have to deal with all the issues of fixed point math. There’s a nice Quick Reference Card from ARM to code for VFP. As I’m using only basic FMUL/FADD/FSUB instructions (due to the nature of the algorithm) of the VFP in the time critical section I guess the benefit for the more complex instructions (VDIV, VSQRT) using the VFP can be quite higher compared to fixed point math.

If you want to assemble the source code with ExtASM, make sure you got the latest version, as we had to deal with a small problem regarding the DCFD command (different encoding of FPA and Cortex-A8-VFP). Make sure you got a desktop at 16 million colours and minimum resolution of 800×600 when starting the application. It’s really just a first release, so some problems might be still there. Any comments welcome.

Jan 6, 2011 8:24am

Terje Slettebø (285) 275 posts

Great job, Michael!

I just wanted to add that the “only” thing I did was to correct the extASM DCFD assembly directive (declare double precision floating-point value) so that it worked for the VFP: It turns out that the FPA and VFP stores double-precision values in different word order, so it now uses the VFP word order, and to get the original FPA word order, one needs to add an #fpa directive in the source code.

Furthermore, it seems Michael got some help from reading the CubeDemo source code, as well as debugging tools from there, which I’m happy that he found useful.

Lastly, the VFP init code is originally from Jeffrey Lee. :) (It should also be unnecessary in later versions of RISC OS, being included in the VFPSupport module, although I haven’t tested that, yet)

To my knowledge, this is the first Beagleboard RISC OS demo using the VFP/NEON unit, which is quite cool. :)

Even though the timing may not be that impressive compared to fixed-point math, the time should be at least cut in half in Cortex A9, with its faster VFP unit (Cortex A8 only has a “VFP Lite” unit).

Furthermore, using the NEON unit, it should be possible to get comparative timing to fixed-point, even today.

The last version of extASM hasn’t been uploaded, yet, but I’ll do that tonight.

Jan 6, 2011 11:56pm

Bryan Hogan (339) 592 posts

Does anyone have the original FastBrot program written by Stephen Streater? I seem to remember that only took a few seconds to calculate on an 8MHz ARM2! It would be interesting to get this running on a BB and see how fast it goes.

Plug – Stephen is the guest speaker at ROUGOL on Monday 17th January. It would be fun to have FastBrot running there.

Jan 7, 2011 7:53am

Trevor Johnson (329) 1645 posts

FastBrot

Sorry, no. I guess you’ve already done some searches. It’s listed here and there’s a small chance that Simon Burrows in Nottingham is the same person. (I presume the pdsoft.lancs.ac.uk service is no more.)

[Edit: Or perhaps it’s more likely to be this Simon Burrows. (He was apparently a member of The ARM Club.)]

Jan 7, 2011 11:07pm

Kuemmel (439) 384 posts

@Terje: I think I’ll try to do a NEON version soon, just to learn more about that unit of the CPU.

It’ll requite a bit more time, as the Mandelbrot points you can iterate in parallel within a SIMD instructions can have a different end of iteration, while others still have to continue, so far more logic has to be implemented to do that fast. I did this in SSE2 on x86 a while ago, so it’ll be nice to see that running on RiscOS, too.

Jan 11, 2011 9:49pm

Kuemmel (439) 384 posts

I did my first steps with the NEON unit. Though even as I used it only in a non SIMD way (making use of just one of the possible 4 single precision numbers in one 128bit wide “Qx”-Register) it was way faster than the VFP for single precision.

Compared to the numbers above I got it done in 5,67 seconds !

This makes again sense when looking at instruction cycles (e.g. VADD is about 9-10 cycles for VFP and 2 cycles for NEON for Cortex-A8).

So, I hope soon to be able to implement all the iteration logic using all 4 instead of 1 numbers without much overhead. So in an ideal way (of course some overhead will be there due to the algoritm) a further speed up by a factor of 4 is possible. That will beat the fixed point stuff by far.

I guess if one only want to use single precision, NEON is the way to do it and that promises a real speed bargain compared to fixed point math.

Jan 12, 2011 3:01pm

Terje Slettebø (285) 275 posts

I did my first steps with the NEON unit. Though even as I used it only in a non SIMD way (making use of just one of the possible 4 single precision numbers in one 128bit wide “Qx”-Register) it was way faster than the VFP for single precision.

Compared to the numbers above I got it done in 5,67 seconds !

Cool. :)

I guess if one only want to use single precision, NEON is the way to do it and that promises a real speed bargain compared to fixed point math.

Yes indeed: :)

“Recommendations: For floating-point operations, use the NEON unit where possible, and only use the VFP unit when needed.”

Jan 14, 2011 6:59pm

Kuemmel (439) 384 posts

Finally I managed to code the first real SIMD NEON version of my Mandelbrot fractal benchmark. You can find it here: FracNEONVFP It includes also the VFP version.

The NEON code is more than 10 times faster and still “mathematically equal” to the single precision VFP code. There’s still some possibilities for optimisation, as I always wait for all 4 pixels in the Qx-SIMD-register to diverge. And of course I’m still learning more about NEON each day. One could implement a logic to feed new pixels into the iteration chain, but that’s for later ;-) At the moment I’m just impressed how fast it is…so “brave new world” of NEON for RiscOS for all applications that require fast single precision math.

Here are the updated results in seconds (800 MHz). I tuned also the VFP code a bit.


- Frac VFP Single Precision..: 19,12 s 
- Frac VFP Double Precision..: 22,39 s
- FixFrac(64BitMul)..........:  3,98 s 
- Frac NEON Single Precision.:  1,72 s

@EDIT: Some small corrections due to some code trash in Frac VFP. Download and result corrected.

Jan 14, 2011 7:26pm

Trevor Johnson (329) 1645 posts

This sounds pretty impressive :-) And in case you’ve not all had enough of poor Christmas cracker jokes…

What does the B in Benoît B Mandelbrot stand for?
...

...

...

...

...

...

Benoît B Mandelbrot

Jan 27, 2011 11:05pm

Kuemmel (439) 384 posts

@Trevor :-)

I tuned my NEON code again. As stated before now I was successfull to implement all the code logic so that if one or more pixels diverged or reached maximum iterations are instantly replaced by new ones, so the full potential of SIMD is used. You can find it here: FracNEONVFP

It includes also the unchanged VFP version. I updated results table in seconds (800 Mhz) here again.


- Frac VFP Single Precision..: 19,12 s 
- Frac VFP Double Precision..: 22,39 s
- FixFrac(64BitMul)..........:  3,98 s 
- Frac NEON Single Precision.:  1,38 s

As I did a similar x86 assembly code I want give you a feeling how fast or slow the C-A8 from that perspective is (even if of course Intel can use dual precision (SSE2) and so I got to divide the C-A8 results in half for fair play). Even with that the C-A8 is still about 2 times faster clock by clock than an Intel ATOM, and about 5 times slower than the latest Intel i7, which I find still very good, and of course I guess as always C-A8 is the winner in terms of performance/Watt ;-) Now I would really like to see a C-A9 result…praying for a port of RiscOS one day ;-)

I might try a real time video style zoom demo, as at low iteration depth the speed even at 800×600 can be more than 10 Frames per second. My benchmark uses an iteration depth of 4096.

As I looked more closely at NEON it could also do wonders for Integer stuff, the 128 Bit wide registers can be used for pairwise additions in single instructions very efficiently (e.g. look here: Sum Integers ...this can speed up some data (e.g. image) manipulations at an order of magnitude I guess. I’ll try some day what it can do to my fire benchmark…

Jan 28, 2011 6:56pm

W P Blatchley (147) 247 posts

I haven’t had a chance to look at this yet, but I’m really looking forward to it! Thanks for making your efforts public. Would you consider putting some information on the ROOL Wiki about some of the things you’ve found out while working on this VFP / NEON code? It could benefit others in the future, I should think!

Mar 4, 2011 2:27am

Trevor Johnson (329) 1645 posts

My data’s in the table in the ‘Benchmarks’ thread.

Mandelbrot Fractal VFP version

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Jan 5, 2011 2:28pm Kuemmel (439) 384 posts	With lots of help from Terje, I managed to code a VFP-version for single and double precision of my Mandelbrot fractal benchmark FixFrac. You can find it here: FracVFP As expected the speed compared to my fixed point math Mandelbrot code is not that high. If one looks up the instruction timings for VFP commands the results fit. I know it’s not really comparable to fixed point but here are the results in seconds (800 MHz) to calculate that Mandelbrot picture: `- Frac VFP Single Precision: 20,28 s - Frac VFP Double Precision: 23,55 s - FixFrac(64BitMul)........: 3,98 s` So it’s about 5 times slower. But anyway, I’m really happy that it works and it’s fun to code for the VFP, as you don’t have to deal with all the issues of fixed point math. There’s a nice Quick Reference Card from ARM to code for VFP. As I’m using only basic FMUL/FADD/FSUB instructions (due to the nature of the algorithm) of the VFP in the time critical section I guess the benefit for the more complex instructions (VDIV, VSQRT) using the VFP can be quite higher compared to fixed point math. If you want to assemble the source code with ExtASM, make sure you got the latest version, as we had to deal with a small problem regarding the DCFD command (different encoding of FPA and Cortex-A8-VFP). Make sure you got a desktop at 16 million colours and minimum resolution of 800×600 when starting the application. It’s really just a first release, so some problems might be still there. Any comments welcome.

Jan 6, 2011 8:24am Terje Slettebø (285) 275 posts	Great job, Michael! I just wanted to add that the “only” thing I did was to correct the extASM DCFD assembly directive (declare double precision floating-point value) so that it worked for the VFP: It turns out that the FPA and VFP stores double-precision values in different word order, so it now uses the VFP word order, and to get the original FPA word order, one needs to add an #fpa directive in the source code. Furthermore, it seems Michael got some help from reading the CubeDemo source code, as well as debugging tools from there, which I’m happy that he found useful. Lastly, the VFP init code is originally from Jeffrey Lee. :) (It should also be unnecessary in later versions of RISC OS, being included in the VFPSupport module, although I haven’t tested that, yet) To my knowledge, this is the first Beagleboard RISC OS demo using the VFP/NEON unit, which is quite cool. :) Even though the timing may not be that impressive compared to fixed-point math, the time should be at least cut in half in Cortex A9, with its faster VFP unit (Cortex A8 only has a “VFP Lite” unit). Furthermore, using the NEON unit, it should be possible to get comparative timing to fixed-point, even today. The last version of extASM hasn’t been uploaded, yet, but I’ll do that tonight.

Jan 6, 2011 11:56pm Bryan Hogan (339) 592 posts	Does anyone have the original FastBrot program written by Stephen Streater? I seem to remember that only took a few seconds to calculate on an 8MHz ARM2! It would be interesting to get this running on a BB and see how fast it goes. Plug – Stephen is the guest speaker at ROUGOL on Monday 17th January. It would be fun to have FastBrot running there.

Jan 7, 2011 7:53am Trevor Johnson (329) 1645 posts	FastBrot Sorry, no. I guess you’ve already done some searches. It’s listed here and there’s a small chance that Simon Burrows in Nottingham is the same person. (I presume the `pdsoft.lancs.ac.uk` service is no more.) [Edit: Or perhaps it’s more likely to be this Simon Burrows. (He was apparently a member of The ARM Club.)]

Jan 7, 2011 11:07pm Kuemmel (439) 384 posts	@Terje: I think I’ll try to do a NEON version soon, just to learn more about that unit of the CPU. It’ll requite a bit more time, as the Mandelbrot points you can iterate in parallel within a SIMD instructions can have a different end of iteration, while others still have to continue, so far more logic has to be implemented to do that fast. I did this in SSE2 on x86 a while ago, so it’ll be nice to see that running on RiscOS, too.

Jan 11, 2011 9:49pm Kuemmel (439) 384 posts	I did my first steps with the NEON unit. Though even as I used it only in a non SIMD way (making use of just one of the possible 4 single precision numbers in one 128bit wide “Qx”-Register) it was way faster than the VFP for single precision. Compared to the numbers above I got it done in 5,67 seconds ! This makes again sense when looking at instruction cycles (e.g. VADD is about 9-10 cycles for VFP and 2 cycles for NEON for Cortex-A8). So, I hope soon to be able to implement all the iteration logic using all 4 instead of 1 numbers without much overhead. So in an ideal way (of course some overhead will be there due to the algoritm) a further speed up by a factor of 4 is possible. That will beat the fixed point stuff by far. I guess if one only want to use single precision, NEON is the way to do it and that promises a real speed bargain compared to fixed point math.

Jan 12, 2011 3:01pm Terje Slettebø (285) 275 posts	I did my first steps with the NEON unit. Though even as I used it only in a non SIMD way (making use of just one of the possible 4 single precision numbers in one 128bit wide “Qx”-Register) it was way faster than the VFP for single precision. Compared to the numbers above I got it done in 5,67 seconds ! Cool. :) I guess if one only want to use single precision, NEON is the way to do it and that promises a real speed bargain compared to fixed point math. Yes indeed: :) “Recommendations: For floating-point operations, use the NEON unit where possible, and only use the VFP unit when needed.”

Jan 14, 2011 6:59pm Kuemmel (439) 384 posts	Finally I managed to code the first real SIMD NEON version of my Mandelbrot fractal benchmark. You can find it here: FracNEONVFP It includes also the VFP version. The NEON code is more than 10 times faster and still “mathematically equal” to the single precision VFP code. There’s still some possibilities for optimisation, as I always wait for all 4 pixels in the Qx-SIMD-register to diverge. And of course I’m still learning more about NEON each day. One could implement a logic to feed new pixels into the iteration chain, but that’s for later ;-) At the moment I’m just impressed how fast it is…so “brave new world” of NEON for RiscOS for all applications that require fast single precision math. Here are the updated results in seconds (800 MHz). I tuned also the VFP code a bit. `- Frac VFP Single Precision..: 19,12 s - Frac VFP Double Precision..: 22,39 s - FixFrac(64BitMul)..........: 3,98 s - Frac NEON Single Precision.: 1,72 s` @EDIT: Some small corrections due to some code trash in Frac VFP. Download and result corrected.

Jan 14, 2011 7:26pm Trevor Johnson (329) 1645 posts	This sounds pretty impressive :-) And in case you’ve not all had enough of poor Christmas cracker jokes… What does the B in Benoît B Mandelbrot stand for? ... ... ... ... ... ... Benoît B Mandelbrot

Jan 27, 2011 11:05pm Kuemmel (439) 384 posts	@Trevor :-) I tuned my NEON code again. As stated before now I was successfull to implement all the code logic so that if one or more pixels diverged or reached maximum iterations are instantly replaced by new ones, so the full potential of SIMD is used. You can find it here: FracNEONVFP It includes also the unchanged VFP version. I updated results table in seconds (800 Mhz) here again. `- Frac VFP Single Precision..: 19,12 s - Frac VFP Double Precision..: 22,39 s - FixFrac(64BitMul)..........: 3,98 s - Frac NEON Single Precision.: 1,38 s` As I did a similar x86 assembly code I want give you a feeling how fast or slow the C-A8 from that perspective is (even if of course Intel can use dual precision (SSE2) and so I got to divide the C-A8 results in half for fair play). Even with that the C-A8 is still about 2 times faster clock by clock than an Intel ATOM, and about 5 times slower than the latest Intel i7, which I find still very good, and of course I guess as always C-A8 is the winner in terms of performance/Watt ;-) Now I would really like to see a C-A9 result…praying for a port of RiscOS one day ;-) I might try a real time video style zoom demo, as at low iteration depth the speed even at 800×600 can be more than 10 Frames per second. My benchmark uses an iteration depth of 4096. As I looked more closely at NEON it could also do wonders for Integer stuff, the 128 Bit wide registers can be used for pairwise additions in single instructions very efficiently (e.g. look here: Sum Integers ...this can speed up some data (e.g. image) manipulations at an order of magnitude I guess. I’ll try some day what it can do to my fire benchmark…

Jan 28, 2011 6:56pm W P Blatchley (147) 247 posts	I haven’t had a chance to look at this yet, but I’m really looking forward to it! Thanks for making your efforts public. Would you consider putting some information on the ROOL Wiki about some of the things you’ve found out while working on this VFP / NEON code? It could benefit others in the future, I should think!

Mar 4, 2011 2:27am Trevor Johnson (329) 1645 posts	My data’s in the table in the ‘Benchmarks’ thread.