RISC OS Open: Forum: RPi performance

Mar 13, 2020 1:43pm

David Gee (1833) 268 posts

Are there any stats that show the relative performance of the various marks of RPi (up to 3B+) on RISC OS? I’m aware that the newer Pis are faster, but much of this comes from having multiple cores which RO can’t use.

Mar 13, 2020 2:37pm

Stuart Painting (5389) 714 posts

Are there any stats that show the relative performance of the various marks of RPi (up to 3B+) on RISC OS?

Chris Hall has some RISC OS benchmarks at http://www.svrsig.org/images/Page36.htm which include most of the Pi models.

Mar 13, 2020 4:31pm

David Feugey (2125) 2709 posts

https://riscos.fr/utilisez.html
(at the end of the page)

Mar 13, 2020 8:32pm

Rick Murray (539) 13840 posts

but much of this comes from having multiple cores which RO can’t use.

While multiple cores will certainly give a speed boost to capable systems, it’s worth noting that there have been increases in the clock speed of the ARM core and changes in architecture. Oh, and I think the RAM access has sped up along the way, which would make a difference.

Certainly, for building RISC OS, my ARMv7 Pi2 is quite a bit nippier than the Pi1, more than a 200MHz difference in clock speed would imply. It’s not just that, it’s also ARM11 → Cortex-A7. The ARM11 wasn’t terribly fast. The Pi2 feels like twice as fast. It also feels faster than the Beagle-xM (Cortex-A8) despite being 100MHz slower in raw clock speed. But perhaps that is due to the ability to push off some of the video handling to the GPU to manage? The Pi3 clocks a mere 1.2GHz with an A53 processor, and the Pi4 has an A72 at 1.5GHz.
So multiple cores aside, there are notable increases in speed available.

Mar 14, 2020 9:43am

Jeff Blyther (1856) 47 posts

I’m probably hijacking this thread, but i’ve just got RO up and running on a pi4 and i’ve been suprised by the performance of it compared to the pi3b+ (an increased clock rate of 100MHz), as one of my ARM code programs is running at nearly double the speed.

I can only assume that the out of order core is rearranging my very badly written code and turning it into something sensible. But some of the more sanely written code has only seen an increase of about 14%.
So as Rick pointed out the ARM core and subsystems plays a big role, and the only real way to see if your code/usage is worth the upgrade is to get someone to try it out for you. I’ve got a few pi’s (from pi0 to 3b+) being used by ‘non nerds’ at work and they dont really notice the difference between them :-(

Mar 14, 2020 2:59pm

George T. Greenfield (154) 748 posts

and they dont really notice the difference between them :-(

Is that whilst running RISC OS? Over the years I’ve collated !Firebench results on the various machines I’ve owned or tested – as follows. !Firebench is a programme written by Michael Kubel to calculate 16,000 iterations of a 320 × 100 pixel fire, calculating the new pixels from the existing 8 surrounding pixels; in other words, a straight test of processor power. Results are:

Iyo = 40.02 secs [36 secs on re-test]

Pi1 [default] = 16.1 secs
Pi1 [900/333/450] = 11.95 secs
Pi1 [1000/500/500] = 10.59 secs

Pi2 [default] = 11.62 secs
Pi2 [1000>600/450/450] = 9.4 secs

Pi3 [default] = 5.14 secs

IGEP = 3.66 secs
Titanium = 3.57 secs

RPCEmu 0.9.2 [Win7, Intel i3, 2.6GHz] = 19.65 secs

‘Default’ means standard CPU/Core/RAM clock settings.

Mar 14, 2020 4:33pm

Jeff Blyther (1856) 47 posts

I just downloaded !Firebench, unfortunately it seems the size is now 512 × 256, not the 320 × 100 which your tests were done, but the iterations have been changed to compensate (16000 down to 4096) so the results should be sameish (probably a dangerous assumption!). Anyway I ran the test on 3 nearby pi’s and got the following results:-
pi2 13.26 sec
pi3b+ 5.97 sec
p4 3.46 sec

Running another program by Michael Kubel, Fixed point integer fractal, gave the following results :-
pi2 2.99 sec
pi3b+ 1.19 sec
pi4 1.46 sec
Yes its not a mistake, the pi3b+ beat the pi4, I can only assume that the A72 doesnt like SMULL ?

Mar 14, 2020 5:16pm

Kuemmel (439) 384 posts

Hi there…it’s me, the author (nickname Kuemmel)…yes, the pi4 was surprisingly slower for the FixFrac, don’t actually know why. It’s hard to find good cpu cycle tables from ARM that could give a hint. But you can check the versions for FPU and NEON, here it’s faster, of course also when you overclock it but also clock by clock.

At some point I updated that !FireBench as it’s done too fast. It was written back in time on my StrongARM. Where FixFrac and the VFP/NEON Versions (check my website here for the latest versions) are true cpu math crunching benchmarks with no effect from memory subsystem, the !FireBench is more of a memory benchmark as it needs to shuffle a lot of data, but of course some adding/shifting is needed also to compute the pixels. If I have some time I’ll publish a version using NEON to do the fire that’s an order of magnitude faster :-)

Mar 14, 2020 5:55pm

Jeff Blyther (1856) 47 posts

Hi Kuemmel,
Thanks for publishing, at the moment I’m trying to learn NEON/VFP and its nice to see your well commented code… I wish my code was like that!

Yes your VFP/NEON fractal versions do beat the pi3b+ A53. Maybe on the A72 we dont get shifts (LSR/LSL) for free anymore? (although that was probably the case after ARM3)

Mar 14, 2020 6:55pm

Kuemmel (439) 384 posts

Hi Jeff…it’s hard to tell, I wouldn’t expect that they did anything ‘bad’ regarding the shifts. Unfortunatelly I could never find cycle timings for the Cortex A53, while the A72 there’s very good data (Link).

May be a bit of reordering instructions makes a difference ? (I don’t have my RPI4 yet set up, so I can’t test myself). You could try e.g. move “MOV R3,R3,LSR#(fl%)” at the two positions in the code between “SMULL R5,R6,R0,R0” and “SMULL R8,R2,R0,R1”.

Mar 14, 2020 7:02pm

George T. Greenfield (154) 748 posts

!FireBench is more of a memory benchmark as it needs to shuffle a lot of data

I stand corrected! But that would explain why the Pi1 (and RPCEmu) is twice as fast as the Iyonix, despite all three having very similar CPU performance as measured by RISCOSmark (Iyo 260%, Pi1 253%, RPCEmu 238% [baseline S/Arm RPC = 100%]; and why overclocking the Core and RAM rates on the Pi1 and 2 has quite a dramatic effect on !FireBench performance.

Mar 14, 2020 7:17pm

Jeff Blyther (1856) 47 posts

done the code shuffle, but the pi4 gave the same answer (I assume the re order buffer is optimising the order of code anyway), but it slowed down the pi3b+ :-(

Mar 14, 2020 7:23pm

Kuemmel (439) 384 posts

okay…that behaviour of the RPI3 is kind of weird…I always thought that its dual issue pipeline would benefit a bit also…the mysteries of modern cpu internals…meanwhile I reordered everything to the max (you can get it here). Does that help for the pi4 ? As you said I wouldn’t expect it, as the reorder buffer should do the same job.

Mar 14, 2020 7:34pm

Jeff Blyther (1856) 47 posts

As you guessed, the pi4 stays the same, but you are making my A53 slower! your original code runs the best!

Mar 15, 2020 12:02pm

Kuemmel (439) 384 posts

Meanwhile I had time to polish my Fire NEON code. Before I release it officially on my website, could you give it a test run on the RPI4 ? Here’s the link

It’s actually not pixel perfect the same as the old FireBench. In the old one I calculated 8 sourrinding pixels, now, to make use of NEON parallel computing I go for a 3×3 pixel block. With the help of NEON long adds and especially the pretty VEXT command things speed up like hell. It’s roughly 3 times faster than the traditional ARM approach and much more easy to code, no unmasking of “fire” bytes at all :-)

P.S.: Is there still no solution to that log in problem to this website ? I constantly have to delete all browers data (using Chrome on Windows 10).

Mar 15, 2020 12:25pm

Jeff Blyther (1856) 47 posts

Well thats interesting..

pi4 2.52 sec
pi3b+ 2.22 sec
pi2 4.29

The A53 beats the A72 again!
Hopefully other pi4 users can run the test just to see if I’ve not got a dodgy pi4 here :-)

Just had a thought, could it be that the pipelines on the A72 are longer than the A53? so when the core is maxed out the shorter pipeline going to win.

Mar 15, 2020 12:35pm

Jeff Blyther (1856) 47 posts

While i’m logged on (using netserf on pi4, I cant log on using safari on my mac) I must say I’m most impressed with riscos on pi4 at such an early stage. Although i’ve only been running it for a day I cant seem to make the system go wrong using my pi as I use it in my work enviroment (spreadsheet + own progs), and it feels real nippy as well!

Mar 15, 2020 12:59pm

David Pitt (3386) 1248 posts

!FireBenchNeon

Titanium  1.96s
RPi4      2.38s
RPi3B+    2.17s

Mar 15, 2020 1:02pm

Jeff Blyther (1856) 47 posts

Phew… my pi4 is not dodgy then.

Mar 15, 2020 2:10pm

Rick Murray (539) 13840 posts

It’s roughly 3 times faster than the traditional ARM approach and much more easy to code, no unmasking of “fire” bytes at all :-)

Am I the only person who thinks that, made full screen and left to run in a loop, that it would make a great screensaver?

Mar 15, 2020 4:08pm

Kuemmel (439) 384 posts

Hm, two ideas on the RPI4 beeing slower…is there still some cache setting not set or enabled or are your RPI4’s are having throttling problems as there’s may be no fan ?

@Jeffrey: Hope you read that, may be you can say something on the cache setting as this benchmark uses lots of memory bandwith, or do you have other suspects ?

@Rick: I coded something for full screen with that blur algorithm…I’ll publish it soon with the !FireBench Update. But somebody with more OS-coding talent has to convert it to a screen saver ;-)

Mar 15, 2020 6:32pm

Jeff Blyther (1856) 47 posts

The pi is not throttling, but I have found a problem with the pi4.
yesterday I was using the 21st feb ROM (very kindly built by Chris Hall) and all my programs were way faster than my pi3b+ (nearly double on one of my progs), but using todays ROM (15 Mar) things are a lot slower, the prog that was nearly double the speed is now slower than my pi3b+ (but only by a few percent….. no its not, its 25%!!).

So somewhere between these dates a change has been made thats not good for pi4’s.
Oh Kuemmel, I just run !FireBench on the old rom and its not made any difference :-( so the mystery continues.

Update.
I just tried todays rom on my pi3b+, which was running 5.24, and the rom change has not made any difference to the timings, so its only a pi4 issue.

Mar 16, 2020 7:56am

Jeff Blyther (1856) 47 posts

I just a little play around to see if I can see why my pi4 is slower on the new rom. But after doing a reset my benchmaking test are saying its not. So something in my setup is causing the problem, I will do some more investigating tonight.

Mar 16, 2020 8:36pm

Chris Gransden (337) 1207 posts

I just a little play around to see if I can see why my pi4 is slower on the new rom. But after doing a reset my benchmaking test are saying its not.

Do you have !CpuClock. You can use it to lock the CPU to the highest clock speed. Sometimes single tasking programs run with the lower clock speed of 600MHz instead of 1500MHz.

Mar 16, 2020 8:54pm

Chris Johnson (125) 825 posts

You can use it to lock the CPU to the highest clock speed.

Yes, you need to set the low speed to the maximum. I used to have it so that you could never set the slow speed to the max possible speed, but after user request I removed that inhibition 8)

RPi performance

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Mar 13, 2020 1:43pm David Gee (1833) 268 posts	Are there any stats that show the relative performance of the various marks of RPi (up to 3B+) on RISC OS? I’m aware that the newer Pis are faster, but much of this comes from having multiple cores which RO can’t use.

Mar 13, 2020 2:37pm Stuart Painting (5389) 714 posts	Are there any stats that show the relative performance of the various marks of RPi (up to 3B+) on RISC OS? Chris Hall has some RISC OS benchmarks at http://www.svrsig.org/images/Page36.htm which include most of the Pi models.

Mar 13, 2020 4:31pm David Feugey (2125) 2709 posts	https://riscos.fr/utilisez.html (at the end of the page)

Mar 13, 2020 8:32pm Rick Murray (539) 13840 posts	but much of this comes from having multiple cores which RO can’t use. While multiple cores will certainly give a speed boost to capable systems, it’s worth noting that there have been increases in the clock speed of the ARM core and changes in architecture. Oh, and I think the RAM access has sped up along the way, which would make a difference. Certainly, for building RISC OS, my ARMv7 Pi2 is quite a bit nippier than the Pi1, more than a 200MHz difference in clock speed would imply. It’s not just that, it’s also ARM11 → Cortex-A7. The ARM11 wasn’t terribly fast. The Pi2 feels like twice as fast. It also feels faster than the Beagle-xM (Cortex-A8) despite being 100MHz slower in raw clock speed. But perhaps that is due to the ability to push off some of the video handling to the GPU to manage? The Pi3 clocks a mere 1.2GHz with an A53 processor, and the Pi4 has an A72 at 1.5GHz. So multiple cores aside, there are notable increases in speed available.

Mar 14, 2020 9:43am Jeff Blyther (1856) 47 posts	I’m probably hijacking this thread, but i’ve just got RO up and running on a pi4 and i’ve been suprised by the performance of it compared to the pi3b+ (an increased clock rate of 100MHz), as one of my ARM code programs is running at nearly double the speed. I can only assume that the out of order core is rearranging my very badly written code and turning it into something sensible. But some of the more sanely written code has only seen an increase of about 14%. So as Rick pointed out the ARM core and subsystems plays a big role, and the only real way to see if your code/usage is worth the upgrade is to get someone to try it out for you. I’ve got a few pi’s (from pi0 to 3b+) being used by ‘non nerds’ at work and they dont really notice the difference between them :-(

Mar 14, 2020 2:59pm George T. Greenfield (154) 748 posts	and they dont really notice the difference between them :-( Is that whilst running RISC OS? Over the years I’ve collated !Firebench results on the various machines I’ve owned or tested – as follows. !Firebench is a programme written by Michael Kubel to calculate 16,000 iterations of a 320 × 100 pixel fire, calculating the new pixels from the existing 8 surrounding pixels; in other words, a straight test of processor power. Results are: Iyo = 40.02 secs [36 secs on re-test] Pi1 [default] = 16.1 secs Pi1 [900/333/450] = 11.95 secs Pi1 [1000/500/500] = 10.59 secs Pi2 [default] = 11.62 secs Pi2 [1000>600/450/450] = 9.4 secs Pi3 [default] = 5.14 secs IGEP = 3.66 secs Titanium = 3.57 secs RPCEmu 0.9.2 [Win7, Intel i3, 2.6GHz] = 19.65 secs ‘Default’ means standard CPU/Core/RAM clock settings.

Mar 14, 2020 4:33pm Jeff Blyther (1856) 47 posts	I just downloaded !Firebench, unfortunately it seems the size is now 512 × 256, not the 320 × 100 which your tests were done, but the iterations have been changed to compensate (16000 down to 4096) so the results should be sameish (probably a dangerous assumption!). Anyway I ran the test on 3 nearby pi’s and got the following results:- pi2 13.26 sec pi3b+ 5.97 sec p4 3.46 sec Running another program by Michael Kubel, Fixed point integer fractal, gave the following results :- pi2 2.99 sec pi3b+ 1.19 sec pi4 1.46 sec Yes its not a mistake, the pi3b+ beat the pi4, I can only assume that the A72 doesnt like SMULL ?

Mar 14, 2020 5:16pm Kuemmel (439) 384 posts	Hi there…it’s me, the author (nickname Kuemmel)…yes, the pi4 was surprisingly slower for the FixFrac, don’t actually know why. It’s hard to find good cpu cycle tables from ARM that could give a hint. But you can check the versions for FPU and NEON, here it’s faster, of course also when you overclock it but also clock by clock. At some point I updated that !FireBench as it’s done too fast. It was written back in time on my StrongARM. Where FixFrac and the VFP/NEON Versions (check my website here for the latest versions) are true cpu math crunching benchmarks with no effect from memory subsystem, the !FireBench is more of a memory benchmark as it needs to shuffle a lot of data, but of course some adding/shifting is needed also to compute the pixels. If I have some time I’ll publish a version using NEON to do the fire that’s an order of magnitude faster :-)

Mar 14, 2020 5:55pm Jeff Blyther (1856) 47 posts	Hi Kuemmel, Thanks for publishing, at the moment I’m trying to learn NEON/VFP and its nice to see your well commented code… I wish my code was like that! Yes your VFP/NEON fractal versions do beat the pi3b+ A53. Maybe on the A72 we dont get shifts (LSR/LSL) for free anymore? (although that was probably the case after ARM3)

Mar 14, 2020 6:55pm Kuemmel (439) 384 posts	Hi Jeff…it’s hard to tell, I wouldn’t expect that they did anything ‘bad’ regarding the shifts. Unfortunatelly I could never find cycle timings for the Cortex A53, while the A72 there’s very good data (Link). May be a bit of reordering instructions makes a difference ? (I don’t have my RPI4 yet set up, so I can’t test myself). You could try e.g. move “MOV R3,R3,LSR#(fl%)” at the two positions in the code between “SMULL R5,R6,R0,R0” and “SMULL R8,R2,R0,R1”.

Mar 14, 2020 7:02pm George T. Greenfield (154) 748 posts	!FireBench is more of a memory benchmark as it needs to shuffle a lot of data I stand corrected! But that would explain why the Pi1 (and RPCEmu) is twice as fast as the Iyonix, despite all three having very similar CPU performance as measured by RISCOSmark (Iyo 260%, Pi1 253%, RPCEmu 238% [baseline S/Arm RPC = 100%]; and why overclocking the Core and RAM rates on the Pi1 and 2 has quite a dramatic effect on !FireBench performance.

Mar 14, 2020 7:17pm Jeff Blyther (1856) 47 posts	done the code shuffle, but the pi4 gave the same answer (I assume the re order buffer is optimising the order of code anyway), but it slowed down the pi3b+ :-(

Mar 14, 2020 7:23pm Kuemmel (439) 384 posts	okay…that behaviour of the RPI3 is kind of weird…I always thought that its dual issue pipeline would benefit a bit also…the mysteries of modern cpu internals…meanwhile I reordered everything to the max (you can get it here). Does that help for the pi4 ? As you said I wouldn’t expect it, as the reorder buffer should do the same job.

Mar 14, 2020 7:34pm Jeff Blyther (1856) 47 posts	As you guessed, the pi4 stays the same, but you are making my A53 slower! your original code runs the best!

Mar 15, 2020 12:02pm Kuemmel (439) 384 posts	Meanwhile I had time to polish my Fire NEON code. Before I release it officially on my website, could you give it a test run on the RPI4 ? Here’s the link It’s actually not pixel perfect the same as the old FireBench. In the old one I calculated 8 sourrinding pixels, now, to make use of NEON parallel computing I go for a 3×3 pixel block. With the help of NEON long adds and especially the pretty VEXT command things speed up like hell. It’s roughly 3 times faster than the traditional ARM approach and much more easy to code, no unmasking of “fire” bytes at all :-) P.S.: Is there still no solution to that log in problem to this website ? I constantly have to delete all browers data (using Chrome on Windows 10).

Mar 15, 2020 12:25pm Jeff Blyther (1856) 47 posts	Well thats interesting.. pi4 2.52 sec pi3b+ 2.22 sec pi2 4.29 The A53 beats the A72 again! Hopefully other pi4 users can run the test just to see if I’ve not got a dodgy pi4 here :-) Just had a thought, could it be that the pipelines on the A72 are longer than the A53? so when the core is maxed out the shorter pipeline going to win.

Mar 15, 2020 12:35pm Jeff Blyther (1856) 47 posts	While i’m logged on (using netserf on pi4, I cant log on using safari on my mac) I must say I’m most impressed with riscos on pi4 at such an early stage. Although i’ve only been running it for a day I cant seem to make the system go wrong using my pi as I use it in my work enviroment (spreadsheet + own progs), and it feels real nippy as well!

Mar 15, 2020 12:59pm David Pitt (3386) 1248 posts	!FireBenchNeon Titanium 1.96s RPi4 2.38s RPi3B+ 2.17s

Mar 15, 2020 1:02pm Jeff Blyther (1856) 47 posts	Phew… my pi4 is not dodgy then.

Mar 15, 2020 2:10pm Rick Murray (539) 13840 posts	It’s roughly 3 times faster than the traditional ARM approach and much more easy to code, no unmasking of “fire” bytes at all :-) Am I the only person who thinks that, made full screen and left to run in a loop, that it would make a great screensaver?

Mar 15, 2020 4:08pm Kuemmel (439) 384 posts	Hm, two ideas on the RPI4 beeing slower…is there still some cache setting not set or enabled or are your RPI4’s are having throttling problems as there’s may be no fan ? @Jeffrey: Hope you read that, may be you can say something on the cache setting as this benchmark uses lots of memory bandwith, or do you have other suspects ? @Rick: I coded something for full screen with that blur algorithm…I’ll publish it soon with the !FireBench Update. But somebody with more OS-coding talent has to convert it to a screen saver ;-)

Mar 15, 2020 6:32pm Jeff Blyther (1856) 47 posts	The pi is not throttling, but I have found a problem with the pi4. yesterday I was using the 21st feb ROM (very kindly built by Chris Hall) and all my programs were way faster than my pi3b+ (nearly double on one of my progs), but using todays ROM (15 Mar) things are a lot slower, the prog that was nearly double the speed is now slower than my pi3b+ (but only by a few percent….. no its not, its 25%!!). So somewhere between these dates a change has been made thats not good for pi4’s. Oh Kuemmel, I just run !FireBench on the old rom and its not made any difference :-( so the mystery continues. Update. I just tried todays rom on my pi3b+, which was running 5.24, and the rom change has not made any difference to the timings, so its only a pi4 issue.

Mar 16, 2020 7:56am Jeff Blyther (1856) 47 posts	I just a little play around to see if I can see why my pi4 is slower on the new rom. But after doing a reset my benchmaking test are saying its not. So something in my setup is causing the problem, I will do some more investigating tonight.

Mar 16, 2020 8:36pm Chris Gransden (337) 1207 posts	I just a little play around to see if I can see why my pi4 is slower on the new rom. But after doing a reset my benchmaking test are saying its not. Do you have !CpuClock. You can use it to lock the CPU to the highest clock speed. Sometimes single tasking programs run with the lower clock speed of 600MHz instead of 1500MHz.

Mar 16, 2020 8:54pm Chris Johnson (125) 825 posts	You can use it to lock the CPU to the highest clock speed. Yes, you need to set the low speed to the maximum. I used to have it so that you could never set the slow speed to the max possible speed, but after user request I removed that inhibition 8)