RISC OS Open: Forum: Benchmarks

Mar 10, 2016 10:56am

Kuemmel (439) 384 posts

Since the arrival of the RPI3 I updated all my Mandelbrot Fractal benchmarks (Fixed Point Math, NEON, VFP single precision, VFP double precision) to be more easily used on all platforms. No more screen switching and some code clean-up. You can find the collection here

I compiled some interesting results here:

               Beagle  Panda  IGEPv5  RPi1   RPi2   RPi3
          MHz   1000	1500   1500   700    900    1200
                 [s]     [s]    [s]   [s]    [s]    [s]
							
FixFrac          3,19   1,82   1,48   5,07   3,24   1,44
FracNEON         1,10   0,71   0,51          2,03   0,63
FracVFP_single  15,35   1,82   2,17   7,02   3,28   2,09
FracVFP_double  17,98   2,06   2,18   7,50   4,87   2,09

                       Difference clock by clock
                Pi3 to   Pi3 to   Pi3 to   Pi3 to   Pi3 to
                Beagle    Panda   IGEPv5    Pi1      Pi2
                 [%]      [%]      [%]      [%]      [%]
				 
FixFrac           +85	   +58	   +28	   +105      +69
FracNEON          +46	   +41	    +1	            +142
FracVFP_single   +512	    +9	   +30	    +96	     +18
FracVFP_double   +617	   +23	   +30	   +109	     +75

As you can see the Pi3/Cortex A53 (clock by clock) is the fastest architecture Risc OS runs by now regarding those benchmarks. It’s impressive how they increased even integer speed (FixFrac) with each generation of cores.

Why that is mostly not reflected in other benchmarks is may be due to the reported issue with the memory transfer speeds of the RPI3. If this is a RiscOS specific issue or not still needs to be sorted out. I tried to figure out what exactly is slow and it’s weird to see that VLD/VLDR/VLDM are reasonably fast at low blocks of memory, but all stores and other loads (LDR/LDM/STR/STM/VSTM/VST/VSTR) are super-slow compared to RPI2…

Mar 10, 2016 11:57am

David Pitt (102) 743 posts

Some more interesting results.

                Titanium   RPi3
MHz               1500     1200
                  (s)      (s)

FixFrac           1.47      1.47
FracNeon          0.51      0.64
FracVFP_single    2.2       2.1
FracVFP_double    2.2       2.1

Mar 10, 2016 9:11pm

Chris Gransden (337) 1207 posts

                  RPi3
MHz               1500
                  (s)

FixFrac           1.17
FracNeon          0.51
FracVFP_single    1.68
FracVFP_double    1.68

arm_freq=1500
core_freq=600
sdram_freq=533

Mar 10, 2016 10:27pm

Rick Murray (539) 13840 posts

Oh my… That’s… impressive… for a $35 bit of kit…

Mar 10, 2016 10:37pm

Rick Murray (539) 13840 posts

Kuemmel – the Pi1 has NEON, just a less complete version.
Do you know why the Beagle’s VFP benchmarks are so miserable? That’s kind of pathetic.

I wonder if the Pi3 load and store speed is due to optimisations for 64 bit behaviour that might come at the expense of 32 bit?
I wonder – does the ARM now contain two entirely different execution units inside it, or is it primarily a 64 bit processor with some additional logic to operate in 32 bit mode?

Mar 10, 2016 11:06pm

Jeffrey Lee (213) 6048 posts

Kuemmel – the Pi1 has NEON, just a less complete version.

Wrong. There is no NEON.

Do you know why the Beagle’s VFP benchmarks are so miserable? That’s kind of pathetic.

AIUI they went for a VFP implementation that was cheap (in terms of die space/complexity) rather than one that was fast.

runfast mode can improve the performance, I think by allowing it to use the NEON execution unit, but I think it still falls short of the performance ratio of later architectures like A9.

Mar 11, 2016 9:05am

Rick Murray (539) 13840 posts

Wrong. There is no NEON.

My mistake. Must have been on the Beagle I was playing with it.

AIUI they went for a VFP implementation that was cheap

Hmm, isn’t that ultimately a bit self defeating? Like, we have FP but it’s crap…

Mar 11, 2016 9:24am

Kuemmel (439) 384 posts

I think you were looking at a different thing may be. The RPI 1 has no NEON and also the VFP unit has only half the Dx registers available (16 instead of 32). I took care about that in my code and request only 16 for the VFP versions and of course use not more than 16.

I checked some old posts from Jeffrey, the runfast mode roughly made the Beagle about 40% faster for single precision…but still very slow…kind of outdated hardware to me anyway…even the xm is now more than 6 years old…

Regarding that memory issue of the RPI3 I’m still puzzled, if anybody can code some inline assembler on linux I would provide some input what and how to test so we can see if those problems appear in the linux world also. The memory test Chris was running didn’t show any slowdown on linux compared to RPI2.

Mar 11, 2016 2:21pm

Jeffrey Lee (213) 6048 posts

I wonder – does the ARM now contain two entirely different execution units inside it, or is it primarily a 64 bit processor with some additional logic to operate in 32 bit mode?

I believe it’s a 64bit processor with logic to allow it to decode the AArch32 instructions. So there are probably three instruction decode units (AArch64, ARM, Thumb) which feed into a generic execution pipeline. This probably isn’t that tricky to implement, when you consider that most AArch32 instructions have a direct equivalent AArch64 instruction, and most (all?) AArch64 instructions have the ability to use just the lower 32 bits of the registers, and when an exception is taken from AArch32 to AArch64 it’s a fairly striaghtforward mapping of registers between the two (i.e. the registers are always stored in the AArch64 register file, and the AArch32 instruction decode logic handles the mapping as appropriate)

Regarding that memory issue of the RPI3 I’m still puzzled, if anybody can code some inline assembler on linux I would provide some input what and how to test so we can see if those problems appear in the linux world also. The memory test Chris was running didn’t show any slowdown on linux compared to RPI2.

I haven’t really had any time to look at this yet, apart from a brief look at the docs to see if there are any clues. Like the Cortex-A7 the L2 cache enable is tied to the same L1 D cache enable bit in the system control register, so if the L1 cache is enabled (which it should be!) then L2 should be enabled as well. The other thing I can think of that might be causing an issue is if the stage 2 MMU (for virtualisation) is enabled, as that might be forcing all our memory accesses to be treated as non-cacheable. The stage 2 MMU should be disabled on startup, but it’s possible the bootloader is enabling it for some reason. Checking the hypervisor CP15 registers and checking raw memory performance in Linux is probably the way to go.

Mar 11, 2016 2:58pm

Rick Murray (539) 13840 posts

I think you were looking at a different thing may be. The RPI 1 has no NEON and also the VFP unit has only half the Dx registers available

Yup. Long day at work, I’m stupid, etc etc. :-/

Mar 11, 2016 3:15pm

Rick Murray (539) 13840 posts

This probably isn’t that tricky to implement, when you consider that most AArch32 instructions have a direct equivalent AArch64 instruction,

…except for LDM and STM which don’t have a direct equivalent, so… Grasping at straws but there must be something going on.

Mar 11, 2016 9:34pm

Kuemmel (439) 384 posts

I coded some small memory transfer and load/store test apps here to find out more about that memory transfer strangeness of the RPI3. With 1 KByte load or store I get e.g.:

     PI2   PI3
MHz  900   1200

     [MByte/s]

LDR  1027  2084
LDRD 1992  2851
LDM  4069  8138
VLDR 5744  8138
VLD  4069  5661
VLDM 4650  6734

STR  1027   485
STRD 2077   485
STM  3616   484
VSTR 4438  8223
VST  3616   484
VSTM 3487   486

what is superweird is when I do the size at 2 KByte also the stores off the PI3 are fast, doing it at 4 KByte slow again, no other cpu is showing that. Overall when using memory transfer load plus store it results in a very poor level like the slow store operations of the RPI3. It seems more an issue of store operation/store cache, may be I was wrong addressing the load operations.

Mar 12, 2016 2:06am

Jeffrey Lee (213) 6048 posts

Good news – I found the problem. Our default cache policy was set to writeback, read-allocate. So if the cache policy is followed to the letter, it means the cache shouldn’t be allocating cache lines for writes that miss the cache – and judging by the poor write performance I’m guessing that’s exactly what’s happening (apart from with VSTR, for some reason). Switching to a read+write-allocate policy brings the performance of writes in line with reads.

When I taught the OS about the VMSAv6 memory attributes a few months ago I did briefly experiment with making write-allocate the default (after some prompting from Ben) but didn’t see any statistically significant difference. So either the benchmark I was using was terrible, or the machine I was using (can’t remember which) was ignoring some of the attributes and treating it as read+write allocate.

Anyway, assuming my checkin made it in in time for the nightly build, you should now see much better memory benchmark performance on the Pi 3, and it’s possible other machines will see improved performance too. I’ll leave the benchmarking to the experts, I haven’t even configured my Pi 2 or 3 to run at full speed yet!

Also, Kuemmel: your tests are almost certainly clobbering dest% because you’re rounding down the start addresses of the buffers. Round up instead (e.g. dest%=(dest%+&FF) AND &FFFFFF00) and it should be fine (as long as you keep the bit which allocates the buffers slightly larger than they need to be) (Oh, and thanks for the test code that showed off the problem so well!)

Mar 12, 2016 1:57pm

Kuemmel (439) 384 posts

Thanks Jeffrey for that hint, stupid mistake with the buffers, might have caused some crashes. I’ll do some benchmarking once the nightly built is there. Is there any reference document/pages for that cache setting stuff ? So the user can really set how the cache is working, I thought that’s all “burnt” in the hardware…are there different cases when either this or that makes sense ?

@Chris Hall: Will be very interesting then to run also your benchmarks again.

Mar 13, 2016 10:14am

Kuemmel (439) 384 posts

Hi Jeffrey, run my test again on PI2 and PI3 both old and new ROM. With the new ROM the small memory blocks are totally in line now:

     PI2   PI3
MHz  900   1200

     [MByte/s]

LDR  1050  1878
LDRD 2077  2790
LDM  4245  8138
VLDR 5425  8138
VLD  4245  5425
VLDM 4882  6510

STR  1050  1878
STRD 2122  2790
STM  3756  5744
VSTR 4650  8138
VST  3756  8138
VSTM 3906 10850

Some other things changed in strange ways found with my !MemSpeed test:

- RAM to “VRAM” on PI2 doubled (new ROM now 800 MByte/s compared to 400 MByte/s before)
- RAM to “VRAM” on PI3 in contrary halfed (new ROM 350 MByte/s compared to 700 MByte before)
- RAM to RAM on PI3 at memory blocks exceeding the 2nd level cache is halfed with the new ROM (around 200 instead of 400 MByte/s)

I didn’t use any strange gpu or memory clocking stuff, so doesn’t make much sense to me, I would expect the PI3 same or faster also “above” second level cache, as the RAM is clocked same according to the specs compared to PI2. Does this VRAM behaviour and lower big block memory speed makes sense to you ?

P.S.: formating post question…how can I get rid of extra empty lines in my post in the number tables ? I’m using just pre plus code. I don’t have any empty lines between the number lines above but it still puts them. @EDIT Fixed, thanks David !

Mar 13, 2016 10:22am

David Pitt (102) 743 posts

formating post question…how can I get rid of extra empty lines in my post in the number tables ?

Just have an empty line above the <pre> tag.

Mar 13, 2016 10:35am

David Pitt (102) 743 posts

Fix poor Pi 3 memory benchmark performance

romarks from a 1200MHz RPi3

Before :-

Version number: RISC OS 5.23
Build date    : Wed,02 Mar 2016.04:24:20
Test                                           Benchmark
Processor - Looped instructions (cache)          2320048    1304%
Memory - Multiple register transfer                 1840    1135%

After :-

Version number: RISC OS 5.23
Build date    : Sun,13 Mar 2016.04:23:32
Test                                           Benchmark
Processor - Looped instructions (cache)          2318655    1303%
Memory - Multiple register transfer                23250   14351%

Wow!!! or Hmm!!!

Mar 13, 2016 12:29pm

George T. Greenfield (154) 748 posts

Anyway, assuming my checkin made it in in time for the nightly build,

Just downloaded it, and I see that the ZeroPain directory is included, containing a !Boot file for merging dated 13 Mar 16: does this mean that the ZeroPain module will still allow low-vector-reliant software to run, or am I becoming confused (not for the first time…)? The accompanying notes include the words: “Also note that the module contains a built-in kill switch – it will only run on ROMs built in 2015.” Is that correct?

Mar 13, 2016 1:14pm

Steve Pampling (1551) 8170 posts

does this mean that the ZeroPain module is still working as before

No and yes.
No, the ZeroPain module from ROOL doesn’t work – because it was coded to stop working when the year string of the OS build was no longer 2015

And yes, a modified version of the module that either looks for 2016 or ignores the date will still work.

This from Rick Murray might be useful.

Mar 13, 2016 2:38pm

Jeffrey Lee (213) 6048 posts

Is there any reference document/pages for that cache setting stuff ? So the user can really set how the cache is working, I thought that’s all “burnt” in the hardware…are there different cases when either this or that makes sense ?

Under RISC OS 5 the cache policy for a dynamic area can be specified in the flags when you create the area, as a combination of bits 4-5 and 12-14. Most ARMv7 CPUs support several cache policies, but you’d have to check the TRM for each processor to find out which are actually supported. The hardware also supports specifying separate policies for the L1 and L2 caches, but under RISC OS the same policy will be used for both.

Note: Ignore the “(Writeback if available, or write alloc for areas mapped by HAL)” comment for the default CB policy. The default CB policy for ARMv6+ is now writeback, write-alloc, and for ARMv5 and below I don’t think the OS has ever treated memory mapped by the HAL differently to memory mapped by the OS.

Application space and any other bits of cacheable non-DA memory (e.g. kernel & HAL workspace) use the default CB policy.

Some other things changed in strange ways found with my !MemSpeed test:

- RAM to “VRAM” on PI2 doubled (new ROM now 800 MByte/s compared to 400 MByte/s before)
- RAM to “VRAM” on PI3 in contrary halfed (new ROM 350 MByte/s compared to 700 MByte before)
- RAM to RAM on PI3 at memory blocks exceeding the 2nd level cache is halfed with the new ROM (around 200 instead of 400 MByte/s)

I didn’t use any strange gpu or memory clocking stuff, so doesn’t make much sense to me, I would expect the PI3 same or faster also “above” second level cache, as the RAM is clocked same according to the specs compared to PI2. Does this VRAM behaviour and lower big block memory speed makes sense to you ?

VRAM speed should be the same as the old ROM. Is it possible that during some of the tests the GPU has been underclocking the system to avoid overheating? When I was testing I did see some big differences when repeating the same test multiple times, so running a test just once doesn’t seem to be enough to get 100% trustworthy results.

Another possibility is that maximum performance depends on being able to get the right “rhythm” of memory accesses, and that sometimes something happens to mess with that (interrupt occurring at a specific time, different physical RAM pages being used from one test to another and affecting their distribution in the cache, etc.)

does this mean that the ZeroPain module is still working as before

No and yes.
No, the ZeroPain module from ROOL doesn’t work – because it was coded to stop working when the year string of the OS build was no longer 2015

There is a new version of ZeroPain in the works (with better error reporting), but other things popping up over the past few weeks have been keeping me from working on it.

When thinking about the fact that the current version doesn’t work my internal dialogue is always along the lines of “Hmm, I should probably remove the current version of ZeroPain until the new one is ready” followed by “Nah, I’m sure the new one will be ready in a couple of weeks! Removing the old one and updating the readmes to explain what’s going on would just be a waste of time”

Mar 13, 2016 4:23pm

Steve Pampling (1551) 8170 posts

There is a new version of ZeroPain in the works (with better error reporting)

Nice.
Much as some might see the capture error and carry on as a crutch for old buggy software even they have to admit that having the OS go base over apex like it sometime odes doesn’t help anyone.

Mar 13, 2016 6:11pm

Mike Freestone (2564) 131 posts

When thinking about the fact that the current version doesn’t work my internal dialogue is always along the lines

We’re extremely grateful to the voices in your head for everything you do for risc os – on my pandaboard here the lack of zeropain hasn’t caused any problems with the apps in daily use on a zpp 5.23

Mar 13, 2016 8:03pm

Chris Hall (132) 3554 posts

Benchmarks updated – yes the memory benchmark increases massively on the Pi 3 (but !UnTarBZ2 still doesn’t work even after updating the SHared Unix Lib to 1.14 (the app needs recompiling as the SUL is not a shared library at all).

Mar 23, 2016 9:58pm

Chris Hall (132) 3554 posts

Benchmarks updated for the igepv5 (RapidoIg) with SATA for hard disc speeds.

Mar 28, 2016 8:04am

Matthew Phillips (473) 721 posts

Link to benchmarks for convenience (had to go back to previous page).

Benchmarks

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Mar 10, 2016 10:56am Kuemmel (439) 384 posts	Since the arrival of the RPI3 I updated all my Mandelbrot Fractal benchmarks (Fixed Point Math, NEON, VFP single precision, VFP double precision) to be more easily used on all platforms. No more screen switching and some code clean-up. You can find the collection here I compiled some interesting results here: `Beagle Panda IGEPv5 RPi1 RPi2 RPi3 MHz 1000 1500 1500 700 900 1200 [s] [s] [s] [s] [s] [s] FixFrac 3,19 1,82 1,48 5,07 3,24 1,44 FracNEON 1,10 0,71 0,51 2,03 0,63 FracVFP_single 15,35 1,82 2,17 7,02 3,28 2,09 FracVFP_double 17,98 2,06 2,18 7,50 4,87 2,09` `Difference clock by clock Pi3 to Pi3 to Pi3 to Pi3 to Pi3 to Beagle Panda IGEPv5 Pi1 Pi2 [%] [%] [%] [%] [%] FixFrac +85 +58 +28 +105 +69 FracNEON +46 +41 +1 +142 FracVFP_single +512 +9 +30 +96 +18 FracVFP_double +617 +23 +30 +109 +75` As you can see the Pi3/Cortex A53 (clock by clock) is the fastest architecture Risc OS runs by now regarding those benchmarks. It’s impressive how they increased even integer speed (FixFrac) with each generation of cores. Why that is mostly not reflected in other benchmarks is may be due to the reported issue with the memory transfer speeds of the RPI3. If this is a RiscOS specific issue or not still needs to be sorted out. I tried to figure out what exactly is slow and it’s weird to see that VLD/VLDR/VLDM are reasonably fast at low blocks of memory, but all stores and other loads (LDR/LDM/STR/STM/VSTM/VST/VSTR) are super-slow compared to RPI2…

Mar 10, 2016 11:57am David Pitt (102) 743 posts	Some more interesting results. Titanium RPi3 MHz 1500 1200 (s) (s) FixFrac 1.47 1.47 FracNeon 0.51 0.64 FracVFP_single 2.2 2.1 FracVFP_double 2.2 2.1

Mar 10, 2016 9:11pm Chris Gransden (337) 1207 posts	RPi3 MHz 1500 (s) FixFrac 1.17 FracNeon 0.51 FracVFP_single 1.68 FracVFP_double 1.68 arm_freq=1500 core_freq=600 sdram_freq=533

Mar 10, 2016 10:27pm Rick Murray (539) 13840 posts	Oh my… That’s… impressive… for a $35 bit of kit…

Mar 10, 2016 10:37pm Rick Murray (539) 13840 posts	Kuemmel – the Pi1 has NEON, just a less complete version. Do you know why the Beagle’s VFP benchmarks are so miserable? That’s kind of pathetic. I wonder if the Pi3 load and store speed is due to optimisations for 64 bit behaviour that might come at the expense of 32 bit? I wonder – does the ARM now contain two entirely different execution units inside it, or is it primarily a 64 bit processor with some additional logic to operate in 32 bit mode?

Mar 10, 2016 11:06pm Jeffrey Lee (213) 6048 posts	Kuemmel – the Pi1 has NEON, just a less complete version. Wrong. There is no NEON. Do you know why the Beagle’s VFP benchmarks are so miserable? That’s kind of pathetic. AIUI they went for a VFP implementation that was cheap (in terms of die space/complexity) rather than one that was fast. runfast mode can improve the performance, I think by allowing it to use the NEON execution unit, but I think it still falls short of the performance ratio of later architectures like A9.

Mar 11, 2016 9:05am Rick Murray (539) 13840 posts	Wrong. There is no NEON. My mistake. Must have been on the Beagle I was playing with it. AIUI they went for a VFP implementation that was cheap Hmm, isn’t that ultimately a bit self defeating? Like, we have FP but it’s crap…

Mar 11, 2016 9:24am Kuemmel (439) 384 posts	I think you were looking at a different thing may be. The RPI 1 has no NEON and also the VFP unit has only half the Dx registers available (16 instead of 32). I took care about that in my code and request only 16 for the VFP versions and of course use not more than 16. I checked some old posts from Jeffrey, the runfast mode roughly made the Beagle about 40% faster for single precision…but still very slow…kind of outdated hardware to me anyway…even the xm is now more than 6 years old… Regarding that memory issue of the RPI3 I’m still puzzled, if anybody can code some inline assembler on linux I would provide some input what and how to test so we can see if those problems appear in the linux world also. The memory test Chris was running didn’t show any slowdown on linux compared to RPI2.

Mar 11, 2016 2:21pm Jeffrey Lee (213) 6048 posts	I wonder – does the ARM now contain two entirely different execution units inside it, or is it primarily a 64 bit processor with some additional logic to operate in 32 bit mode? I believe it’s a 64bit processor with logic to allow it to decode the AArch32 instructions. So there are probably three instruction decode units (AArch64, ARM, Thumb) which feed into a generic execution pipeline. This probably isn’t that tricky to implement, when you consider that most AArch32 instructions have a direct equivalent AArch64 instruction, and most (all?) AArch64 instructions have the ability to use just the lower 32 bits of the registers, and when an exception is taken from AArch32 to AArch64 it’s a fairly striaghtforward mapping of registers between the two (i.e. the registers are always stored in the AArch64 register file, and the AArch32 instruction decode logic handles the mapping as appropriate) Regarding that memory issue of the RPI3 I’m still puzzled, if anybody can code some inline assembler on linux I would provide some input what and how to test so we can see if those problems appear in the linux world also. The memory test Chris was running didn’t show any slowdown on linux compared to RPI2. I haven’t really had any time to look at this yet, apart from a brief look at the docs to see if there are any clues. Like the Cortex-A7 the L2 cache enable is tied to the same L1 D cache enable bit in the system control register, so if the L1 cache is enabled (which it should be!) then L2 should be enabled as well. The other thing I can think of that might be causing an issue is if the stage 2 MMU (for virtualisation) is enabled, as that might be forcing all our memory accesses to be treated as non-cacheable. The stage 2 MMU should be disabled on startup, but it’s possible the bootloader is enabling it for some reason. Checking the hypervisor CP15 registers and checking raw memory performance in Linux is probably the way to go.

Mar 11, 2016 2:58pm Rick Murray (539) 13840 posts	I think you were looking at a different thing may be. The RPI 1 has no NEON and also the VFP unit has only half the Dx registers available Yup. Long day at work, I’m stupid, etc etc. :-/

Mar 11, 2016 3:15pm Rick Murray (539) 13840 posts	This probably isn’t that tricky to implement, when you consider that most AArch32 instructions have a direct equivalent AArch64 instruction, …except for LDM and STM which don’t have a direct equivalent, so… Grasping at straws but there must be something going on.

Mar 11, 2016 9:34pm Kuemmel (439) 384 posts	I coded some small memory transfer and load/store test apps here to find out more about that memory transfer strangeness of the RPI3. With 1 KByte load or store I get e.g.: `PI2 PI3 MHz 900 1200 [MByte/s] LDR 1027 2084 LDRD 1992 2851 LDM 4069 8138 VLDR 5744 8138 VLD 4069 5661 VLDM 4650 6734 STR 1027 485 STRD 2077 485 STM 3616 484 VSTR 4438 8223 VST 3616 484 VSTM 3487 486` what is superweird is when I do the size at 2 KByte also the stores off the PI3 are fast, doing it at 4 KByte slow again, no other cpu is showing that. Overall when using memory transfer load plus store it results in a very poor level like the slow store operations of the RPI3. It seems more an issue of store operation/store cache, may be I was wrong addressing the load operations.

Mar 12, 2016 2:06am Jeffrey Lee (213) 6048 posts	Good news – I found the problem. Our default cache policy was set to writeback, read-allocate. So if the cache policy is followed to the letter, it means the cache shouldn’t be allocating cache lines for writes that miss the cache – and judging by the poor write performance I’m guessing that’s exactly what’s happening (apart from with VSTR, for some reason). Switching to a read+write-allocate policy brings the performance of writes in line with reads. When I taught the OS about the VMSAv6 memory attributes a few months ago I did briefly experiment with making write-allocate the default (after some prompting from Ben) but didn’t see any statistically significant difference. So either the benchmark I was using was terrible, or the machine I was using (can’t remember which) was ignoring some of the attributes and treating it as read+write allocate. Anyway, assuming my checkin made it in in time for the nightly build, you should now see much better memory benchmark performance on the Pi 3, and it’s possible other machines will see improved performance too. I’ll leave the benchmarking to the experts, I haven’t even configured my Pi 2 or 3 to run at full speed yet! Also, Kuemmel: your tests are almost certainly clobbering dest% because you’re rounding down the start addresses of the buffers. Round up instead (e.g. dest%=(dest%+&FF) AND &FFFFFF00) and it should be fine (as long as you keep the bit which allocates the buffers slightly larger than they need to be) (Oh, and thanks for the test code that showed off the problem so well!)

Mar 12, 2016 1:57pm Kuemmel (439) 384 posts	Thanks Jeffrey for that hint, stupid mistake with the buffers, might have caused some crashes. I’ll do some benchmarking once the nightly built is there. Is there any reference document/pages for that cache setting stuff ? So the user can really set how the cache is working, I thought that’s all “burnt” in the hardware…are there different cases when either this or that makes sense ? @Chris Hall: Will be very interesting then to run also your benchmarks again.

Mar 13, 2016 10:14am Kuemmel (439) 384 posts	Hi Jeffrey, run my test again on PI2 and PI3 both old and new ROM. With the new ROM the small memory blocks are totally in line now: `PI2 PI3 MHz 900 1200 [MByte/s] LDR 1050 1878 LDRD 2077 2790 LDM 4245 8138 VLDR 5425 8138 VLD 4245 5425 VLDM 4882 6510 STR 1050 1878 STRD 2122 2790 STM 3756 5744 VSTR 4650 8138 VST 3756 8138 VSTM 3906 10850` Some other things changed in strange ways found with my !MemSpeed test: - RAM to “VRAM” on PI2 doubled (new ROM now 800 MByte/s compared to 400 MByte/s before) - RAM to “VRAM” on PI3 in contrary halfed (new ROM 350 MByte/s compared to 700 MByte before) - RAM to RAM on PI3 at memory blocks exceeding the 2nd level cache is halfed with the new ROM (around 200 instead of 400 MByte/s) I didn’t use any strange gpu or memory clocking stuff, so doesn’t make much sense to me, I would expect the PI3 same or faster also “above” second level cache, as the RAM is clocked same according to the specs compared to PI2. Does this VRAM behaviour and lower big block memory speed makes sense to you ? P.S.: formating post question…how can I get rid of extra empty lines in my post in the number tables ? I’m using just pre plus code. I don’t have any empty lines between the number lines above but it still puts them. @EDIT Fixed, thanks David !

Mar 13, 2016 10:22am David Pitt (102) 743 posts	formating post question…how can I get rid of extra empty lines in my post in the number tables ? Just have an empty line above the <pre> tag.

Mar 13, 2016 10:35am David Pitt (102) 743 posts	Fix poor Pi 3 memory benchmark performance romarks from a 1200MHz RPi3 Before :- Version number: RISC OS 5.23 Build date : Wed,02 Mar 2016.04:24:20 Test Benchmark Processor - Looped instructions (cache) 2320048 1304% Memory - Multiple register transfer 1840 1135% After :- Version number: RISC OS 5.23 Build date : Sun,13 Mar 2016.04:23:32 Test Benchmark Processor - Looped instructions (cache) 2318655 1303% Memory - Multiple register transfer 23250 14351% Wow!!! or Hmm!!!

Mar 13, 2016 12:29pm George T. Greenfield (154) 748 posts	Anyway, assuming my checkin made it in in time for the nightly build, Just downloaded it, and I see that the ZeroPain directory is included, containing a !Boot file for merging dated 13 Mar 16: does this mean that the ZeroPain module will still allow low-vector-reliant software to run, or am I becoming confused (not for the first time…)? The accompanying notes include the words: “Also note that the module contains a built-in kill switch – it will only run on ROMs built in 2015.” Is that correct?

Mar 13, 2016 1:14pm Steve Pampling (1551) 8170 posts	does this mean that the ZeroPain module is still working as before No and yes. No, the ZeroPain module from ROOL doesn’t work – because it was coded to stop working when the year string of the OS build was no longer 2015 And yes, a modified version of the module that either looks for 2016 or ignores the date will still work. This from Rick Murray might be useful.

Mar 13, 2016 2:38pm Jeffrey Lee (213) 6048 posts	Is there any reference document/pages for that cache setting stuff ? So the user can really set how the cache is working, I thought that’s all “burnt” in the hardware…are there different cases when either this or that makes sense ? Under RISC OS 5 the cache policy for a dynamic area can be specified in the flags when you create the area, as a combination of bits 4-5 and 12-14. Most ARMv7 CPUs support several cache policies, but you’d have to check the TRM for each processor to find out which are actually supported. The hardware also supports specifying separate policies for the L1 and L2 caches, but under RISC OS the same policy will be used for both. Note: Ignore the “(Writeback if available, or write alloc for areas mapped by HAL)” comment for the default CB policy. The default CB policy for ARMv6+ is now writeback, write-alloc, and for ARMv5 and below I don’t think the OS has ever treated memory mapped by the HAL differently to memory mapped by the OS. Application space and any other bits of cacheable non-DA memory (e.g. kernel & HAL workspace) use the default CB policy. Some other things changed in strange ways found with my !MemSpeed test: - RAM to “VRAM” on PI2 doubled (new ROM now 800 MByte/s compared to 400 MByte/s before) - RAM to “VRAM” on PI3 in contrary halfed (new ROM 350 MByte/s compared to 700 MByte before) - RAM to RAM on PI3 at memory blocks exceeding the 2nd level cache is halfed with the new ROM (around 200 instead of 400 MByte/s) I didn’t use any strange gpu or memory clocking stuff, so doesn’t make much sense to me, I would expect the PI3 same or faster also “above” second level cache, as the RAM is clocked same according to the specs compared to PI2. Does this VRAM behaviour and lower big block memory speed makes sense to you ? VRAM speed should be the same as the old ROM. Is it possible that during some of the tests the GPU has been underclocking the system to avoid overheating? When I was testing I did see some big differences when repeating the same test multiple times, so running a test just once doesn’t seem to be enough to get 100% trustworthy results. Another possibility is that maximum performance depends on being able to get the right “rhythm” of memory accesses, and that sometimes something happens to mess with that (interrupt occurring at a specific time, different physical RAM pages being used from one test to another and affecting their distribution in the cache, etc.) does this mean that the ZeroPain module is still working as before No and yes. No, the ZeroPain module from ROOL doesn’t work – because it was coded to stop working when the year string of the OS build was no longer 2015 There is a new version of ZeroPain in the works (with better error reporting), but other things popping up over the past few weeks have been keeping me from working on it. When thinking about the fact that the current version doesn’t work my internal dialogue is always along the lines of “Hmm, I should probably remove the current version of ZeroPain until the new one is ready” followed by “Nah, I’m sure the new one will be ready in a couple of weeks! Removing the old one and updating the readmes to explain what’s going on would just be a waste of time”

Mar 13, 2016 4:23pm Steve Pampling (1551) 8170 posts	There is a new version of ZeroPain in the works (with better error reporting) Nice. Much as some might see the capture error and carry on as a crutch for old buggy software even they have to admit that having the OS go base over apex like it sometime odes doesn’t help anyone.

Mar 13, 2016 6:11pm Mike Freestone (2564) 131 posts	When thinking about the fact that the current version doesn’t work my internal dialogue is always along the lines We’re extremely grateful to the voices in your head for everything you do for risc os – on my pandaboard here the lack of zeropain hasn’t caused any problems with the apps in daily use on a zpp 5.23

Mar 13, 2016 8:03pm Chris Hall (132) 3554 posts	Benchmarks updated – yes the memory benchmark increases massively on the Pi 3 (but !UnTarBZ2 still doesn’t work even after updating the SHared Unix Lib to 1.14 (the app needs recompiling as the SUL is not a shared library at all).

Mar 23, 2016 9:58pm Chris Hall (132) 3554 posts	Benchmarks updated for the igepv5 (RapidoIg) with SATA for hard disc speeds.

Mar 28, 2016 8:04am Matthew Phillips (473) 721 posts	Link to benchmarks for convenience (had to go back to previous page).