Benchmarks
Pages: 1 ... 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Kuemmel (439) 384 posts |
Since the arrival of the RPI3 I updated all my Mandelbrot Fractal benchmarks (Fixed Point Math, NEON, VFP single precision, VFP double precision) to be more easily used on all platforms. No more screen switching and some code clean-up. You can find the collection here I compiled some interesting results here:
As you can see the Pi3/Cortex A53 (clock by clock) is the fastest architecture Risc OS runs by now regarding those benchmarks. It’s impressive how they increased even integer speed (FixFrac) with each generation of cores. Why that is mostly not reflected in other benchmarks is may be due to the reported issue with the memory transfer speeds of the RPI3. If this is a RiscOS specific issue or not still needs to be sorted out. I tried to figure out what exactly is slow and it’s weird to see that VLD/VLDR/VLDM are reasonably fast at low blocks of memory, but all stores and other loads (LDR/LDM/STR/STM/VSTM/VST/VSTR) are super-slow compared to RPI2… |
David Pitt (102) 743 posts |
Some more interesting results. Titanium RPi3 MHz 1500 1200 (s) (s) FixFrac 1.47 1.47 FracNeon 0.51 0.64 FracVFP_single 2.2 2.1 FracVFP_double 2.2 2.1 |
Chris Gransden (337) 1207 posts |
RPi3 MHz 1500 (s) FixFrac 1.17 FracNeon 0.51 FracVFP_single 1.68 FracVFP_double 1.68 arm_freq=1500 |
Rick Murray (539) 13840 posts |
Oh my… That’s… impressive… for a $35 bit of kit… |
Rick Murray (539) 13840 posts |
Kuemmel – the Pi1 has NEON, just a less complete version. I wonder if the Pi3 load and store speed is due to optimisations for 64 bit behaviour that might come at the expense of 32 bit? |
Jeffrey Lee (213) 6048 posts |
Wrong. There is no NEON.
AIUI they went for a VFP implementation that was cheap (in terms of die space/complexity) rather than one that was fast. runfast mode can improve the performance, I think by allowing it to use the NEON execution unit, but I think it still falls short of the performance ratio of later architectures like A9. |
Rick Murray (539) 13840 posts |
My mistake. Must have been on the Beagle I was playing with it.
Hmm, isn’t that ultimately a bit self defeating? Like, we have FP but it’s crap… |
Kuemmel (439) 384 posts |
I think you were looking at a different thing may be. The RPI 1 has no NEON and also the VFP unit has only half the Dx registers available (16 instead of 32). I took care about that in my code and request only 16 for the VFP versions and of course use not more than 16. I checked some old posts from Jeffrey, the runfast mode roughly made the Beagle about 40% faster for single precision…but still very slow…kind of outdated hardware to me anyway…even the xm is now more than 6 years old… Regarding that memory issue of the RPI3 I’m still puzzled, if anybody can code some inline assembler on linux I would provide some input what and how to test so we can see if those problems appear in the linux world also. The memory test Chris was running didn’t show any slowdown on linux compared to RPI2. |
Jeffrey Lee (213) 6048 posts |
I believe it’s a 64bit processor with logic to allow it to decode the AArch32 instructions. So there are probably three instruction decode units (AArch64, ARM, Thumb) which feed into a generic execution pipeline. This probably isn’t that tricky to implement, when you consider that most AArch32 instructions have a direct equivalent AArch64 instruction, and most (all?) AArch64 instructions have the ability to use just the lower 32 bits of the registers, and when an exception is taken from AArch32 to AArch64 it’s a fairly striaghtforward mapping of registers between the two (i.e. the registers are always stored in the AArch64 register file, and the AArch32 instruction decode logic handles the mapping as appropriate)
I haven’t really had any time to look at this yet, apart from a brief look at the docs to see if there are any clues. Like the Cortex-A7 the L2 cache enable is tied to the same L1 D cache enable bit in the system control register, so if the L1 cache is enabled (which it should be!) then L2 should be enabled as well. The other thing I can think of that might be causing an issue is if the stage 2 MMU (for virtualisation) is enabled, as that might be forcing all our memory accesses to be treated as non-cacheable. The stage 2 MMU should be disabled on startup, but it’s possible the bootloader is enabling it for some reason. Checking the hypervisor CP15 registers and checking raw memory performance in Linux is probably the way to go. |
Rick Murray (539) 13840 posts |
Yup. Long day at work, I’m stupid, etc etc. :-/ |
Rick Murray (539) 13840 posts |
…except for LDM and STM which don’t have a direct equivalent, so… Grasping at straws but there must be something going on. |
Kuemmel (439) 384 posts |
I coded some small memory transfer and load/store test apps here to find out more about that memory transfer strangeness of the RPI3. With 1 KByte load or store I get e.g.:
what is superweird is when I do the size at 2 KByte also the stores off the PI3 are fast, doing it at 4 KByte slow again, no other cpu is showing that. Overall when using memory transfer load plus store it results in a very poor level like the slow store operations of the RPI3. It seems more an issue of store operation/store cache, may be I was wrong addressing the load operations. |
Jeffrey Lee (213) 6048 posts |
Good news – I found the problem. Our default cache policy was set to writeback, read-allocate. So if the cache policy is followed to the letter, it means the cache shouldn’t be allocating cache lines for writes that miss the cache – and judging by the poor write performance I’m guessing that’s exactly what’s happening (apart from with VSTR, for some reason). Switching to a read+write-allocate policy brings the performance of writes in line with reads. When I taught the OS about the VMSAv6 memory attributes a few months ago I did briefly experiment with making write-allocate the default (after some prompting from Ben) but didn’t see any statistically significant difference. So either the benchmark I was using was terrible, or the machine I was using (can’t remember which) was ignoring some of the attributes and treating it as read+write allocate. Anyway, assuming my checkin made it in in time for the nightly build, you should now see much better memory benchmark performance on the Pi 3, and it’s possible other machines will see improved performance too. I’ll leave the benchmarking to the experts, I haven’t even configured my Pi 2 or 3 to run at full speed yet! Also, Kuemmel: your tests are almost certainly clobbering dest% because you’re rounding down the start addresses of the buffers. Round up instead (e.g. dest%=(dest%+&FF) AND &FFFFFF00) and it should be fine (as long as you keep the bit which allocates the buffers slightly larger than they need to be) (Oh, and thanks for the test code that showed off the problem so well!) |
Kuemmel (439) 384 posts |
Thanks Jeffrey for that hint, stupid mistake with the buffers, might have caused some crashes. I’ll do some benchmarking once the nightly built is there. Is there any reference document/pages for that cache setting stuff ? So the user can really set how the cache is working, I thought that’s all “burnt” in the hardware…are there different cases when either this or that makes sense ? @Chris Hall: Will be very interesting then to run also your benchmarks again. |
Kuemmel (439) 384 posts |
Hi Jeffrey, run my test again on PI2 and PI3 both old and new ROM. With the new ROM the small memory blocks are totally in line now:
Some other things changed in strange ways found with my !MemSpeed test: - RAM to “VRAM” on PI2 doubled (new ROM now 800 MByte/s compared to 400 MByte/s before) I didn’t use any strange gpu or memory clocking stuff, so doesn’t make much sense to me, I would expect the PI3 same or faster also “above” second level cache, as the RAM is clocked same according to the specs compared to PI2. Does this VRAM behaviour and lower big block memory speed makes sense to you ? P.S.: formating post question…how can I get rid of extra empty lines in my post in the number tables ? I’m using just pre plus code. I don’t have any empty lines between the number lines above but it still puts them. @EDIT Fixed, thanks David ! |
David Pitt (102) 743 posts |
Just have an empty line above the <pre> tag. |
David Pitt (102) 743 posts |
Fix poor Pi 3 memory benchmark performance romarks from a 1200MHz RPi3 Before :- Version number: RISC OS 5.23 Build date : Wed,02 Mar 2016.04:24:20 Test Benchmark Processor - Looped instructions (cache) 2320048 1304% Memory - Multiple register transfer 1840 1135% After :- Version number: RISC OS 5.23 Build date : Sun,13 Mar 2016.04:23:32 Test Benchmark Processor - Looped instructions (cache) 2318655 1303% Memory - Multiple register transfer 23250 14351% Wow!!! or Hmm!!! |
George T. Greenfield (154) 748 posts |
Just downloaded it, and I see that the ZeroPain directory is included, containing a !Boot file for merging dated 13 Mar 16: does this mean that the ZeroPain module will still allow low-vector-reliant software to run, or am I becoming confused (not for the first time…)? The accompanying notes include the words: “Also note that the module contains a built-in kill switch – it will only run on ROMs built in 2015.” Is that correct? |
Steve Pampling (1551) 8170 posts |
No and yes. And yes, a modified version of the module that either looks for 2016 or ignores the date will still work. This from Rick Murray might be useful. |
Jeffrey Lee (213) 6048 posts |
Under RISC OS 5 the cache policy for a dynamic area can be specified in the flags when you create the area, as a combination of bits 4-5 and 12-14. Most ARMv7 CPUs support several cache policies, but you’d have to check the TRM for each processor to find out which are actually supported. The hardware also supports specifying separate policies for the L1 and L2 caches, but under RISC OS the same policy will be used for both. Note: Ignore the “(Writeback if available, or write alloc for areas mapped by HAL)” comment for the default CB policy. The default CB policy for ARMv6+ is now writeback, write-alloc, and for ARMv5 and below I don’t think the OS has ever treated memory mapped by the HAL differently to memory mapped by the OS. Application space and any other bits of cacheable non-DA memory (e.g. kernel & HAL workspace) use the default CB policy. Some other things changed in strange ways found with my !MemSpeed test: VRAM speed should be the same as the old ROM. Is it possible that during some of the tests the GPU has been underclocking the system to avoid overheating? When I was testing I did see some big differences when repeating the same test multiple times, so running a test just once doesn’t seem to be enough to get 100% trustworthy results. Another possibility is that maximum performance depends on being able to get the right “rhythm” of memory accesses, and that sometimes something happens to mess with that (interrupt occurring at a specific time, different physical RAM pages being used from one test to another and affecting their distribution in the cache, etc.) does this mean that the ZeroPain module is still working as before There is a new version of ZeroPain in the works (with better error reporting), but other things popping up over the past few weeks have been keeping me from working on it. When thinking about the fact that the current version doesn’t work my internal dialogue is always along the lines of “Hmm, I should probably remove the current version of ZeroPain until the new one is ready” followed by “Nah, I’m sure the new one will be ready in a couple of weeks! Removing the old one and updating the readmes to explain what’s going on would just be a waste of time” |
Steve Pampling (1551) 8170 posts |
Nice. |
Mike Freestone (2564) 131 posts |
We’re extremely grateful to the voices in your head for everything you do for risc os – on my pandaboard here the lack of zeropain hasn’t caused any problems with the apps in daily use on a zpp 5.23 |
Chris Hall (132) 3554 posts |
Benchmarks updated – yes the memory benchmark increases massively on the Pi 3 (but !UnTarBZ2 still doesn’t work even after updating the SHared Unix Lib to 1.14 (the app needs recompiling as the SUL is not a shared library at all). |
Chris Hall (132) 3554 posts |
Benchmarks updated for the igepv5 (RapidoIg) with SATA for hard disc speeds. |
Matthew Phillips (473) 721 posts |
Link to benchmarks for convenience (had to go back to previous page). |
Pages: 1 ... 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18