Benchmarks
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Chris Gransden (337) 1207 posts |
If you’re worried about security you really shouldn’t be using RISC OS.
If !Boot gets screwed up it’s now easy to re-image the SD card and start again. |
Chris Gransden (337) 1207 posts |
As has already been mentioned there isn’t really that much currently that runs on RISC OS or is suited to, that requires high bandwidth IO. You’re better off upgrading to a machine with a faster cpu if you want to get more useful work done.
The whole point of using an SD card image is to lower the barriers on getting started with RISC OS regardless of what hardware that may be and then go from there. |
Jess Hampshire (158) 865 posts |
But if you have a truly read-only system, how could it be insecure? You are back to the old days of a ROM based system.
Depends on the situation. If RISC OS is designed for this assumption, you immediately lose the option of simple workstations you can allow unknow people to use. (For example RISC OS on Pi is so cheap you could imagine an Internet Cafe with no charge for machine hire.) |
Steve Revill (20) 1361 posts |
I would say that if you want to have Choices and Scrap in RAMFS and you want to protect your SD card, just flip the write-protect switch on the card. But the RPi doesn’t appear to have wired this up so SDFS cannot know that your card is write-protected. :( |
Jan Rinze (235) 368 posts |
I see in the benchmarks that SDFS is supposed to be reasonably fast. |
Ben Avison (25) 445 posts |
Yes, all three are supported (and MMC and MMCplus as well, plus all the different form factors of all the above). SDHC implies a capacity of 4GB – 32GB. These are too large to be accessed using DOSFS at present, but they are usable if FileCore formated. A future version of FAT32FS may also be able to provide access to FAT-formatted SDHC cards. SDXC implies a capacity of 32GB – 2TB, and they come pre-formatted to exFAT, a different format that I don’t think even Linux supports yet (and has patent protection issues). Again, these work if you format them with a FileCore format.
This may partly be down to the specific card you’re using. Inside an SD card (or a USB flash stick) there is a controller and flash memory. Flash memory by its design can be read in small blocks, but written only in much larger ones. The controller has a dual purpose, to deal with wear levelling, and also to create the illusion that you can write in much smaller blocks than is actually supported by the flash memory. The controller will typically buffer multiple write blocks and coalesce smaller writes into these before writing them to flash – but this does mean that if you are writing to more blocks than the controller is buffering, then performance falls off a cliff. (This is actually a vastly simplified explanation – some people have tried to reverse-engineer the algorithms used and documented them on the web.) |
Kuemmel (439) 384 posts |
As the Basic Assembler on the Raspberry Pi is running now I was able to let Jeffrey run my Mandelbrot Benchmarks (adapted for that board regarding screen mode 1920×1080 and the issue about a maximum of 16/32 floating point registers, you can find the code here). The results are quite interesting. Have a look at the diagram below: I can say that regarding the VFP unit, the Raspberry Pi is about 3 times faster per MHz than the Beagle Board and about 2 times slower than the Panda Board. So an unexpected bonus compared to the Beagle. When I looked at the TRM’s I found that the results are not that suprising, as the ARM1176 got really low latencies and therefore much faster implemention of the VFP unit than the Beagle. That should result, for example also in good results for example in applications like POVRAY or whatever floating point intense apps. May be somebody got POVRAY running at the Pi ? |
Chris Gransden (337) 1207 posts |
POVRay almost works with hardware floating point on RPi. It outputs the first few lines of teapot.pov and then crashes with an illegal instruction error and produces an abort, Address &FC236628 is at offset &00000724 in module VFPSupport I did manage to get a few floating point benchmark programs running. I’ve included figures for softfloat fpa and vfp hardware floating point. The figures for the raspberry pi include the default clock speed of 700MHz and also a stable overclock of 950MHz. Any higher than that and it became unstable. The Pandaboard ES (pes) was running at 1500MHz and the Beagleboard xM (xm) at 1000MHz. The flops and scimark2 values are both in MFlops. The figures should give a rough idea of the relative performance of hardware and software floating point on the three platforms. flops pes pes xm xm rpi 700 rpi 950 rpi 700 rpi 950 fpa vfp fpa vfp fpa fpa vfp vfp MFLOPS(1) 15.5761 389.4106 7.9314 63.4146 4.1086 5.9251 128.0246 183.7405 MFLOPS(2) 14.2849 284.3301 7.2491 52.1245 3.8734 5.5834 83.0849 119.4389 MFLOPS(3) 16.9423 320.7002 8.2020 57.0487 4.5887 6.6187 95.5322 137.3158 MFLOPS(4) 18.9992 341.8342 8.8748 60.7553 5.1591 7.4400 105.0884 151.0961 scimark2 Composite Score 16.12 196.65 8.11 37.81 3.73 6.28 30.45 49.13 FFT 8.33 200.03 4.42 27.25 2.47 3.66 25.00 40.53 SOR 29.66 302.03 14.55 54.25 7.00 10.38 52.82 83.64 MonteCarlo 9.40 74.98 4.53 18.39 0.69 3.87 17.08 24.58 Sparse matmult 17.00 192.19 8.71 53.89 4.43 6.85 26.77 46.55 LU 16.22 214.00 8.33 35.26 4.07 6.65 30.59 50.34 whetstone MWIPS 80.637 1030.875 42.361 207.512 20.529 30.489 210.954 313.151 romark v1.01 (Scores relative to a RiscPC SA 202MHz) PES XM RPI RPI Clock Speed (MHz) 1500 1000 700 950 Processor - Looped instructions (cache) 1325% 584% 255% 366% Memory - Multiple register transfer 7453% 1969% 483% 769% Rectangle Copy - Graphics acceleration test 816% 202% 426% 656% Icon Plotting - 16 colour sprite with mask 715% 407% 400% 638% Draw Path - Stroke narrow line 479% 241% 244% 365% Draw Fill - Plot filled shape 348% 198% 245% 387% |
Raik (463) 2061 posts |
On my RPi povray and lame not working. The right version module VFPSupport ist missing. |
Rick Murray (539) 13840 posts |
So… What’s with the OMAP3 that it is so slow? It looks like it kicks ass over the Pi on raw number crunching, although it itself gets flogged by the PES that’s half again faster in clockrate and much faster than it would seem with raw data shifting (that memory transfer rate is insane). Would I be (anywhere near) right if I guessed that there’s a reasonably decent path to memory, but the interconnect between CPU, VFP, and GPU (etc) sucks? BTW, does this test include memory “reserved” for the GPU? My xM in Linux struggled to play videos (480p) until I got U-Boot to reserve some video memory. At which point it all went much more smoothly. I understand RISC OS doesn’t have GPU access, however I am wondering if even as a framebuffer it needs to have some workspace zoned off…? |
Kuemmel (439) 384 posts |
Regarding the slow VFP OMAP3 results: One can read up a little bit here . It seems that the Cortex-A8 only got an unpipielined VFP-lite implementation. One solution for this is (at least for single precision) to use the NEON unit for that computations as far as the instructions are there. @Chris: To prove this for C-code, is it possible to tell the GCC to use only NEON and not the VFP unit to compile ? Of course this would not work for code using double precision values, I guess then the compiler would use the software library. |
Jeffrey Lee (213) 6048 posts |
For the graphics ops there are effectively two things that are being tested:
For the memory performance test, you need to bear in mind that the standard version of romark does a test that involves repeatedly copying a 128KB block of data from A to B. So for OMAP3/OMAP4 it’s more of a cache performance test than a memory performance test (OMAP3 has 256KB L2 cache, OMAP4 has 1MB). Kuemmel has some nice graphs here highlighting how much of a difference the caches make to performance. |
Chris Gransden (337) 1207 posts |
Here’s some more figures to give an idea of real world relative performance of hardware floating point. RPi RPi xM PES 700Mhz 950Mhz 1000MHz 1500Mhz lame 3.99.3 1.27x 1.98x 1.5x 8.1x povray 3.6.1 148s 83s 106s 21s |
Rick Murray (539) 13840 posts |
Ah, but having read the stuff Kuemmel linked to (a few messages above), the Beagle will always suffer in these tests because the OMAP3 has a precise-but-slow VFP implementation, because the speedy hardware floating point is the fast single-precision NEON (which is adequate for most real-world stuff); the video codec add-on for the video player on my phone is optimised for NEON, so if it is good enough for XviD and the like, it ought to be good enough for MP3! To put this into context, it takes the VFP between 18 and 21 cycles to perform a single precision multiply-accumulate. The NEON, on the other hand, can do two per cycle. I would imagine both lame and povray could benefit hugely from using the NEON FP if single precision is sufficient for the calculations. Are you able to build these programs from source? Does your compiler support NEON? |
Chris Gransden (337) 1207 posts |
I was expecting the ‘armv6 vfp’ times to be much slower than the ‘armv7 neon/vfp’ times but they are virtualy the same. Plus the ‘armv6 vfp’ executables also run on armv7. Saves having to have two different executables. |
Jeffrey Lee (213) 6048 posts |
To boost VFP performance on OMAP3 (and perhaps OMAP4?) try enabling “runfast mode”, which will allow some (but not all) of the VFP instructions to run in the NEON pipeline. This involves enabling flush-to-zero and default NaN modes, i.e. setting bits 24 & 25 of R3 when calling VFPSupport_CreateContext. On a 600MHz BB this results in Kuemmel’s single precision fractal benchmark completing in 21.06s instead of 28.75s. Note that this will only work with single precision math, since NEON can’t do double precision floats. Enabling flush-to-zero and default NaN mode is also the way to work around the support code requirement on the Pi (we don’t have any support code yet to deal with the situations the hardware can’t cope with, so if bits 24 & 25 of FPSCR aren’t set some software may fail with undefined instruction errors). I’m not quite sure what the best way of enforcing this in C apps is – perhaps by using a custom build of unixlib with _FPU_DEFAULT set to &3000000.
I suspect the times are virtually the same because the compiler isn’t smart enough to work out which bits can be done in NEON instead of VFP. Plus it might be wary of automatically using NEON since it isn’t fully IEEE compliant. |
Raik (463) 2061 posts |
I try CDRipR10 with lame and the same CD with the same settings and comparable Hardware (DVDROM, Harddisc etc.): Panda A4 – 1Ghz: <25min |
patric aristide (434) 418 posts |
Amazing how the RPi works out to be 38% faster than the BB-xM in this case. That’s quite significant I’d say. |
Chris Gransden (337) 1207 posts |
I’ve redone the SDL Quake 1 ‘timedemo demo1’ benchmark. This time using GCC 4.7.3 armv6/vfp. Softfloat Softfloat vfp vfp 800x600 640x480 800x600 640x480 Raspberry Pi 700MHz 7.5 fps 10.5 12.6 17.9 Raspberry Pi 950MHz 11.0 fps 15.4 18.8 26.6 Beagleboard xM 1GHz 17.8 fps 24.7 25.0 35.3 Pandaboard ES 1.5GHz 28.5 fps 40.2 51.1 74.4 |
Malcolm Hussain-Gambles (1596) 811 posts |
I did the fractal test on my Pi (overclocked): |
Rick Murray (539) 13840 posts |
Significant that you aren’t using code optimised for the strengths of the OMAP3’s FP 1. The standard multicapable FP is slow. I guess TI were expecting everybody to use the faster NEON, maybe they revised their thinking on this when designing the OMAP4? Either way, let’s see a NEON-FP only build, if such a thing exists. 1 I think of it like the difference between MMX and 3DNow. Same idea, different way of doing it, not compatible. ;-) |
Kuemmel (439) 384 posts |
@Jeffrey: Interesting, that “runfast mode”,…for your record, it doesn’t have any effect on my benchmark on my OMAP4, the results stay the same. So it seems some quite benefitial, but BB only…wonder why…checking now the TRM’s… Update…so in ARM11/C-A8 the runfast mode is described. For C-A9 the runfast mode is not in the TRM’s any more (but flush-to-zero and NaN are still there !) Besides the flush-to-zero and NaN thing the runfast mode says additionally “all exception enable bits are cleared”. I just didn’t find out how would you implement that ? …for the others, you just need to know that the runfast mode isn’t IEE compatible any more…but in which occasion that is really necessary, I don’t know, guess try and error… |
Jeffrey Lee (213) 6048 posts |
Replace “TI” with “ARM”, and “OMAP4” with “Cortex-A9” and you might be right :) The ARM cores in OMAP3/OMAP4 are all fairly standard designs from ARM without any tweaks having been made to the instruction pipelines.
With the fully pipelined VFP in the A9 there probably wasn’t any point in them implementing a special runfast mode.
The exception enable bits are in bits 8-15 of the FPSCR. I’m fairly certain they’re ignored on Cortex-A8 (since FP exceptions aren’t supported at all), but for runfast mode to work on ARM11 you would have to make sure they’re clear. I did experiment with runfast mode on ARM11, but it didn’t make things significantly faster (Fractal demo about 5% faster). |
Jerome Mathevet (1630) 19 posts |
Does RISCOS Mplayer use any fp-operation for decoding videos ? |
Chris Gransden (337) 1207 posts |
There’s no benefit from hardware floating point. I’ve built a version that is optimised for armv6/7 but it only runs 5-10% faster. |
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18