RISC OS Open: Forum: Benchmarks

Jul 7, 2012 4:35pm

Chris Gransden (337) 1207 posts

The other good reason for having a read only boot drive is security.

If you’re worried about security you really shouldn’t be using RISC OS.

If the bootdrive is properly locked, then users messing with setting etc. is not going to be a problem.

If !Boot gets screwed up it’s now easy to re-image the SD card and start again.

Jul 7, 2012 5:05pm

Chris Gransden (337) 1207 posts

Disc access is already slower than the Iyonix as it goes via USB, and writing to SD is slower still.

As has already been mentioned there isn’t really that much currently that runs on RISC OS or is suited to, that requires high bandwidth IO. You’re better off upgrading to a machine with a faster cpu if you want to get more useful work done.

capacity in SD cards – does the board accept SD/SDHC/SDXC and what limits does that imply for a working – not experimental – computer?

The whole point of using an SD card image is to lower the barriers on getting started with RISC OS regardless of what hardware that may be and then go from there.

Jul 7, 2012 5:20pm

Jess Hampshire (158) 865 posts

If you’re worried about security you really shouldn’t be using RISC OS.

But if you have a truly read-only system, how could it be insecure? You are back to the old days of a ROM based system.

If !Boot gets screwed up it’s now easy to re-image the SD card and start again.

Depends on the situation. If RISC OS is designed for this assumption, you immediately lose the option of simple workstations you can allow unknow people to use. (For example RISC OS on Pi is so cheap you could imagine an Internet Cafe with no charge for machine hire.)

Jul 7, 2012 9:46pm

Steve Revill (20) 1361 posts

I would say that if you want to have Choices and Scrap in RAMFS and you want to protect your SD card, just flip the write-protect switch on the card. But the RPi doesn’t appear to have wired this up so SDFS cannot know that your card is write-protected. :(

Jul 8, 2012 11:43am

Jan Rinze (235) 368 posts

I see in the benchmarks that SDFS is supposed to be reasonably fast.
However on my Beagle Board it seems that it has a problem when writing multiple files. It becomes horribly slow.
This seems to be similar to an old bug with SCSIFS where there was a 4 files / second limit..
Anyone else has this same experience?

Jul 8, 2012 9:55pm

Ben Avison (25) 445 posts

does the board accept SD/SDHC/SDXC

Yes, all three are supported (and MMC and MMCplus as well, plus all the different form factors of all the above).

SDHC implies a capacity of 4GB – 32GB. These are too large to be accessed using DOSFS at present, but they are usable if FileCore formated. A future version of FAT32FS may also be able to provide access to FAT-formatted SDHC cards.

SDXC implies a capacity of 32GB – 2TB, and they come pre-formatted to exFAT, a different format that I don’t think even Linux supports yet (and has patent protection issues). Again, these work if you format them with a FileCore format.

However on my Beagle Board it seems that it has a problem when writing multiple files. It becomes horribly slow.

This may partly be down to the specific card you’re using. Inside an SD card (or a USB flash stick) there is a controller and flash memory. Flash memory by its design can be read in small blocks, but written only in much larger ones. The controller has a dual purpose, to deal with wear levelling, and also to create the illusion that you can write in much smaller blocks than is actually supported by the flash memory. The controller will typically buffer multiple write blocks and coalesce smaller writes into these before writing them to flash – but this does mean that if you are writing to more blocks than the controller is buffering, then performance falls off a cliff. (This is actually a vastly simplified explanation – some people have tried to reverse-engineer the algorithms used and documented them on the web.)

Oct 10, 2012 6:20pm

Kuemmel (439) 384 posts

As the Basic Assembler on the Raspberry Pi is running now I was able to let Jeffrey run my Mandelbrot Benchmarks (adapted for that board regarding screen mode 1920×1080 and the issue about a maximum of 16/32 floating point registers, you can find the code here). The results are quite interesting. Have a look at the diagram below:

I can say that regarding the VFP unit, the Raspberry Pi is about 3 times faster per MHz than the Beagle Board and about 2 times slower than the Panda Board. So an unexpected bonus compared to the Beagle.

When I looked at the TRM’s I found that the results are not that suprising, as the ARM1176 got really low latencies and therefore much faster implemention of the VFP unit than the Beagle.

That should result, for example also in good results for example in applications like POVRAY or whatever floating point intense apps. May be somebody got POVRAY running at the Pi ?

Oct 13, 2012 5:30pm

Chris Gransden (337) 1207 posts

That should result, for example also in good results for example in applications like POVRAY or whatever floating point intense apps. May be somebody got POVRAY running at the Pi ?

POVRay almost works with hardware floating point on RPi. It outputs the first few lines of teapot.pov and then crashes with an illegal instruction error and produces an abort,

Address &FC236628 is at offset &00000724 in module VFPSupport

I did manage to get a few floating point benchmark programs running. I’ve included figures for softfloat fpa and vfp hardware floating point. The figures for the raspberry pi include the default clock speed of 700MHz and also a stable overclock of 950MHz. Any higher than that and it became unstable. The Pandaboard ES (pes) was running at 1500MHz and the Beagleboard xM (xm) at 1000MHz.

The flops and scimark2 values are both in MFlops. The figures should give a rough idea of the relative performance of hardware and software floating point on the three platforms.


flops
                pes      pes       xm      xm       rpi 700  rpi 950  rpi 700   rpi 950
                fpa      vfp       fpa     vfp      fpa      fpa      vfp       vfp
MFLOPS(1)       15.5761  389.4106  7.9314  63.4146  4.1086   5.9251   128.0246  183.7405
MFLOPS(2)       14.2849  284.3301  7.2491  52.1245  3.8734   5.5834    83.0849  119.4389
MFLOPS(3)       16.9423  320.7002  8.2020  57.0487  4.5887   6.6187    95.5322  137.3158
MFLOPS(4)       18.9992  341.8342  8.8748  60.7553  5.1591   7.4400   105.0884  151.0961


scimark2

Composite Score 16.12    196.65     8.11   37.81    3.73      6.28     30.45     49.13

FFT              8.33    200.03     4.42   27.25    2.47      3.66     25.00     40.53
SOR             29.66    302.03    14.55   54.25    7.00     10.38     52.82     83.64
MonteCarlo       9.40     74.98     4.53   18.39    0.69      3.87     17.08     24.58
Sparse matmult  17.00    192.19     8.71   53.89    4.43      6.85     26.77     46.55
LU              16.22    214.00     8.33   35.26    4.07      6.65     30.59     50.34


whetstone

MWIPS           80.637   1030.875  42.361  207.512  20.529   30.489    210.954   313.151

romark v1.01 (Scores relative to a RiscPC SA 202MHz)

                                                  PES      XM     RPI     RPI
Clock Speed (MHz)                                1500    1000     700     950
                                                 
Processor - Looped instructions (cache)          1325%    584%    255%    366% 
Memory - Multiple register transfer              7453%   1969%    483%    769%
Rectangle Copy - Graphics acceleration test       816%    202%    426%    656%
Icon Plotting - 16 colour sprite with mask        715%    407%    400%    638%
Draw Path - Stroke narrow line                    479%    241%    244%    365%
Draw Fill - Plot filled shape                     348%    198%    245%    387%

Oct 13, 2012 6:22pm

Raik (463) 2061 posts

On my RPi povray and lame not working. The right version module VFPSupport ist missing.

Oct 13, 2012 6:23pm

Rick Murray (539) 13840 posts

So… What’s with the OMAP3 that it is so slow? It looks like it kicks ass over the Pi on raw number crunching, although it itself gets flogged by the PES that’s half again faster in clockrate and much faster than it would seem with raw data shifting (that memory transfer rate is insane).
But to have a 1GHz Beagle flunk on graphics ops and, it would seem, VFP, by a processor running almost half as fast? That’s gotta hurt!

Would I be (anywhere near) right if I guessed that there’s a reasonably decent path to memory, but the interconnect between CPU, VFP, and GPU (etc) sucks?

BTW, does this test include memory “reserved” for the GPU? My xM in Linux struggled to play videos (480p) until I got U-Boot to reserve some video memory. At which point it all went much more smoothly. I understand RISC OS doesn’t have GPU access, however I am wondering if even as a framebuffer it needs to have some workspace zoned off…?

Oct 14, 2012 9:26am

Kuemmel (439) 384 posts

Regarding the slow VFP OMAP3 results: One can read up a little bit here . It seems that the Cortex-A8 only got an unpipielined VFP-lite implementation.

One solution for this is (at least for single precision) to use the NEON unit for that computations as far as the instructions are there.

@Chris: To prove this for C-code, is it possible to tell the GCC to use only NEON and not the VFP unit to compile ? Of course this would not work for code using double precision values, I guess then the compiler would use the software library.

Oct 15, 2012 1:18pm

Jeffrey Lee (213) 6048 posts

For the graphics ops there are effectively two things that are being tested:

CPU performance (icon plotting, draw path, draw fill). RISC OS 5 doesn’t map screen memory as cacheable, so screen reads/writes are always going to be significantly slower than accesses to most other areas of memory. However the Pi has a system-level L2 cache, so clock-for-clock it’s always going to be faster than the other machines.
DMA performance (rectangle copy). The OMAP3 DMA controller seems to be a tad slow. Luckily it looks like they fixed it for the OMAP4. Either that or we haven’t configured it properly and it’s running at a slow clock speed.

For the memory performance test, you need to bear in mind that the standard version of romark does a test that involves repeatedly copying a 128KB block of data from A to B. So for OMAP3/OMAP4 it’s more of a cache performance test than a memory performance test (OMAP3 has 256KB L2 cache, OMAP4 has 1MB). Kuemmel has some nice graphs here highlighting how much of a difference the caches make to performance.

Oct 17, 2012 6:07pm

Chris Gransden (337) 1207 posts

Here’s some more figures to give an idea of real world relative performance of hardware floating point.
The figures for lame were using the default encoding settings showing the numbers of times faster than real time to encode a random wav file.
The povray times are the time in seconds it took to render ‘abyss.pov’ -w320 -h120.


             RPi      RPi     xM      PES
             700Mhz   950Mhz  1000MHz 1500Mhz

lame 3.99.3  1.27x    1.98x   1.5x    8.1x


povray 3.6.1 148s     83s     106s    21s

Oct 17, 2012 8:08pm

Rick Murray (539) 13840 posts

Here’s some more figures to give an idea of real world relative performance of hardware floating point.

Ah, but having read the stuff Kuemmel linked to (a few messages above), the Beagle will always suffer in these tests because the OMAP3 has a precise-but-slow VFP implementation, because the speedy hardware floating point is the fast single-precision NEON (which is adequate for most real-world stuff); the video codec add-on for the video player on my phone is optimised for NEON, so if it is good enough for XviD and the like, it ought to be good enough for MP3!

To put this into context, it takes the VFP between 18 and 21 cycles to perform a single precision multiply-accumulate. The NEON, on the other hand, can do two per cycle.

I would imagine both lame and povray could benefit hugely from using the NEON FP if single precision is sufficient for the calculations.

Are you able to build these programs from source? Does your compiler support NEON?

Oct 17, 2012 9:16pm

Chris Gransden (337) 1207 posts

Are you able to build these programs from source? Does your compiler support NEON?

I was expecting the ‘armv6 vfp’ times to be much slower than the ‘armv7 neon/vfp’ times but they are virtualy the same. Plus the ‘armv6 vfp’ executables also run on armv7. Saves having to have two different executables.

Oct 18, 2012 11:28pm

Jeffrey Lee (213) 6048 posts

To boost VFP performance on OMAP3 (and perhaps OMAP4?) try enabling “runfast mode”, which will allow some (but not all) of the VFP instructions to run in the NEON pipeline. This involves enabling flush-to-zero and default NaN modes, i.e. setting bits 24 & 25 of R3 when calling VFPSupport_CreateContext. On a 600MHz BB this results in Kuemmel’s single precision fractal benchmark completing in 21.06s instead of 28.75s. Note that this will only work with single precision math, since NEON can’t do double precision floats.

Enabling flush-to-zero and default NaN mode is also the way to work around the support code requirement on the Pi (we don’t have any support code yet to deal with the situations the hardware can’t cope with, so if bits 24 & 25 of FPSCR aren’t set some software may fail with undefined instruction errors). I’m not quite sure what the best way of enforcing this in C apps is – perhaps by using a custom build of unixlib with _FPU_DEFAULT set to &3000000.

I was expecting the ‘armv6 vfp’ times to be much slower than the ‘armv7 neon/vfp’ times but they are virtualy the same. Plus the ‘armv6 vfp’ executables also run on armv7. Saves having to have two different executables.

I suspect the times are virtually the same because the compiler isn’t smart enough to work out which bits can be done in NEON instead of VFP. Plus it might be wary of automatically using NEON since it isn’t fully IEEE compliant.

Oct 19, 2012 4:13am

Raik (463) 2061 posts

I try CDRipR10 with lame and the same CD with the same settings and comparable Hardware (DVDROM, Harddisc etc.):
CD with 40min (13 songs); lame: VQ4, 192kb/s, stereo

Panda A4 – 1Ghz: <25min
xM – 1 Ghz: 56min
xM – set min/max CPUClock to 1GHz: 50min
RPi – 600MHz: 40min
RPi – 950MHz: 31min

Oct 19, 2012 11:43am

patric aristide (434) 418 posts

Amazing how the RPi works out to be 38% faster than the BB-xM in this case. That’s quite significant I’d say.

Oct 19, 2012 3:48pm

Chris Gransden (337) 1207 posts

I’ve redone the SDL Quake 1 ‘timedemo demo1’ benchmark. This time using GCC 4.7.3 armv6/vfp.


                       Softfloat    Softfloat  vfp      vfp
                       800x600      640x480    800x600  640x480
Raspberry Pi 700MHz     7.5 fps     10.5       12.6     17.9
Raspberry Pi 950MHz    11.0 fps     15.4       18.8     26.6
Beagleboard xM 1GHz    17.8 fps     24.7       25.0     35.3
Pandaboard ES 1.5GHz   28.5 fps     40.2       51.1     74.4

Oct 19, 2012 5:48pm

Malcolm Hussain-Gambles (1596) 811 posts

I did the fractal test on my Pi (overclocked):
I got 7.25s for Double precision and 6.78 for single precision.
Not sure how this translates to above figures

Oct 19, 2012 6:08pm

Rick Murray (539) 13840 posts

Amazing how the RPi works out to be 38% faster than the BB-xM in this case. That’s quite significant I’d say.

Significant that you aren’t using code optimised for the strengths of the OMAP3’s FP ¹. The standard multicapable FP is slow. I guess TI were expecting everybody to use the faster NEON, maybe they revised their thinking on this when designing the OMAP4?

Either way, let’s see a NEON-FP only build, if such a thing exists.

¹ I think of it like the difference between MMX and 3DNow. Same idea, different way of doing it, not compatible. ;-)

Oct 19, 2012 6:31pm

Kuemmel (439) 384 posts

@Jeffrey: Interesting, that “runfast mode”,…for your record, it doesn’t have any effect on my benchmark on my OMAP4, the results stay the same. So it seems some quite benefitial, but BB only…wonder why…checking now the TRM’s…

Update…so in ARM11/C-A8 the runfast mode is described. For C-A9 the runfast mode is not in the TRM’s any more (but flush-to-zero and NaN are still there !)

Besides the flush-to-zero and NaN thing the runfast mode says additionally “all exception enable bits are cleared”. I just didn’t find out how would you implement that ?

…for the others, you just need to know that the runfast mode isn’t IEE compatible any more…but in which occasion that is really necessary, I don’t know, guess try and error…

Oct 19, 2012 7:15pm

Jeffrey Lee (213) 6048 posts

I guess TI were expecting everybody to use the faster NEON, maybe they revised their thinking on this when designing the OMAP4?

Replace “TI” with “ARM”, and “OMAP4” with “Cortex-A9” and you might be right :) The ARM cores in OMAP3/OMAP4 are all fairly standard designs from ARM without any tweaks having been made to the instruction pipelines.

Update…so in ARM11/C-A8 the runfast mode is described. For C-A9 the runfast mode is not in the TRM’s any more (but flush-to-zero and NaN are still there !)

With the fully pipelined VFP in the A9 there probably wasn’t any point in them implementing a special runfast mode.

Besides the flush-to-zero and NaN thing the runfast mode says additionally “all exception enable bits are cleared”. I just didn’t find out how would you implement that ?

The exception enable bits are in bits 8-15 of the FPSCR. I’m fairly certain they’re ignored on Cortex-A8 (since FP exceptions aren’t supported at all), but for runfast mode to work on ARM11 you would have to make sure they’re clear. I did experiment with runfast mode on ARM11, but it didn’t make things significantly faster (Fractal demo about 5% faster).

Oct 19, 2012 8:51pm

Jerome Mathevet (1630) 19 posts

Does RISCOS Mplayer use any fp-operation for decoding videos ?

Oct 20, 2012 11:24am

Chris Gransden (337) 1207 posts

Does RISCOS Mplayer use any fp-operation for decoding videos ?

There’s no benefit from hardware floating point. I’ve built a version that is optimised for armv6/7 but it only runs 5-10% faster.

Benchmarks

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Jul 7, 2012 4:35pm Chris Gransden (337) 1207 posts	The other good reason for having a read only boot drive is security. If you’re worried about security you really shouldn’t be using RISC OS. If the bootdrive is properly locked, then users messing with setting etc. is not going to be a problem. If !Boot gets screwed up it’s now easy to re-image the SD card and start again.

Jul 7, 2012 5:05pm Chris Gransden (337) 1207 posts	Disc access is already slower than the Iyonix as it goes via USB, and writing to SD is slower still. As has already been mentioned there isn’t really that much currently that runs on RISC OS or is suited to, that requires high bandwidth IO. You’re better off upgrading to a machine with a faster cpu if you want to get more useful work done. capacity in SD cards – does the board accept SD/SDHC/SDXC and what limits does that imply for a working – not experimental – computer? The whole point of using an SD card image is to lower the barriers on getting started with RISC OS regardless of what hardware that may be and then go from there.

Jul 7, 2012 5:20pm Jess Hampshire (158) 865 posts	If you’re worried about security you really shouldn’t be using RISC OS. But if you have a truly read-only system, how could it be insecure? You are back to the old days of a ROM based system. If !Boot gets screwed up it’s now easy to re-image the SD card and start again. Depends on the situation. If RISC OS is designed for this assumption, you immediately lose the option of simple workstations you can allow unknow people to use. (For example RISC OS on Pi is so cheap you could imagine an Internet Cafe with no charge for machine hire.)

Jul 7, 2012 9:46pm Steve Revill (20) 1361 posts	I would say that if you want to have Choices and Scrap in RAMFS and you want to protect your SD card, just flip the write-protect switch on the card. But the RPi doesn’t appear to have wired this up so SDFS cannot know that your card is write-protected. :(

Jul 8, 2012 11:43am Jan Rinze (235) 368 posts	I see in the benchmarks that SDFS is supposed to be reasonably fast. However on my Beagle Board it seems that it has a problem when writing multiple files. It becomes horribly slow. This seems to be similar to an old bug with SCSIFS where there was a 4 files / second limit.. Anyone else has this same experience?

Jul 8, 2012 9:55pm Ben Avison (25) 445 posts	does the board accept SD/SDHC/SDXC Yes, all three are supported (and MMC and MMCplus as well, plus all the different form factors of all the above). SDHC implies a capacity of 4GB – 32GB. These are too large to be accessed using DOSFS at present, but they are usable if FileCore formated. A future version of FAT32FS may also be able to provide access to FAT-formatted SDHC cards. SDXC implies a capacity of 32GB – 2TB, and they come pre-formatted to exFAT, a different format that I don’t think even Linux supports yet (and has patent protection issues). Again, these work if you format them with a FileCore format. However on my Beagle Board it seems that it has a problem when writing multiple files. It becomes horribly slow. This may partly be down to the specific card you’re using. Inside an SD card (or a USB flash stick) there is a controller and flash memory. Flash memory by its design can be read in small blocks, but written only in much larger ones. The controller has a dual purpose, to deal with wear levelling, and also to create the illusion that you can write in much smaller blocks than is actually supported by the flash memory. The controller will typically buffer multiple write blocks and coalesce smaller writes into these before writing them to flash – but this does mean that if you are writing to more blocks than the controller is buffering, then performance falls off a cliff. (This is actually a vastly simplified explanation – some people have tried to reverse-engineer the algorithms used and documented them on the web.)

Oct 10, 2012 6:20pm Kuemmel (439) 384 posts	As the Basic Assembler on the Raspberry Pi is running now I was able to let Jeffrey run my Mandelbrot Benchmarks (adapted for that board regarding screen mode 1920×1080 and the issue about a maximum of 16/32 floating point registers, you can find the code here). The results are quite interesting. Have a look at the diagram below: I can say that regarding the VFP unit, the Raspberry Pi is about 3 times faster per MHz than the Beagle Board and about 2 times slower than the Panda Board. So an unexpected bonus compared to the Beagle. When I looked at the TRM’s I found that the results are not that suprising, as the ARM1176 got really low latencies and therefore much faster implemention of the VFP unit than the Beagle. That should result, for example also in good results for example in applications like POVRAY or whatever floating point intense apps. May be somebody got POVRAY running at the Pi ?

Oct 13, 2012 5:30pm Chris Gransden (337) 1207 posts	That should result, for example also in good results for example in applications like POVRAY or whatever floating point intense apps. May be somebody got POVRAY running at the Pi ? POVRay almost works with hardware floating point on RPi. It outputs the first few lines of teapot.pov and then crashes with an illegal instruction error and produces an abort, Address &FC236628 is at offset &00000724 in module VFPSupport I did manage to get a few floating point benchmark programs running. I’ve included figures for softfloat fpa and vfp hardware floating point. The figures for the raspberry pi include the default clock speed of 700MHz and also a stable overclock of 950MHz. Any higher than that and it became unstable. The Pandaboard ES (pes) was running at 1500MHz and the Beagleboard xM (xm) at 1000MHz. The flops and scimark2 values are both in MFlops. The figures should give a rough idea of the relative performance of hardware and software floating point on the three platforms. flops pes pes xm xm rpi 700 rpi 950 rpi 700 rpi 950 fpa vfp fpa vfp fpa fpa vfp vfp MFLOPS(1) 15.5761 389.4106 7.9314 63.4146 4.1086 5.9251 128.0246 183.7405 MFLOPS(2) 14.2849 284.3301 7.2491 52.1245 3.8734 5.5834 83.0849 119.4389 MFLOPS(3) 16.9423 320.7002 8.2020 57.0487 4.5887 6.6187 95.5322 137.3158 MFLOPS(4) 18.9992 341.8342 8.8748 60.7553 5.1591 7.4400 105.0884 151.0961 scimark2 Composite Score 16.12 196.65 8.11 37.81 3.73 6.28 30.45 49.13 FFT 8.33 200.03 4.42 27.25 2.47 3.66 25.00 40.53 SOR 29.66 302.03 14.55 54.25 7.00 10.38 52.82 83.64 MonteCarlo 9.40 74.98 4.53 18.39 0.69 3.87 17.08 24.58 Sparse matmult 17.00 192.19 8.71 53.89 4.43 6.85 26.77 46.55 LU 16.22 214.00 8.33 35.26 4.07 6.65 30.59 50.34 whetstone MWIPS 80.637 1030.875 42.361 207.512 20.529 30.489 210.954 313.151 romark v1.01 (Scores relative to a RiscPC SA 202MHz) PES XM RPI RPI Clock Speed (MHz) 1500 1000 700 950 Processor - Looped instructions (cache) 1325% 584% 255% 366% Memory - Multiple register transfer 7453% 1969% 483% 769% Rectangle Copy - Graphics acceleration test 816% 202% 426% 656% Icon Plotting - 16 colour sprite with mask 715% 407% 400% 638% Draw Path - Stroke narrow line 479% 241% 244% 365% Draw Fill - Plot filled shape 348% 198% 245% 387%

Oct 13, 2012 6:22pm Raik (463) 2061 posts	On my RPi povray and lame not working. The right version module VFPSupport ist missing.

Oct 13, 2012 6:23pm Rick Murray (539) 13840 posts	So… What’s with the OMAP3 that it is so slow? It looks like it kicks ass over the Pi on raw number crunching, although it itself gets flogged by the PES that’s half again faster in clockrate and much faster than it would seem with raw data shifting (that memory transfer rate is insane). But to have a 1GHz Beagle flunk on graphics ops and, it would seem, VFP, by a processor running almost half as fast? That’s gotta hurt! Would I be (anywhere near) right if I guessed that there’s a reasonably decent path to memory, but the interconnect between CPU, VFP, and GPU (etc) sucks? BTW, does this test include memory “reserved” for the GPU? My xM in Linux struggled to play videos (480p) until I got U-Boot to reserve some video memory. At which point it all went much more smoothly. I understand RISC OS doesn’t have GPU access, however I am wondering if even as a framebuffer it needs to have some workspace zoned off…?

Oct 14, 2012 9:26am Kuemmel (439) 384 posts	Regarding the slow VFP OMAP3 results: One can read up a little bit here . It seems that the Cortex-A8 only got an unpipielined VFP-lite implementation. One solution for this is (at least for single precision) to use the NEON unit for that computations as far as the instructions are there. @Chris: To prove this for C-code, is it possible to tell the GCC to use only NEON and not the VFP unit to compile ? Of course this would not work for code using double precision values, I guess then the compiler would use the software library.

Oct 15, 2012 1:18pm Jeffrey Lee (213) 6048 posts	For the graphics ops there are effectively two things that are being tested: CPU performance (icon plotting, draw path, draw fill). RISC OS 5 doesn’t map screen memory as cacheable, so screen reads/writes are always going to be significantly slower than accesses to most other areas of memory. However the Pi has a system-level L2 cache, so clock-for-clock it’s always going to be faster than the other machines. DMA performance (rectangle copy). The OMAP3 DMA controller seems to be a tad slow. Luckily it looks like they fixed it for the OMAP4. Either that or we haven’t configured it properly and it’s running at a slow clock speed. For the memory performance test, you need to bear in mind that the standard version of romark does a test that involves repeatedly copying a 128KB block of data from A to B. So for OMAP3/OMAP4 it’s more of a cache performance test than a memory performance test (OMAP3 has 256KB L2 cache, OMAP4 has 1MB). Kuemmel has some nice graphs here highlighting how much of a difference the caches make to performance.

Oct 17, 2012 6:07pm Chris Gransden (337) 1207 posts	Here’s some more figures to give an idea of real world relative performance of hardware floating point. The figures for lame were using the default encoding settings showing the numbers of times faster than real time to encode a random wav file. The povray times are the time in seconds it took to render ‘abyss.pov’ -w320 -h120. RPi RPi xM PES 700Mhz 950Mhz 1000MHz 1500Mhz lame 3.99.3 1.27x 1.98x 1.5x 8.1x povray 3.6.1 148s 83s 106s 21s

Oct 17, 2012 8:08pm Rick Murray (539) 13840 posts	Here’s some more figures to give an idea of real world relative performance of hardware floating point. Ah, but having read the stuff Kuemmel linked to (a few messages above), the Beagle will always suffer in these tests because the OMAP3 has a precise-but-slow VFP implementation, because the speedy hardware floating point is the fast single-precision NEON (which is adequate for most real-world stuff); the video codec add-on for the video player on my phone is optimised for NEON, so if it is good enough for XviD and the like, it ought to be good enough for MP3! To put this into context, it takes the VFP between 18 and 21 cycles to perform a single precision multiply-accumulate. The NEON, on the other hand, can do two per cycle. I would imagine both lame and povray could benefit hugely from using the NEON FP if single precision is sufficient for the calculations. Are you able to build these programs from source? Does your compiler support NEON?

Oct 17, 2012 9:16pm Chris Gransden (337) 1207 posts	Are you able to build these programs from source? Does your compiler support NEON? I was expecting the ‘armv6 vfp’ times to be much slower than the ‘armv7 neon/vfp’ times but they are virtualy the same. Plus the ‘armv6 vfp’ executables also run on armv7. Saves having to have two different executables.

Oct 18, 2012 11:28pm Jeffrey Lee (213) 6048 posts	To boost VFP performance on OMAP3 (and perhaps OMAP4?) try enabling “runfast mode”, which will allow some (but not all) of the VFP instructions to run in the NEON pipeline. This involves enabling flush-to-zero and default NaN modes, i.e. setting bits 24 & 25 of R3 when calling VFPSupport_CreateContext. On a 600MHz BB this results in Kuemmel’s single precision fractal benchmark completing in 21.06s instead of 28.75s. Note that this will only work with single precision math, since NEON can’t do double precision floats. Enabling flush-to-zero and default NaN mode is also the way to work around the support code requirement on the Pi (we don’t have any support code yet to deal with the situations the hardware can’t cope with, so if bits 24 & 25 of FPSCR aren’t set some software may fail with undefined instruction errors). I’m not quite sure what the best way of enforcing this in C apps is – perhaps by using a custom build of unixlib with _FPU_DEFAULT set to &3000000. I was expecting the ‘armv6 vfp’ times to be much slower than the ‘armv7 neon/vfp’ times but they are virtualy the same. Plus the ‘armv6 vfp’ executables also run on armv7. Saves having to have two different executables. I suspect the times are virtually the same because the compiler isn’t smart enough to work out which bits can be done in NEON instead of VFP. Plus it might be wary of automatically using NEON since it isn’t fully IEEE compliant.

Oct 19, 2012 4:13am Raik (463) 2061 posts	I try CDRipR10 with lame and the same CD with the same settings and comparable Hardware (DVDROM, Harddisc etc.): CD with 40min (13 songs); lame: VQ4, 192kb/s, stereo Panda A4 – 1Ghz: <25min xM – 1 Ghz: 56min xM – set min/max CPUClock to 1GHz: 50min RPi – 600MHz: 40min RPi – 950MHz: 31min

Oct 19, 2012 11:43am patric aristide (434) 418 posts	Amazing how the RPi works out to be 38% faster than the BB-xM in this case. That’s quite significant I’d say.

Oct 19, 2012 3:48pm Chris Gransden (337) 1207 posts	I’ve redone the SDL Quake 1 ‘timedemo demo1’ benchmark. This time using GCC 4.7.3 armv6/vfp. Softfloat Softfloat vfp vfp 800x600 640x480 800x600 640x480 Raspberry Pi 700MHz 7.5 fps 10.5 12.6 17.9 Raspberry Pi 950MHz 11.0 fps 15.4 18.8 26.6 Beagleboard xM 1GHz 17.8 fps 24.7 25.0 35.3 Pandaboard ES 1.5GHz 28.5 fps 40.2 51.1 74.4

Oct 19, 2012 5:48pm Malcolm Hussain-Gambles (1596) 811 posts	I did the fractal test on my Pi (overclocked): I got 7.25s for Double precision and 6.78 for single precision. Not sure how this translates to above figures

Oct 19, 2012 6:08pm Rick Murray (539) 13840 posts	Amazing how the RPi works out to be 38% faster than the BB-xM in this case. That’s quite significant I’d say. Significant that you aren’t using code optimised for the strengths of the OMAP3’s FP ¹. The standard multicapable FP is slow. I guess TI were expecting everybody to use the faster NEON, maybe they revised their thinking on this when designing the OMAP4? Either way, let’s see a NEON-FP only build, if such a thing exists. ¹ I think of it like the difference between MMX and 3DNow. Same idea, different way of doing it, not compatible. ;-)

Oct 19, 2012 6:31pm Kuemmel (439) 384 posts	@Jeffrey: Interesting, that “runfast mode”,…for your record, it doesn’t have any effect on my benchmark on my OMAP4, the results stay the same. So it seems some quite benefitial, but BB only…wonder why…checking now the TRM’s… Update…so in ARM11/C-A8 the runfast mode is described. For C-A9 the runfast mode is not in the TRM’s any more (but flush-to-zero and NaN are still there !) Besides the flush-to-zero and NaN thing the runfast mode says additionally “all exception enable bits are cleared”. I just didn’t find out how would you implement that ? …for the others, you just need to know that the runfast mode isn’t IEE compatible any more…but in which occasion that is really necessary, I don’t know, guess try and error…

Oct 19, 2012 7:15pm Jeffrey Lee (213) 6048 posts	I guess TI were expecting everybody to use the faster NEON, maybe they revised their thinking on this when designing the OMAP4? Replace “TI” with “ARM”, and “OMAP4” with “Cortex-A9” and you might be right :) The ARM cores in OMAP3/OMAP4 are all fairly standard designs from ARM without any tweaks having been made to the instruction pipelines. Update…so in ARM11/C-A8 the runfast mode is described. For C-A9 the runfast mode is not in the TRM’s any more (but flush-to-zero and NaN are still there !) With the fully pipelined VFP in the A9 there probably wasn’t any point in them implementing a special runfast mode. Besides the flush-to-zero and NaN thing the runfast mode says additionally “all exception enable bits are cleared”. I just didn’t find out how would you implement that ? The exception enable bits are in bits 8-15 of the FPSCR. I’m fairly certain they’re ignored on Cortex-A8 (since FP exceptions aren’t supported at all), but for runfast mode to work on ARM11 you would have to make sure they’re clear. I did experiment with runfast mode on ARM11, but it didn’t make things significantly faster (Fractal demo about 5% faster).

Oct 19, 2012 8:51pm Jerome Mathevet (1630) 19 posts	Does RISCOS Mplayer use any fp-operation for decoding videos ?

Oct 20, 2012 11:24am Chris Gransden (337) 1207 posts	Does RISCOS Mplayer use any fp-operation for decoding videos ? There’s no benefit from hardware floating point. I’ve built a version that is optimised for armv6/7 but it only runs 5-10% faster.