RISC OS Open: Forum: Idea for discussion: Rewriting RISC OS using a high-level language

Oct 18, 2011 7:04pm

Ok, the results are in, at least for 32 BPP plotting:

When everything was done in all-C++ code, the result was as follows (“Ext” is the new module and timings are in seconds):

32 BPP, no mask

Plot action 0, OS=0.19, Ext=0.19
Plot action 1, OS=1.98, Ext=2.73
Plot action 2, OS=1.98, Ext=2.73
Plot action 3, OS=1.98, Ext=2.73
Plot action 4, OS=1.98, Ext=2.73
Plot action 5, OS=1.98, Ext=2.7
Plot action 6, OS=1.98, Ext=2.73
Plot action 7, OS=1.98, Ext=2.73

Rewriting just the routine for plotting a single row of a sprite in non-overwrite mode in assembly code, I get this result:

32 BPP, no mask

Plot action 0, OS=0.19, Ext=0.19
Plot action 1, OS=1.98, Ext=1.41
Plot action 2, OS=1.98, Ext=1.41
Plot action 3, OS=1.98, Ext=1.4
Plot action 4, OS=1.98, Ext=1.41
Plot action 5, OS=1.98, Ext=1.41
Plot action 6, OS=1.98, Ext=1.41
Plot action 7, OS=1.98, Ext=1.41

All timings are done on real BeagleBoard hardware.

The more challenging work now is to make plotting work efficiently for sprites with less than 32 BPP.

Updates will follow…

Oct 19, 2011 7:41am

Rik Griffin (98) 264 posts

A bit late, but interestingly, here’s what Norcroft 5.69 makes of the code posted earlier:

 STMDB   R13!,{R4-R9,R14}
 MOV     R2,#0
 ADD     R3,R0,R2,LSL #4
 LDMIA   R3!,{R12,R14}
 LDMIA   R3,{R4,R5}
 ADD     R3,R1,R2,LSL #4
 LDMIB   R3,{R6-R8}
 LDR     R9,[R3,#0]
 ORR     R14,R6,R14
 ORR     R6,R7,R4
 ORR     R12,R9,R12
 STMIA   R3!,{R12,R14}
 ORR     R8,R8,R5
 STMIA   R3,{R6,R8}
 ADD     R2,R2,#1
 CMP     R2,#&0A            ; =10
 BNE     &00008088
 LDMIA   R13!,{R4-R9,PC}

The loads are somewhat oddly split up – the first one (from spriteRow) into 2 x LDM and the second one (from screenRow) into an LDM and an LDR. Similarly for the stores, but I think at least the ORR between the two stores is “free”.

Oct 19, 2011 9:03am

W P Blatchley (147) 247 posts

A bit late, but interestingly, here’s what Norcroft 5.69 makes of the code posted earlier:

For others looking for it, the code from this post, I think: https://www.riscosopen.org/forum/forums/5/topics/731?page=4#posts-8700

@Terje, great results! Looks like you’re making brilliant progress.

Oct 19, 2011 9:13am

Trevor Johnson (329) 1645 posts

Rewriting just the routine for plotting a single row of a sprite in non-overwrite mode in assembly code

So for this routine, the compiled C++ output isn’t "better than humans" in this case?

Oct 19, 2011 9:58am

Terje Slettebø (285) 275 posts

A bit late, but interestingly, here’s what Norcroft 5.69 makes of the code posted earlier:

Interesting, it seems that Norcroft produces better code than GCC in this case…

Too bad the Norcroft compiler is only an earlier, “partial” implementation of C++, and important things like exceptions are missing. I also suspect that its handling of templates is not that complete, although I haven’t tested that.

If I get the code to work with the Norcroft compiler (even if it needs to be adjusted a little), I’ll report on what kind of performance I get there.

@Trevor:

So for this routine, the compiled C++ output isn’t “better than humans” in this case?

No, I’m afraid not… As much as I’d like to avoid assembly code in the new module, it’s just not acceptable to have plotting of 8 and 16 BPP sprites half the speed of the OS version, and plotting of 4 to 1 BPP pixels be 20-50 times slower than the OS version, so a different approach is needed…

It would be possible to “spoon-feed” the compiler some more, telling it in details how to do things (“First load these words, then OR these together, then write them all out, then increment word counter”, etc.), but you’d then essentially be writing assembly code in C++, and you wouldn’t get optimal performance, anyway, so you may just go all the way and write it entirely in assembly code.

Now, experience has shown that typically you only need to rewrite relatively small, time-critical parts in assembly code, while the rest of the module retains high-level code, so that you get easier maintenance, in general and high performance.

The problem for sub-32 BPP pixel plotting (and the problem is particularly severe for 4 to 1 BPP, but it’s also significant for 8 and 16 BPP plotting, as mentioned above) is that, frankly, you can’t expect the compiler to do the kind of transformations needed for this to become efficient…

What transformations may that be? Well, it is of fundamental importance to know that all operations work faster on whole words, than if you have to work on bytes and bits, so if you need to plot e.g. an 8 BPP sprite to an arbitrary screen location, the word alignment of each pixel may be different for the sprite and screen, so you can’t do simple LDM/STM for the sub-32 BPP plotting in general (only as a special case where they do align with each other).

Currently, the way the new module handles that is to write a byte/halfword at a time, for 8 and 16 BPP (slow), and to write just a few bits at a time (masked and shifted) for less than 8 BPP (super-slow)…

The OS version does something clever: It reads a bunch of words (8) from the sprite using LDM, shifts them right so they align with the screen pointer alignment, and write 7 words to the screen using STM (one word is “lost” in the shifting). The result: Super-fast plotting for all colour depths (1-32)… :)

I’m currently implementing this algorithm in the new module, too.

As mentioned above, it appears only the “plot row” function needs to be reimplemented in assembly in this way, so the majority of the module should still be possible to keep as high-level code.

Oct 19, 2011 11:33am

Andrew Rawnsley (492) 1445 posts

Just to put a certain ghost to rest, can I request that a C (rather than C++) or C++-norcroft-friendly version be tested and run through Norcroft? Since that’s current the defacto standard build tool, and has historically had a reputation for being “top dog” in performance terms, it would be worth seeing figures to tell if this is still the case, before resolving to C++/GCC. Sorry to dig this up again, but now that benchmarking is practical, it would seem wise to test all the options before settling in for the long haul…

Oct 19, 2011 3:58pm

Terje Slettebø (285) 275 posts

Hi Andrew.

Just to put a certain ghost to rest, can I request that a C (rather than C++) or C++-norcroft-friendly version be tested and run through Norcroft?

Sure. :) I had planned to try just that:

If I get the code to work with the Norcroft compiler (even if it needs to be adjusted a little), I’ll report on what kind of performance I get there.

Oct 19, 2011 5:40pm

Steve Revill (20) 1361 posts

can I request that a C (rather than C++) or C++-norcroft-friendly version be tested and run through Norcroft?

Seconded. Anyway, sounds like Terje is going to take a look at this. It may turn out that you’ll have to sacrifice a little of the abstraction that C++ provides in order to produce something which gets close(er) to the speed of the OS plotting code. Don’t forget that the OS code is coping with all kinds of cases (e.g. plot transformed) that we’re not at a point to compare alternatives against.

Oct 21, 2011 12:29pm

Rik Griffin (98) 264 posts

Interesting, it seems that Norcroft produces better code than GCC in this case…

In my (very) limited experience, Norcroft usually does better than gcc, for ARM / RISCOS anyway. The gcc optimiser for x86 is pretty good though.

Too bad the Norcroft compiler is only an earlier, “partial” implementation of C++, and important things like exceptions are missing. I also suspect that its handling of templates is not that complete, although I haven’t tested that.

However Norcroft has complete (I think) support for C99. Given that relying on gcc to compile C++ code would mean introducing a dependency on gcc into the RISCOS build process, would it be a good idea to stick to C99 rather than C++?

Oct 21, 2011 5:27pm

Terje Slettebø (285) 275 posts

However Norcroft has complete (I think) support for C99. Given that relying on gcc to compile C++ code would mean introducing a dependency on gcc into the RISCOS build process, would it be a good idea to stick to C99 rather than C++?

It would be an idea, and if everyone was happy with it, I don’t see any problem with it.

However, for me, if I wouldn’t be able to use the higher-level abstractions available in C++, I’m afraid that doing this work just wouldn’t hold any attraction to me, I’m sorry…

Some people may feel the same way about C, that not being able to use pure C would not hold any attraction to them, and that’s ok by me… :) I’m not out to “convert” anyone… ;)

I’d also be happy if more parts of the kernel was converted to a higher-level language, be it C, C++ or whatever.

In any case, it’s my experience so far that it’s small sections of the code that eats up the lion’s share of processing time, where the difference to the OS code is 20-50 times slower, and there I hardly think that another compiler would make a significant difference.

Still, I plan to compile just those time-critical pieces of code using Norcroft, compare it with GCC, and post the results here.

Fortunately, RISC OS is module-based, and the only part you can’t actually replace as a module (because it has the module support, for one thing), the kernel, has great customisation possibilities.

Therefore, you don’t really need to be able to build all of it in the same process. You can try out modules implemented in whatever language you want, without having to rebuild the OS.

So far, it seems like the most promising approach is to implement the time-critical parts in assembly code (for sprites that’s the “plot row” function), so that’s what I’m currently trying out.

If someone are subsequently able to convert that code to roughly equivalently performant C code, I’d be delighted, as I’d much rather have C code than assembly code if possible… :)

Oct 22, 2011 8:21pm

Terje Slettebø (285) 275 posts

Hi all.

I had originally not thought of posting any more performance figures before the full sprite plotting functionality was complete (including scaled/transformed plotting), but since some of the earlier results were so abysmal, with half the speed of the OS for 8- and 16-bit plotting, and 20-50 times slower for sprites with less than 8 BPP, I figured it would be good to give an update.

As usual, the test has been run on a BeagleBoard, and consists of plotting a 100 × 100 pixel sprite 1000 times ¹, using different BPP and plot action (overwrite, AND, OR, etc.).

The figures have been rounded to two significant digits, as the third digit varied from run to run (“OS” is the OS and “Ext” is the new module)

32 BPP, no mask

Plot action 0, OS=0.2, Ext=0.2
Plot action 1, OS=2.1, Ext=2.1
Plot action 2, OS=2.1, Ext=2.1
Plot action 3, OS=2.1, Ext=2.1
Plot action 4, OS=2.1, Ext=2.1
Plot action 5, OS=2.1, Ext=2.1
Plot action 6, OS=2.1, Ext=2.1
Plot action 7, OS=2.1, Ext=2.1

16 bpp, no mask

Plot action 0, OS=0.1, Ext=0.1
Plot action 1, OS=1.0, Ext=1.0
Plot action 2, OS=1.0, Ext=1.0
Plot action 3, OS=1.0, Ext=1.0
Plot action 4, OS=1.0, Ext=1.0
Plot action 5, OS=1.0, Ext=1.0
Plot action 6, OS=1.0, Ext=1.0
Plot action 7, OS=1.0, Ext=1.0

8 BPP, no mask

Plot action 0, OS=0.1, Ext=0.1
Plot action 1, OS=0.5, Ext=0.5
Plot action 2, OS=0.5, Ext=0.5
Plot action 3, OS=0.5, Ext=0.5
Plot action 4, OS=0.5, Ext=0.5
Plot action 5, OS=0.5, Ext=0.5
Plot action 6, OS=0.5, Ext=0.5
Plot action 7, OS=0.5, Ext=0.5

4 BPP, no mask

Plot action 0, OS=0.1, Ext=0.1
Plot action 1, OS=0.3, Ext=0.3
Plot action 2, OS=0.3, Ext=0.3
Plot action 3, OS=0.3, Ext=0.3
Plot action 4, OS=0.3, Ext=0.3
Plot action 5, OS=0.3, Ext=0.0
Plot action 6, OS=0.3, Ext=0.3
Plot action 7, OS=0.3, Ext=0.3

2 BPP, no mask

Plot action 0, OS=0.1, Ext=0.1
Plot action 1, OS=0.2, Ext=0.2
Plot action 2, OS=0.2, Ext=0.2
Plot action 3, OS=0.2, Ext=0.2
Plot action 4, OS=0.2, Ext=0.2
Plot action 5, OS=0.2, Ext=0.0
Plot action 6, OS=0.2, Ext=0.2
Plot action 7, OS=0.2, Ext=0.2

1 BPP, no mask

Plot action 0, OS=0.0, Ext=0.1
Plot action 1, OS=0.1, Ext=0.1
Plot action 2, OS=0.1, Ext=0.1
Plot action 3, OS=0.1, Ext=0.1
Plot action 4, OS=0.1, Ext=0.1
Plot action 5, OS=0.1, Ext=0.0
Plot action 6, OS=0.1, Ext=0.1
Plot action 7, OS=0.1, Ext=0.1

I’m not kidding; the figures really follow each other that closely…

What has been done is to rewrite the function for plotting a sprite row in the different cases in assembly code, and linking that in.

The rest of the module may stay as C++ (possibly converted to C, if the community prefers that), and compiled using GCC. I’ll also try to compile it with Norcroft, both the parts now rewritten in assembly code, and the whole module, and measure the difference.

Implementing plotting with mask using assembly code remains, but it’s not expected to give any surprises (it’s already implemented using C++ from before).

Plotting is now done in an entirely different way from the original implementation, always using LDM/STM for plotting all colour depths, just like the OS does, and the performance difference is dramatic.

¹ Other test cases have also been used, plotting both much smaller and much larger sprites, and the performance is the same.

Oct 23, 2011 1:23pm

Martin Bazley (331) 379 posts

Could you explain these?

Plot action 5, OS=2.1, Ext=2.1

Plot action 5, OS=1.0, Ext=1.0

Plot action 5, OS=0.5, Ext=0.5

In sub-8bpp modes, Ext seems to consistently get 0.0!

Oct 23, 2011 1:48pm

Terje Slettebø (285) 275 posts

Hi Martin.

Good catch. :) I think you’ll realise the answer if you look up “plot action 5”… :) Hint: It’s “no change”

I noticed those, too, and oddly enough, I’m not checking especially for this plot action, so it does the usual load/store on it, but possibly the system detects no changes on the store, and optimises it away. For some reason, it doesn’t happen for 8 BPP and above, or for the OS version.

In any case, it would be easy to check for this for all colour depths and return immediately, but it may not be worth it (who uses this plot action, anyway?). However, if it comes for free, why not? :)

By the way, “Ext” comes from “extension”, and is the preliminary SWI prefix used, to not collide with the OS.

Oct 24, 2011 6:29am

Terje Slettebø (285) 275 posts

@Uwe:

Terje, is it possible to add a different behaviour that uses the mask(byte) not as a switch but as alpha value?

Interestingly, now that all sprite plotting is done with LDM/STM, what you suggest may be the most appropriate way to handle mask… :) It seems to be how the OS does it, too.

Nov 11, 2011 12:40pm

Uwe Kall (215) 120 posts

@Terje:

Have you tried it? :-)

Nov 14, 2011 3:15pm

Rik Griffin (98) 264 posts

I’d be interested to see what you can come up with for plotting 32 bpp sprites with transparency. In case it’s useful, here’s the plot loop from Popcorn, to plot a 32 bit sprite to a 32 bit frame buffer, with an alpha value between 0 (totally transparent) to 255 (totally opaque).



; R12 -> sprite data
; R11 -> frame buffer
; R9  = number of pixels (words, not bytes) to plot
; r7 = translucency value (alpha = 0 to 255)
plot_bit_translucent
        RSB        r6,r7,#&ff                ; 1 - alpha

        CMP        r9,#0
trans_loop
        BEQ        bit_finished

        LDR        r0,[r11]                ; frame buffer pixel
        LDR        r1,[r12],#4             ; source pixel

        AND        r2,r0,#&ff              ; frame buffer red component
        AND        r3,r1,#&ff              ; source red component

        MUL        r3,r7,r3                ; src * alpha
        MLA        r5,r6,r2,r3             ; + back * (1 - alpha)
        AND        r5,r5,#&ff00
        ; r5 is red component << 8

        AND        r2,r0,#&ff00            ; back green << 8
        AND        r3,r1,#&ff00            ; src green << 8

        MUL        r3,r7,r3                ; src * alpha
        MLA        r4,r6,r2,r3             ; + back * (1 - alpha)
        AND        r4,r4,#&ff0000
        ; r4 is green component << 16

        AND        r2,r0,#&ff0000          ; back blue << 16
        AND        r3,r1,#&ff0000          ; src blue << 16

        MUL        r3,r7,r3                ; src * alpha
        MLA        r3,r6,r2,r3             ; + back * (1 - alpha)
        AND        r3,r3,#&ff000000
        ; r3 is blue component << 24

        MOV        r5,r5,LSR#8             ; combine the 3 components
        ORR        r5,r5,r4,LSR#8          ; into the correct bytes of r5
        ORR        r5,r5,r3,LSR#8

        STR        r5,[r11],#4

        SUBS       r9,r9,#1
        B          trans_loop

Nov 14, 2011 3:24pm

Rik Griffin (98) 264 posts

Why is it so hard to get a reasonably formatted code block on these forums? Where did all my indentation go and where did those random blank lines come from?

Nov 14, 2011 3:59pm

Trevor Johnson (329) 1645 posts

Have you got blank lines around the pre tags?

Nov 14, 2011 5:45pm

Rik Griffin (98) 264 posts

Ah that’s better, thanks Mr Johnson.

Nov 14, 2011 11:57pm

Jeffrey Lee (213) 6048 posts

Top tip: Since you’re using a constant alpha value for the entire sprite, and the result of each calculation won’t go over 65535, you can use the upper and lower halves of a register to calculate two values at once. After a few minutes work, I’ve arrived at this piece of code – still compatible with everything down to ARMv2, but it processes two source pixels per loop, while the loop itself is only two instructions longer than yours. It’s completely untested, and could do with a few of the instructions being reordered to help avoid pipeline stalls, and will need an extra case adding somewhere for any 1 pixel remainder, but in terms of pixels-per-instruction it’s about as good as you’ll get from plain ARM.

; R12 -> sprite data
; R11 -> frame buffer
; R9  = number of pixels (words, not bytes) to plot
; r7 = translucency value (alpha = 0 to 255)
plot_bit_translucent
	RSB	r6,r7,#&ff	; 1 - alpha
	CMP	r9,#0
	BEQ	bit_finished
	LDR	r0,=&ff00ff
trans_loop
	LDMIA	r11,{r1-r2}	; 2 frame buffer pixels
	LDMIA	r12!,{r3-r4}	; 2 sprite pixels
	AND	r5,r1,#&ff00	; framebuffer green 1
	ORR	r5,r5,r2,LSL #16 ; framebuffer green 2 (and red 2)
	AND	r5,r0,r5,LSR #8	; shift down and mask out the red byte
	AND	r1,r1,r0	; framebuffer red, blue 1
	AND	r2,r2,r0	; framebuffer red, blue 2
	MUL	r5,r6,r5	; (green 1, green 2) * (1 - alpha)
	MUL	r1,r6,r1	; (red 1, blue 1) * (1 - alpha)
	MUL	r2,r6,r2	; (red 2, blue 2) * (1 - alpha)
	AND	r8,r3,#&ff00	; sprite green 1
	ORR	r8,r8,r4,LSL #16 ; sprite green 2 (and red 2)
	AND	r8,r0,r8,LSR #8	; shift down and mask out the red byte
	AND	r3,r3,r0	; sprite red, blue 1
	AND	r4,r4,r0	; sprite red, blue 2
	MLA	r5,r7,r8,r5	; (green 1, green 2) += sprite (green 1, green 2) * alpha
	MLA	r1,r7,r3,r1	; etc.
	MLA	r2,r7,r4,r2
	AND	r5,r0,r5,LSR #8 ; final green 1 & 2
	AND	r1,r0,r1,LSR #8 ; final red, blue 1
	AND	r2,r0,r2,LSR #8 ; final red, blue 2
	ORR	r1,r1,r5,LSL #8 ; insert green 1
	ORR	r2,r2,r5,LSR #8 ; insert green 2
	STMIA	r11!,{r1-r2}
	SUBS	r9,r9,#2
	BNE	trans_loop

Nov 15, 2011 10:43am

Rik Griffin (98) 264 posts

Cool, I’ll try to incorporate that into Popcorn soon. Thank you! Hopefully this will be useful for Terje as well.

Nov 15, 2011 5:33pm

Terje Slettebø (285) 275 posts

@Uwe:

Have you tried it? :-)

Yes, I have and it works fine for the “old” sprite format (where the mask has the same number of bits as the pixels). :) It’s more challenging with 1 BPP mask “new” sprite format, though.

@Rik and Jeffrey:

Thanks for the code examples. So far, I’m only working on the existing OS_SpriteOp functionality, though, i.e. without alpha channel, but this is certainly a possible extension.

Nov 16, 2011 10:27pm

Uwe Kall (215) 120 posts

It’s more challenging with 1 BPP mask “new” sprite format, though.

especially if you want gradual transparency ;-)
So when is it time for the next demo?

Nov 16, 2011 11:34pm

Terje Slettebø (285) 275 posts

It’s more challenging with 1 BPP mask “new” sprite format, though.

especially if you want gradual transparency ;-)

:)

Nov 17, 2011 2:03pm

Jeffrey Lee (213) 6048 posts

So when is it time for the next demo?

Recently I’ve been thinking that metaballs could be a fun thing to use as the basis for a simple demo. AFAIK the last time they were seen in a RISC OS demo was around 10 years ago, for the Evolution competition – so this should provide a good opportunity to show just how much more powerful a beagleboard is than a StrongARM RiscPC.

Now I just need to find the time to work on it, without feeling guilty about neglecting my other duties ;-)

Idea for discussion: Rewriting RISC OS using a high-level language

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options