Idea for discussion: Rewriting RISC OS using a high-level language
Terje Slettebø (285) 275 posts |
Ok, the results are in, at least for 32 BPP plotting: When everything was done in all-C++ code, the result was as follows (“Ext” is the new module and timings are in seconds):
Rewriting just the routine for plotting a single row of a sprite in non-overwrite mode in assembly code, I get this result:
All timings are done on real BeagleBoard hardware. The more challenging work now is to make plotting work efficiently for sprites with less than 32 BPP. Updates will follow… |
Rik Griffin (98) 264 posts |
A bit late, but interestingly, here’s what Norcroft 5.69 makes of the code posted earlier: STMDB R13!,{R4-R9,R14} MOV R2,#0 ADD R3,R0,R2,LSL #4 LDMIA R3!,{R12,R14} LDMIA R3,{R4,R5} ADD R3,R1,R2,LSL #4 LDMIB R3,{R6-R8} LDR R9,[R3,#0] ORR R14,R6,R14 ORR R6,R7,R4 ORR R12,R9,R12 STMIA R3!,{R12,R14} ORR R8,R8,R5 STMIA R3,{R6,R8} ADD R2,R2,#1 CMP R2,#&0A ; =10 BNE &00008088 LDMIA R13!,{R4-R9,PC} The loads are somewhat oddly split up – the first one (from spriteRow) into 2 x LDM and the second one (from screenRow) into an LDM and an LDR. Similarly for the stores, but I think at least the ORR between the two stores is “free”. |
W P Blatchley (147) 247 posts |
For others looking for it, the code from this post, I think: https://www.riscosopen.org/forum/forums/5/topics/731?page=4#posts-8700 @Terje, great results! Looks like you’re making brilliant progress. |
Trevor Johnson (329) 1645 posts |
So for this routine, the compiled C++ output isn’t "better than humans" in this case? |
Terje Slettebø (285) 275 posts |
Interesting, it seems that Norcroft produces better code than GCC in this case… Too bad the Norcroft compiler is only an earlier, “partial” implementation of C++, and important things like exceptions are missing. I also suspect that its handling of templates is not that complete, although I haven’t tested that. If I get the code to work with the Norcroft compiler (even if it needs to be adjusted a little), I’ll report on what kind of performance I get there. @Trevor:
No, I’m afraid not… As much as I’d like to avoid assembly code in the new module, it’s just not acceptable to have plotting of 8 and 16 BPP sprites half the speed of the OS version, and plotting of 4 to 1 BPP pixels be 20-50 times slower than the OS version, so a different approach is needed… It would be possible to “spoon-feed” the compiler some more, telling it in details how to do things (“First load these words, then OR these together, then write them all out, then increment word counter”, etc.), but you’d then essentially be writing assembly code in C++, and you wouldn’t get optimal performance, anyway, so you may just go all the way and write it entirely in assembly code. Now, experience has shown that typically you only need to rewrite relatively small, time-critical parts in assembly code, while the rest of the module retains high-level code, so that you get easier maintenance, in general and high performance. The problem for sub-32 BPP pixel plotting (and the problem is particularly severe for 4 to 1 BPP, but it’s also significant for 8 and 16 BPP plotting, as mentioned above) is that, frankly, you can’t expect the compiler to do the kind of transformations needed for this to become efficient… What transformations may that be? Well, it is of fundamental importance to know that all operations work faster on whole words, than if you have to work on bytes and bits, so if you need to plot e.g. an 8 BPP sprite to an arbitrary screen location, the word alignment of each pixel may be different for the sprite and screen, so you can’t do simple LDM/STM for the sub-32 BPP plotting in general (only as a special case where they do align with each other). Currently, the way the new module handles that is to write a byte/halfword at a time, for 8 and 16 BPP (slow), and to write just a few bits at a time (masked and shifted) for less than 8 BPP (super-slow)… The OS version does something clever: It reads a bunch of words (8) from the sprite using LDM, shifts them right so they align with the screen pointer alignment, and write 7 words to the screen using STM (one word is “lost” in the shifting). The result: Super-fast plotting for all colour depths (1-32)… :) I’m currently implementing this algorithm in the new module, too. As mentioned above, it appears only the “plot row” function needs to be reimplemented in assembly in this way, so the majority of the module should still be possible to keep as high-level code. |
Andrew Rawnsley (492) 1445 posts |
Just to put a certain ghost to rest, can I request that a C (rather than C++) or C++-norcroft-friendly version be tested and run through Norcroft? Since that’s current the defacto standard build tool, and has historically had a reputation for being “top dog” in performance terms, it would be worth seeing figures to tell if this is still the case, before resolving to C++/GCC. Sorry to dig this up again, but now that benchmarking is practical, it would seem wise to test all the options before settling in for the long haul… |
Terje Slettebø (285) 275 posts |
Hi Andrew.
Sure. :) I had planned to try just that:
|
Steve Revill (20) 1361 posts |
Seconded. Anyway, sounds like Terje is going to take a look at this. It may turn out that you’ll have to sacrifice a little of the abstraction that C++ provides in order to produce something which gets close(er) to the speed of the OS plotting code. Don’t forget that the OS code is coping with all kinds of cases (e.g. plot transformed) that we’re not at a point to compare alternatives against. |
Rik Griffin (98) 264 posts |
In my (very) limited experience, Norcroft usually does better than gcc, for ARM / RISCOS anyway. The gcc optimiser for x86 is pretty good though.
However Norcroft has complete (I think) support for C99. Given that relying on gcc to compile C++ code would mean introducing a dependency on gcc into the RISCOS build process, would it be a good idea to stick to C99 rather than C++? |
Terje Slettebø (285) 275 posts |
It would be an idea, and if everyone was happy with it, I don’t see any problem with it. However, for me, if I wouldn’t be able to use the higher-level abstractions available in C++, I’m afraid that doing this work just wouldn’t hold any attraction to me, I’m sorry… Some people may feel the same way about C, that not being able to use pure C would not hold any attraction to them, and that’s ok by me… :) I’m not out to “convert” anyone… ;) I’d also be happy if more parts of the kernel was converted to a higher-level language, be it C, C++ or whatever. In any case, it’s my experience so far that it’s small sections of the code that eats up the lion’s share of processing time, where the difference to the OS code is 20-50 times slower, and there I hardly think that another compiler would make a significant difference. Still, I plan to compile just those time-critical pieces of code using Norcroft, compare it with GCC, and post the results here. Fortunately, RISC OS is module-based, and the only part you can’t actually replace as a module (because it has the module support, for one thing), the kernel, has great customisation possibilities. Therefore, you don’t really need to be able to build all of it in the same process. You can try out modules implemented in whatever language you want, without having to rebuild the OS. So far, it seems like the most promising approach is to implement the time-critical parts in assembly code (for sprites that’s the “plot row” function), so that’s what I’m currently trying out. If someone are subsequently able to convert that code to roughly equivalently performant C code, I’d be delighted, as I’d much rather have C code than assembly code if possible… :) |
Terje Slettebø (285) 275 posts |
Hi all. I had originally not thought of posting any more performance figures before the full sprite plotting functionality was complete (including scaled/transformed plotting), but since some of the earlier results were so abysmal, with half the speed of the OS for 8- and 16-bit plotting, and 20-50 times slower for sprites with less than 8 BPP, I figured it would be good to give an update. As usual, the test has been run on a BeagleBoard, and consists of plotting a 100 × 100 pixel sprite 1000 times 1, using different BPP and plot action (overwrite, AND, OR, etc.). The figures have been rounded to two significant digits, as the third digit varied from run to run (“OS” is the OS and “Ext” is the new module) 32 BPP, no mask Plot action 0, OS=0.2, Ext=0.2 16 bpp, no mask Plot action 0, OS=0.1, Ext=0.1 8 BPP, no mask Plot action 0, OS=0.1, Ext=0.1 4 BPP, no mask Plot action 0, OS=0.1, Ext=0.1 2 BPP, no mask Plot action 0, OS=0.1, Ext=0.1 1 BPP, no mask Plot action 0, OS=0.0, Ext=0.1 I’m not kidding; the figures really follow each other that closely… What has been done is to rewrite the function for plotting a sprite row in the different cases in assembly code, and linking that in. The rest of the module may stay as C++ (possibly converted to C, if the community prefers that), and compiled using GCC. I’ll also try to compile it with Norcroft, both the parts now rewritten in assembly code, and the whole module, and measure the difference. Implementing plotting with mask using assembly code remains, but it’s not expected to give any surprises (it’s already implemented using C++ from before). Plotting is now done in an entirely different way from the original implementation, always using LDM/STM for plotting all colour depths, just like the OS does, and the performance difference is dramatic. 1 Other test cases have also been used, plotting both much smaller and much larger sprites, and the performance is the same. |
Martin Bazley (331) 379 posts |
Could you explain these?
In sub-8bpp modes, Ext seems to consistently get 0.0! |
Terje Slettebø (285) 275 posts |
Hi Martin. Good catch. :) I think you’ll realise the answer if you look up “plot action 5”… :) Hint: It’s “no change” I noticed those, too, and oddly enough, I’m not checking especially for this plot action, so it does the usual load/store on it, but possibly the system detects no changes on the store, and optimises it away. For some reason, it doesn’t happen for 8 BPP and above, or for the OS version. In any case, it would be easy to check for this for all colour depths and return immediately, but it may not be worth it (who uses this plot action, anyway?). However, if it comes for free, why not? :) By the way, “Ext” comes from “extension”, and is the preliminary SWI prefix used, to not collide with the OS. |
Terje Slettebø (285) 275 posts |
@Uwe:
Interestingly, now that all sprite plotting is done with LDM/STM, what you suggest may be the most appropriate way to handle mask… :) It seems to be how the OS does it, too. |
Uwe Kall (215) 120 posts |
@Terje: Have you tried it? :-) |
Rik Griffin (98) 264 posts |
I’d be interested to see what you can come up with for plotting 32 bpp sprites with transparency. In case it’s useful, here’s the plot loop from Popcorn, to plot a 32 bit sprite to a 32 bit frame buffer, with an alpha value between 0 (totally transparent) to 255 (totally opaque).
|
Rik Griffin (98) 264 posts |
Why is it so hard to get a reasonably formatted code block on these forums? Where did all my indentation go and where did those random blank lines come from? |
Trevor Johnson (329) 1645 posts |
Have you got blank lines around the pre tags? |
Rik Griffin (98) 264 posts |
Ah that’s better, thanks Mr Johnson. |
Jeffrey Lee (213) 6048 posts |
Top tip: Since you’re using a constant alpha value for the entire sprite, and the result of each calculation won’t go over 65535, you can use the upper and lower halves of a register to calculate two values at once. After a few minutes work, I’ve arrived at this piece of code – still compatible with everything down to ARMv2, but it processes two source pixels per loop, while the loop itself is only two instructions longer than yours. It’s completely untested, and could do with a few of the instructions being reordered to help avoid pipeline stalls, and will need an extra case adding somewhere for any 1 pixel remainder, but in terms of pixels-per-instruction it’s about as good as you’ll get from plain ARM. ; R12 -> sprite data ; R11 -> frame buffer ; R9 = number of pixels (words, not bytes) to plot ; r7 = translucency value (alpha = 0 to 255) plot_bit_translucent RSB r6,r7,#&ff ; 1 - alpha CMP r9,#0 BEQ bit_finished LDR r0,=&ff00ff trans_loop LDMIA r11,{r1-r2} ; 2 frame buffer pixels LDMIA r12!,{r3-r4} ; 2 sprite pixels AND r5,r1,#&ff00 ; framebuffer green 1 ORR r5,r5,r2,LSL #16 ; framebuffer green 2 (and red 2) AND r5,r0,r5,LSR #8 ; shift down and mask out the red byte AND r1,r1,r0 ; framebuffer red, blue 1 AND r2,r2,r0 ; framebuffer red, blue 2 MUL r5,r6,r5 ; (green 1, green 2) * (1 - alpha) MUL r1,r6,r1 ; (red 1, blue 1) * (1 - alpha) MUL r2,r6,r2 ; (red 2, blue 2) * (1 - alpha) AND r8,r3,#&ff00 ; sprite green 1 ORR r8,r8,r4,LSL #16 ; sprite green 2 (and red 2) AND r8,r0,r8,LSR #8 ; shift down and mask out the red byte AND r3,r3,r0 ; sprite red, blue 1 AND r4,r4,r0 ; sprite red, blue 2 MLA r5,r7,r8,r5 ; (green 1, green 2) += sprite (green 1, green 2) * alpha MLA r1,r7,r3,r1 ; etc. MLA r2,r7,r4,r2 AND r5,r0,r5,LSR #8 ; final green 1 & 2 AND r1,r0,r1,LSR #8 ; final red, blue 1 AND r2,r0,r2,LSR #8 ; final red, blue 2 ORR r1,r1,r5,LSL #8 ; insert green 1 ORR r2,r2,r5,LSR #8 ; insert green 2 STMIA r11!,{r1-r2} SUBS r9,r9,#2 BNE trans_loop |
Rik Griffin (98) 264 posts |
Cool, I’ll try to incorporate that into Popcorn soon. Thank you! Hopefully this will be useful for Terje as well. |
Terje Slettebø (285) 275 posts |
@Uwe:
Yes, I have and it works fine for the “old” sprite format (where the mask has the same number of bits as the pixels). :) It’s more challenging with 1 BPP mask “new” sprite format, though. @Rik and Jeffrey: Thanks for the code examples. So far, I’m only working on the existing OS_SpriteOp functionality, though, i.e. without alpha channel, but this is certainly a possible extension. |
Uwe Kall (215) 120 posts |
especially if you want gradual transparency ;-) |
Terje Slettebø (285) 275 posts |
:) |
Jeffrey Lee (213) 6048 posts |
Recently I’ve been thinking that metaballs could be a fun thing to use as the basis for a simple demo. AFAIK the last time they were seen in a RISC OS demo was around 10 years ago, for the Evolution competition – so this should provide a good opportunity to show just how much more powerful a beagleboard is than a StrongARM RiscPC. Now I just need to find the time to work on it, without feeling guilty about neglecting my other duties ;-) |