Idea for discussion: Rewriting RISC OS using a high-level language
Steve Revill (20) 1361 posts |
Adding ARMv7 support to the Norcroft compiler is probably a few man weeks of work. Migrating all of the assembler-sourced code from ObjAsm to asasm or similar is almost certainly a much bigger task for a few reasons: 1. there are a lot of ARM components Similar arguments exist for the C components, too. This is just my speculation but I would find it hard to believe that the reverse is true. |
Martin Bazley (331) 379 posts |
I believe one of the current major focuses of RISC OS GCC development is improving asasm to support said features. It doesn’t yet, but it will do. Right now, as you say, effectively rewriting half the kernel is far more trouble than it’s worth. |
Terje Slettebø (285) 275 posts |
You guys may be interested in knowing that steady progress in being made. Lately, I’ve been focusing on plotting sprites (source and target BPP being the same), and the following now works:
Care is being taken to not do more work than needed for a given combination of parameters (for example, the mask is not tracked for non-mask plotting), as well as using fast implementation for simpler cases (e.g. using memcpy() in the case of 8/16/32 BPP, plot action 0 (overwrite), and no mask). Next up, and this is a big one: Plotting sprites with less than 8 BPP… |
Steve Revill (20) 1361 posts |
Nice work, Terje. One question: are you in a position to do a bit of early benchmarking (likely to be quite bodgy, I’d imagine) to see how the two versions compare? I accept it’s not necessarily going to give you a very useful answer – performance is probably an issue you’ve not been focused on too much yet – but it’s an interesting question. |
Terje Slettebø (285) 275 posts |
Hi Steve.
Mind reader… ;) I had actually planned to do just that, once I get it working also for less than 8 BPP sprites, which I planned to work on today. I’ve had quite a bit of mail correspondence with, and encouragement by, Richard Spencer (“hilltop” here, the author of the Memphis RAM disc alternative) as the design has progressed, and as he mentioned, it’s going to be a tough competition against hand-written assembly code… :) Nevertheless, quite a lot of work has gone into performance, as it would be comparatively “easy” to make a system that would handle all variations of BPP, mask, plot action, etc., and be dog slow in all of them… It’s a different matter to ensure that the simple cases run fast, and that it doesn’t do anything more than needed in a given case (as mentioned above). However, having done that, any further optimisation need to be guided by profiling, so that’s already in the plans, then. I’ll post figures here once it’s ready. The tests also need to be quite varied… It needs to be fast for large sprites, but also for small sprites, because a lot of the sprites in RISC OS are quite small, like the window furniture and icons, so we need several kinds of tests.
Interesting and vitally important… :) The last thing I want to do is to spend lots of time on this, and not enough up-front thinking on performance, and end up with something that doesn’t “fly”… Therefore, although “premature optimisation” should be avoided, performance has to an important factor from the very start… That means making the most use of the processor and memory system, which means use it in a way that works well, like plotting rows from left to right (rather than e.g. top to bottom), to make use of locality of reference, the write-back buffer, etc. Thanks for the encouragement. :) |
Ben Avison (25) 445 posts |
Um, maybe. Could get a bit messy getting dependencies correct when the object files are built from the same source files. I think it would be simpler to add ColourTrans to the “BuildHost” build, install it in RiscOS.Library, and then RMEnsure it from RiscOS.Env.!Common. Of course, this is also one of those things that make cross-compilation difficult, so it would also be useful to replace this with a portable C application that uses the same algorithm as ColourTrans. |
Terje Slettebø (285) 275 posts |
The implementation of sprite plotting (same source and target colour depth) is complete, and the preliminary results are in. There were some surprises, and some things that were rather expected. The module was compiled using O3 optimisation, and it includes a temporary additional reason code for testing, which plots a given sprite a given number of times, and for this test, a 100 × 100 pixel sprite was plotted 1000 times, with various settings for colour depth, plot action, and mask. A pointer to the sprite was used when plotting, to eliminate the time to look the sprite up. The test was run on VirtualAcorn on a dual-core Duo PC, but the relative timing of the OS and the new version should be similar for a native system. Also, when testing, both the OS and the new version were plotted side by side, to check that the resulting images were identical. Here’s the result of the performance test (“Ext” is the new module, since the name of the SWI is “Ext_SpriteOp”, and the times are in seconds): 32 BPP, no mask Plot action 0, OS=0.1, Ext=0.09 I’m quite pleased that the new module matches the OS for this first test using 32 BPP sprite, and it’s somewhat surprising that the new module is actually faster than the OS for non-overwrite plotting… 16 bpp, no mask Plot action 0, OS=0.07, Ext=0.07 For 16 BPP sprites, except for the overwrite case, the OS is about twice as fast as new new module, so it clearly does some optimisation, here (perhaps combining pairs of pixels into words before plotting). 8 BPP, no mask Plot action 0, OS=0.05, Ext=0.05 We get a similar result for 8 BPP sprites: The OS is about twice as fast, probably for the same reason as for 16 BPP sprites. 4 BPP, no mask Plot action 0, OS=0.04, Ext=0.83 For 4 and less BPP sprites, the difference is – as expected – quite dramatic in this initial test. The new module is based on generic programming, employing an STL-based architecture of iterators for sprites, rows and pixels, and for sub-byte pixels, quite a lot of bit fiddling is being done for each pixel plotting. The OS, on the other hand, does some clever optimisations for sub-8 BPP sprites, involving loading words from the sprite and shifting them the right number of bits before plotting. 2 BPP, no mask Plot action 0, OS=0.04, Ext=0.83 Same result as above, for the same reason. 1 BPP, no mask Plot action 0, OS=0.02, Ext=0.83 Same result as above, for the same reason. 32 BPP, mask Plot action 0, 2.13, Ext=0.76 Again, we have a surprise: The new module is considerably faster for plotting 32 BPP sprites with mask than the OS (about 2-3 times as fast). 16 BPP, mask Plot action 0, 1.22, Ext=0.75 It’s still faster for 16 BPP with mask, but not to the same degree, probably due to the same speedup as noted for 16 BPP sprites without mask. 8 BPP, mask Plot action 0, 0.15, Ext=0.76 Apparently, the OS has some optimisation for 8 BPP sprites with mask, as here it’s considerably faster than the new version. 4 BPP, mask Plot action 0, 0.09, Ext=1.22 As for the non-mask version, the sub-8 BPP sprite plotting is considerably faster for the OS version than the new version. 2 BPP, mask Plot action 0, 0.06, Ext=1.2 Same result as above, for the same reason. 1 BPP, mask Plot action 0, 0.05, Ext=1.16 Same result as above, for the same reason. Summary: Frankly, I’m quite relieved now that the preliminary numbers finally are in (we also need to run tests with small and large sprites, but I suspect the numbers there will resemble these). When working on the development, I never knew how the real result would turn out versus the OS version, and I’m relieved that, at least for 8/16/32 BPP sprites with mask/no mask, the numbers are comparable, with the new version being somewhat faster for 32 BPP sprites, and 2-3 times as fast for 32 BPP sprites with mask, while the OS is about twice as fast for 8/16 BPP non-mask plotting. For 8/16 BPP plotting, as well as for sub-8 BPP plotting, there are clearly opportunities for optimisation. |
Trevor Johnson (329) 1645 posts |
These sound like excellent results! Thanks for sharing them and for all the other updates on the way. Is there any chance of ROOL considering the issue of a press release (at an appropriate time, when there have been further comparisons), maybe aimed at programming journals/academics? Perhaps a competition for work on other modules could be launched… supported by a small value bounty to cover the prize. While this sort of work might not be considered of key importance on the road map, it’s very newsworthy IMHO. |
Terje Slettebø (285) 275 posts |
Thanks for the nice feedback, Trevor. :) My main motivation for starting a reimplementation of parts of RISC OS in a higher-level language is that it’s simply “too hard” and too time-consuming to work with raw assembly code, and especially given the limited developer resources we have, it simply doesn’t get done, except for some core kernel parts (thanks Jeffrey and the others!), leaving much of the OS simply stale… By having a foundation written in a higher-level language like C/C++ we have at least the possibility that more people may chip in, and start doing updates to the OS, so that we may actually get somewhere with our roadmap… :) I’ve learned a lot from the sprite work, and the lessons learned might be applicable to other kernel graphics work, as well. Not to mention that it would be nice to have a 3D graphics API sometime… It may be best to get some more experience with this approach before doing much publicising, but by all means, I’m all for if someone else would like to chip in to work on reimplementation, as well. :) As an aside, I’ve noticed that this thread has been mentioned at a few RISC OS news sites. The results above appear to validate the viability of this approach (reimplementing parts of RISC OS using C/C++), so I’ll postpone any optimisations for now, and instead work on completing the rest of the OS_SpriteOp functionality. There’s quite a bit that remains, including more advanced plotting, like plotting to a different colour depth, colour table translations, scaling and rotating sprites, etc. Fun! :) |
Andrew Rawnsley (492) 1445 posts |
One comment I would make is that it would be better to test on real OMAP (or similar) hardware, since that’s where the results are going to end up. Testing/benchmarking on VA can be a very odd experience (I speak from considerable experience historically!). I have benchmarks showing VA to be 10x (or more) faster than an Iyonix (eg. firebench) which is obviously not true. I just ignored these outliers when publishing results, but it does mean that you can’t always expect to get parity between VA and real hardware, despite the end result (ie. VA operation) feeling so similar. But, on other business, congratulations on getting this far :) :) :) |
Jeffrey Lee (213) 6048 posts |
I’ll second Andrew’s comments about profiling on real hardware! The good news is that you can access the Cortex-A8 performance counter registers from user mode, so by using the cycle count register you can easily get high-accuracy timings of small sections of code. When I get home I’ll post a few snippets of assembler showing you how to do it. It should be much better than using OS_ReadMonotonicTime for everything :) You can also do a similar thing on the Iyonix, although if you want to profile from user mode you have to use one of the 200MHz IOP timers instead of the performance counter hardware. |
Terje Slettebø (285) 275 posts |
Hi Andrew. Thanks for the tip. :) I’ll fire up my BeagleBoard this evening and rerun the test there. I’ll then update the table above. As mentioned, the reason I felt that wasn’t that important to begin with was that it was about comparing two different modules, running on the same system, so we’re only talking comparing relative performance, here. I guess that may still lead to some surprises, though, so I’ll rerun the test, then. The reason I used VA is that I’ve been doing the development there (it lets me use my preferred Windows text editor, and it also tends to be somewhat faster than the BeagleBoard, especially for file operations). |
Alan Robertson (52) 420 posts |
I doff my cap to you sir. |
Jeffrey Lee (213) 6048 posts |
Here’s some code for accessing the Cortex-A8 cycle counter from user mode. First, to enable the counter: SWI "OS_EnterOS" MOV R0,#1 MCR CP15,0,R0,C9,C14,0 ; Enable user mode access MVN R0,#0 MCR CP15,0,R0,C9,C12,2 ; Disable performance counters MCR CP15,0,R0,C9,C14,2 ; Disable counter interrupts MOV R0,#&80000000 MCR CP15,0,R0,C9,C12,1 ; Enable cycle counter MOV R0,#7 MCR CP15,0,R0,C9,C12,0 ; Enable & reset SWI "OS_LeaveOS" Then the counter can simply be read using “MRC CP15,0,<reg>,C9,C13,0”. As you’d expect it counts the number of CPU clock cycles. More info can be found in the Cortex-A8 TRM. |
Terje Slettebø (285) 275 posts |
Thanks, Jeffrey. For the timing I’ve done so far, I’ve just used OS_ReadMonotonicTime (as you guessed :) ), which seems to work well enough, but it’s nice to know how to use higher-resolution counters. In any case, I’ve re-run the test on the BeagleBoard, and the result have come out somewhat differently, which seems to indicate that extra writes to screen memory are penalised particularly severely, reasonably enough, given the huge difference in clock speed between the core and the main memory. This seems to have been somewhat masked on the VirtualAcorn timing, whereas when testing on the BeagleBoard, they show up more clearly. For reference, the sprite used for the testing looks like this. The form has been chosen for easier testing of functionality, not for performance testing, but the actual content of the sprite shouldn’t affect the performance. However, as will be seen from the results below, the amount of sprite masked away certainly seems to play a quite large role, with the masked plotting actually being faster than the unmasked one, presumably since there’s less writing to the screen memory. When reading the following test result, please keep in mind that no optimisation has yet been done, so the purpose of these measurements is not at least to identify potential opportunities for optimisation. Here’s the result from the BeagleBoard test: 32 BPP, no mask Plot action 0, OS=0.21, Ext=0.21 Sadly, the new version no longer leads for non-overwrite plotting. 16 BPP, no mask Plot action 0, OS=0.1, Ext=0.11 The difference grows for 16 BPP sprites. I suspect this is due to the OS combining pixels into words before writing, while the new version writes each 8/16 BPP pixel separately. 8 BPP, no mask Plot action 0, OS=0.1, Ext=0.08 The same goes for 8 BPP sprites. 4 BPP, no mask Plot action 0, OS=0.07, Ext=2.61 As in the earlier test, there’s a lot of opportunity for optimisation for sub-8 BPP sprites. 2 BPP, no mask Plot action 0, OS=0.07, Ext=2.58 1 BPP, no mask Plot action 0, OS=0.06, Ext=2.57 32 BPP, mask Plot action 0, OS=3.93, Ext=0.54 Now, this one is interesting… Apparently, the OS version does more work for masked sprites, while the new version – as only a tiny part of the sprite remains after applying the mask – only writes a little to the screen, so at least for this kind of heavily masked sprites, the new version comes out ahead. 16 BPP, mask Plot action 0, OS=1.61, Ext=0.49 The new version lead is still there for 16 BPP sprites with mask, although less pronounced. 8 BPP, mask Plot action 0, OS=0.71, Ext=0.49 For 8 BPP sprites with mask they are roughly equal. 4 BPP, mask Plot action 0, OS=0.34, Ext=0.94 Sub-8 BPP sprites with mask behave similarly to their non-masked counterparts. 2 BPP, mask Plot action 0, OS=0.14, Ext=0.88 1 BPP, mask Plot action 0, OS=0.08, Ext=0.85 In summary, minimising screen memory writes, and combining pixels into (multiple) word writes seems to be a useful way forward for any optimisation. Also, for modern systems, it probably pays to focus on fast plotting for 8/16/32 BPP, rather than the sub-8 BPP sprites, at least to start with. |
Uwe Kall (215) 120 posts |
Terje, is it possible to add a different behaviour that uses the mask(byte) not as a switch but as alpha value? I think it might be worth the try and be fun … What do you think? Concerning what needs to be optimised: how about the overall speed? I mean, if 8bpp is as fast as 32bpp and 32bpp plotting is fast ‘enough’ then probably the optimisation for lower colour resolution should better not clutter the code and endanger your goal of better code readability. Even on RiscPC with Viewfinder I tended to use high colour depth modes as they were possible in conjunction with high resolution. |
Terje Slettebø (285) 275 posts |
Hi Uwe.
It certainly would but I’d then expect that the only results from the last timing where the new version is ahead of the OS version, namely plotting with mask, would be lost as an advantage, because of the many unnecessary screen writes, which appear to be really expensive. At the moment, I’m looking into what makes things slow in the new version (some of it I know, some of I need to determine, and all of it I need more details on), and in all likelihood, it’s “only” the inner loop that needs to be optimised, i.e. the function that draws a row of the sprite. There are several different variants of the implementation of this row-draw function, such as this: std::memcpy(screenRow.begin(),spriteRow.begin(),spriteRow.size()); // Plot 8/16/32 BPP sprites with overwrite mode, no mask (this one gets identical timing to the OS version) std::copy(spriteRow.begin(),spriteRow.end(),screenRow.begin()); // As above, but for 1/2/4 BPP, using “pixel iterators” etc.
Having the performance of some of the last figures where the new version is still ahead drop like a rock, is not my idea of “fun”… ;) Consider these: 32 BPP, no mask: Plot action 1, OS=2.1, Ext=2.88 (the OS is ahead) 32 BPP, mask: Plot action 1, OS=3.93, Ext=0.65 (the new version is ahead) I can’t think of any other reason than the avoided screen writes that the new version is faster (although I haven’t checked what the OS version is doing here, and it’s nontrivial to read that from the kernel source), but to humour you, I could try using the mask as an alpha value as you suggest. :)
The Draw module is another module it would have been nice to reimplement it in C/C++, although I see at least as big challenges for getting fast performance there as the sprite plotting (which, compared to Draw paths, except for scaled/rotated sprites, is comparatively simple). It might enable us to implement some Artworks-like improvements in the OS.
Code clutter is probably not such a big concern, here, as the handling of different plotting situations (BPP, plot action, mask, etc.) is done by different functions, especially for the “lowest” level: Plotting a single row of a sprite. Also, there are no swich-case’s or if-else’s used to determine plotting policy: Everything is done by polymorphism (virtual function calls). However, I certainly agree with you about the optimisation priority: For now, I focus on making 8/16/32 BPP plotting fast (8 and 16 BPP plotting is rather a lot slower than the OS in the latest timing above). As for 1/2/4 BPP, I guess hardly anyone uses 16 colours or less, these days. However, as a developer, I find it “inelegant” to have something use way more resources than it (apparently) needs. For now, I feel it’s best to have the focus on 8/16/32 BPP plotting. “I’ll be back…” ;) Thanks for your comments. :) |
Andrew Rawnsley (492) 1445 posts |
I suspect some of the desktop graphics/icons are still 16 colour for legacy reasons, and size. I have a sneeky suspicion that the tool-sprites, for example, fall into this category. However, I could be wrong, and OS5 may be different to OS4 etc in that regard – I haven’t checked, I’m afraid. On the other hand, this may be an impetus to replace any remaining low colour depth graphics with more modern ones. |
Andrew Rawnsley (492) 1445 posts |
I suspect some of the desktop graphics/icons are still 16 colour for legacy reasons, and size. I have a sneeky suspicion that the tool-sprites, for example, fall into this category. However, I could be wrong, and OS5 may be different to OS4 etc in that regard – I haven’t checked, I’m afraid. On the other hand, this may be an impetus to replace any remaining low colour depth graphics with more modern ones. |
Terje Slettebø (285) 275 posts |
Having started optimising the code, I started from the top of the test results, in other words this one (the BeagleBoard results):
As we can see, the first case where the new version falls behind the OS version is when using plot action 1 (OR). To investigate the potential for optimising this and other cases, I started out writing a hand-written assembly function for this case (only the function plotting a row of the sprite). Here’s the result
:D I’ve not forgotten my “assembly code kung fu”… ;) In other words, it’s absolutely possible to optimise these cases, and keep the assembly code to a fairly limited amount. However, as the main motivation for this reimplementation was to move away from assembly code, I’ll investigate whether it’s possible to coach the compiler into producing roughly equivalently performant code. Also, the main reasons for this assembly code “experiment” was to provide a baseline against which compiled code performance may be measured, as well as experimenting with various implementation strategies. The assembly code version here is probably about as fast as you can get it: It uses LDM/STM to load/store four words/pixels at a time, so the question is if it’s possible to get the compiler to do the same… By the way, I’m using GAS as the assembler, here, as I’ve read somewhere that ASASM isn’t (yet) able to handle VFP/NEON instructions (which, by the way, might lead to a further optimisation for code like this). Edit: Using only a loop of LDR/STR makes the new version take about twice as long as when using LDM/STM, ending up taking 3 seconds where the OS only takes 2.1, so LDM/STM (or something like it from VFP/SIMD), seems definitely to be needed to match or exceed OS performance. Edit: Nope, not even when using GCC with -march=armv7-a and -O3 did I manage to get it to use LDM/STM… Here’s the program I used: struct PixelBlock { int pixel[4]; }; PixelBlock *spriteRow; PixelBlock *screenRow; void f(PixelBlock *spriteRow,PixelBlock *screenRow) { for(int i=0; i!=10; ++i) { PixelBlock sprite=spriteRow[i]; PixelBlock screen=screenRow[i]; screen.pixel[0]|=sprite.pixel[0]; screen.pixel[1]|=sprite.pixel[1]; screen.pixel[2]|=sprite.pixel[2]; screen.pixel[3]|=sprite.pixel[3]; screenRow[i]=screen; } } int main() { f(spriteRow,screenRow); } It seems assembly code is here to stay, at least for now… |
Jeffrey Lee (213) 6048 posts |
Have you tried using __restrict on the pointers? It should result in the compiler producing better code, perhaps even using LDM/STM if you’re lucky.
<pre> works for me, although maybe it needs to be <notextile><pre> to stop [1] being translated to 1: hello, I am not using notextile screen.pixel[0] |= sprite.pixel[0]; hello, I am using notextile screen.pixel[0] |= sprite.pixel[0]; Nope, everything seems fine without using <notextile>. |
Terje Slettebø (285) 275 posts |
Thanks for the tip, I’ll try that. Edit: I’ve tested it now, and oddly enough, the code using __restrict ended up slightly longer, and with about the same number of LDM/STM as the original version (which is to say, just a few; the vast majority being LDR/STR). Looking at the example code here, I can’t see any inefficiencies that would result from pointer aliasing: Every value is used exactly once, so we’re not reusing a value that might have been overwritten since it was first used, and therefore may have to be reloaded (such as in the example here).
I found now that for some reason, if you don’t have a blank line above/below <pre>/</pre>, it won’t come out right. |
Trevor Johnson (329) 1645 posts |
ISTR it being the same for |
Jeff Doggett (257) 234 posts |
Try to avoid using for loops when you know that the loop will execute at least once. The C standard defines a for loop as a while..do for(int i=0; i!=10; ++i) Most compilers will produce better code with a do..while loop. unsigned int i; Also you should use unsigned ints to represent unsigned quantities. And if you’re not using i in the loop then decrementing will produce faster code…. i = 10; |
Terje Slettebø (285) 275 posts |
Hi Jeff. First of all, the above code is just some test code I used to try to make the compiler use LDM/STM (which didn’t work), not the actual code. Secondly, the actual code works generally in terms of “iterators”, which may not have a < operator, but which are guaranteed to have == and != operators. In this case, either would work (since it’s raw pointers, also in the actual code), but there are other concerns: Having done a search on this, the opinion on using < versus != is mixed. One said that using < expresses the intent that the variable is always less than the given value. One issue with <, though, is that you don’t know the postcondition, i.e. what i is after the loop. With != you know that it is, namely what you test against. Again, that may not matter in this case, but one problem with using < as a “safer” alternative is that it may hide bugs, and that’s not something I want… If for some reason, i is higher than 10 from the start, then that’s a bug, and that’s something I want to know about (in the form of an “infinite” loop), rather than the program appearing to work correctly. The latter is my main motivation for using !=: It’s “safer”, in the sense that you discover bugs quickly, rather than having them hidden by some “safe” defensive programming practice… Regarding signed/unsigned int: Again, the consensus seems to be mixed. I’ve earlier suggested the use of unsigned int, myself, but in a long discussion about it years ago at comp.sys.lang.c++.*, it emerged that using unsigned int in C/C++ for one thing doesn’t help detecting problems, and may actually lead to undetected ones (you may assigned negative values to unsigned variables, and they are silently converted to the wrong value), so I’ve tended to play it safe and use signed variables. In any case, it should not make a difference, performance-wise, to the code above. Lastly, about using do-while rather than for and using decrement and testing against zero: Yes, I’m aware of these, but in the real code, I test against the end pointer, anyway, not using a counter variable. Again, the code above in no way has any resemblance to the actual code, so trying to “optimise” it is pretty futile… :) |