Idea for discussion: Rewriting RISC OS using a high-level language

143 posts, 25 voices

Pages: 1 2 3 4 5 6

Oct 6, 2011 11:46pm Steve Revill (20) 1361 posts	I find that difficult to believe. Can you elaborate on this ? Adding ARMv7 support to the Norcroft compiler is probably a few man weeks of work. Migrating all of the assembler-sourced code from ObjAsm to asasm or similar is almost certainly a much bigger task for a few reasons: 1. there are a lot of ARM components 2. many of them use features in objasm which asasm doesn’t support 3. testing the results would be time-consuming given that many of these components are very low level and critical Similar arguments exist for the C components, too. This is just my speculation but I would find it hard to believe that the reverse is true.

Oct 7, 2011 4:46pm Martin Bazley (331) 379 posts	2. many of them use features in objasm which asasm doesn’t support I believe one of the current major focuses of RISC OS GCC development is improving asasm to support said features. It doesn’t yet, but it will do. Right now, as you say, effectively rewriting half the kernel is far more trouble than it’s worth.

Oct 9, 2011 2:06pm Terje Slettebø (285) 275 posts	As for OS_SpriteOp: Work is well underway there. So far, the following has been reimplemented, both as SWI call and star commands (star command version mentioned in the following where relevant), for all colour depths: Loading/saving sprite area: SNew, SLoad, SSave, SMerge Info: SInfo, SList Transformation: SFlipX, SFlipY (mask handling remains) Read/write sprite pixels Next up will be some more sprite manipulation: Insert/remove rows and columns, to get an even better grasp on the fundamental sprite operations. You guys may be interested in knowing that steady progress in being made. Lately, I’ve been focusing on plotting sprites (source and target BPP being the same), and the following now works: Plotting 8/16/32 BPP sprites, using any plot action (0-7), with mask/no mask (old and new type), and clipping to graphics window. Care is being taken to not do more work than needed for a given combination of parameters (for example, the mask is not tracked for non-mask plotting), as well as using fast implementation for simpler cases (e.g. using memcpy() in the case of 8/16/32 BPP, plot action 0 (overwrite), and no mask). Next up, and this is a big one: Plotting sprites with less than 8 BPP…

Oct 12, 2011 4:15pm Steve Revill (20) 1361 posts	Nice work, Terje. One question: are you in a position to do a bit of early benchmarking (likely to be quite bodgy, I’d imagine) to see how the two versions compare? I accept it’s not necessarily going to give you a very useful answer – performance is probably an issue you’ve not been focused on too much yet – but it’s an interesting question.

Oct 12, 2011 4:42pm Terje Slettebø (285) 275 posts	Hi Steve. Nice work, Terje. One question: are you in a position to do a bit of early benchmarking (likely to be quite bodgy, I’d imagine) to see how the two versions compare? Mind reader… ;) I had actually planned to do just that, once I get it working also for less than 8 BPP sprites, which I planned to work on today. I’ve had quite a bit of mail correspondence with, and encouragement by, Richard Spencer (“hilltop” here, the author of the Memphis RAM disc alternative) as the design has progressed, and as he mentioned, it’s going to be a tough competition against hand-written assembly code… :) Nevertheless, quite a lot of work has gone into performance, as it would be comparatively “easy” to make a system that would handle all variations of BPP, mask, plot action, etc., and be dog slow in all of them… It’s a different matter to ensure that the simple cases run fast, and that it doesn’t do anything more than needed in a given case (as mentioned above). However, having done that, any further optimisation need to be guided by profiling, so that’s already in the plans, then. I’ll post figures here once it’s ready. The tests also need to be quite varied… It needs to be fast for large sprites, but also for small sprites, because a lot of the sprites in RISC OS are quite small, like the window furniture and icons, so we need several kinds of tests. I accept it’s not necessarily going to give you a very useful answer – performance is probably an issue you’ve not been focused on too much yet – but it’s an interesting question. Interesting and vitally important… :) The last thing I want to do is to spend lots of time on this, and not enough up-front thinking on performance, and end up with something that doesn’t “fly”… Therefore, although “premature optimisation” should be avoided, performance has to an important factor from the very start… That means making the most use of the processor and memory system, which means use it in a way that works well, like plotting rows from left to right (rather than e.g. top to bottom), to make use of locality of reference, the write-back buffer, etc. Thanks for the encouragement. :)

Oct 13, 2011 4:31pm Ben Avison (25) 445 posts	Don’t you just want to use Ben’s “HostTools” shared makefiles Um, maybe. Could get a bit messy getting dependencies correct when the object files are built from the same source files. I think it would be simpler to add ColourTrans to the “BuildHost” build, install it in RiscOS.Library, and then RMEnsure it from RiscOS.Env.!Common. Of course, this is also one of those things that make cross-compilation difficult, so it would also be useful to replace this with a portable C application that uses the same algorithm as ColourTrans.

Oct 14, 2011 4:06am Terje Slettebø (285) 275 posts	I’ll post figures here once it’s ready. The implementation of sprite plotting (same source and target colour depth) is complete, and the preliminary results are in. There were some surprises, and some things that were rather expected. The module was compiled using O3 optimisation, and it includes a temporary additional reason code for testing, which plots a given sprite a given number of times, and for this test, a 100 × 100 pixel sprite was plotted 1000 times, with various settings for colour depth, plot action, and mask. A pointer to the sprite was used when plotting, to eliminate the time to look the sprite up. The test was run on VirtualAcorn on a dual-core Duo PC, but the relative timing of the OS and the new version should be similar for a native system. Also, when testing, both the OS and the new version were plotted side by side, to check that the resulting images were identical. Here’s the result of the performance test (“Ext” is the new module, since the name of the SWI is “Ext_SpriteOp”, and the times are in seconds): 32 BPP, no mask Plot action 0, OS=0.1, Ext=0.09 Plot action 1, OS=0.27, Ext=0.2 Plot action 2, OS=0.27, Ext=0.21 Plot action 3, OS=0.27, Ext=0.19 Plot action 4, OS=0.27, Ext=0.15 Plot action 5, OS=0.27, Ext=0.03 Plot action 6, OS=0.27, Ext=0.19 Plot action 7, OS=0.27, Ext=0.2 I’m quite pleased that the new module matches the OS for this first test using 32 BPP sprite, and it’s somewhat surprising that the new module is actually faster than the OS for non-overwrite plotting… 16 bpp, no mask Plot action 0, OS=0.07, Ext=0.07 Plot action 1, OS=0.13, Ext=0.33 Plot action 2, OS=0.14, Ext=0.32 Plot action 3, OS=0.13, Ext=0.32 Plot action 4, OS=0.13, Ext=0.24 Plot action 5, OS=0.14, Ext=0.03 Plot action 6, OS=0.13, Ext=0.33 Plot action 7, OS=0.13, Ext=0.32 For 16 BPP sprites, except for the overwrite case, the OS is about twice as fast as new new module, so it clearly does some optimisation, here (perhaps combining pairs of pixels into words before plotting). 8 BPP, no mask Plot action 0, OS=0.05, Ext=0.05 Plot action 1, OS=0.08, Ext=0.19 Plot action 2, OS=0.08, Ext=0.18 Plot action 3, OS=0.07, Ext=0.19 Plot action 4, OS=0.07, Ext=0.15 Plot action 5, OS=0.08, Ext=0.03 Plot action 6, OS=0.08, Ext=0.18 Plot action 7, OS=0.08, Ext=0.18 We get a similar result for 8 BPP sprites: The OS is about twice as fast, probably for the same reason as for 16 BPP sprites. 4 BPP, no mask Plot action 0, OS=0.04, Ext=0.83 Plot action 1, OS=0.04, Ext=1.08 Plot action 2, OS=0.05, Ext=1.12 Plot action 3, OS=0.05, Ext=1.06 Plot action 4, OS=0.05, Ext=0.91 Plot action 5, OS=0.05, Ext=0.1 Plot action 6, OS=0.04, Ext=1.09 Plot action 7, OS=0.05, Ext=1.08 For 4 and less BPP sprites, the difference is – as expected – quite dramatic in this initial test. The new module is based on generic programming, employing an STL-based architecture of iterators for sprites, rows and pixels, and for sub-byte pixels, quite a lot of bit fiddling is being done for each pixel plotting. The OS, on the other hand, does some clever optimisations for sub-8 BPP sprites, involving loading words from the sprite and shifting them the right number of bits before plotting. 2 BPP, no mask Plot action 0, OS=0.04, Ext=0.83 Plot action 1, OS=0.02, Ext=1.07 Plot action 2, OS=0.03, Ext=1.14 Plot action 3, OS=0.03, Ext=1.07 Plot action 4, OS=0.03, Ext=0.92 Plot action 5, OS=0.02, Ext=0.1 Plot action 6, OS=0.03, Ext=1.1 Plot action 7, OS=0.02, Ext=1.09 Same result as above, for the same reason. 1 BPP, no mask Plot action 0, OS=0.02, Ext=0.83 Plot action 1, OS=0.03, Ext=1.07 Plot action 2, OS=0.02, Ext=1.13 Plot action 3, OS=0.02, Ext=1.08 Plot action 4, OS=0.02, Ext=0.91 Plot action 5, OS=0.02, Ext=0.1 Plot action 6, OS=0.01, Ext=1.1 Plot action 7, OS=0.02, Ext=1.09 Same result as above, for the same reason. 32 BPP, mask Plot action 0, 2.13, Ext=0.76 Plot action 1, 2.13, Ext=0.75 Plot action 2, 2.12, Ext=0.77 Plot action 3, 2.12, Ext=0.76 Plot action 4, 2.13, Ext=0.78 Plot action 5, 2.11, Ext=0.07 Plot action 6, 2.12, Ext=0.76 Plot action 7, 2.11, Ext=0.77 Again, we have a surprise: The new module is considerably faster for plotting 32 BPP sprites with mask than the OS (about 2-3 times as fast). 16 BPP, mask Plot action 0, 1.22, Ext=0.75 Plot action 1, 1.19, Ext=0.76 Plot action 2, 1.19, Ext=0.75 Plot action 3, 1.2, Ext=0.74 Plot action 4, 1.2, Ext=0.78 Plot action 5, 1.2, Ext=0.06 Plot action 6, 1.18, Ext=0.76 Plot action 7, 1.2, Ext=0.73 It’s still faster for 16 BPP with mask, but not to the same degree, probably due to the same speedup as noted for 16 BPP sprites without mask. 8 BPP, mask Plot action 0, 0.15, Ext=0.76 Plot action 1, 0.13, Ext=0.79 Plot action 2, 0.13, Ext=0.8 Plot action 3, 0.13, Ext=0.78 Plot action 4, 0.13, Ext=0.76 Plot action 5, 0.13, Ext=0.05 Plot action 6, 0.13, Ext=0.74 Plot action 7, 0.14, Ext=0.82 Apparently, the OS has some optimisation for 8 BPP sprites with mask, as here it’s considerably faster than the new version. 4 BPP, mask Plot action 0, 0.09, Ext=1.22 Plot action 1, 0.07, Ext=1.31 Plot action 2, 0.08, Ext=1.3 Plot action 3, 0.08, Ext=1.32 Plot action 4, 0.08, Ext=1.4 Plot action 5, 0.08, Ext=0.08 Plot action 6, 0.08, Ext=1.31 Plot action 7, 0.07, Ext=1.36 As for the non-mask version, the sub-8 BPP sprite plotting is considerably faster for the OS version than the new version. 2 BPP, mask Plot action 0, 0.06, Ext=1.2 Plot action 1, 0.05, Ext=1.31 Plot action 2, 0.05, Ext=1.3 Plot action 3, 0.05, Ext=1.31 Plot action 4, 0.05, Ext=1.37 Plot action 5, 0.06, Ext=0.08 Plot action 6, 0.05, Ext=1.31 Plot action 7, 0.05, Ext=1.35 Same result as above, for the same reason. 1 BPP, mask Plot action 0, 0.05, Ext=1.16 Plot action 1, 0.04, Ext=1.28 Plot action 2, 0.04, Ext=1.3 Plot action 3, 0.04, Ext=1.29 Plot action 4, 0.04, Ext=1.34 Plot action 5, 0.04, Ext=0.08 Plot action 6, 0.03, Ext=1.28 Plot action 7, 0.04, Ext=1.32 Same result as above, for the same reason. Summary: Frankly, I’m quite relieved now that the preliminary numbers finally are in (we also need to run tests with small and large sprites, but I suspect the numbers there will resemble these). When working on the development, I never knew how the real result would turn out versus the OS version, and I’m relieved that, at least for 8/16/32 BPP sprites with mask/no mask, the numbers are comparable, with the new version being somewhat faster for 32 BPP sprites, and 2-3 times as fast for 32 BPP sprites with mask, while the OS is about twice as fast for 8/16 BPP non-mask plotting. For 8/16 BPP plotting, as well as for sub-8 BPP plotting, there are clearly opportunities for optimisation.

Oct 14, 2011 5:50am Trevor Johnson (329) 1645 posts	These sound like excellent results! Thanks for sharing them and for all the other updates on the way. Is there any chance of ROOL considering the issue of a press release (at an appropriate time, when there have been further comparisons), maybe aimed at programming journals/academics? Perhaps a competition for work on other modules could be launched… supported by a small value bounty to cover the prize. While this sort of work might not be considered of key importance on the road map, it’s very newsworthy IMHO.

Oct 14, 2011 6:51am Terje Slettebø (285) 275 posts	Thanks for the nice feedback, Trevor. :) My main motivation for starting a reimplementation of parts of RISC OS in a higher-level language is that it’s simply “too hard” and too time-consuming to work with raw assembly code, and especially given the limited developer resources we have, it simply doesn’t get done, except for some core kernel parts (thanks Jeffrey and the others!), leaving much of the OS simply stale… By having a foundation written in a higher-level language like C/C++ we have at least the possibility that more people may chip in, and start doing updates to the OS, so that we may actually get somewhere with our roadmap… :) I’ve learned a lot from the sprite work, and the lessons learned might be applicable to other kernel graphics work, as well. Not to mention that it would be nice to have a 3D graphics API sometime… It may be best to get some more experience with this approach before doing much publicising, but by all means, I’m all for if someone else would like to chip in to work on reimplementation, as well. :) As an aside, I’ve noticed that this thread has been mentioned at a few RISC OS news sites. The results above appear to validate the viability of this approach (reimplementing parts of RISC OS using C/C++), so I’ll postpone any optimisations for now, and instead work on completing the rest of the OS_SpriteOp functionality. There’s quite a bit that remains, including more advanced plotting, like plotting to a different colour depth, colour table translations, scaling and rotating sprites, etc. Fun! :)

Oct 14, 2011 9:37am Andrew Rawnsley (492) 1445 posts	One comment I would make is that it would be better to test on real OMAP (or similar) hardware, since that’s where the results are going to end up. Testing/benchmarking on VA can be a very odd experience (I speak from considerable experience historically!). I have benchmarks showing VA to be 10x (or more) faster than an Iyonix (eg. firebench) which is obviously not true. I just ignored these outliers when publishing results, but it does mean that you can’t always expect to get parity between VA and real hardware, despite the end result (ie. VA operation) feeling so similar. But, on other business, congratulations on getting this far :) :) :)

Oct 14, 2011 11:00am Jeffrey Lee (213) 6048 posts	I’ll second Andrew’s comments about profiling on real hardware! The good news is that you can access the Cortex-A8 performance counter registers from user mode, so by using the cycle count register you can easily get high-accuracy timings of small sections of code. When I get home I’ll post a few snippets of assembler showing you how to do it. It should be much better than using OS_ReadMonotonicTime for everything :) You can also do a similar thing on the Iyonix, although if you want to profile from user mode you have to use one of the 200MHz IOP timers instead of the performance counter hardware.

Oct 14, 2011 11:04am Terje Slettebø (285) 275 posts	Hi Andrew. Thanks for the tip. :) I’ll fire up my BeagleBoard this evening and rerun the test there. I’ll then update the table above. As mentioned, the reason I felt that wasn’t that important to begin with was that it was about comparing two different modules, running on the same system, so we’re only talking comparing relative performance, here. I guess that may still lead to some surprises, though, so I’ll rerun the test, then. The reason I used VA is that I’ve been doing the development there (it lets me use my preferred Windows text editor, and it also tends to be somewhat faster than the BeagleBoard, especially for file operations).

Oct 14, 2011 12:10pm Alan Robertson (52) 420 posts	I doff my cap to you sir. Great work on the rewrite so far. Good to see some real progress being made to move away from assembler to something more portable and manageable.

Oct 14, 2011 6:22pm Jeffrey Lee (213) 6048 posts	Here’s some code for accessing the Cortex-A8 cycle counter from user mode. First, to enable the counter: SWI "OS_EnterOS" MOV R0,#1 MCR CP15,0,R0,C9,C14,0 ; Enable user mode access MVN R0,#0 MCR CP15,0,R0,C9,C12,2 ; Disable performance counters MCR CP15,0,R0,C9,C14,2 ; Disable counter interrupts MOV R0,#&80000000 MCR CP15,0,R0,C9,C12,1 ; Enable cycle counter MOV R0,#7 MCR CP15,0,R0,C9,C12,0 ; Enable & reset SWI "OS_LeaveOS" Then the counter can simply be read using “MRC CP15,0,<reg>,C9,C13,0”. As you’d expect it counts the number of CPU clock cycles. More info can be found in the Cortex-A8 TRM.

Oct 14, 2011 9:33pm Terje Slettebø (285) 275 posts	Thanks, Jeffrey. For the timing I’ve done so far, I’ve just used OS_ReadMonotonicTime (as you guessed :) ), which seems to work well enough, but it’s nice to know how to use higher-resolution counters. In any case, I’ve re-run the test on the BeagleBoard, and the result have come out somewhat differently, which seems to indicate that extra writes to screen memory are penalised particularly severely, reasonably enough, given the huge difference in clock speed between the core and the main memory. This seems to have been somewhat masked on the VirtualAcorn timing, whereas when testing on the BeagleBoard, they show up more clearly. For reference, the sprite used for the testing looks like this. The form has been chosen for easier testing of functionality, not for performance testing, but the actual content of the sprite shouldn’t affect the performance. However, as will be seen from the results below, the amount of sprite masked away certainly seems to play a quite large role, with the masked plotting actually being faster than the unmasked one, presumably since there’s less writing to the screen memory. When reading the following test result, please keep in mind that no optimisation has yet been done, so the purpose of these measurements is not at least to identify potential opportunities for optimisation. Here’s the result from the BeagleBoard test: 32 BPP, no mask Plot action 0, OS=0.21, Ext=0.21 Plot action 1, OS=2.1, Ext=2.88 Plot action 2, OS=2.1, Ext=2.88 Plot action 3, OS=2.11, Ext=2.88 Plot action 4, OS=2.11, Ext=2.88 Plot action 5, OS=2.11, Ext=0.02 Plot action 6, OS=2.11, Ext=2.88 Plot action 7, OS=2.11, Ext=2.88 Sadly, the new version no longer leads for non-overwrite plotting. 16 BPP, no mask Plot action 0, OS=0.1, Ext=0.11 Plot action 1, OS=0.77, Ext=2.76 Plot action 2, OS=0.77, Ext=2.76 Plot action 3, OS=0.77, Ext=2.76 Plot action 4, OS=0.77, Ext=2.78 Plot action 5, OS=0.76, Ext=0.02 Plot action 6, OS=0.77, Ext=2.76 Plot action 7, OS=0.77, Ext=2.76 The difference grows for 16 BPP sprites. I suspect this is due to the OS combining pixels into words before writing, while the new version writes each 8/16 BPP pixel separately. 8 BPP, no mask Plot action 0, OS=0.1, Ext=0.08 Plot action 1, OS=0.52, Ext=2.62 Plot action 2, OS=0.53, Ext=2.63 Plot action 3, OS=0.53, Ext=2.62 Plot action 4, OS=0.53, Ext=2.63 Plot action 5, OS=0.52, Ext=0.02 Plot action 6, OS=0.52, Ext=2.62 Plot action 7, OS=0.53, Ext=2.63 The same goes for 8 BPP sprites. 4 BPP, no mask Plot action 0, OS=0.07, Ext=2.61 Plot action 1, OS=0.28, Ext=5.1 Plot action 2, OS=0.28, Ext=5.07 Plot action 3, OS=0.28, Ext=5.1 Plot action 4, OS=0.28, Ext=4.95 Plot action 5, OS=0.28, Ext=0.04 Plot action 6, OS=0.28, Ext=5.05 Plot action 7, OS=0.27, Ext=5.08 As in the earlier test, there’s a lot of opportunity for optimisation for sub-8 BPP sprites. 2 BPP, no mask Plot action 0, OS=0.07, Ext=2.58 Plot action 1, OS=0.13, Ext=5.06 Plot action 2, OS=0.13, Ext=5.01 Plot action 3, OS=0.13, Ext=5.06 Plot action 4, OS=0.13, Ext=4.91 Plot action 5, OS=0.13, Ext=0.04 Plot action 6, OS=0.13, Ext=5.01 Plot action 7, OS=0.13, Ext=5.05 1 BPP, no mask Plot action 0, OS=0.06, Ext=2.57 Plot action 1, OS=0.1, Ext=5.03 Plot action 2, OS=0.1, Ext=5 Plot action 3, OS=0.11, Ext=5.03 Plot action 4, OS=0.11, Ext=4.91 Plot action 5, OS=0.1, Ext=0.04 Plot action 6, OS=0.1, Ext=5 Plot action 7, OS=0.1, Ext=5.03 32 BPP, mask Plot action 0, OS=3.93, Ext=0.54 Plot action 1, OS=3.93, Ext=0.65 Plot action 2, OS=3.93, Ext=0.64 Plot action 3, OS=3.93, Ext=0.64 Plot action 4, OS=3.93, Ext=0.65 Plot action 5, OS=3.94, Ext=0.03 Plot action 6, OS=3.93, Ext=0.64 Plot action 7, OS=3.93, Ext=0.68 Now, this one is interesting… Apparently, the OS version does more work for masked sprites, while the new version – as only a tiny part of the sprite remains after applying the mask – only writes a little to the screen, so at least for this kind of heavily masked sprites, the new version comes out ahead. 16 BPP, mask Plot action 0, OS=1.61, Ext=0.49 Plot action 1, OS=1.61, Ext=0.6 Plot action 2, OS=1.61, Ext=0.6 Plot action 3, OS=1.61, Ext=0.6 Plot action 4, OS=1.61, Ext=0.61 Plot action 5, OS=1.61, Ext=0.03 Plot action 6, OS=1.61, Ext=0.6 Plot action 7, OS=1.61, Ext=0.63 The new version lead is still there for 16 BPP sprites with mask, although less pronounced. 8 BPP, mask Plot action 0, OS=0.71, Ext=0.49 Plot action 1, OS=0.72, Ext=0.58 Plot action 2, OS=0.71, Ext=0.58 Plot action 3, OS=0.71, Ext=0.58 Plot action 4, OS=0.72, Ext=0.59 Plot action 5, OS=0.71, Ext=0.03 Plot action 6, OS=0.72, Ext=0.58 Plot action 7, OS=0.72, Ext=0.62 For 8 BPP sprites with mask they are roughly equal. 4 BPP, mask Plot action 0, OS=0.34, Ext=0.94 Plot action 1, OS=0.35, Ext=1.12 Plot action 2, OS=0.35, Ext=1.12 Plot action 3, OS=0.34, Ext=1.12 Plot action 4, OS=0.34, Ext=1.13 Plot action 5, OS=0.35, Ext=0.05 Plot action 6, OS=0.34, Ext=1.12 Plot action 7, OS=0.34, Ext=1.12 Sub-8 BPP sprites with mask behave similarly to their non-masked counterparts. 2 BPP, mask Plot action 0, OS=0.14, Ext=0.88 Plot action 1, OS=0.14, Ext=1.06 Plot action 2, OS=0.14, Ext=1.07 Plot action 3, OS=0.13, Ext=1.07 Plot action 4, OS=0.14, Ext=1.07 Plot action 5, OS=0.14, Ext=0.06 Plot action 6, OS=0.13, Ext=1.07 Plot action 7, OS=0.13, Ext=1.07 1 BPP, mask Plot action 0, OS=0.08, Ext=0.85 Plot action 1, OS=0.08, Ext=1.02 Plot action 2, OS=0.08, Ext=1.02 Plot action 3, OS=0.08, Ext=1.02 Plot action 4, OS=0.09, Ext=1.02 Plot action 5, OS=0.08, Ext=0.05 Plot action 6, OS=0.08, Ext=1.02 Plot action 7, OS=0.08, Ext=1.03 In summary, minimising screen memory writes, and combining pixels into (multiple) word writes seems to be a useful way forward for any optimisation. Also, for modern systems, it probably pays to focus on fast plotting for 8/16/32 BPP, rather than the sub-8 BPP sprites, at least to start with.

Oct 15, 2011 11:04am Uwe Kall (215) 120 posts	Terje, is it possible to add a different behaviour that uses the mask(byte) not as a switch but as alpha value? I think it might be worth the try and be fun … (I did this some years ago with the draw module binary – only constant transparency at that time – and found that the path strokes overlapped considerably – another point where one could optimize) What do you think? Concerning what needs to be optimised: how about the overall speed? I mean, if 8bpp is as fast as 32bpp and 32bpp plotting is fast ‘enough’ then probably the optimisation for lower colour resolution should better not clutter the code and endanger your goal of better code readability. Even on RiscPC with Viewfinder I tended to use high colour depth modes as they were possible in conjunction with high resolution.

Oct 15, 2011 4:13pm Terje Slettebø (285) 275 posts	Hi Uwe. Terje, is it possible to add a different behaviour that uses the mask(byte) not as a switch but as alpha value? It certainly would but I’d then expect that the only results from the last timing where the new version is ahead of the OS version, namely plotting with mask, would be lost as an advantage, because of the many unnecessary screen writes, which appear to be really expensive. At the moment, I’m looking into what makes things slow in the new version (some of it I know, some of I need to determine, and all of it I need more details on), and in all likelihood, it’s “only” the inner loop that needs to be optimised, i.e. the function that draws a row of the sprite. There are several different variants of the implementation of this row-draw function, such as this: std::memcpy(screenRow.begin(),spriteRow.begin(),spriteRow.size()); // Plot 8/16/32 BPP sprites with overwrite mode, no mask (this one gets identical timing to the OS version) std::copy(spriteRow.begin(),spriteRow.end(),screenRow.begin()); // As above, but for 1/2/4 BPP, using “pixel iterators” etc. I think it might be worth the try and be fun … Having the performance of some of the last figures where the new version is still ahead drop like a rock, is not my idea of “fun”… ;) Consider these: 32 BPP, no mask: Plot action 1, OS=2.1, Ext=2.88 (the OS is ahead) 32 BPP, mask: Plot action 1, OS=3.93, Ext=0.65 (the new version is ahead) I can’t think of any other reason than the avoided screen writes that the new version is faster (although I haven’t checked what the OS version is doing here, and it’s nontrivial to read that from the kernel source), but to humour you, I could try using the mask as an alpha value as you suggest. :) (I did this some years ago with the draw module binary – only constant transparency at that time – and found that the path strokes overlapped considerably – another point where one could optimize) The Draw module is another module it would have been nice to reimplement it in C/C++, although I see at least as big challenges for getting fast performance there as the sprite plotting (which, compared to Draw paths, except for scaled/rotated sprites, is comparatively simple). It might enable us to implement some Artworks-like improvements in the OS. Concerning what needs to be optimised: how about the overall speed? I mean, if 8bpp is as fast as 32bpp and 32bpp plotting is fast ‘enough’ then probably the optimisation for lower colour resolution should better not clutter the code and endanger your goal of better code readability. Code clutter is probably not such a big concern, here, as the handling of different plotting situations (BPP, plot action, mask, etc.) is done by different functions, especially for the “lowest” level: Plotting a single row of a sprite. Also, there are no swich-case’s or if-else’s used to determine plotting policy: Everything is done by polymorphism (virtual function calls). However, I certainly agree with you about the optimisation priority: For now, I focus on making 8/16/32 BPP plotting fast (8 and 16 BPP plotting is rather a lot slower than the OS in the latest timing above). As for 1/2/4 BPP, I guess hardly anyone uses 16 colours or less, these days. However, as a developer, I find it “inelegant” to have something use way more resources than it (apparently) needs. For now, I feel it’s best to have the focus on 8/16/32 BPP plotting. “I’ll be back…” ;) Thanks for your comments. :)

Oct 15, 2011 11:23pm Andrew Rawnsley (492) 1445 posts	I suspect some of the desktop graphics/icons are still 16 colour for legacy reasons, and size. I have a sneeky suspicion that the tool-sprites, for example, fall into this category. However, I could be wrong, and OS5 may be different to OS4 etc in that regard – I haven’t checked, I’m afraid. On the other hand, this may be an impetus to replace any remaining low colour depth graphics with more modern ones.

Oct 15, 2011 11:23pm Andrew Rawnsley (492) 1445 posts	I suspect some of the desktop graphics/icons are still 16 colour for legacy reasons, and size. I have a sneeky suspicion that the tool-sprites, for example, fall into this category. However, I could be wrong, and OS5 may be different to OS4 etc in that regard – I haven’t checked, I’m afraid. On the other hand, this may be an impetus to replace any remaining low colour depth graphics with more modern ones.

Oct 16, 2011 2:02pm Terje Slettebø (285) 275 posts	Having started optimising the code, I started from the top of the test results, in other words this one (the BeagleBoard results): 32 BPP, no mask Plot action 0, OS=0.21, Ext=0.21 Plot action 1, OS=2.1, Ext=2.88 Plot action 2, OS=2.1, Ext=2.88 Plot action 3, OS=2.11, Ext=2.88 Plot action 4, OS=2.11, Ext=2.88 Plot action 5, OS=2.11, Ext=0.02 Plot action 6, OS=2.11, Ext=2.88 Plot action 7, OS=2.11, Ext=2.88 As we can see, the first case where the new version falls behind the OS version is when using plot action 1 (OR). To investigate the potential for optimising this and other cases, I started out writing a hand-written assembly function for this case (only the function plotting a row of the sprite). Here’s the result Plot action 1, OS=2.1, Ext=1.67 :D I’ve not forgotten my “assembly code kung fu”… ;) In other words, it’s absolutely possible to optimise these cases, and keep the assembly code to a fairly limited amount. However, as the main motivation for this reimplementation was to move away from assembly code, I’ll investigate whether it’s possible to coach the compiler into producing roughly equivalently performant code. Also, the main reasons for this assembly code “experiment” was to provide a baseline against which compiled code performance may be measured, as well as experimenting with various implementation strategies. The assembly code version here is probably about as fast as you can get it: It uses LDM/STM to load/store four words/pixels at a time, so the question is if it’s possible to get the compiler to do the same… By the way, I’m using GAS as the assembler, here, as I’ve read somewhere that ASASM isn’t (yet) able to handle VFP/NEON instructions (which, by the way, might lead to a further optimisation for code like this). Edit: Using only a loop of LDR/STR makes the new version take about twice as long as when using LDM/STM, ending up taking 3 seconds where the OS only takes 2.1, so LDM/STM (or something like it from VFP/SIMD), seems definitely to be needed to match or exceed OS performance. Edit: Nope, not even when using GCC with -march=armv7-a and -O3 did I manage to get it to use LDM/STM… Here’s the program I used: struct PixelBlock { int pixel[4]; }; PixelBlock spriteRow; PixelBlock screenRow; void f(PixelBlock spriteRow,PixelBlock screenRow) { for(int i=0; i!=10; ++i) { PixelBlock sprite=spriteRow[i]; PixelBlock screen=screenRow[i]; screen.pixel[0]\|=sprite.pixel[0]; screen.pixel[1]\|=sprite.pixel[1]; screen.pixel[2]\|=sprite.pixel[2]; screen.pixel[3]\|=sprite.pixel[3]; screenRow[i]=screen; } } int main() { f(spriteRow,screenRow); } It seems assembly code is here to stay, at least for now…

Oct 16, 2011 6:29pm Jeffrey Lee (213) 6048 posts	Have you tried using __restrict on the pointers? It should result in the compiler producing better code, perhaps even using LDM/STM if you’re lucky. P.S. Does anyone know how to get proper code formatting, here? <pre> works for me, although maybe it needs to be <notextile><pre> to stop [1] being translated to ¹: hello, I am not using notextile screen.pixel[0] \|= sprite.pixel[0]; hello, I am using notextile screen.pixel[0] \|= sprite.pixel[0]; Nope, everything seems fine without using <notextile>.

Oct 16, 2011 7:34pm Terje Slettebø (285) 275 posts	Have you tried using __restrict on the pointers? It should result in the compiler producing better code, perhaps even using LDM/STM if you’re lucky. Thanks for the tip, I’ll try that. Edit: I’ve tested it now, and oddly enough, the code using __restrict ended up slightly longer, and with about the same number of LDM/STM as the original version (which is to say, just a few; the vast majority being LDR/STR). Looking at the example code here, I can’t see any inefficiencies that would result from pointer aliasing: Every value is used exactly once, so we’re not reusing a value that might have been overwritten since it was first used, and therefore may have to be reloaded (such as in the example here). Nope, everything seems fine without using <notextile>. I found now that for some reason, if you don’t have a blank line above/below <pre>/</pre>, it won’t come out right.

Oct 16, 2011 7:57pm Trevor Johnson (329) 1645 posts	blank line above/below ISTR it being the same for `bq.`

Oct 17, 2011 7:34am Jeff Doggett (257) 234 posts	Try to avoid using for loops when you know that the loop will execute at least once. The C standard defines a for loop as a while..do for(int i=0; i!=10; ++i) Most compilers will produce better code with a do..while loop. unsigned int i; i = 0; do { } while (++i < 10); Also you should use unsigned ints to represent unsigned quantities. And avoid using != or == to terminate loops, safer to use < or > And if you’re not using i in the loop then decrementing will produce faster code…. i = 10; do { } while (—i);

Oct 17, 2011 11:08am Terje Slettebø (285) 275 posts	Hi Jeff. First of all, the above code is just some test code I used to try to make the compiler use LDM/STM (which didn’t work), not the actual code. Secondly, the actual code works generally in terms of “iterators”, which may not have a < operator, but which are guaranteed to have == and != operators. In this case, either would work (since it’s raw pointers, also in the actual code), but there are other concerns: Having done a search on this, the opinion on using < versus != is mixed. One said that using < expresses the intent that the variable is always less than the given value. One issue with <, though, is that you don’t know the postcondition, i.e. what i is after the loop. With != you know that it is, namely what you test against. Again, that may not matter in this case, but one problem with using < as a “safer” alternative is that it may hide bugs, and that’s not something I want… If for some reason, i is higher than 10 from the start, then that’s a bug, and that’s something I want to know about (in the form of an “infinite” loop), rather than the program appearing to work correctly. The latter is my main motivation for using !=: It’s “safer”, in the sense that you discover bugs quickly, rather than having them hidden by some “safe” defensive programming practice… Regarding signed/unsigned int: Again, the consensus seems to be mixed. I’ve earlier suggested the use of unsigned int, myself, but in a long discussion about it years ago at comp.sys.lang.c++.*, it emerged that using unsigned int in C/C++ for one thing doesn’t help detecting problems, and may actually lead to undetected ones (you may assigned negative values to unsigned variables, and they are silently converted to the wrong value), so I’ve tended to play it safe and use signed variables. In any case, it should not make a difference, performance-wise, to the code above. Lastly, about using do-while rather than for and using decrement and testing against zero: Yes, I’m aware of these, but in the real code, I test against the end pointer, anyway, not using a counter variable. Again, the code above in no way has any resemblance to the actual code, so trying to “optimise” it is pretty futile… :)