Idea for discussion: Rewriting RISC OS using a high-level language
Chris (121) 472 posts |
Alpha channel sprites would be a very welcome development. |
Terje Slettebø (285) 275 posts |
“Don’t confuse me with facts, I’ve made up my mind.” :P Seriously, though, ARM assembly programming is wonderful, provided that either: a) You have an army of programmers, or: b) You have a life expectancy of 10,000 years… In short: ARM assembly programming takes time… Lot’s of it… Moreover, you may not even get the fastest code that way, because you may for example use a simple linear search for symbols, while if you used hash lookup, your performance could improve by several orders of magnitude, dwarfing any micro-optimisations you get from assembly code… Case in point: The first version of extASM used linear search for its symbols, which lead to something like O(N^2) performance relative to code size N, so assembling the 350KB extASM source itself took over three minutes on an ARM2 (it could be done in about 30 seconds using the BASIC assembler)… It was later changed to hash lookup, which changed its performance dramatically, and the final version (which also included support for FPA instructions, floating point expressions, a large numeric and string function library, etc.) could assemble its 500 KB source in about 15 seconds… Going from linear search to hash lookup was not trivial, but if you did it in a language like C++, it would be more or less as easy as changing from std::vector to std::map (or std::unordered_map, in this case): A one-line change… More time for play… ;) I still have a soft spot for ARM assembly, though, so I’ll read the page you link to… Edit: Having now read the article, and to give a less flippant reply: Yes, certainly, to make use of things like SIMD instructions, you may need to use assembly code (or compiler intrinsics), by all means. But if you want to implement a complete application using assembly code, then although it’s fun, it takes a lot of time… That tradeoff made more sense when computing power was expensive, and labour comparatively cheap, but nowadays, it’s the other way around… |
Jeffrey Lee (213) 6048 posts |
Not quite the same thing, but perhaps more relevant, is David Thomas’s guide on writing efficient C for ARM. If you’ve worked with ARM assembly quite a bit or looked at the output of a compiler then you might have already worked out most of the things he’s listed, but it’s useful to have it all spelled out. Unfortunately a number of the techniques for ‘massaging’ code like this do end up obscuring the purpose, or increase the risk of bugs, so trying to write ARM/compiler-friendly code isn’t always the best choice. |
Terje Slettebø (285) 275 posts |
Thanks, this is certainly useful, and I’ll have a look at it. A few observations::
Sadly, this is no longer the case. The clock frequency has stopped rising, and instead we get more cores (who ordered that? :P), so to get this kind of speed improvement these days, you have to go for concurrent programming (where possible). The free lunch is over. You’re right: Most of these things fall out from knowing how you would do things in assembly code, and I’ve found that I almost unconsciously apply them, like making sure structures are laid out in a way that minimises padding. Some of the things he recommends are done automatically by modern compilers, like hoisting invariant code out of loops, so before you “uglify” your code with an optimisation, make sure that it actually makes a difference! He also links to a page of compiler optimisations, which one should be familiar with, as well. Some of the optimisations actually result in cleaner code (IMHO), like this (avoiding repetition), and in such cases, we get the best of both worlds (elegance and efficiency). :) Fortunately, starting out with the clearest possible code (no clever tricks) tend to give the compiler the best chance to optimise, and “tricks” can actually reduce the chance of optimised code, as the compiler doesn’t understand what you’re doing, and so produces inefficient code just to be sure! So if you optimise, don’t forget to profile afterwards, to see if your change actually was an improvement. Unfortunately, one place where Fortran (and maybe C) still may have an edge over C++ is inefficiencies from pointer aliasing, but there are techniques you can use to improve the resulting code. Using sentinels may be a less obvious optimisation. It may be best to leave loop unrolling one to the compiler. Interestingly, one of the things that excited me about template metaprogramming in C++ many years ago was the possibility of loop unrolling. This is often done by the compiler these days, but there are still cases of program transformations possible with such metaprogramming that compilers don’t do (because they are too hard for compilers to do, lacking specific domain knowledge). Another one that may not be obvious: Using char or short instead of int can actually decrease performance… Note that avoiding array indexing may sometimes produce suboptimal code. This is an interesting optimisation which unfortunately may create confusing code… An argument for OO programming. :) |
Andrew Rawnsley (492) 1445 posts |
Since we’re talking about C and C++, I ran headlong into this over the weekend when going in to update the Mpro editor. Whilst the whole of Mpro and MsgServe is in C, the editor is in C++, using several template/class libraries. And I was only going in to add a “simple” extra button for something… Facepalm! |
Terje Slettebø (285) 275 posts |
That’s the downside of frameworks and complex languages like C++: They may involve a significant learning curve… The upside is that once you’ve learned them, development tends to go significantly faster, and resulting in less code… Studies 1 have shown that the bug density in code tends to be roughly the same independent of language. Therefore, the less code you have, the less bugs you’ll tend to have, too… For myself, for a language I may use in any number of projects, the effort has always been worth it… I guess I’ve had similar problems as you when trying to understand C programs… In this case, though, it’s usually the complexity of the program that presents the largest barrier to understanding, not the complexity of the language or the (often nonexistent) framework. 1 I didn’t find a reference to research, but I did find a quote referring to research on the topic: http://www.faqs.org/docs/artu/ch13s01.html “The reasons go back to perhaps the most important empirical result in software engineering, one we’ve cited before: the defect density of code, bugs per hundred lines, tends to be a constant independent of implementation language. More lines of code means more bugs, and debugging is the most expensive and time-consuming part of development.” |
Terje Slettebø (285) 275 posts |
Just a note saying that I have started on a reimplementation of SpriteUtils/OS_SpriteOp, but I quickly found that I need to brush up on my C++ skills, as I’ve been away from active C++ programmming for many years, and therefore have become rather “rusty” at it… That will take a while… On the plus side, I think I’ve come to an elegant way of implementing the various operations, with support for all colour depths, using the same algorithm code… :) For anyone else wanting to work on reimplementing graphics code in RISC OS, there should be plenty to go around… :) |
Steve Revill (20) 1361 posts |
It’s interesting that you’ve hit on one of the most speed-critical parts of the OS which is perhaps used amongst the most. Well, that’s not quite true, I take it you’re only looking at the SpriteUtils module, so it’s not like all Sprite operations will suddenly become slower and the desktop experience will suffer. And I by no means say this to be negative about ARM vs C vs C++ or your programming skills :) What I do think, however, is that stuff like the sprite operations seems like it could well benefit from being tightly-written ARM code, or even NEON in places. My suggestion, and this is a bit late in the day, would be to look at all the higher-level ARM code modules (so you can steer clear of the guts of RISC OS for the time being) and look for the ones that perform the least speed critical functions. Or, find the ones that are very simple and even if recoded in C++ it’s unlikely you’d end up with something significantly slower. |
Trevor Johnson (329) 1645 posts |
Isn’t it (going to be) possible to include VFP/NEON within C/C++ source code for RISC OS? 1, 2. (Although maybe not yet with the Acorn/Castle/ROOL tools?) |
Terje Slettebø (285) 275 posts |
Hi Steve.
Yeah, I’m fully aware of that… :) Considering that it’s used all over the place in the Desktop, it’s paramount that it stays highly performant. Yet, I chose to take a stab at this one, for the following reasons: Three of my passions in life are:
Reimplementing OS_SpriteOp would give me all three in the same project… :) Furthermore, it’s pretty stand-alone, and may be implemented using “normal” code, unlike e.g. the Wimp, where you need to deal with things like swapping applications in and out of memory, stack frames, SWIs that don’t return in a normal fashion (e.g. Wimp_StartTask), etc.
I haven’t looked into how handling of sprites was divided between SpriteUtils and the rest of the OS, thinking that perhaps that module also implemented OS_SpriteOp. In any case, I did mean reimplementing OS_SpriteOp, not just the star commands.
That may well be true; I guess we’ll find out… :) One problem with assembly code is of course that it’s very hard to do fundamental changes, like adding an alpha channel, or support for different colour spaces, and even if you can do it, working in assembly code is kind of expert stuff, where few people would be able to contribute. Nevertheless, we can’t compromise on performance, so the aim is to write code that has roughly the same performance as the original code. If that turns out to be difficult some places, then some functions could well be written in assembly code, or inline assembly (as I understand, the development-version of GCC does have support for VFP/SIMD code). Then again, some operations are used more than others, and if e.g. flipping a sprite along its x or y axis turns out to be somewhat slower than the original code, but with the advantage of more maintainable code (that could be optimised later if necessary), that might be acceptable, although I leave that to the community to decide.
Having browsed through the modules again, now, I’m not sure which it should be. Many if not most of the modules in RISC OS are performance-critical, and/or involves low-level code (like drivers). It should also be something that’s fun to do, and makes a difference for people, so that one is motivated to work on it. That kind of limits the options for me, at least for the first project. I might be motivated to do more “mundane” stuff later, or something requiring more specialised knowledge, but this was one I could embark on with my existing knowledge. Any suggestions for alternative modules? |
Steve Revill (20) 1361 posts |
Hi Terje. It sounds like you’re fully aware of all the issues so it’ll be very interesting to see how you get on. Best of luck! Given that this is community-based development, I believe that it’s important that people are working on areas of the code that they are personally interested in. :) Ultimately, I do agree that the proliferation and complexity of the ARM code within RISC OS is currently a barrier to its progress. |
Ben Avison (25) 445 posts |
Perhaps you didn’t know that C code can be linked with assembly code? It’s quite common for the most time-critical parts of C software to be written in assembly. So this announcement is relevant here – work on adding VFP and NEON code to RISC OS modules can start whenever anyone fancies. Though it would be nice to at least wrap it in build switches for the benefit of anyone who hasn’t upgraded to ObjAsm 4. I might also point at this bounty with regards to utilisation of VFP and NEON from C code, though support for it to date has been a little disappointing… |
Trevor Johnson (329) 1645 posts |
Or seen it but forgotten the content and not bothered checking before posting! Thanks.
The important words being “to date”! |
Andrew Rawnsley (492) 1445 posts |
I suspect one of the reasons this bounty hasn’t been supported too much this far is that it is a bounty for work on a closed source, commercial product. Personally, I’d hope that this kind of dev would be paid for by the “6 monthly paid-for upgrade” policy that has been introduced on the C-tools, although I appreciate that is rather a niche product. I know nothing of compiler design, but one would have thought the work done on ObjAsm would be transferable to the C compiler, at least on an intellectual level (ie. knowing what needs to be implemented, potential pitfalls and syntax etc). |
Jeffrey Lee (213) 6048 posts |
Terje: If you’re still looking for things to rewrite in C, how about ColourTrans? (source in castle.RiscOS.Sources.Video.Render.Colours) The main reason I suggest it is that I’ve just had to fix a problem where ColourTrans was softloading itself in order to help build some lookup tables that are held in Resources. However since objasm 4 will try and use new instructions like MOVW for generating immediate constants, building the resources will most likely cause a horrible crash if the host isn’t the same architecture as the target. At the moment I’ve fixed it by making it build a second copy of the module that’s safe to softload, but it’s a bit of a nasty hack. The lookup tables are actually created by the ‘maketables’ utility, which is written in C. The only dependency it has on ColourTrans is the ColourTrans_ReturnColourNumberForMode SWI – so if the module was rewritten in C it should be easy enough to seperate the required code out into a seperate file which can then be shared between ColourTrans and maketables. Apart from removing the need to softload a copy of the module it would also remove a barrier to being able to build the module on non-RISC OS hosts. There is the usual downside that rewriting the module in C will almost certainly impair performance in some areas. But it looks like no attempt has been made to tweak the code for each new CPU/architecture version, so there are probably a few areas where a rewrite would provide a good performance boost. E.g. there’s the FindCol macro in s.Commons which could easily benefit from some NEON vectorisation to allow it to compare 4 colours at once (disregarding the fact that the NEON code would currently have to be written in assembler!) |
Steve Revill (20) 1361 posts |
That’s a fair point. In fact, the money we make from C Tools and sales is put into the general pot for ROOL activities thus far. It’s probably the main thing that has kept us afloat in the last couple of years, with donations coming a close second. We are updating the tool chain all the time, taking time out where we can to perform this work. However, because we don’t pay ourselves, this is a slow activity. Our idea with the bounty was that it might attract a developer (may be even one of us at ROOL) to do this vital development and bring it to the platform sooner rather than later. I appreciate it’s a closed-source toolchain; our license prohibits us from having it any other way. But we’re currently (and for the foreseeable future) stuck with the fact that the vast majority of the OS and disc image require these tools to build and if we want to progress the RISC OS sources to take advantage of modern architecture features, it’s going to be considerably less effort to develop the existing tools than to try to migrate to code base to something like GCC. IMO. :) |
Andrew Rawnsley (492) 1445 posts |
I certainly don’t advocate moving away from Norcroft for the main dev toolchain – it has served us well, and “if it ain’t broke”.. etc. I have been thinking about this, and have a few suggestions. These may be rubbish, but they might be helpful in some small way… Whilst I have no problem (personally) with Norcroft being closed source, the price of entry is a potential barrier to new blood. It occurred to me that perhaps one solution (that might make bounty donations more forthcoming) is the release of a basic set of tools (perhaps with minimal documentation, libraries and so on) that would allow coders to use Norcroft “for free”. The commercial version would be more complete, and include source for libraries and other portions present (perhaps lumps of code from the online repository, for things like toolbox modules and so on), with more extensive documentation. I would gladly pay for recent prints of the manuals, for example, perhaps printed via one of the online services (I don’t know how financially viable this would be). A suggestion might be a move to a subscription scheme. The “free” toolset would be updated every couple of years, lets say, but you could offer private, early access to intrim upgrades to subscribers. To simplify things, you could base this on year (2012, 2013 and so on). Perhaps a 20-25ukp a year price. The logic behind this is that not everyone will benefit from every update, but kept affordable, across a year the upgrades would be worthwhile for most. For example, I haven’t rushed to get the recent ObjAsm update since I’ve never used ObjAsm (I’m a C coder). However, updates to cc, link, squeeze, find and diff are all extremely useful to me. |
Andrew Rawnsley (492) 1445 posts |
A further suggestion for a related product… A proper “step by step” guide to downloading the source, installing it to hard disc (on a RISC OS machine) and preparing it for use/compilation via Norcroft. Ideally this would be printed, introducing what source is where, and contain necessary obey files to set up the build environment (don’t assume we all have everything in the default locations!). I think this would sell nicely for (say) 15ukp at shows, as a “getting started” kit. I mention this as there have been a few “how to build the OS on RISC OS” threads, and many of us don’t run our own linux/cvs servers. |
Trevor Johnson (329) 1645 posts |
For the moment, the tools can’t be installed on an ARMv7 machine (although when copied across from another machine, they do run). |
Sprow (202) 1158 posts |
Don’t you just want to use Ben’s “HostTools” shared makefiles instead? Take a look at the change in revision 4.10 of SpriteExtend’s makefile for an example: it compiles the intermediate tools on whatever the host machine is regardless of the target. |
Jeffrey Lee (213) 6048 posts |
Alas, that only covers the C compiler. As far as I know ColourTrans is the only assembler module that softloads itself as part of the build process, but if there are others then I guess it would make sense to update HostTools to be able to cope with them. |
Terje Slettebø (285) 275 posts |
Funny you should mention it, as having browsed the RISC OS modules, that looked like a good candidate for reimplementation, so I was thinking of having a go at it after OS_SpriteOp… :) As for OS_SpriteOp: Work is well underway there. So far, the following has been reimplemented, both as SWI call and star commands (star command version mentioned in the following where relevant), for all colour depths:
Next up will be some more sprite manipulation: Insert/remove rows and columns, to get an even better grasp on the fundamental sprite operations. To avoid a large module, I’ve (regrettably) avoided C++ iostreams, and instead use printf() (for formatted output), or simply OS_Write*, for simpler output.
Regarding high-performance code and high-level programming: Given that OS_SpriteOp needs to be high-performing, I’ve given this quite a lot of thought, lately. I’ve come to the same as I’ve come to before: Express as much as possible of your intent in the code, i.e. rather than having some code doing X, make an abstraction X. Case in point: For sprite x/y flipping, I’ve implemented an optimisation in the x-axis flip case: That is, rather than doing a pixel-for-pixel exchange between pairwise rows of the sprite, since a full row is to be exchanged, it’s much faster to do a simple memcpy() for those rows. With a little luck, memcpy() is implemented as a fast memory copy function, i.e. using LDM/STM where possible. However, having those memcpy() calls in “client code” (i.e. free functions outside the SpriteArea/Sprite class) just didn’t feel right: Optimisations that relies on such “intimate” knowledge of sprite layout should be better encapsulated, but how? I’ve then come to that there appears to be a missing abstraction, here: The concept of “rows” and “coloumns”, and by having member functions like readRow()/writeRow(), this memcpy() optimisation may be safely tucked away. My point with all of this is by creating a useful conceptual model, with useful abstractions, you have better opportunities for optimisations, as you may then target specific parts of the system, while still allowing “client code” to operate using high-level abstractions. To start with, I mostly implement things in the most straight-forward manner possible, i.e. no optimisations unless they are “obvious” and the advantages are obvious: Using memcpy() rather than iterating over the pixels for x-flip is an example of such. Then, following profiling, optimisations may be done where needed, or where there are opportunities for it. I’m fortunate to know ARM assembly quite well, too, so in cases I’ve wondered about how to do something efficiently, I’ve taken a peek at the implementation (I’ll do that for all functions, eventually, to make sure that the new code does all the one did, especially things like error handling, but otherwise, the PRM tends to be a better reference than the code), such as how to plot sprites on arbitrary word boundaries. I found that the original code does something weird/clever there: It shifts the sprite data left/right so that it aligns with the screen memory, and then use STM for plotting. This shifting is done on-the-fly. This would only be needed for sprites with colour depths less than 32 bits, so I think it’s no secret that at least the new one will definitely be fastest when running in 32-bit colour mode, where an optimised version of routines could dispense with the bit fiddling… :) Going back to what you said above: There’s every possibility of using VFP/NEON where it matters in a reimplemented version of ColourTrans, for example, and with sensible abstractions you’ll typically be able to isolate such optimisations in a way that preserves the high-level nature of the code in general. As mentioned, I’m currently working on OS_SpriteOp, and there’s a lot to do, so I’ll be busy with that in the foreseeable future. Remember that it also includes things like scaled/transformed sprite plotting, and I have some ideas that may make such operations “super-fast”, taking inspiration from texture mapping… :) Therefore, if anyone else would like to do something with ColourTrans: Be my guest. :) However, if it’s still not reimplemented by the time OS_SpriteOp is done, and if we find its performance to be adequate, I’ll be happy to have a go at ColourTrans. |
John Tytgat (244) 13 posts |
I find that difficult to believe. Can you elaborate on this ? |
Sprow (202) 1158 posts |
Ah – didn’t realise that. ColourTrans is probably a bit freaky then, but the contents of HostTools looks pretty simple. Maybe Ben could be convinced to extend the idea. |
Terje Slettebø (285) 275 posts |
Just to note that this might be a useful next step for the sprite implementation. I realise that some of this also requires kernel changes, which could be future project. The sprite reimplementation is proceeding, and while being able to operate on and plot sprites with varying attributes is quite well-understood, the challenge is to provide efficient implementations for the cases where it may do it in an optimised way (like using memcpy() where BPP is 8 or more, and there’s no mask or non-zero plot action), without an explosion in (generated) code. |