Sometimes assembler is still useful ;)
Rick Murray (539) 13850 posts |
For something I’m working on, prototyping various ideas in BASIC, I needed to read a pixel from the screen and “brighten” it. And do it another hundred and twenty thousand times. ;-) I am reading data directly from screen memory. Using the OS calls to do this is just tooooooo slow. As I am using a 256 colour mode, the bit layout in memory is weird (refer to PRM 1-561). My initial code would pull apart the byte into red, green, blue, and tint. Adjust, then recombine, then write back. It was, understandably, really slow. So my next idea was to use a lookup table. So I wrote a program to generate the lookup table, and then simply used the screen colour value (0-255) as an index into the table which would give a replacement byte to write. This was quite a bit faster…but still slow. So my final idea was to write an assembler routine to work out which bit of memory to fiddle with, read/index/write. There are a few speed tweaks like using writeback #1 after write (saves an increment), and counting my loops backwards with SUBS (will be EQ when zero, saves a CMP). So much so, that when I come to code the real thing in C (as I plan to), I might simply port the function as it is into ObjAsm format and use it. So… once in a while, assembler has its uses. ;) |
Jeff Doggett (257) 234 posts |
If you write the C code to match the assembler, I would expect the resultant code to run just as fast. |
Kuemmel (439) 384 posts |
…the good old C vs ASM topic ;-) …only chance for a verdict is to benchmark it. When I look at the Cortex-A72 optimization guide by ARM (Link here) e.g. at section 4.5/page 38 – load/store throughput I wouldn’t think that any compiler would do those optmisations out of the blue adapted for a specific context…but as always in this discussion, if you want to save time and have code easy to be maintained…don’t do it… |
Rick Murray (539) 13850 posts |
Possibly, but this means rewriting the code (again), this time in C. As I have something that works (written to use with BASIC), I might just copy that over.
One can specify architecture and processor core to the compiler. I’m not sure if it makes much difference, never tried it and examined the code.
Save time? Saving time would be using assembler. It’s already been written, tested, and debugged. As for maintenance, it’s quite liberally commented. It’s important to comment copiously, so you can hit the ground running when you come back to a project five years later and you’re like “I wrote this?”. ;-) |
Clive Semmens (2335) 3276 posts |
Five years later? I’ve been trying to update inadequately commented assembler routines I wrote thirty years ago… |
David Feugey (2125) 2709 posts |
And the big question is: is there an integrated C compiler in BBC Basic? :) |
Colin Ferris (399) 1818 posts |
There is BASIC to ‘C’ convertor prog – but I think Rick likes a challenge. :-) |
Clive Semmens (2335) 3276 posts |
That’s a rather different beast from an integrated C compiler though, which is far more like the embedded assembler. |
David J. Ruck (33) 1636 posts |
It does make a significant difference, especially with the newer cores, being able to optimise for the instruction issue width, and use new instructions introduced with each architecture. But its only useful if you are targetting specific hardware for a project, or just your own machine. RISC OS runs on a large number of cores and architectures, so you you need to use the compilers lowest common denominator code, which works on everything but is far from optimal on any of it. The beauty of writing it in C, is optimising for cores or architectures is only a option change and a recompile away. |
Clive Semmens (2335) 3276 posts |
Is there, or could there be, an option to make at least heavily worked parts of the code be compiled in parallel for different architectures, with branches switched according to the hardware the code is running on? Obviously it would make the code bigger, but would surely be worth it at least for those heavily worked routines – which the programmer would presumably have to mark somehow. (The same trick would be useful in BBC BASIC, too, for built-in functions. I don’t think expecting BASIC programmers to mark up other heavily-used routines makes much sense.) |
David J. Ruck (33) 1636 posts |
That is definitely possible, I’ve not seen anyone doing that with integer code in RISC OS, but it’s been done for floating point, with FPA, VFP2, VFP3 and Neon variations. I don’t think it’s worth it for BASIC, as if speed is that critical, don’t use an interpreted language. Same goes for Python, but that is mitigated by the possibility of the heavy lifting being done by library code written in C, which can be compiled for the particular CPU at installation time. |
Rick Murray (539) 13850 posts |
That’s exactly it. The only reason the assembler was written at all is because I’m using a shell written in BASIC to test out ideas before “doing it properly in C”. ;-) And, like you say, BASIC was just too damn slow at doing this particular part. Painfully so.
For reading bytes from the screen, using them to index a byte table to get a new value to byte write back to the screen, are there actually any architecturally specific things worth considering that would make a significant difference? Given it’s screen memory (and probably not cached), a better speed increase might be that once the address is word aligned, to read words at a time, splitting the bytes into registers for lookup, and then recombining to write a word at a time…but then we’re getting into a lot of additional effort for what gain? |
Paolo Fabio Zaino (28) 1882 posts |
I actually been testing this for the conversion of my M/Forth from ASM to C, after some work and tests turned out that the best approach (and compromise) between simplicity and performance is to adopt what I called “Boot Strap Module” approach. Right now, my M/Forth has a module called mforth that loads as BASIC does and can be used to either run or interactively compile forth applications as you would do to run a BASIC file. However, what it does underneath is it will load the appropriate Forth VM for the specific platform (and eventually the specific app as well). The module contains all the routines that do not need optimisations (like file loading and such) while the Forth bytecode is executed by a VM that is loaded at runtime. Such VM is compiled with all the possible optimisations for each specific RISC OS platforms (Rpi1, Rpi2, Archimedes with ARM3, Rpi4 etc…). So at module load time: When the user runs a Forth program: Given that the VMs are all very small it seems to be ok on Raspberry Pis etc…, this also has another advantage: The Forth code can specify which version of the VM it wants to run on, so it can be used to warrantee compatibility too in case one wants that. The other advantage is that the code stays small. As soon as I have time to finish it, I’ll release the whole package so people can evaluate if it’s a useful approach or not for their needs. The reason I opted for this approach is because SD cards seem to be quite bad at loading time, so the less we load the better it is. Also given that each VM is an executable AIF, it can be squeezed too, making it even faster at loading from an SD card. Just my 0.5c as always, but maybe someone may find it useful… |
Charles Ferguson (8243) 427 posts |
I just tried this, for the laugh… https://godbolt.org/z/G3q1E1P53 – has the (rudimentary) code I created, and its compiled version to see what happened. Initially I wanted to just do the operation that was originally described and then see what happened when I converted it to word operations which can be more efficient and have more opportunity for optimisations… If you cannot see the Compiler Explorer page, the code that I produced was pretty simple: #define OPTIMISE (1) int screen_cipher(unsigned char *screen, int screenlen, unsigned char *lut) { #if OPTIMISE == 0 int i; /* Plain old boring code */ for (i=0; i<screenlen; i++) { unsigned char c = *screen; c = lut[c]; *screen++ = c; } #else /* Optimise using word accesses */ unsigned long *screenword = (unsigned long *)screen; unsigned long *screenend = (unsigned long *)(screen + screenlen); for (; screenword < screenend;) { unsigned long w = *screenword; unsigned long c1 = lut[(w>>0) & 255]; unsigned long c2 = lut[(w>>8) & 255] << 8; unsigned long c3 = lut[(w>>16) & 255] << 16; unsigned long c4 = lut[(w>>24) & 255] << 24; w = c1 | c2 | c3 | c4; *screenword++ = w; } #endif } Which produced the amusing code: screen_cipher: push {r4, r5, lr} add r4, r0, r1 cmp r0, r4 popcs {r4, r5, pc} .L3: ldr r3, [r0] uxtb r5, r3 ubfx lr, r3, #8, #8 ubfx ip, r3, #16, #8 ldrb r1, [r2, r3, lsr #24] @ zero_extendqisi2 ldrb r3, [r2, r5] @ zero_extendqisi2 ldrb lr, [r2, lr] @ zero_extendqisi2 ldrb ip, [r2, ip] @ zero_extendqisi2 orr r3, r3, r1, lsl #24 orr r3, r3, lr, lsl #8 orr r3, r3, ip, lsl #16 str r3, [r0], #4 cmp r4, r0 bhi .L3 pop {r4, r5, pc} The use of the `uxtb`/`ubfx` saves the need to AND off the bits you didn’t care about. There are only 3 of these instruction needed because the high byte can be implicitly truncated by the shift on load (the LSR #24). So then I thought… can we get it to be more clever if we do this as a 64bit operation? https://godbolt.org/z/h33PhxYx1 Again, for those without access to compiler explorer, the loop is now: /* Optimise using multiple word accesses */ unsigned long long *screenword = (unsigned long long *)screen; unsigned long long *screenend = (unsigned long long *)(screen + screenlen); for (; screenword < screenend;) { unsigned long long w = *screenword; unsigned long long c1 = lut[(w>>0) & 255]; unsigned long long c2 = lut[(w>>8) & 255] << 8; unsigned long long c3 = lut[(w>>16) & 255] << 16; unsigned long long c4 = lut[(w>>24) & 255] << 24; unsigned long long c5 = lut[(w>>32) & 255]; unsigned long long c6 = lut[(w>>40) & 255] << 8; unsigned long long c7 = lut[(w>>48) & 255] << 16; unsigned long long c8 = lut[(w>>56) & 255] << 24; w = c1 | c2 | c3 | c4 | ((c5 | c6 | c7 | c8) << 32); *screenword++ = w; } which comes out as: screen_cipher: push {r4, r5, r6, r7, r8, r9, lr} add r4, r0, r1 cmp r0, r4 popcs {r4, r5, r6, r7, r8, r9, pc} .L3: ldr r3, [r0] ldr r1, [r0, #4] ubfx ip, r3, #16, #8 ubfx r8, r3, #8, #8 uxtb r7, r1 uxtb r9, r3 ubfx r6, r1, #8, #8 ldrb lr, [r2, r3, lsr #24] @ zero_extendqisi2 ldrb ip, [r2, ip] @ zero_extendqisi2 ubfx r5, r1, #16, #8 ldrb r3, [r2, r7] @ zero_extendqisi2 ldrb r8, [r2, r8] @ zero_extendqisi2 lsl lr, lr, #24 ldrb r7, [r2, r1, lsr #24] @ zero_extendqisi2 lsl ip, ip, #16 ldrb r1, [r2, r9] @ zero_extendqisi2 ldrb r6, [r2, r6] @ zero_extendqisi2 orr ip, ip, r8, lsl #8 ldrb r5, [r2, r5] @ zero_extendqisi2 orr r1, lr, r1 orr r3, r3, r7, lsl #24 orr ip, ip, r1 orr r3, r3, r6, lsl #8 str ip, [r0], #8 orr r3, r3, r5, lsl #16 orr r3, r3, lr, asr #31 cmp r4, r0 str r3, [r0, #-4] bhi .L3 pop {r4, r5, r6, r7, r8, r9, pc} It’s quite an interesting optimisation process. The compiler has essentially split up the code that I had built into 64bit values so that it’s only operating on the 32bit parts independantly, to the extent that the write of the 64bit word actually happens 4 instructions apart, with the low part written first, then the trailing operations are performed before we store the high word. That said, I suspect my code has a problem somewhere as I don’t know how there can be an ASR #31 in there – surely that’s just going to make the whole word have set bits if the top bit of lr was set? Meh, that’s what tests are for… As for the comment about doing this using OS calls being slow… I think it depends on what calls you were using and whether you intend it to work in other depths (which you should). In a 256 (or fewer) colour mode the fastest way to brighten the screen would be to change the palette to make it brighter, assuming you wanted the whole screen changed. If you only want a part of the screen changed, as a one off operation, create a sprite coloured white, plot it scaled to cover the region that you’re interested in with transparency of a proportion you like for the brightening. This will be slower than custom code, but it’ll work with defined interfaces – and will therefore work with systems that don’t have frame buffers, or which are in different depths than you need in a specific case (because you shouldn’t write code that relies on a given screen depth). On the other hand, should you wish the fade out a region of the screen, the simplest way to do that would be to read the region into a sprite, then use the ColourTrans tables / Sprite colour mapping to plot the sprite back with modifications present. Again, no direct frame buffer access required, and the operations work in any mode. Anyhow… I guess my point was… no, I don’t think that assembler has much of a use for this. You can start from a high level language and have relatively decent code written which you can test nicely, and vary without having to resort to low level assembler. Only when your profiling says that it’s not good enough do you need to optimise in assembler – and even then, changing your algorithm is invariably easier in C. And going straight to directly manipulating the screen is terrible from the perspective of making code that works well across the board on many platforms. In the specific case where you’re working with a 256 colour mode data only, the table you built is essentially a pixel translation table, so using a sprite operation means one call with all the code having already been written (and JIT’d for just your case). |
Rick Murray (539) 13850 posts |
;-) That’s half the fun of it. Thing is, while the word access makes some interesting optimisations, one cannot guarantee that the transition will be on a word boundary. Not insurmountable, but makes the code more complicated.
No, my comment was that doing it in BASIC was slow.
It would be better to support “whatever”, but the behaviour of 256 colour modes (especially their colour handling) is different enough that it ought to be treated as a separate thing.
Thank you. I never thought about it that way. I’ll need to investigate grabbing a part of the screen as a sprite, then plotting it back with a modified table. That would probably be the most optimal method, yes… |
Rick Murray (539) 13850 posts |
Thank you, Charles. I set up my old Pi outside so I could enjoy the nice weather (a photo, if you’re interested). All that assembler was easily replaced by SpriteOp 16 (grab a bit of the screen), SpriteOp 52 (paste it back with translation), then SpriteOp 25 (delete the sprite). The OS takes care of bounds checking (so I don’t have to). It’s basically three OS calls and a couple of lines to update offsets of things). Much simpler. ;-) In benchmarking (do it for 100 frames, use a stopwatch), both methods took “about 3.5 seconds”, so there’s still room for improvement when this stuff is written in C rather than BASIC. ;-) BTW, is the forum having a nervous breakdown? I keep having to log in… |
Charles Ferguson (8243) 427 posts |
Interesting; I expected it to be a more noticeable (on the sort of data you were using) difference – ie the price you pay for using non-custom APIs is that it is slower, but with the benefit of being more manageable and compatible with other systems. I hadn’t expected it to be approximately comparable – just ‘acceptable’. Good to know. It’s possible that if you are going to do it repeatedly you don’t need to delete the sprite when you graph. I have a feeling it will delete the old one for you. Can’t remember whether it does, and I’ve not implemented the ‘grab sprite from screen’ yet 1 so don’t have it in the local cache. As for the question about colour,tint… honestly, it’s so much pain working with those operations that I try not to think about it these days 2. ColourTrans_ReturnGCOLNumber maybe… Chris Bazley was always my guy for explaining to me how I’d done it wrong, usually backed up by Ian Jeffray when Chris wasn’t around. Also, one of the formats for ColoutTrans_GenerateTable uses GCOL numbers. Not sure if that’s useful for you… or anyone else. Pretty sure I haven’t implemented that yet, ‘cos it’s weird. 1 The plumbing is there, ‘cos I’ve implemented Screensave, but just haven’t needed to do it yet. 2 And that’s after having implemented most of ColourTrans and the palette operations in all the modes. |
Rick Murray (539) 13850 posts |
Well spotted!
Me too, but we are comparing code that reads bytes, converts them, and writes them back. There’s a lot of memory access there, and it probably suffers for that – about 121,200 instructions to chunder through a 600×200 block. SpriteOp probably writes as words as soon as it can, so fewer instructions executed for the same result, meaning there are cycles to spare for the extra stuff it does. Plus the SWIs are called from BASIC so we’re hopping in and out of the interpreter. End result? They’re about the same. ;-) |
Rick Murray (539) 13850 posts |
You do not need shared libraries for C. You absolutely do need libraries, however. This is because the core of C is extremely small, and most facilities are implemented as libraries. It just so happens that library sharing (either in the style of the SharedCLib module or the GCC shared libraries) is often beneficial as it means the applications don’t need to include the same baggage over and over and over. Now, as long as you have a compiler and the base set of libraries, you can write generic C on just about anything. Some care will be needed (as something like “int” means different things on different platforms), but it is quite possible.
That’s because it is a rather artificial operation. A fast way to write 16 bytes of the same data to word aligned addresses. C, on the other hand, has to contend with making functional code for “copy this data from here to there” where the size is not necessarily known until runtime, the addresses may not be word aligned, it will be a copy (so needing reads as well as writes, and potentially from addresses with different alignment offsets), and the areas read from and written to may even overlap. If you want faster copying, you should not be writing your own code for it. You should be using the appropriate library function, such as memcpy(). For setting a fixed value, as in your example, there is memset(), but note again it has to cope in situations where yours wouldn’t. The more optimised library functions are here, and it’s worth noting the similarity between your code and Two of the main things in favour of a compiler, for me, are:
Don’t get me wrong here, there’s plenty to hate about C too. My personal one is that there’s no standardised attempt at giving the option of dealing with bounds checking of, well, anything 2, nor automatic detection of null pointers. Two problems that have visited chaos upon the world. 1 The “every machine cycle counts” ship sailed long ago. If a little function takes three nanoseconds instead of two and a half, the only way you’re going to notice is accumulated executions on the order of millions. I mean, FFS, we’re wasting vast amounts of time with an emulation of an ancient floating point co-processor, instead of the internal hardware that is like two orders of magnitude faster. And even that’s faster than interpreting BASIC. So the concept of optimisation really needs to be taken in context. 2 Not that assembler provides you with anything at all, mind… |
David J. Ruck (33) 1636 posts |
Lets not feed the troll again, we’ve already done this one several times. |