Sometimes assembler is still useful ;)

20 posts, 9 voices

Jul 8, 2021 6:19pm Rick Murray (539) 13840 posts	For something I’m working on, prototyping various ideas in BASIC, I needed to read a pixel from the screen and “brighten” it. And do it another hundred and twenty thousand times. ;-) I am reading data directly from screen memory. Using the OS calls to do this is just tooooooo slow. As I am using a 256 colour mode, the bit layout in memory is weird (refer to PRM 1-561). My initial code would pull apart the byte into red, green, blue, and tint. Adjust, then recombine, then write back. It was, understandably, really slow. So my next idea was to use a lookup table. So I wrote a program to generate the lookup table, and then simply used the screen colour value (0-255) as an index into the table which would give a replacement byte to write. This was quite a bit faster…but still slow. So my final idea was to write an assembler routine to work out which bit of memory to fiddle with, read/index/write. There are a few speed tweaks like using writeback #1 after write (saves an increment), and counting my loops backwards with SUBS (will be EQ when zero, saves a CMP). This works nice and fast. So much so, that when I come to code the real thing in C (as I plan to), I might simply port the function as it is into ObjAsm format and use it. So… once in a while, assembler has its uses. ;)

Jul 9, 2021 8:50am Jeff Doggett (257) 234 posts	If you write the C code to match the assembler, I would expect the resultant code to run just as fast.

Jul 9, 2021 11:21am Kuemmel (439) 384 posts	…the good old C vs ASM topic ;-) …only chance for a verdict is to benchmark it. When I look at the Cortex-A72 optimization guide by ARM (Link here) e.g. at section 4.5/page 38 – load/store throughput I wouldn’t think that any compiler would do those optmisations out of the blue adapted for a specific context…but as always in this discussion, if you want to save time and have code easy to be maintained…don’t do it…

Jul 9, 2021 4:53pm Rick Murray (539) 13840 posts	I would expect the resultant code to run just as fast. Possibly, but this means rewriting the code (again), this time in C. As I have something that works (written to use with BASIC), I might just copy that over. I wouldn’t think that any compiler would do those optmisations out of the blue adapted for a specific context One can specify architecture and processor core to the compiler. I’m not sure if it makes much difference, never tried it and examined the code. if you want to save time and have code easy to be maintained…don’t do it… Save time? Saving time would be using assembler. It’s already been written, tested, and debugged. As for maintenance, it’s quite liberally commented. It’s important to comment copiously, so you can hit the ground running when you come back to a project five years later and you’re like “I wrote this?”. ;-)

Jul 9, 2021 5:22pm Clive Semmens (2335) 3276 posts	It’s important to comment copiously, so you can hit the ground running when you come back to a project five years later and you’re like “I wrote this?”. ;-) Five years later? I’ve been trying to update inadequately commented assembler routines I wrote thirty years ago…

Jul 9, 2021 5:42pm David Feugey (2125) 2709 posts	Possibly, but this means rewriting the code (again), this time in C And the big question is: is there an integrated C compiler in BBC Basic? :)

Jul 9, 2021 6:02pm Colin Ferris (399) 1814 posts	There is BASIC to ‘C’ convertor prog – but I think Rick likes a challenge. :-)

Jul 9, 2021 6:08pm Clive Semmens (2335) 3276 posts	There is BASIC to ‘C’ convertor prog That’s a rather different beast from an integrated C compiler though, which is far more like the embedded assembler.

Jul 9, 2021 7:42pm David J. Ruck (33) 1635 posts	One can specify architecture and processor core to the compiler. I’m not sure if it makes much difference, never tried it and examined the code. It does make a significant difference, especially with the newer cores, being able to optimise for the instruction issue width, and use new instructions introduced with each architecture. But its only useful if you are targetting specific hardware for a project, or just your own machine. RISC OS runs on a large number of cores and architectures, so you you need to use the compilers lowest common denominator code, which works on everything but is far from optimal on any of it. The beauty of writing it in C, is optimising for cores or architectures is only a option change and a recompile away.

Jul 10, 2021 6:56am Clive Semmens (2335) 3276 posts	…you you need to use the compilers lowest common denominator code, which works on everything but is far from optimal on any of it. Is there, or could there be, an option to make at least heavily worked parts of the code be compiled in parallel for different architectures, with branches switched according to the hardware the code is running on? Obviously it would make the code bigger, but would surely be worth it at least for those heavily worked routines – which the programmer would presumably have to mark somehow. (The same trick would be useful in BBC BASIC, too, for built-in functions. I don’t think expecting BASIC programmers to mark up other heavily-used routines makes much sense.)

Jul 12, 2021 5:50pm David J. Ruck (33) 1635 posts	That is definitely possible, I’ve not seen anyone doing that with integer code in RISC OS, but it’s been done for floating point, with FPA, VFP2, VFP3 and Neon variations. I don’t think it’s worth it for BASIC, as if speed is that critical, don’t use an interpreted language. Same goes for Python, but that is mitigated by the possibility of the heavy lifting being done by library code written in C, which can be compiled for the particular CPU at installation time.

Jul 12, 2021 8:12pm Rick Murray (539) 13840 posts	as if speed is that critical, don’t use an interpreted language. That’s exactly it. The only reason the assembler was written at all is because I’m using a shell written in BASIC to test out ideas before “doing it properly in C”. ;-) And, like you say, BASIC was just too damn slow at doing this particular part. Painfully so. an option to make at least heavily worked parts of the code be compiled in parallel for different architectures, with branches switched according to the hardware the code is running on? For reading bytes from the screen, using them to index a byte table to get a new value to byte write back to the screen, are there actually any architecturally specific things worth considering that would make a significant difference? Given it’s screen memory (and probably not cached), a better speed increase might be that once the address is word aligned, to read words at a time, splitting the bytes into registers for lookup, and then recombining to write a word at a time…but then we’re getting into a lot of additional effort for what gain? The code, when in use, drops the framerate to ~36fps…but then the code, when in use, only happens for three frames at a time so this stall isn’t noticeable. Continued further optimisation of this would seem to me to be concentrating on something that isn’t important.

Jul 13, 2021 11:57pm Paolo Fabio Zaino (28) 1882 posts	Is there, or could there be, an option to make at least heavily worked parts of the code be compiled in parallel for different architectures, with branches switched according to the hardware the code is running on? I actually been testing this for the conversion of my M/Forth from ASM to C, after some work and tests turned out that the best approach (and compromise) between simplicity and performance is to adopt what I called “Boot Strap Module” approach. Right now, my M/Forth has a module called mforth that loads as BASIC does and can be used to either run or interactively compile forth applications as you would do to run a BASIC file. However, what it does underneath is it will load the appropriate Forth VM for the specific platform (and eventually the specific app as well). The module contains all the routines that do not need optimisations (like file loading and such) while the Forth bytecode is executed by a VM that is loaded at runtime. Such VM is compiled with all the possible optimisations for each specific RISC OS platforms (Rpi1, Rpi2, Archimedes with ARM3, Rpi4 etc…). So at module load time: - mforth module initialise and determine which platform it’s running on When the user runs a Forth program: - mforth module loads the forth source file and/or dictionary the user has requested (or double clicked on) - mforth module does the JIT bycode generation - mforth module dynamically loads the appropriate mforth runtime VM that will execute the forth bytecode as fast as possible on the specific platform Given that the VMs are all very small it seems to be ok on Raspberry Pis etc…, this also has another advantage: The Forth code can specify which version of the VM it wants to run on, so it can be used to warrantee compatibility too in case one wants that. The other advantage is that the code stays small. As soon as I have time to finish it, I’ll release the whole package so people can evaluate if it’s a useful approach or not for their needs. The reason I opted for this approach is because SD cards seem to be quite bad at loading time, so the less we load the better it is. Also given that each VM is an executable AIF, it can be squeezed too, making it even faster at loading from an SD card. Just my 0.5c as always, but maybe someone may find it useful…

Jul 17, 2021 11:37pm Charles Ferguson (8243) 427 posts	I just tried this, for the laugh… https://godbolt.org/z/G3q1E1P53 – has the (rudimentary) code I created, and its compiled version to see what happened. Initially I wanted to just do the operation that was originally described and then see what happened when I converted it to word operations which can be more efficient and have more opportunity for optimisations… If you cannot see the Compiler Explorer page, the code that I produced was pretty simple: #define OPTIMISE (1) int screen_cipher(unsigned char screen, int screenlen, unsigned char lut) { #if OPTIMISE == 0 int i; /* Plain old boring code / for (i=0; i<screenlen; i++) { unsigned char c = screen; c = lut[c]; screen++ = c; } #else / Optimise using word accesses / unsigned long screenword = (unsigned long )screen; unsigned long screenend = (unsigned long )(screen + screenlen); for (; screenword < screenend;) { unsigned long w = screenword; unsigned long c1 = lut[(w>>0) & 255]; unsigned long c2 = lut[(w>>8) & 255] << 8; unsigned long c3 = lut[(w>>16) & 255] << 16; unsigned long c4 = lut[(w>>24) & 255] << 24; w = c1 \| c2 \| c3 \| c4; screenword++ = w; } #endif } Which produced the amusing code: screen_cipher: push {r4, r5, lr} add r4, r0, r1 cmp r0, r4 popcs {r4, r5, pc} .L3: ldr r3, [r0] uxtb r5, r3 ubfx lr, r3, #8, #8 ubfx ip, r3, #16, #8 ldrb r1, [r2, r3, lsr #24] @ zero_extendqisi2 ldrb r3, [r2, r5] @ zero_extendqisi2 ldrb lr, [r2, lr] @ zero_extendqisi2 ldrb ip, [r2, ip] @ zero_extendqisi2 orr r3, r3, r1, lsl #24 orr r3, r3, lr, lsl #8 orr r3, r3, ip, lsl #16 str r3, [r0], #4 cmp r4, r0 bhi .L3 pop {r4, r5, pc} The use of the `uxtb`/`ubfx` saves the need to AND off the bits you didn’t care about. There are only 3 of these instruction needed because the high byte can be implicitly truncated by the shift on load (the LSR #24). So then I thought… can we get it to be more clever if we do this as a 64bit operation? https://godbolt.org/z/h33PhxYx1 Again, for those without access to compiler explorer, the loop is now: / Optimise using multiple word accesses / unsigned long long screenword = (unsigned long long )screen; unsigned long long screenend = (unsigned long long )(screen + screenlen); for (; screenword < screenend;) { unsigned long long w = screenword; unsigned long long c1 = lut[(w>>0) & 255]; unsigned long long c2 = lut[(w>>8) & 255] << 8; unsigned long long c3 = lut[(w>>16) & 255] << 16; unsigned long long c4 = lut[(w>>24) & 255] << 24; unsigned long long c5 = lut[(w>>32) & 255]; unsigned long long c6 = lut[(w>>40) & 255] << 8; unsigned long long c7 = lut[(w>>48) & 255] << 16; unsigned long long c8 = lut[(w>>56) & 255] << 24; w = c1 \| c2 \| c3 \| c4 \| ((c5 \| c6 \| c7 \| c8) << 32); screenword++ = w; } which comes out as: screen_cipher: push {r4, r5, r6, r7, r8, r9, lr} add r4, r0, r1 cmp r0, r4 popcs {r4, r5, r6, r7, r8, r9, pc} .L3: ldr r3, [r0] ldr r1, [r0, #4] ubfx ip, r3, #16, #8 ubfx r8, r3, #8, #8 uxtb r7, r1 uxtb r9, r3 ubfx r6, r1, #8, #8 ldrb lr, [r2, r3, lsr #24] @ zero_extendqisi2 ldrb ip, [r2, ip] @ zero_extendqisi2 ubfx r5, r1, #16, #8 ldrb r3, [r2, r7] @ zero_extendqisi2 ldrb r8, [r2, r8] @ zero_extendqisi2 lsl lr, lr, #24 ldrb r7, [r2, r1, lsr #24] @ zero_extendqisi2 lsl ip, ip, #16 ldrb r1, [r2, r9] @ zero_extendqisi2 ldrb r6, [r2, r6] @ zero_extendqisi2 orr ip, ip, r8, lsl #8 ldrb r5, [r2, r5] @ zero_extendqisi2 orr r1, lr, r1 orr r3, r3, r7, lsl #24 orr ip, ip, r1 orr r3, r3, r6, lsl #8 str ip, [r0], #8 orr r3, r3, r5, lsl #16 orr r3, r3, lr, asr #31 cmp r4, r0 str r3, [r0, #-4] bhi .L3 pop {r4, r5, r6, r7, r8, r9, pc} It’s quite an interesting optimisation process. The compiler has essentially split up the code that I had built into 64bit values so that it’s only operating on the 32bit parts independantly, to the extent that the write of the 64bit word actually happens 4 instructions apart, with the low part written first, then the trailing operations are performed before we store the high word. That said, I suspect my code has a problem somewhere as I don’t know how there can be an ASR #31 in there – surely that’s just going to make the whole word have set bits if the top bit of lr was set? Meh, that’s what tests are for… As for the comment about doing this using OS calls being slow… I think it depends on what calls you were using and whether you intend it to work in other depths (which you should). In a 256 (or fewer) colour mode the fastest way to brighten the screen would be to change the palette to make it brighter, assuming you wanted the whole screen changed. If you only want a part of the screen changed, as a one off operation, create a sprite coloured white, plot it scaled to cover the region that you’re interested in with transparency of a proportion you like for the brightening. This will be slower than custom code, but it’ll work with defined interfaces – and will therefore work with systems that don’t have frame buffers, or which are in different depths than you need in a specific case (because you shouldn’t write code that relies on a given screen depth). On the other hand, should you wish the fade out a region of the screen, the simplest way to do that would be to read the region into a sprite, then use the ColourTrans tables / Sprite colour mapping to plot the sprite back with modifications present. Again, no direct frame buffer access required, and the operations work in any mode. Anyhow… I guess my point was… no, I don’t think that assembler has much of a use for this. You can start from a high level language and have relatively decent code written which you can test nicely, and vary without having to resort to low level assembler. Only when your profiling says that it’s not good enough do you need to optimise in assembler – and even then, changing your algorithm is invariably easier in C. And* going straight to directly manipulating the screen is terrible from the perspective of making code that works well across the board on many platforms. In the specific case where you’re working with a 256 colour mode data only, the table you built is essentially a pixel translation table, so using a sprite operation means one call with all the code having already been written (and JIT’d for just your case).

Jul 18, 2021 12:16am Rick Murray (539) 13840 posts	I just tried this, for the laugh… ;-) That’s half the fun of it. Thing is, while the word access makes some interesting optimisations, one cannot guarantee that the transition will be on a word boundary. Not insurmountable, but makes the code more complicated. As for the comment about doing this using OS calls being slow… No, my comment was that doing it in BASIC was slow. (because you shouldn’t write code that relies on a given screen depth) It would be better to support “whatever”, but the behaviour of 256 colour modes (especially their colour handling) is different enough that it ought to be treated as a separate thing. I’ve written a routine to translate between colour,tint and the in-memory colour values (the latter being what Paint shows in its palette). I’m not aware of a ColourTrans call that does this, but then the algorithm is pretty simple. Reason – sprites are designed using Paint. Paint has 256 numbered colours. These numbers work if you’re bunging data into screen memory. They don’t work with OS calls that want colour and tint (like seeing a pixel in OS_SpriteOp). the table you built is essentially a pixel translation table, so using a sprite operation Thank you. I never thought about it that way. I’ll need to investigate grabbing a part of the screen as a sprite, then plotting it back with a modified table. That would probably be the most optimal method, yes…

Jul 18, 2021 3:21pm Rick Murray (539) 13840 posts	Thank you, Charles. I set up my old Pi outside so I could enjoy the nice weather (a photo, if you’re interested). All that assembler was easily replaced by SpriteOp 16 (grab a bit of the screen), SpriteOp 52 (paste it back with translation), then SpriteOp 25 (delete the sprite). The OS takes care of bounds checking (so I don’t have to). It’s basically three OS calls and a couple of lines to update offsets of things). Much simpler. ;-) In benchmarking (do it for 100 frames, use a stopwatch), both methods took “about 3.5 seconds”, so there’s still room for improvement when this stuff is written in C rather than BASIC. ;-) BTW, is the forum having a nervous breakdown? I keep having to log in…

Jul 18, 2021 7:24pm Charles Ferguson (8243) 427 posts	Interesting; I expected it to be a more noticeable (on the sort of data you were using) difference – ie the price you pay for using non-custom APIs is that it is slower, but with the benefit of being more manageable and compatible with other systems. I hadn’t expected it to be approximately comparable – just ‘acceptable’. Good to know. It’s possible that if you are going to do it repeatedly you don’t need to delete the sprite when you graph. I have a feeling it will delete the old one for you. Can’t remember whether it does, and I’ve not implemented the ‘grab sprite from screen’ yet ¹ so don’t have it in the local cache. As for the question about colour,tint… honestly, it’s so much pain working with those operations that I try not to think about it these days ². ColourTrans_ReturnGCOLNumber maybe… Chris Bazley was always my guy for explaining to me how I’d done it wrong, usually backed up by Ian Jeffray when Chris wasn’t around. Also, one of the formats for ColoutTrans_GenerateTable uses GCOL numbers. Not sure if that’s useful for you… or anyone else. Pretty sure I haven’t implemented that yet, ‘cos it’s weird. ¹ The plumbing is there, ‘cos I’ve implemented Screensave, but just haven’t needed to do it yet. ² And that’s after having implemented most of ColourTrans and the palette operations in all the modes.

Jul 19, 2021 12:26pm Rick Murray (539) 13840 posts	I have a feeling it will delete the old one for you. Well spotted! https://gitlab.riscosopen.org/RiscOS/Sources/Kernel/-/blob/master/s/vdu/vdugrafj#L149 I expected it to be a more noticeable (on the sort of data you were using) difference Me too, but we are comparing code that reads bytes, converts them, and writes them back. There’s a lot of memory access there, and it probably suffers for that – about 121,200 instructions to chunder through a 600×200 block. SpriteOp probably writes as words as soon as it can, so fewer instructions executed for the same result, meaning there are cycles to spare for the extra stuff it does. Plus the SWIs are called from BASIC so we’re hopping in and out of the interpreter. End result? They’re about the same. ;-)

Jul 31, 2021 3:40pm Rick Murray (539) 13840 posts	While C often does better at crossing the barrier between different CPUs, the code often ends up tied to a set of shared libraries and an expected OS arch, and thus is not portable. You do not need shared libraries for C. You absolutely do need libraries, however. This is because the core of C is extremely small, and most facilities are implemented as libraries. It just so happens that library sharing (either in the style of the SharedCLib module or the GCC shared libraries) is often beneficial as it means the applications don’t need to include the same baggage over and over and over. Now, as long as you have a compiler and the base set of libraries, you can write generic C on just about anything. Some care will be needed (as something like “int” means different things on different platforms), but it is quite possible. A little example of my own is, as I believe I mentioned earlier, that crappy Arthur clone I threw together last winter. The main library work there was making a RISC OS clone of the TurboC graphics library (because I’m a lot more familiar with RISC OS, so it was much easier to get it to do the same things as Borland GDI rather than trying to get TurboC’s environment to do things like RISC OS). There did need to be some system-specific code, but not a lot. I got RISC OS to fake a lot of how TurboC behaves. So in the end, the libraries and support code sorted out, the same base code for the window manager and apps builds on both systems. A slightly larger example is OvationPro. When this was first written, I don’t imagine David Pilling even considered the possibility of a Windows version. And, yet, a Windows version eventually turned up. I don’t have access to the sources so I’m not sure exactly how this was done, but I’d imagine that he wrote some library code for Windows to fake the behaviour of the bits of RISC OS that he needed in Windows, so the same base code compiles and works on both machines. And the uber example? Linux. Ported widely, with a standard known as POSIX that specifies certain things being ways in order to make a system that is “Unix-like”. No C compiler comes near to this level of optimization for this operation, That’s because it is a rather artificial operation. A fast way to write 16 bytes of the same data to word aligned addresses. C, on the other hand, has to contend with making functional code for “copy this data from here to there” where the size is not necessarily known until runtime, the addresses may not be word aligned, it will be a copy (so needing reads as well as writes, and potentially from addresses with different alignment offsets), and the areas read from and written to may even overlap. If you want faster copying, you should not be writing your own code for it. You should be using the appropriate library function, such as memcpy(). For setting a fixed value, as in your example, there is memset(), but note again it has to cope in situations where yours wouldn’t. The more optimised library functions are here, and it’s worth noting the similarity between your code and `_memset`. Plus, I would imagine, it doesn’t reliably cope with non-aligned addresses. Two of the main things in favour of a compiler, for me, are: It deals with the mundane. Sorting out registers, unstacking what’s been stacked, making the correct sorts of memory accesses… it may not be better than a highly competent assembler programmer, however most of us are not that. So the compiler produces perfectly reasonable code ¹ in a lot less time than it would take to write the same in low level by taking care of all the mundane stuff. The high level code is the high level code. Targetting a different CPU is a question of altering the compiler options if the choice is between, say, an ARM Cortex-A72 versus an ARM926. Amusingly, on the larger and more capable compilers such as GCC, it’s not actually that much different to choosing whether the output is for an ARM926 or a 80486… Yup, entirely different architecture, same code. Don’t get me wrong here, there’s plenty to hate about C too. My personal one is that there’s no standardised attempt at giving the option of dealing with bounds checking of, well, anything ², nor automatic detection of null pointers. Two problems that have visited chaos upon the world. It’s just that I grew up when C was hot stuff and before most of the trendy modern language paradigms of today even existed. Hell, I was probably arsing around with TurboC when the designers of these trendy modern paradigms were in diapers, if not just bishie bubbles in the eyes of their parents. ;-) ¹ The “every machine cycle counts” ship sailed long ago. If a little function takes three nanoseconds instead of two and a half, the only way you’re going to notice is accumulated executions on the order of millions. I mean, FFS, we’re wasting vast amounts of time with an emulation of an ancient floating point co-processor, instead of the internal hardware that is like two orders of magnitude faster. And even that’s faster than interpreting BASIC. So the concept of optimisation really needs to be taken in context. ² Not that assembler provides you with anything at all, mind…

Aug 1, 2021 7:51am David J. Ruck (33) 1635 posts	Lets not feed the troll again, we’ve already done this one several times.

Reply

To post replies, please first log in.

Forums → General →

Sometimes assembler is still useful ;)

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options