Shifting via the stack? Huh?
Rick Murray (539) 13851 posts |
There is an amount of code like this in DeskLib:
I have changed the single register load at the end to This is one instruction larger, but surely quicker on any ARM ever built (only stacks the return address, simply shifts registers around in the core)?
Was there a specific reason to swap registers using the stack like that (remember, bits of DeskLib are compatible with RISC OS 2) or is that really as weird as it looks to me? There’s also a lot of paranoid saving R14 around SWI calls that is a complete waste of time in user mode applications (but rather important in SVC mode). I’m reluctant to touch that in case unexpected things break horribly. |
Steve Pampling (1551) 8172 posts |
Excess register saves: Generic code blocks or macro expansion? Or both in a nice mix. That said, I wonder how much the repeated use of those multi-register loads / saves slows down the system. |
Theo Markettos (89) 919 posts |
It’s probably not too bad. The store will go into the L1 cache, from where the load will pick it up (if not from the store buffer). Since they’re adjacent, there’s a good chance the locations will be in the same line, so there won’t be any extra cache pollution. The line will eventually get evicted and have to go to DRAM, but it would have to do that anyway because you stored SP in it. The STM and LDM will get converted into 3 and then 2 micro-ops. So, assuming you can access L1 at the same speed as registers (which is very likely), you only waste 2 cycles doing the STM/LDM rather than the STR/MOV/MOV. Superscalar-wise, the STR can happen in parallel with the other register loads/stores, so the whole lot might take 2 cycles (STR/MOV/MOV) or 4 cycles (STM/LDM) – slower, but not much slower, certainly not microseconds. |
Rick Murray (539) 13851 posts |
Thanks – great answer. I think I’ll convert it to MOV/MOV as it seems “purer”1 to me! but many thanks again for a detailed explanation. 1 Extremely subjective! |
Theo Markettos (89) 919 posts |
I’d probably do that too. In a lot of cases the overhead is hidden by the microarchitecture, but at some point you might run into somewhere the microarchitecture doesn’t support it – what if you wanted it to run on a Cortex M3, for instance? Keeping it ‘pure’ avoids unnecessary potential issues – unless you introduce new ones. On the other hand, relying on the microarchitecture can make life easier: AArch64 doesn’t have conditional instructions because they cause instruction dependencies and the branch predictor can negate the overhead of branches, meaning equivalent performance to conditionals. |
Rick Murray (539) 13851 posts |
I converted most of them, but things like this make my head hurt this time of the night…
|
Steve Drain (222) 1620 posts |
I have a couple of slightly related questions for those who know these things. I have read some of the ARM documentation and have concluded that for modern ARM processors: 1. the branch prediction is so good that it is not worth using stretches of conditional instructions rather than a jump over them. 2. That LDM fetches words in 64-bit pairs, so that is faster than two LDRs for two registers. Are either of those remotely correct? If so, it changes my conception of writing efficient assembler from ARM2 days. ;-) In any case, with the speeds and variations between procesors is it really worth bothering and should the best guide be to just write clear and understandable code with no tricks? |
Kuemmel (439) 384 posts |
…regarding 1. and 2. I think you really have to benchmark that with a real world piece of code on each specific processor and especially for branch prediciton the conclusion might change between OMAP3/4/5/…that’s the pity of that modern architectures. It depends a lot on pipeline lentgh and instruction units used during the flow of code… No tricks ? Then what’s the fun in assembler ;-) …I can just advise to comment almost each line of assembler, as it’s quite a horror to track down what you coded some years ago without ;-) |
Theo Markettos (89) 919 posts |
I can’t speak for any particular ARM architecture, but speaking generally: 1. Branch predictors are pretty good these days. There are several problems with stretches of conditional instructions: 2. That really depends on the setup of a particular processor: how many ports there are from cache and to registers as to whether multiple loads can get issued at once. There isn’t a whole lot of difference between LDM of two registers (turned into two micro-ops) and twice LDR (already two micro ops). Note that there’s also Load/Store Double (LDD/STD, AArch32) and Load/Store Pair (LDP/STP, AArch64). Exact details vary on the processor – for instance ARM11 is single-issue with a 64 bit memory bus, so LDM or LDD are likely quicker than 2x LDR, but Cortex A15 is superscalar so the 2xLDR can be issued in parallel. I think – I haven’t stared at the manuals in too much detail. As for whether this stuff matters, I’d say write clean understandable code (hint: don’t use assembler!) and then resort to tricks for the key parts where it matters (that’s what profiling is for). No point in unnecessary ‘optimisation’ (ie making it harder to read). |
Rick Murray (539) 13851 posts |
Personally, I would use conditionals if the code was short and would avoid one or more short branches. Branch prediction may be good these days, however people are still using RiscPC era hardware. There’s something a bit messed up about talking about micro-ops on a RISC processor. |
Theo Markettos (89) 919 posts |
I can see the argument with sticking with Risc PC friendly versions: the ARM6/7/StrongARM doesn’t have a lot of cycles to spare, so cycles matter. On the other side, it might not be optimal for a Cortex A15, but the A15 has plenty more cycles to spare.
ARM isn’t a RISC processor these days, and hasn’t been for some time. Anyway, the RISC v CISC debate is no longer relevant today. There’s a nice paper by Emily Blem et al from Wisconsin-Madison (paywall – but there are some decent writeups (1), (2) and a previous version is available ). They show that, for modern workloads (Cortex A8 and above) there’s no difference in performance/power between ARM, x86, or MIPS ISAs. The chips out there are merely different implementation points in the same space and those points are largely defined by microarchitecture, rather than ISA. |
Steve Drain (222) 1620 posts |
Thanks for the replies. @Keummel It does seem that there is no obvious strategey for writing the best assembler effective on all processors, but for what I do it is not really that important. I cannot agree more with the recommendation to comment nearly every instruction and when coding for fun it is good to save even one instruction where possible. ;-) @Theo What you said first confirms my less expert reading of the ARM documents. I generally follow your second comment about writing for RiscPC. Indeed, I still have in mind ARM2 – 3 instruction pipeline, no branch prediction and no cache. @Rick I echo your sentiments about how the code looks, but that leads me to often use subroutines (BL) rather than macros, because it is logically neater. Nothing startling has arisen and I will go on much as I have been. ;-) |
Rick Murray (539) 13851 posts |
Commenting in moderation – pushing #12 to select the OS_GBPB operation should carry a comment to say what the operation is, but no, you wouldn’t call SWI OS_GBPB and comment that you called SWI OS_GBPB. ;-) What is vitally important is to comment things that are unusual, specifically to say why the code is not as much be expected. Perhaps to optimise the instruction ordering to save pipeline stalls or whatever – it should be commented so you don’t revisit it in years and wonder “say whut?!?!”.
I am of this mindset, though I really do wonder about the point of saving a couple of unnecessary instructions when in the background the machine will be servicing thousands of interrupts and context switching tasks and so on and so on. But, then again, if there are going to be slack wasteful instructions, it ain’t gonna be in MY code. ;-)
Certainly. And add out of order execution to the mix, confused yet? ;-)
Tout au fait. But, then, knowing assembler, I suspect we’re in a dying breed. |
Jeffrey Lee (213) 6048 posts |
To deal with 26/32 bit APCS function entry/exit you should probably write some macros – that should save you a lot of brain ache, and allow for other niceties (optional placement of function names prior to the function body, to allow the names to appear in APCS stack traces). The shared C library uses macros for the entry/exit of most of its assembler functions, but I’m not sure if the license would allow you to copy them into DeskLib. Optimisation is fun, and different CPUs do like different optimisations (I’m reminded of the time I spent optimising ColourTrans – there are now several different algorithms it uses, based on D-cache size, instruction set features/performance, etc.), but I wouldn’t worry too much about shaving one or two cycles off of arbitrary SWI wrappers. Remember the golden rules of optimisation:
So if you find that calling 10 million SWIs is slow, that’s probably a sign that you shouldn’t be calling 10 million SWIs, not that you need to optimise the SWI wrappers :) |
Steve Pampling (1551) 8172 posts |
Possibly. Plenty of comments in the code helps when you don’t read assembler like English. 1 Hackcherly it was on split apart Bathams beer mats in the pub up the road2 from my digs and it worked first time. |
GavinWraith (26) 1563 posts |
Reading Emily Blem’s paper (thanks Theo) left me in the same mood. Processors have become a lot more complicated over the years, and compilers have had to become a lot more sophisticated. Is there any role left for assembly language skills, apart from education or tinkering? I have a fantasy about thinktanks being set up to address the problem of how mankind could cope with reconstituting its technical knowledge after a catastrophe without the assistance of machines. So many ancient skills we have lost forever; who knows when we may need them again? No doubt in The Culture ( https://en.wikipedia.org/wiki/The_Culture |
Kuemmel (439) 384 posts |
I think assembler will be always relevant when it comes to speed…at least the demoscene will keep it alive if nobody else ;-) It’s getting a bit offtopic but I want to show you how complicate or surprising optimization became on x86…Intel did really wonders (in massive parallel instruction units) on their cpu’s since starting with Core2Duo. See that example of my Mandelbrot SSE2 (like NEON, here it’s operating on two double floats with one instruction) code: One might think using these two blocks for point 1/2 and point 3/4 is just classic loop unrolling…in fact the second block uses mostly other operating registers and that doubles the throughput…so something like an out-of-order system that works for much larger areas of code than just one instruction before or after…it’s really tough to get your mind set to this and makes optimizaton on x86 a quite fun benefitial headache…could be done in C maybe, too, but you really got to change your code to use these benefits…and it doesn’t do anything on older cpu’s and AMD is totally different again…
|