RISC OS Open: Forum: Shifting via the stack? Huh?

Jul 9, 2015 9:00pm

Rick Murray (539) 13851 posts

There is an amount of code like this in DeskLib:

   STMFD   sp!, {a1-a3, lr}
   LDMFD   sp!, {a2-a3}
   ...do some other stuff...
   LDMFD   sp!, {pc}

I have changed the single register load at the end to LDR pc, [sp], #4 but I can’t help but wonder what sort of logic was behind the first two lines?

This is one instruction larger, but surely quicker on any ARM ever built (only stacks the return address, simply shifts registers around in the core)?

   STR     lr, [sp, #-4]!
   MOV     a3, a2
   MOV     a2, a1
   ...do some other stuff...
   LDR pc, [sp], #4

Was there a specific reason to swap registers using the stack like that (remember, bits of DeskLib are compatible with RISC OS 2) or is that really as weird as it looks to me?

There’s also a lot of paranoid saving R14 around SWI calls that is a complete waste of time in user mode applications (but rather important in SVC mode). I’m reluctant to touch that in case unexpected things break horribly.

Jul 10, 2015 7:11am

Steve Pampling (1551) 8172 posts

Excess register saves: Generic code blocks or macro expansion? Or both in a nice mix.
Logical unless you want to spend a lot of time creating more optimised constructs.

That said, I wonder how much the repeated use of those multi-register loads / saves slows down the system.
Will we ever get those microseconds back?

Jul 12, 2015 2:05pm

Theo Markettos (89) 919 posts

It’s probably not too bad. The store will go into the L1 cache, from where the load will pick it up (if not from the store buffer). Since they’re adjacent, there’s a good chance the locations will be in the same line, so there won’t be any extra cache pollution. The line will eventually get evicted and have to go to DRAM, but it would have to do that anyway because you stored SP in it. The STM and LDM will get converted into 3 and then 2 micro-ops. So, assuming you can access L1 at the same speed as registers (which is very likely), you only waste 2 cycles doing the STM/LDM rather than the STR/MOV/MOV. Superscalar-wise, the STR can happen in parallel with the other register loads/stores, so the whole lot might take 2 cycles (STR/MOV/MOV) or 4 cycles (STM/LDM) – slower, but not much slower, certainly not microseconds.

Jul 12, 2015 2:18pm

Rick Murray (539) 13851 posts

Thanks – great answer. I think I’ll convert it to MOV/MOV as it seems “purer”¹ to me! but many thanks again for a detailed explanation.

¹ Extremely subjective! (^_^)

Jul 12, 2015 5:34pm

Theo Markettos (89) 919 posts

I’d probably do that too. In a lot of cases the overhead is hidden by the microarchitecture, but at some point you might run into somewhere the microarchitecture doesn’t support it – what if you wanted it to run on a Cortex M3, for instance? Keeping it ‘pure’ avoids unnecessary potential issues – unless you introduce new ones. On the other hand, relying on the microarchitecture can make life easier: AArch64 doesn’t have conditional instructions because they cause instruction dependencies and the branch predictor can negate the overhead of branches, meaning equivalent performance to conditionals.

Jul 12, 2015 9:23pm

Rick Murray (539) 13851 posts

I think I’ll leave it for now – it looks that most of the Sprite library (some 40 files) does weird stuff with the stack like this, so I’ll just make it 26/32 switchable and revisit the odd code in the future.

I converted most of them, but things like this make my head hurt this time of the night…

        MOV     ip, sp
        STMFD   sp!, {a1, a2, a3, a4, v1, v2, v3, v4, v5, lr}
        LDMFD   sp!, {a2, a3, a4, v1}
        LDMFD   ip!, {v3, v4, v5}
        MOV     a1, #16
        ADD     a1, a1, #256
        SWI     SWI_OS_SpriteOp + XOS_Bit
        LDRVC   ip, [ip, #0]
        STRVC   a3, [ip, #0]
        MOVVC   a1, #0
 [ {CONFIG}=32
        LDMFD   sp!, {v1, v2, v3, v4, v5, pc}
 |
        LDMFD   sp!, {v1, v2, v3, v4, v5, pc}^
 ]

Jul 13, 2015 4:13pm

Steve Drain (222) 1620 posts

I have a couple of slightly related questions for those who know these things.

I have read some of the ARM documentation and have concluded that for modern ARM processors:

1. the branch prediction is so good that it is not worth using stretches of conditional instructions rather than a jump over them.

2. That LDM fetches words in 64-bit pairs, so that is faster than two LDRs for two registers.

Are either of those remotely correct? If so, it changes my conception of writing efficient assembler from ARM2 days. ;-)

In any case, with the speeds and variations between procesors is it really worth bothering and should the best guide be to just write clear and understandable code with no tricks?

Jul 13, 2015 5:49pm

Kuemmel (439) 384 posts

…regarding 1. and 2. I think you really have to benchmark that with a real world piece of code on each specific processor and especially for branch prediciton the conclusion might change between OMAP3/4/5/…that’s the pity of that modern architectures. It depends a lot on pipeline lentgh and instruction units used during the flow of code…

No tricks ? Then what’s the fun in assembler ;-) …I can just advise to comment almost each line of assembler, as it’s quite a horror to track down what you coded some years ago without ;-)

Jul 13, 2015 6:33pm

Theo Markettos (89) 919 posts

I can’t speak for any particular ARM architecture, but speaking generally:

1. Branch predictors are pretty good these days. There are several problems with stretches of conditional instructions:
a) You still have to fetch and decode them even if you aren’t going to execute (time and power wasted), b) the issue unit may not be able to fully deduce dependencies of conditional instructions – for example the flag-preserving status of a BLxx or SWIxx isn’t known at issue time. That may cause a false dependency. Of course you could also hit a pathological case in the branch predictor, but that’s less likely.

2. That really depends on the setup of a particular processor: how many ports there are from cache and to registers as to whether multiple loads can get issued at once. There isn’t a whole lot of difference between LDM of two registers (turned into two micro-ops) and twice LDR (already two micro ops). Note that there’s also Load/Store Double (LDD/STD, AArch32) and Load/Store Pair (LDP/STP, AArch64). Exact details vary on the processor – for instance ARM11 is single-issue with a 64 bit memory bus, so LDM or LDD are likely quicker than 2x LDR, but Cortex A15 is superscalar so the 2xLDR can be issued in parallel. I think – I haven’t stared at the manuals in too much detail.

As for whether this stuff matters, I’d say write clean understandable code (hint: don’t use assembler!) and then resort to tricks for the key parts where it matters (that’s what profiling is for). No point in unnecessary ‘optimisation’ (ie making it harder to read).

Jul 13, 2015 7:23pm

Rick Murray (539) 13851 posts

Personally, I would use conditionals if the code was short and would avoid one or more short branches. Branch prediction may be good these days, however people are still using RiscPC era hardware.
Besides, code with lots of unnecessary branching on a processor that supports conditional execution just looks… messy.

There’s something a bit messed up about talking about micro-ops on a RISC processor.

Jul 13, 2015 10:35pm

Theo Markettos (89) 919 posts

I can see the argument with sticking with Risc PC friendly versions: the ARM6/7/StrongARM doesn’t have a lot of cycles to spare, so cycles matter. On the other side, it might not be optimal for a Cortex A15, but the A15 has plenty more cycles to spare.

There’s something a bit messed up about talking about micro-ops on a RISC processor.

ARM isn’t a RISC processor these days, and hasn’t been for some time.

Anyway, the RISC v CISC debate is no longer relevant today. There’s a nice paper by Emily Blem et al from Wisconsin-Madison (paywall – but there are some decent writeups (1), (2) and a previous version is available ). They show that, for modern workloads (Cortex A8 and above) there’s no difference in performance/power between ARM, x86, or MIPS ISAs. The chips out there are merely different implementation points in the same space and those points are largely defined by microarchitecture, rather than ISA.

Jul 14, 2015 9:48am

Steve Drain (222) 1620 posts

Thanks for the replies.

@Keummel

It does seem that there is no obvious strategey for writing the best assembler effective on all processors, but for what I do it is not really that important. I cannot agree more with the recommendation to comment nearly every instruction and when coding for fun it is good to save even one instruction where possible. ;-)

@Theo

What you said first confirms my less expert reading of the ARM documents. I generally follow your second comment about writing for RiscPC. Indeed, I still have in mind ARM2 – 3 instruction pipeline, no branch prediction and no cache.

@Rick

I echo your sentiments about how the code looks, but that leads me to often use subroutines (BL) rather than macros, because it is logically neater.

Nothing startling has arisen and I will go on much as I have been. ;-)

Jul 14, 2015 12:15pm

Rick Murray (539) 13851 posts

I cannot agree more with the recommendation to comment nearly every instruction

Commenting in moderation – pushing #12 to select the OS_GBPB operation should carry a comment to say what the operation is, but no, you wouldn’t call SWI OS_GBPB and comment that you called SWI OS_GBPB. ;-)

What is vitally important is to comment things that are unusual, specifically to say why the code is not as much be expected. Perhaps to optimise the instruction ordering to save pipeline stalls or whatever – it should be commented so you don’t revisit it in years and wonder “say whut?!?!”.

and when coding for fun it is good to save even one instruction where possible. ;-)

I am of this mindset, though I really do wonder about the point of saving a couple of unnecessary instructions when in the background the machine will be servicing thousands of interrupts and context switching tasks and so on and so on. But, then again, if there are going to be slack wasteful instructions, it ain’t gonna be in MY code. ;-)

Indeed, I still have in mind ARM2 – 3 instruction pipeline, no branch prediction and no cache.

Certainly. And add out of order execution to the mix, confused yet? ;-)

Nothing startling has arisen and I will go on much as I have been. ;-)

Tout au fait. But, then, knowing assembler, I suspect we’re in a dying breed.

Jul 14, 2015 12:54pm

Jeffrey Lee (213) 6048 posts

To deal with 26/32 bit APCS function entry/exit you should probably write some macros – that should save you a lot of brain ache, and allow for other niceties (optional placement of function names prior to the function body, to allow the names to appear in APCS stack traces). The shared C library uses macros for the entry/exit of most of its assembler functions, but I’m not sure if the license would allow you to copy them into DeskLib.

Optimisation is fun, and different CPUs do like different optimisations (I’m reminded of the time I spent optimising ColourTrans – there are now several different algorithms it uses, based on D-cache size, instruction set features/performance, etc.), but I wouldn’t worry too much about shaving one or two cycles off of arbitrary SWI wrappers. Remember the golden rules of optimisation:

Don’t do premature optimisation
Profile before optimising
Try and look for a better algorithm before resorting to micro-optimisations on what’s already there

So if you find that calling 10 million SWIs is slow, that’s probably a sign that you shouldn’t be calling 10 million SWIs, not that you need to optimise the SWI wrappers :)

Jul 14, 2015 1:04pm

Steve Pampling (1551) 8172 posts

But, then, knowing assembler, I suspect we’re in a dying breed.

Possibly. Plenty of comments in the code helps when you don’t read assembler like English.
Trying to get somewhere with the learning bit, don’t think I will ever match the doing Z80 assembler in my head¹ level of my 20’s. Current victim for my learning exercises is the Filer module.³

¹ Hackcherly it was on split apart Bathams beer mats in the pub up the road² from my digs and it worked first time.
² Lamp Tavern, Dudley if anyone is even vaguely interested.
³ Kept banging on about how there ought to be key shortcuts and not getting rear end in gear. So, learn a little and see what I can add to keep me happy (possibly others)

Jul 14, 2015 1:06pm

GavinWraith (26) 1563 posts

But, then, knowing assembler, I suspect we’re in a dying breed.

Reading Emily Blem’s paper (thanks Theo) left me in the same mood. Processors have become a lot more complicated over the years, and compilers have had to become a lot more sophisticated. Is there any role left for assembly language skills, apart from education or tinkering? I have a fantasy about thinktanks being set up to address the problem of how mankind could cope with reconstituting its technical knowledge after a catastrophe without the assistance of machines. So many ancient skills we have lost forever; who knows when we may need them again? No doubt in The Culture ( https://en.wikipedia.org/wiki/The_Culture
) every sufficiently complex member contains a digitized and compressed image from which the whole Culture could be rebooted in case of emergency. Or is that just DNA?

Jul 14, 2015 9:56pm

Kuemmel (439) 384 posts

I think assembler will be always relevant when it comes to speed…at least the demoscene will keep it alive if nobody else ;-) It’s getting a bit offtopic but I want to show you how complicate or surprising optimization became on x86…Intel did really wonders (in massive parallel instruction units) on their cpu’s since starting with Core2Duo. See that example of my Mandelbrot SSE2 (like NEON, here it’s operating on two double floats with one instruction) code:

...snip...
movaps 	xmm3, [.two_two]		;    2.0    |    2.0	; 1st iter point 1 and 2
mulpd  	xmm3, xmm1     			;   2*iz    |   2*iz
mulpd  	xmm1, xmm1     			;   iz^2    |   iz^2
mulpd  	xmm3, xmm0     			;  2*rz*iz  |  2*rz*iz
mulpd  	xmm0, xmm0     			;   rz^2    |   rz^2
movaps 	xmm2, xmm0 			;   rz^2    |   rz^2
subpd  	xmm0, xmm1     			; rz^2-iz^2 | rz^2-iz^2
addpd  	xmm1, xmm2 			; rz^2+iz^2 | rz^2+iz^2
...snip...
movaps 	xmm3, [.two_two]		;    2.0    |    2.0	; 1st iter point 3 and 4
mulpd  	xmm3, xmm5     			;   2*iz    |   2*iz
mulpd  	xmm5, xmm5     			;   iz^2    |   iz^2
mulpd  	xmm3, xmm4     			;  2*rz*iz  |  2*rz*iz
mulpd  	xmm4, xmm4     			;   rz^2    |   rz^2
movaps 	xmm2, xmm4 			;   rz^2    |   rz^2
subpd  	xmm4, xmm5     			; rz^2-iz^2 | rz^2-iz^2
addpd  	xmm5, xmm2 			; rz^2+iz^2 | rz^2+iz^2
...snip...

One might think using these two blocks for point 1/2 and point 3/4 is just classic loop unrolling…in fact the second block uses mostly other operating registers and that doubles the throughput…so something like an out-of-order system that works for much larger areas of code than just one instruction before or after…it’s really tough to get your mind set to this and makes optimizaton on x86 a quite fun benefitial headache…could be done in C maybe, too, but you really got to change your code to use these benefits…and it doesn’t do anything on older cpu’s and AMD is totally different again…

Shifting via the stack? Huh?

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Jul 9, 2015 9:00pm Rick Murray (539) 13851 posts	There is an amount of code like this in DeskLib: `STMFD sp!, {a1-a3, lr} LDMFD sp!, {a2-a3} ...do some other stuff... LDMFD sp!, {pc}` I have changed the single register load at the end to `LDR pc, [sp], #4` but I can’t help but wonder what sort of logic was behind the first two lines? This is one instruction larger, but surely quicker on any ARM ever built (only stacks the return address, simply shifts registers around in the core)? `STR lr, [sp, #-4]! MOV a3, a2 MOV a2, a1 ...do some other stuff... LDR pc, [sp], #4` Was there a specific reason to swap registers using the stack like that (remember, bits of DeskLib are compatible with RISC OS 2) or is that really as weird as it looks to me? There’s also a lot of paranoid saving R14 around SWI calls that is a complete waste of time in user mode applications (but rather important in SVC mode). I’m reluctant to touch that in case unexpected things break horribly.

Jul 10, 2015 7:11am Steve Pampling (1551) 8172 posts	Excess register saves: Generic code blocks or macro expansion? Or both in a nice mix. Logical unless you want to spend a lot of time creating more optimised constructs. That said, I wonder how much the repeated use of those multi-register loads / saves slows down the system. Will we ever get those microseconds back?

Jul 12, 2015 2:05pm Theo Markettos (89) 919 posts	It’s probably not too bad. The store will go into the L1 cache, from where the load will pick it up (if not from the store buffer). Since they’re adjacent, there’s a good chance the locations will be in the same line, so there won’t be any extra cache pollution. The line will eventually get evicted and have to go to DRAM, but it would have to do that anyway because you stored SP in it. The STM and LDM will get converted into 3 and then 2 micro-ops. So, assuming you can access L1 at the same speed as registers (which is very likely), you only waste 2 cycles doing the STM/LDM rather than the STR/MOV/MOV. Superscalar-wise, the STR can happen in parallel with the other register loads/stores, so the whole lot might take 2 cycles (STR/MOV/MOV) or 4 cycles (STM/LDM) – slower, but not much slower, certainly not microseconds.

Jul 12, 2015 2:18pm Rick Murray (539) 13851 posts	Thanks – great answer. I think I’ll convert it to MOV/MOV as it seems “purer”¹ to me! but many thanks again for a detailed explanation. ¹ Extremely subjective! `(^_^)`

Jul 12, 2015 5:34pm Theo Markettos (89) 919 posts	I’d probably do that too. In a lot of cases the overhead is hidden by the microarchitecture, but at some point you might run into somewhere the microarchitecture doesn’t support it – what if you wanted it to run on a Cortex M3, for instance? Keeping it ‘pure’ avoids unnecessary potential issues – unless you introduce new ones. On the other hand, relying on the microarchitecture can make life easier: AArch64 doesn’t have conditional instructions because they cause instruction dependencies and the branch predictor can negate the overhead of branches, meaning equivalent performance to conditionals.

Jul 12, 2015 9:23pm Rick Murray (539) 13851 posts	I think I’ll leave it for now – it looks that most of the Sprite library (some 40 files) does weird stuff with the stack like this, so I’ll just make it 26/32 switchable and revisit the odd code in the future. I converted most of them, but things like this make my head hurt this time of the night… `MOV ip, sp STMFD sp!, {a1, a2, a3, a4, v1, v2, v3, v4, v5, lr} LDMFD sp!, {a2, a3, a4, v1} LDMFD ip!, {v3, v4, v5} MOV a1, #16 ADD a1, a1, #256 SWI SWI_OS_SpriteOp + XOS_Bit LDRVC ip, [ip, #0] STRVC a3, [ip, #0] MOVVC a1, #0 [ {CONFIG}=32 LDMFD sp!, {v1, v2, v3, v4, v5, pc} \| LDMFD sp!, {v1, v2, v3, v4, v5, pc}^ ]`

Jul 13, 2015 4:13pm Steve Drain (222) 1620 posts	I have a couple of slightly related questions for those who know these things. I have read some of the ARM documentation and have concluded that for modern ARM processors: 1. the branch prediction is so good that it is not worth using stretches of conditional instructions rather than a jump over them. 2. That LDM fetches words in 64-bit pairs, so that is faster than two LDRs for two registers. Are either of those remotely correct? If so, it changes my conception of writing efficient assembler from ARM2 days. ;-) In any case, with the speeds and variations between procesors is it really worth bothering and should the best guide be to just write clear and understandable code with no tricks?

Jul 13, 2015 5:49pm Kuemmel (439) 384 posts	…regarding 1. and 2. I think you really have to benchmark that with a real world piece of code on each specific processor and especially for branch prediciton the conclusion might change between OMAP3/4/5/…that’s the pity of that modern architectures. It depends a lot on pipeline lentgh and instruction units used during the flow of code… No tricks ? Then what’s the fun in assembler ;-) …I can just advise to comment almost each line of assembler, as it’s quite a horror to track down what you coded some years ago without ;-)

Jul 13, 2015 6:33pm Theo Markettos (89) 919 posts	I can’t speak for any particular ARM architecture, but speaking generally: 1. Branch predictors are pretty good these days. There are several problems with stretches of conditional instructions: a) You still have to fetch and decode them even if you aren’t going to execute (time and power wasted), b) the issue unit may not be able to fully deduce dependencies of conditional instructions – for example the flag-preserving status of a BLxx or SWIxx isn’t known at issue time. That may cause a false dependency. Of course you could also hit a pathological case in the branch predictor, but that’s less likely. 2. That really depends on the setup of a particular processor: how many ports there are from cache and to registers as to whether multiple loads can get issued at once. There isn’t a whole lot of difference between LDM of two registers (turned into two micro-ops) and twice LDR (already two micro ops). Note that there’s also Load/Store Double (LDD/STD, AArch32) and Load/Store Pair (LDP/STP, AArch64). Exact details vary on the processor – for instance ARM11 is single-issue with a 64 bit memory bus, so LDM or LDD are likely quicker than 2x LDR, but Cortex A15 is superscalar so the 2xLDR can be issued in parallel. I think – I haven’t stared at the manuals in too much detail. As for whether this stuff matters, I’d say write clean understandable code (hint: don’t use assembler!) and then resort to tricks for the key parts where it matters (that’s what profiling is for). No point in unnecessary ‘optimisation’ (ie making it harder to read).

Jul 13, 2015 7:23pm Rick Murray (539) 13851 posts	Personally, I would use conditionals if the code was short and would avoid one or more short branches. Branch prediction may be good these days, however people are still using RiscPC era hardware. Besides, code with lots of unnecessary branching on a processor that supports conditional execution just looks… messy. There’s something a bit messed up about talking about micro-ops on a *RISC* processor.

Jul 13, 2015 10:35pm Theo Markettos (89) 919 posts	I can see the argument with sticking with Risc PC friendly versions: the ARM6/7/StrongARM doesn’t have a lot of cycles to spare, so cycles matter. On the other side, it might not be optimal for a Cortex A15, but the A15 has plenty more cycles to spare. There’s something a bit messed up about talking about micro-ops on a RISC processor. ARM isn’t a RISC processor these days, and hasn’t been for some time. Anyway, the RISC v CISC debate is no longer relevant today. There’s a nice paper by Emily Blem et al from Wisconsin-Madison (paywall – but there are some decent writeups (1), (2) and a previous version is available ). They show that, for modern workloads (Cortex A8 and above) there’s no difference in performance/power between ARM, x86, or MIPS ISAs. The chips out there are merely different implementation points in the same space and those points are largely defined by microarchitecture, rather than ISA.

Jul 14, 2015 9:48am Steve Drain (222) 1620 posts	Thanks for the replies. @Keummel It does seem that there is no obvious strategey for writing the best assembler effective on all processors, but for what I do it is not really that important. I cannot agree more with the recommendation to comment nearly every instruction and when coding for fun it is good to save even one instruction where possible. ;-) @Theo What you said first confirms my less expert reading of the ARM documents. I generally follow your second comment about writing for RiscPC. Indeed, I still have in mind ARM2 – 3 instruction pipeline, no branch prediction and no cache. @Rick I echo your sentiments about how the code looks, but that leads me to often use subroutines (BL) rather than macros, because it is logically neater. Nothing startling has arisen and I will go on much as I have been. ;-)

Jul 14, 2015 12:15pm Rick Murray (539) 13851 posts	I cannot agree more with the recommendation to comment nearly every instruction Commenting in moderation – pushing #12 to select the OS_GBPB operation should carry a comment to say what the operation is, but no, you wouldn’t call SWI OS_GBPB and comment that you called SWI OS_GBPB. ;-) What is vitally important is to comment things that are unusual, specifically to say why the code is not as much be expected. Perhaps to optimise the instruction ordering to save pipeline stalls or whatever – it should be commented so you don’t revisit it in years and wonder “say whut?!?!”. and when coding for fun it is good to save even one instruction where possible. ;-) I am of this mindset, though I really do wonder about the point of saving a couple of unnecessary instructions when in the background the machine will be servicing thousands of interrupts and context switching tasks and so on and so on. But, then again, if there are going to be slack wasteful instructions, it ain’t gonna be in MY code. ;-) Indeed, I still have in mind ARM2 – 3 instruction pipeline, no branch prediction and no cache. Certainly. And add out of order execution to the mix, confused yet? ;-) Nothing startling has arisen and I will go on much as I have been. ;-) Tout au fait. But, then, knowing assembler, I suspect we’re in a dying breed.

Jul 14, 2015 12:54pm Jeffrey Lee (213) 6048 posts	To deal with 26/32 bit APCS function entry/exit you should probably write some macros – that should save you a lot of brain ache, and allow for other niceties (optional placement of function names prior to the function body, to allow the names to appear in APCS stack traces). The shared C library uses macros for the entry/exit of most of its assembler functions, but I’m not sure if the license would allow you to copy them into DeskLib. Optimisation is fun, and different CPUs do like different optimisations (I’m reminded of the time I spent optimising ColourTrans – there are now several different algorithms it uses, based on D-cache size, instruction set features/performance, etc.), but I wouldn’t worry too much about shaving one or two cycles off of arbitrary SWI wrappers. Remember the golden rules of optimisation: Don’t do premature optimisation Profile before optimising Try and look for a better algorithm before resorting to micro-optimisations on what’s already there So if you find that calling 10 million SWIs is slow, that’s probably a sign that you shouldn’t be calling 10 million SWIs, not that you need to optimise the SWI wrappers :)

Jul 14, 2015 1:04pm Steve Pampling (1551) 8172 posts	But, then, knowing assembler, I suspect we’re in a dying breed. Possibly. Plenty of comments in the code helps when you don’t read assembler like English. Trying to get somewhere with the learning bit, don’t think I will ever match the doing Z80 assembler in my head¹ level of my 20’s. Current victim for my learning exercises is the Filer module.³ ¹ Hackcherly it was on split apart Bathams beer mats in the pub up the road² from my digs and it worked first time. ² Lamp Tavern, Dudley if anyone is even vaguely interested. ³ Kept banging on about how there ought to be key shortcuts and not getting rear end in gear. So, learn a little and see what I can add to keep me happy (possibly others)

Jul 14, 2015 1:06pm GavinWraith (26) 1563 posts	But, then, knowing assembler, I suspect we’re in a dying breed. Reading Emily Blem’s paper (thanks Theo) left me in the same mood. Processors have become a lot more complicated over the years, and compilers have had to become a lot more sophisticated. Is there any role left for assembly language skills, apart from education or tinkering? I have a fantasy about thinktanks being set up to address the problem of how mankind could cope with reconstituting its technical knowledge after a catastrophe without the assistance of machines. So many ancient skills we have lost forever; who knows when we may need them again? No doubt in The Culture ( https://en.wikipedia.org/wiki/The_Culture ) every sufficiently complex member contains a digitized and compressed image from which the whole Culture could be rebooted in case of emergency. Or is that just DNA?

Jul 14, 2015 9:56pm Kuemmel (439) 384 posts	I think assembler will be always relevant when it comes to speed…at least the demoscene will keep it alive if nobody else ;-) It’s getting a bit offtopic but I want to show you how complicate or surprising optimization became on x86…Intel did really wonders (in massive parallel instruction units) on their cpu’s since starting with Core2Duo. See that example of my Mandelbrot SSE2 (like NEON, here it’s operating on two double floats with one instruction) code: ...snip... movaps xmm3, [.two_two] ; 2.0 \| 2.0 ; 1st iter point 1 and 2 mulpd xmm3, xmm1 ; 2iz \| 2iz mulpd xmm1, xmm1 ; iz^2 \| iz^2 mulpd xmm3, xmm0 ; 2rziz \| 2rziz mulpd xmm0, xmm0 ; rz^2 \| rz^2 movaps xmm2, xmm0 ; rz^2 \| rz^2 subpd xmm0, xmm1 ; rz^2-iz^2 \| rz^2-iz^2 addpd xmm1, xmm2 ; rz^2+iz^2 \| rz^2+iz^2 ...snip... movaps xmm3, [.two_two] ; 2.0 \| 2.0 ; 1st iter point 3 and 4 mulpd xmm3, xmm5 ; 2iz \| 2iz mulpd xmm5, xmm5 ; iz^2 \| iz^2 mulpd xmm3, xmm4 ; 2rziz \| 2rziz mulpd xmm4, xmm4 ; rz^2 \| rz^2 movaps xmm2, xmm4 ; rz^2 \| rz^2 subpd xmm4, xmm5 ; rz^2-iz^2 \| rz^2-iz^2 addpd xmm5, xmm2 ; rz^2+iz^2 \| rz^2+iz^2 ...snip... One might think using these two blocks for point 1/2 and point 3/4 is just classic loop unrolling…in fact the second block uses mostly other operating registers and that doubles the throughput…so something like an out-of-order system that works for much larger areas of code than just one instruction before or after…it’s really tough to get your mind set to this and makes optimizaton on x86 a quite fun benefitial headache…could be done in C maybe, too, but you really got to change your code to use these benefits…and it doesn’t do anything on older cpu’s and AMD is totally different again…