RISC OS Open: Forum: Dual-Issue, pipelines, out-of-order, in-order...

Mar 15, 2016 10:13pm

Kuemmel (439) 384 posts

Following Jeffrey’s “invitation” ;-) to discuss dual-issue, pipelines, out-of-order, in-order cpu design I took my change to post some issue what could be interesting in that sense and keeps me confused…

I got some code within my Mandelbrot fixed point math (10.22 format) that looks like this:

SMULL R3,R9,R1,R1         ;" y%*y%
SMULL R5,R6,R0,R0         ;" x%*x%
SMULL R8,R2,R0,R1         ;" y%*x%
MOV   R3,R3,LSR#22
ORR   R9,R3,R9,LSL#10
MOV   R5,R5,LSR#22
ORR   R6,R5,R6,LSL#10
MOV   R8,R8,LSR#21   
ORR   R2,R8,R2,LSL#11    ;" to multiply y%*x% by 2"

So while optimising that I always thought this one would be faster as it creates no dependencies:

SMULL R3,R9,R1,R1        ;" y%*y%
SMULL R5,R6,R0,R0        ;" x%*x%
SMULL R8,R2,R0,R1        ;" y%*x%
MOV   R3,R3,LSR#22
MOV   R5,R5,LSR#22
MOV   R8,R8,LSR#21
ORR   R9,R3,R9,LSL#10
ORR   R6,R5,R6,LSL#10
ORR   R2,R8,R2,LSL#11    ;" to multiply y%*x% by 2"

…but the opposite is true, the first version is still faster on Cortex A7,A8,A9,A53,A15 (didn’t test on StrongARM though) and I don’t get why…seems optimizing assembler nowdays is try and error ;-)

E.g. the A53 is an in-order-design, but could dual issue…doesn’t seem to help. A15 is an out-of-order design, also no success…or I’m getting it all wrong.

Even ARM documents say on dual issue…“There must be no data dependency between the two instructions. That is, the second instruction must not have any source registers that are destination registers of the first instruction.”

Mar 15, 2016 10:31pm

Rick Murray (539) 13840 posts

The A8Time program reports:

Cycle   Pipeline 0                          Pipeline 1
================================================================================
     1  SMULL r3,r9,r1,r1                   blocked during multi-cycle op
     2  SMULL (cycle 2)                     blocked during multi-cycle op
     3  SMULL (cycle 3)                     need pipeline 0 for multiply
     4  SMULL r5,r6,r0,r0                   blocked during multi-cycle op
     5  SMULL (cycle 2)                     blocked during multi-cycle op
     6  SMULL (cycle 3)                     need pipeline 0 for multiply
     7  SMULL r8,r2,r0,r1                   blocked during multi-cycle op
     8  SMULL (cycle 2)                     blocked during multi-cycle op
     9  SMULL (cycle 3)                     MOV r3,r3,LSR #22
    10  ORR r9,r3,r9,LSL #10                MOV r5,r5,LSR #22
    11  ORR r6,r5,r6,LSL #10                wait for r8
    12  wait for r8                         wait for r8
    13  MOV r8,r8,LSR #21                   wait for r2
    14  ORR r2,r8,r2,LSL #11

And for the second:

Cycle   Pipeline 0                          Pipeline 1
================================================================================
     1  SMULL r3,r9,r1,r1                   blocked during multi-cycle op
     2  SMULL (cycle 2)                     blocked during multi-cycle op
     3  SMULL (cycle 3)                     need pipeline 0 for multiply
     4  SMULL r5,r6,r0,r0                   blocked during multi-cycle op
     5  SMULL (cycle 2)                     blocked during multi-cycle op
     6  SMULL (cycle 3)                     need pipeline 0 for multiply
     7  SMULL r8,r2,r0,r1                   blocked during multi-cycle op
     8  SMULL (cycle 2)                     blocked during multi-cycle op
     9  SMULL (cycle 3)                     MOV r3,r3,LSR #22
    10  ORR r9,r3,r9,LSL #10                MOV r5,r5,LSR #22
    11  ORR r6,r5,r6,LSL #10                wait for r8
    12  wait for r8                         wait for r8
    13  MOV r8,r8,LSR #21                   wait for r2
    14  ORR r2,r8,r2,LSL #11

It appears that SMULL is costly, blocking the processor for three cycles or five instructions, but not having the result available until seven cycles or thirteen instructions later.

Ouch.

Mar 15, 2016 11:27pm

Jeffrey Lee (213) 6048 posts

Copy-paste error from Rick – these are the actual second results:

Cycle   Pipeline 0                          Pipeline 1
================================================================================
     1  SMULL r3,r9,r1,r1                   blocked during multi-cycle op
     2  SMULL (cycle 2)                     blocked during multi-cycle op
     3  SMULL (cycle 3)                     need pipeline 0 for multiply
     4  SMULL r5,r6,r0,r0                   blocked during multi-cycle op
     5  SMULL (cycle 2)                     blocked during multi-cycle op
     6  SMULL (cycle 3)                     need pipeline 0 for multiply
     7  SMULL r8,r2,r0,r1                   blocked during multi-cycle op
     8  SMULL (cycle 2)                     blocked during multi-cycle op
     9  SMULL (cycle 3)                     MOV r3,r3,LSR #22
    10  MOV r5,r5,LSR #22                   wait for r8
    11  wait for r8                         wait for r8
    12  wait for r8                         wait for r8
    13  MOV r8,r8,LSR #21                   ORR r9,r3,r9,LSL #10
    14  ORR r6,r5,r6,LSL #10                ORR r2,r8,r2,LSL #11

So the second one is slower by half a cycle, plus however long it takes for the ORRs to produce their results.

You might think you’d be able to make use of the first two “need pipeline 0 for multiply” bubbles, but SMULL takes too long to calculate the results – it’s only partway through third SMULL that the results from the first become available.

Mar 15, 2016 11:46pm

Jeffrey Lee (213) 6048 posts

Also, if you eliminate the SMULL’s then a8time reports both routines as having the same cycle timing – 3 cycles, both pipelines fully occupied. So the first routine looks like this, with the A8 able to forward the result of each MOV to the ORR in the other pipeline:

Cycle   Pipeline 0                          Pipeline 1
================================================================================
     1  MOV r3,r3,LSR #22                   ORR r9,r3,r9,LSL #10
     2  MOV r5,r5,LSR #22                   ORR r6,r5,r6,LSL #10
     3  MOV r8,r8,LSR #21                   ORR r2,r8,r2,LSL #11

But if the MOV was something more complex (e.g. ADD with r3 as destination) then it looks like you’d suffer a stall.

Mar 16, 2016 9:09am

Kuemmel (439) 384 posts

Thanks Jeffrey and Rick, I see, everything delayed due to the SMULL’s (still faster than not using them for the fixed point math…). The Cortex A8 documentation only points out 3 cycles for an SMULL, so I guess they didn’t mention the extra result latency there. Only now I found Ben’s website with much more data.

I didn’t know about that interesting tool “a8time”, as far as I see it’s part of the DDE, might be worth buying it even if it’s a bit outdated with the A8, I think it still gives some nice hints as I see here. Does a8time also cover the ARMv6, NEON and VFP instructions ?

Mar 16, 2016 10:42am

Jeffrey Lee (213) 6048 posts

Does a8time also cover the ARMv6, NEON and VFP instructions ?

It supports all the ARM instructions that the A8 does. However it doesn’t support NEON. I’m not sure about VFP.

Mar 16, 2016 11:25am

Rick Murray (539) 13840 posts

It supports all the ARM instructions that the A8 does.

Except SWI/SVC and B/BL it seems. While it is not possible to give a meaningful reply for a SWI call, it would be interesting to see what effect this and branching has on a program. If branching is expensive an unrolled loop might be better, for instance.

Mar 16, 2016 11:45am

Jeffrey Lee (213) 6048 posts

B & BL will be subject to branch prediction, so the cost of the branch will be dependent almost entirely on how well the prediction is working – see Ben’s cycle timings page for some details. Odd that he mentions that “Tight loops in particular often have extra stalls” – you’d assume the hardware could fully predict the branch to be taken. But maybe it can only predict one branch ahead, so if the loop length is shorter than the pipeline length you’ll be in trouble?

Not sure offhand if the A8 is able to do anything special for SWI.

Mar 24, 2016 11:03pm

Kuemmel (439) 384 posts

Meanwhile I got some interesting additions to the topic. I got my copy of A8Time in the bundle of Nut Pi, quite nice, but as Jeffrey said and suspected it doesn’t cover NEON and VFP.

Luckily a similar tool like that exists for free online, don’t know about the quality, but seems pretty decent, while playing around a little. And futhermore it also accepts VFP and NEON instructions, you can find it here
Pulsars webpage is a really valuable source of information anyway when it comes to NEON assembler hacking. His online tool covers also “only” the A8, but I think nothing optimised for A8 should be bad for any other CPU.

Another thing I did was contacting ARM if there is any tool for free or commericially available for cores like A7/A53/A15 or whatever. They responded quite quickly saying that something like a specific static code analyzer doesn’t exist. They explained that back in time that approach might have made sense but now due to the complex dependencies (e.g. complex pipelines, reordering of instructions, branch prediciton, multiple outstanding data transactions, multiple physical aliases of architectural registers, automatic speculative loads, etc) it’s not applicable any more. He also explained that the cost of instruction dependencies is also usually dwarfed by the cost of external memory accesses and poor cache utilization nowadays (might be especially true for non Risc OS systems…)

One hint he gave is to utilize the PMU (Performance Monitor Unit) that each Cortex-Ax has and it can be configured to count many forms of events (cycles, instructions executed, cache misses). I remember something like this on x86, like a cycle counter. My question is, can this be easily accessed under Risc OS ? If yes, may be somebody has got a short example code to do that ?

Mar 24, 2016 11:20pm

Jeffrey Lee (213) 6048 posts

One hint he gave is to utilize the PMU (Performance Monitor Unit) that each Cortex-Ax has and it can be configured to count many forms of events (cycles, instructions executed, cache misses). I remember something like this on x86, like a cycle counter. My question is, can this be easily accessed under RiscOS ? If yes, may be somebody has got a short example code to do that ?

There isn’t any direct support for performance counters in the OS, but it is fairly straightforward to write your own code to access them. FIQprof/profanal is one example – although it’s aimed at profiling CPU activity as a whole rather than individual apps.

I think on most cores you can configure the performance counters to be accessible from user mode, so it’s easy to access them directly from the routines you’re interested in.

Oct 6, 2016 12:54pm

Jeffrey Lee (213) 6048 posts

Some interesting optimisation tips that I’ve just found: http://pandorawiki.org/Assembly_Code_Optimization

The page is aimed at the Cortex-A8, so some care might be needed when applying the tips to other processors (e.g. placing NOPs inbetween conditional branches)

Dual-Issue, pipelines, out-of-order, in-order...

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Mar 15, 2016 10:13pm Kuemmel (439) 384 posts	Following Jeffrey’s “invitation” ;-) to discuss dual-issue, pipelines, out-of-order, in-order cpu design I took my change to post some issue what could be interesting in that sense and keeps me confused… I got some code within my Mandelbrot fixed point math (10.22 format) that looks like this: `SMULL R3,R9,R1,R1 ;" y%y% SMULL R5,R6,R0,R0 ;" x%x% SMULL R8,R2,R0,R1 ;" y%x% MOV R3,R3,LSR#22 ORR R9,R3,R9,LSL#10 MOV R5,R5,LSR#22 ORR R6,R5,R6,LSL#10 MOV R8,R8,LSR#21 ORR R2,R8,R2,LSL#11 ;" to multiply y%x% by 2"` So while optimising that I always thought this one would be faster as it creates no dependencies: `SMULL R3,R9,R1,R1 ;" y%y% SMULL R5,R6,R0,R0 ;" x%x% SMULL R8,R2,R0,R1 ;" y%x% MOV R3,R3,LSR#22 MOV R5,R5,LSR#22 MOV R8,R8,LSR#21 ORR R9,R3,R9,LSL#10 ORR R6,R5,R6,LSL#10 ORR R2,R8,R2,LSL#11 ;" to multiply y%x% by 2"` …but the opposite is true, the first version is still faster on Cortex A7,A8,A9,A53,A15 (didn’t test on StrongARM though) and I don’t get why…seems optimizing assembler nowdays is try and error ;-) E.g. the A53 is an in-order-design, but could dual issue…doesn’t seem to help. A15 is an out-of-order design, also no success…or I’m getting it all wrong. Even ARM documents say on dual issue…“There must be no data dependency between the two instructions. That is, the second instruction must not have any source registers that are destination registers of the first instruction.”

Mar 15, 2016 10:31pm Rick Murray (539) 13840 posts	The A8Time program reports: Cycle Pipeline 0 Pipeline 1 ================================================================================ 1 SMULL r3,r9,r1,r1 blocked during multi-cycle op 2 SMULL (cycle 2) blocked during multi-cycle op 3 SMULL (cycle 3) need pipeline 0 for multiply 4 SMULL r5,r6,r0,r0 blocked during multi-cycle op 5 SMULL (cycle 2) blocked during multi-cycle op 6 SMULL (cycle 3) need pipeline 0 for multiply 7 SMULL r8,r2,r0,r1 blocked during multi-cycle op 8 SMULL (cycle 2) blocked during multi-cycle op 9 SMULL (cycle 3) MOV r3,r3,LSR #22 10 ORR r9,r3,r9,LSL #10 MOV r5,r5,LSR #22 11 ORR r6,r5,r6,LSL #10 wait for r8 12 wait for r8 wait for r8 13 MOV r8,r8,LSR #21 wait for r2 14 ORR r2,r8,r2,LSL #11 And for the second: Cycle Pipeline 0 Pipeline 1 ================================================================================ 1 SMULL r3,r9,r1,r1 blocked during multi-cycle op 2 SMULL (cycle 2) blocked during multi-cycle op 3 SMULL (cycle 3) need pipeline 0 for multiply 4 SMULL r5,r6,r0,r0 blocked during multi-cycle op 5 SMULL (cycle 2) blocked during multi-cycle op 6 SMULL (cycle 3) need pipeline 0 for multiply 7 SMULL r8,r2,r0,r1 blocked during multi-cycle op 8 SMULL (cycle 2) blocked during multi-cycle op 9 SMULL (cycle 3) MOV r3,r3,LSR #22 10 ORR r9,r3,r9,LSL #10 MOV r5,r5,LSR #22 11 ORR r6,r5,r6,LSL #10 wait for r8 12 wait for r8 wait for r8 13 MOV r8,r8,LSR #21 wait for r2 14 ORR r2,r8,r2,LSL #11 It appears that SMULL is costly, blocking the processor for three cycles or five instructions, but not having the result available until seven cycles or thirteen instructions later. Ouch.

Mar 15, 2016 11:27pm Jeffrey Lee (213) 6048 posts	Copy-paste error from Rick – these are the actual second results: Cycle Pipeline 0 Pipeline 1 ================================================================================ 1 SMULL r3,r9,r1,r1 blocked during multi-cycle op 2 SMULL (cycle 2) blocked during multi-cycle op 3 SMULL (cycle 3) need pipeline 0 for multiply 4 SMULL r5,r6,r0,r0 blocked during multi-cycle op 5 SMULL (cycle 2) blocked during multi-cycle op 6 SMULL (cycle 3) need pipeline 0 for multiply 7 SMULL r8,r2,r0,r1 blocked during multi-cycle op 8 SMULL (cycle 2) blocked during multi-cycle op 9 SMULL (cycle 3) MOV r3,r3,LSR #22 10 MOV r5,r5,LSR #22 wait for r8 11 wait for r8 wait for r8 12 wait for r8 wait for r8 13 MOV r8,r8,LSR #21 ORR r9,r3,r9,LSL #10 14 ORR r6,r5,r6,LSL #10 ORR r2,r8,r2,LSL #11 So the second one is slower by half a cycle, plus however long it takes for the ORRs to produce their results. You might think you’d be able to make use of the first two “need pipeline 0 for multiply” bubbles, but SMULL takes too long to calculate the results – it’s only partway through third SMULL that the results from the first become available.

Mar 15, 2016 11:46pm Jeffrey Lee (213) 6048 posts	Also, if you eliminate the SMULL’s then a8time reports both routines as having the same cycle timing – 3 cycles, both pipelines fully occupied. So the first routine looks like this, with the A8 able to forward the result of each MOV to the ORR in the other pipeline: Cycle Pipeline 0 Pipeline 1 ================================================================================ 1 MOV r3,r3,LSR #22 ORR r9,r3,r9,LSL #10 2 MOV r5,r5,LSR #22 ORR r6,r5,r6,LSL #10 3 MOV r8,r8,LSR #21 ORR r2,r8,r2,LSL #11 But if the MOV was something more complex (e.g. ADD with r3 as destination) then it looks like you’d suffer a stall.

Mar 16, 2016 9:09am Kuemmel (439) 384 posts	Thanks Jeffrey and Rick, I see, everything delayed due to the SMULL’s (still faster than not using them for the fixed point math…). The Cortex A8 documentation only points out 3 cycles for an SMULL, so I guess they didn’t mention the extra result latency there. Only now I found Ben’s website with much more data. I didn’t know about that interesting tool “a8time”, as far as I see it’s part of the DDE, might be worth buying it even if it’s a bit outdated with the A8, I think it still gives some nice hints as I see here. Does a8time also cover the ARMv6, NEON and VFP instructions ?

Mar 16, 2016 10:42am Jeffrey Lee (213) 6048 posts	Does a8time also cover the ARMv6, NEON and VFP instructions ? It supports all the ARM instructions that the A8 does. However it doesn’t support NEON. I’m not sure about VFP.

Mar 16, 2016 11:25am Rick Murray (539) 13840 posts	It supports all the ARM instructions that the A8 does. Except SWI/SVC and B/BL it seems. While it is not possible to give a meaningful reply for a SWI call, it would be interesting to see what effect this and branching has on a program. If branching is expensive an unrolled loop might be better, for instance.

Mar 16, 2016 11:45am Jeffrey Lee (213) 6048 posts	B & BL will be subject to branch prediction, so the cost of the branch will be dependent almost entirely on how well the prediction is working – see Ben’s cycle timings page for some details. Odd that he mentions that “Tight loops in particular often have extra stalls” – you’d assume the hardware could fully predict the branch to be taken. But maybe it can only predict one branch ahead, so if the loop length is shorter than the pipeline length you’ll be in trouble? Not sure offhand if the A8 is able to do anything special for SWI.

Mar 24, 2016 11:03pm Kuemmel (439) 384 posts	Meanwhile I got some interesting additions to the topic. I got my copy of A8Time in the bundle of Nut Pi, quite nice, but as Jeffrey said and suspected it doesn’t cover NEON and VFP. Luckily a similar tool like that exists for free online, don’t know about the quality, but seems pretty decent, while playing around a little. And futhermore it also accepts VFP and NEON instructions, you can find it here Pulsars webpage is a really valuable source of information anyway when it comes to NEON assembler hacking. His online tool covers also “only” the A8, but I think nothing optimised for A8 should be bad for any other CPU. Another thing I did was contacting ARM if there is any tool for free or commericially available for cores like A7/A53/A15 or whatever. They responded quite quickly saying that something like a specific static code analyzer doesn’t exist. They explained that back in time that approach might have made sense but now due to the complex dependencies (e.g. complex pipelines, reordering of instructions, branch prediciton, multiple outstanding data transactions, multiple physical aliases of architectural registers, automatic speculative loads, etc) it’s not applicable any more. He also explained that the cost of instruction dependencies is also usually dwarfed by the cost of external memory accesses and poor cache utilization nowadays (might be especially true for non Risc OS systems…) One hint he gave is to utilize the PMU (Performance Monitor Unit) that each Cortex-Ax has and it can be configured to count many forms of events (cycles, instructions executed, cache misses). I remember something like this on x86, like a cycle counter. My question is, can this be easily accessed under Risc OS ? If yes, may be somebody has got a short example code to do that ?

Mar 24, 2016 11:20pm Jeffrey Lee (213) 6048 posts	One hint he gave is to utilize the PMU (Performance Monitor Unit) that each Cortex-Ax has and it can be configured to count many forms of events (cycles, instructions executed, cache misses). I remember something like this on x86, like a cycle counter. My question is, can this be easily accessed under RiscOS ? If yes, may be somebody has got a short example code to do that ? There isn’t any direct support for performance counters in the OS, but it is fairly straightforward to write your own code to access them. FIQprof/profanal is one example – although it’s aimed at profiling CPU activity as a whole rather than individual apps. I think on most cores you can configure the performance counters to be accessible from user mode, so it’s easy to access them directly from the routines you’re interested in.

Oct 6, 2016 12:54pm Jeffrey Lee (213) 6048 posts	Some interesting optimisation tips that I’ve just found: http://pandorawiki.org/Assembly_Code_Optimization The page is aimed at the Cortex-A8, so some care might be needed when applying the tips to other processors (e.g. placing NOPs inbetween conditional branches)