Dual-Issue, pipelines, out-of-order, in-order...
Kuemmel (439) 384 posts |
Following Jeffrey’s “invitation” ;-) to discuss dual-issue, pipelines, out-of-order, in-order cpu design I took my change to post some issue what could be interesting in that sense and keeps me confused… I got some code within my Mandelbrot fixed point math (10.22 format) that looks like this:
…but the opposite is true, the first version is still faster on Cortex A7,A8,A9,A53,A15 (didn’t test on StrongARM though) and I don’t get why…seems optimizing assembler nowdays is try and error ;-) E.g. the A53 is an in-order-design, but could dual issue…doesn’t seem to help. A15 is an out-of-order design, also no success…or I’m getting it all wrong. Even ARM documents say on dual issue…“There must be no data dependency between the two instructions. That is, the second instruction must not have any source registers that are destination registers of the first instruction.” |
Rick Murray (539) 13840 posts |
The A8Time program reports: Cycle Pipeline 0 Pipeline 1 ================================================================================ 1 SMULL r3,r9,r1,r1 blocked during multi-cycle op 2 SMULL (cycle 2) blocked during multi-cycle op 3 SMULL (cycle 3) need pipeline 0 for multiply 4 SMULL r5,r6,r0,r0 blocked during multi-cycle op 5 SMULL (cycle 2) blocked during multi-cycle op 6 SMULL (cycle 3) need pipeline 0 for multiply 7 SMULL r8,r2,r0,r1 blocked during multi-cycle op 8 SMULL (cycle 2) blocked during multi-cycle op 9 SMULL (cycle 3) MOV r3,r3,LSR #22 10 ORR r9,r3,r9,LSL #10 MOV r5,r5,LSR #22 11 ORR r6,r5,r6,LSL #10 wait for r8 12 wait for r8 wait for r8 13 MOV r8,r8,LSR #21 wait for r2 14 ORR r2,r8,r2,LSL #11 And for the second: Cycle Pipeline 0 Pipeline 1 ================================================================================ 1 SMULL r3,r9,r1,r1 blocked during multi-cycle op 2 SMULL (cycle 2) blocked during multi-cycle op 3 SMULL (cycle 3) need pipeline 0 for multiply 4 SMULL r5,r6,r0,r0 blocked during multi-cycle op 5 SMULL (cycle 2) blocked during multi-cycle op 6 SMULL (cycle 3) need pipeline 0 for multiply 7 SMULL r8,r2,r0,r1 blocked during multi-cycle op 8 SMULL (cycle 2) blocked during multi-cycle op 9 SMULL (cycle 3) MOV r3,r3,LSR #22 10 ORR r9,r3,r9,LSL #10 MOV r5,r5,LSR #22 11 ORR r6,r5,r6,LSL #10 wait for r8 12 wait for r8 wait for r8 13 MOV r8,r8,LSR #21 wait for r2 14 ORR r2,r8,r2,LSL #11 It appears that SMULL is costly, blocking the processor for three cycles or five instructions, but not having the result available until seven cycles or thirteen instructions later. Ouch. |
Jeffrey Lee (213) 6048 posts |
Copy-paste error from Rick – these are the actual second results: Cycle Pipeline 0 Pipeline 1 ================================================================================ 1 SMULL r3,r9,r1,r1 blocked during multi-cycle op 2 SMULL (cycle 2) blocked during multi-cycle op 3 SMULL (cycle 3) need pipeline 0 for multiply 4 SMULL r5,r6,r0,r0 blocked during multi-cycle op 5 SMULL (cycle 2) blocked during multi-cycle op 6 SMULL (cycle 3) need pipeline 0 for multiply 7 SMULL r8,r2,r0,r1 blocked during multi-cycle op 8 SMULL (cycle 2) blocked during multi-cycle op 9 SMULL (cycle 3) MOV r3,r3,LSR #22 10 MOV r5,r5,LSR #22 wait for r8 11 wait for r8 wait for r8 12 wait for r8 wait for r8 13 MOV r8,r8,LSR #21 ORR r9,r3,r9,LSL #10 14 ORR r6,r5,r6,LSL #10 ORR r2,r8,r2,LSL #11 So the second one is slower by half a cycle, plus however long it takes for the ORRs to produce their results. You might think you’d be able to make use of the first two “need pipeline 0 for multiply” bubbles, but SMULL takes too long to calculate the results – it’s only partway through third SMULL that the results from the first become available. |
Jeffrey Lee (213) 6048 posts |
Also, if you eliminate the SMULL’s then a8time reports both routines as having the same cycle timing – 3 cycles, both pipelines fully occupied. So the first routine looks like this, with the A8 able to forward the result of each MOV to the ORR in the other pipeline: Cycle Pipeline 0 Pipeline 1 ================================================================================ 1 MOV r3,r3,LSR #22 ORR r9,r3,r9,LSL #10 2 MOV r5,r5,LSR #22 ORR r6,r5,r6,LSL #10 3 MOV r8,r8,LSR #21 ORR r2,r8,r2,LSL #11 But if the MOV was something more complex (e.g. ADD with r3 as destination) then it looks like you’d suffer a stall. |
Kuemmel (439) 384 posts |
Thanks Jeffrey and Rick, I see, everything delayed due to the SMULL’s (still faster than not using them for the fixed point math…). The Cortex A8 documentation only points out 3 cycles for an SMULL, so I guess they didn’t mention the extra result latency there. Only now I found Ben’s website with much more data. I didn’t know about that interesting tool “a8time”, as far as I see it’s part of the DDE, might be worth buying it even if it’s a bit outdated with the A8, I think it still gives some nice hints as I see here. Does a8time also cover the ARMv6, NEON and VFP instructions ? |
Jeffrey Lee (213) 6048 posts |
It supports all the ARM instructions that the A8 does. However it doesn’t support NEON. I’m not sure about VFP. |
Rick Murray (539) 13840 posts |
Except SWI/SVC and B/BL it seems. While it is not possible to give a meaningful reply for a SWI call, it would be interesting to see what effect this and branching has on a program. If branching is expensive an unrolled loop might be better, for instance. |
Jeffrey Lee (213) 6048 posts |
B & BL will be subject to branch prediction, so the cost of the branch will be dependent almost entirely on how well the prediction is working – see Ben’s cycle timings page for some details. Odd that he mentions that “Tight loops in particular often have extra stalls” – you’d assume the hardware could fully predict the branch to be taken. But maybe it can only predict one branch ahead, so if the loop length is shorter than the pipeline length you’ll be in trouble? Not sure offhand if the A8 is able to do anything special for SWI. |
Kuemmel (439) 384 posts |
Meanwhile I got some interesting additions to the topic. I got my copy of A8Time in the bundle of Nut Pi, quite nice, but as Jeffrey said and suspected it doesn’t cover NEON and VFP. Luckily a similar tool like that exists for free online, don’t know about the quality, but seems pretty decent, while playing around a little. And futhermore it also accepts VFP and NEON instructions, you can find it here Another thing I did was contacting ARM if there is any tool for free or commericially available for cores like A7/A53/A15 or whatever. They responded quite quickly saying that something like a specific static code analyzer doesn’t exist. They explained that back in time that approach might have made sense but now due to the complex dependencies (e.g. complex pipelines, reordering of instructions, branch prediciton, multiple outstanding data transactions, multiple physical aliases of architectural registers, automatic speculative loads, etc) it’s not applicable any more. He also explained that the cost of instruction dependencies is also usually dwarfed by the cost of external memory accesses and poor cache utilization nowadays (might be especially true for non Risc OS systems…) One hint he gave is to utilize the PMU (Performance Monitor Unit) that each Cortex-Ax has and it can be configured to count many forms of events (cycles, instructions executed, cache misses). I remember something like this on x86, like a cycle counter. My question is, can this be easily accessed under Risc OS ? If yes, may be somebody has got a short example code to do that ? |
Jeffrey Lee (213) 6048 posts |
There isn’t any direct support for performance counters in the OS, but it is fairly straightforward to write your own code to access them. FIQprof/profanal is one example – although it’s aimed at profiling CPU activity as a whole rather than individual apps. I think on most cores you can configure the performance counters to be accessible from user mode, so it’s easy to access them directly from the routines you’re interested in. |
Jeffrey Lee (213) 6048 posts |
Some interesting optimisation tips that I’ve just found: http://pandorawiki.org/Assembly_Code_Optimization The page is aimed at the Cortex-A8, so some care might be needed when applying the tips to other processors (e.g. placing NOPs inbetween conditional branches) |