ARM6 fast multiply/division
Jeffrey Lee (213) 6048 posts |
Although the 1MHz monotonic timer project hasn’t progressed much over the past few months, one thing that I did start to look at recently was the math needed for converting between internal timer values and 1MHz values (and vice-versa). This is essentially a 64bit unsigned int multipled by a 32bit unsigned int (producing a 96bit intermediate value) and then divided by a 32bit unsigned int, producing a 64bit result. It’ll probably have to detect overflow, and there’ll probably have to be some rounding (but I’m not sure yet whether rounding up or rounding to nearest will be the most useful). Since most machines don’t have hardware divide, I’ve already decided that the division will be handled by multiplication by a reciprocal of the denominator. Actually, both of the 32bit values are going to be constant, so it can just pre-calculate A/B (to sufficient precision), so that the main routine only needs to perform (e.g.) a 64×32 → 96 bit multiply and then shift the result by a power of two. StrongARM and above have UMULL, so this is all pretty straightforward: EXPORT scale_calc ; In: a1 = Numerator (N) ; a2 = Denominator (D) ; Out: a1 = Scale (X) ; a2 = Shift (S) ; Where X >= 0x80000000, and X>>S = N/D ; S will be in the range [0,63] scale_calc ; Shfit N & D as far left as possible CLZ ip, a2 MOV a4, a2, LSL ip CLZ a3, a1 SUB a2, a3, ip MOV a3, a1, LSL a3 MOV a1, #0 ; Now a1 = X, a2 = S, a3 = N, a4 = D ; Both N and D have their top bit set ; Make sure that N < D CMP a3, a4 SUBHS a3, a3, a4 ADDHS a1, a1, #1 10 ; N < D, D >= 0x80000000 ; There are two interesting possibilities for N: ; (1) 0 <= N < 0x80000000 ; (2) 0x80000000 <= N < D ; For (1), a shift left won't overflow, but might cause N >= D ; For (2), a shift left will overflow, and will cause N >= D MOVS a3, a3, LSL #1 CMPCC a3, a4 ; (1), is N >= D? SUBCS a3, a3, a4 ; Keep N < D ADCS a1, a1, a1 ADD a2, a2, #1 BPL %BT10 CMP a3, #0 ADDNE a1, a1, #1 ; Round up. Overflow should be impossible. MOV pc, lr EXPORT mul_asm ; In: a1,a2 = 64 bit value ; a3 = Scale (X) ; a4 = Shift (S) ; Out: a1,a2 = ((a1,a2) * a3) >> a4 mul_asm STMFD sp!,{v1,v2,lr} MOV ip,a2 UMULL a1,a2,a3,a1 MOV v1,#0 UMLAL a2,v1,a3,ip ; 96 bit product {a1,a2,v1} CMP a4,#32 MOVHS v2,a1 ; 128 bit product {v2,a1,a2,v1} MOVHS a1,a2 ; MOVHS a2,v1 ; MOVHS v1,#0 ; MOVLO v2,#0 ; Return {a1,a2} >> (a4 & 31) ANDS a4,a4,#31 RSBNE ip,a4,#32 ORRNE v2,v2,a1,LSL ip MOVNE a1,a1,LSR a4 ORRNE a1,a1,a2,LSL ip MOVNE a2,a2,LSR a4 ORRNE a2,a2,v1,LSL ip ; Round up CMP v2,#1 ADCS a1,a1,#0 ADC a2,a2,#0 LDMFD sp!,{v1,v2,pc} But ARM6 & ARM7 don’t have UMULL, and they have terrible MUL performance (up to 17 cycles), so swapping the UMULLs for multiple MULs isn’t going to be great. So for those CPUs the approach will probably be to replace mul_asm with a routine that gets generated by a code generator. Whenever the scale factors change (which will be relatively infrequently) the code generator can be run again to generate the new routine. The tricky bit is that I’m not sure if there’s an “optimal” code generator available. And this routine is a bit more special than the typical case (e.g. code generators designed for 32×32 multiply or 32/32 division), so even if a generator was available it might not work as well when modified for this case. I’ve managed to produce an OK-ish generator of my own, but I know it’s possible to do better, and for slow machines like the ARM6 every cycle might matter. So my challenge for you is, can anyone do better than the below? void mul_gen(scale_t s) { int8_t raw_terms[33] = {0}; int bits = 0; uint32_t scale = s.scale; while (!(scale & (1<<bits))) bits++; int8_t *terms = raw_terms+1; bits = 32-bits; int shift = -1; int dir = 1; /* Yuck, we need to special-case the initial inversion */ if ((scale & 0xc0000000) == 0xc0000000) { terms[shift] += dir; dir = -dir; scale = ~scale; } shift++; while (bits > 0) { if (scale & 0x80000000) { terms[shift] += dir; } else if ((scale & 0xe0000000) == 0x60000000) { terms[shift] += dir; dir = -dir; scale = ~scale; } shift++; bits--; scale = scale<<1; if (dir < 0) { scale |= 1; } } /* Worst case: 17 ops */ printf("%08x = \n",s.scale); uint32_t val = 0; for(int i=-1;i<32;i++) { if (terms[i]) { printf("%c (x>>%d)\n",terms[i] > 0 ? '+' : '-',i); if (terms[i] > 0) { val += 0x80000000u >> i; } else { val -= 0x80000000u >> i; } } } printf("%08x\n",val); printf(" MOV outlo,#&ffffffff ; Bias for rounding\n"); printf( "MOV outmid,#0\n"); printf(" MOV outhi,#0\n"); shift = s.shift-33; int cyc=3; for(int i=-1;i<32;i++) { shift++; /* Range now [-32,63] */ if (!terms[i]) { continue; } printf(" ; %c= X >> %d\n",terms[i] > 0 ? '+' : '-',shift); char *noc = terms[i] > 0 ? "ADD" : "SUB"; char *c = terms[i] > 0 ? "ADC" : "SBC"; if (shift > 32) { printf(" %sS outlo,outlo,hi,LSL #%d\n",noc,64-shift); printf(" %sS outmid,outmid,hi,LSR #%d\n",c,shift&31); printf(" %s outhi,outhi,#0\n",c); cyc+=3; printf(" %sS outlo,outlo,lo,LSR #%d\n",noc,shift&31); printf(" %sS outmid,outmid,#0\n",c); printf(" %s outhi,outhi,#0\n",c); cyc+=3; } else if (shift == 32) { printf(" %sS outlo,outlo,lo\n",noc); printf(" %sS outmid,outmid,hi\n",c); printf(" %s outhi,outhi,#0\n",c); cyc+=3; } else if (shift > 0) { printf(" %sS outmid,outmid,hi,LSL #%d\n",noc,32-shift); printf(" %s outhi,outhi,hi,LSR #%d\n",c,shift); cyc+=2; printf(" %sS outlo,outlo,lo,LSL #%d\n",noc,32-shift); printf(" %sS outmid,outmid,lo,LSR #%d\n",c,shift); printf(" %s outhi,outhi,#0\n",c); cyc+=3; } else if (shift == 0) { printf(" %sS outmid,outmid,lo\n",noc); printf(" %s outhi,outhi,hi\n",c); cyc+=2; } else if (shift > -32) { printf(" %sS outmid,outmid,lo,LSL #%d\n",noc,-shift); printf(" %s outhi,outhi,lo,LSR #%d\n",c,32+shift); printf(" %s outhi,outhi,hi,LSL #%d\n",noc,-shift); cyc+=3; } else { printf(" %s outhi,outhi,lo\n",noc); cyc+=1; } } printf(" MOV lo,outmid\n"); printf(" MOV hi,outhi\n"); printf(" MOV pc,lr\n"); cyc += 2; // Ignore MOV pc,lr in cycle count printf("; %d cycles\n",cyc); } Which will happily work with this register mapping: lo RN 0 hi RN 1 outlo RN 2 outmid RN 3 outhi RN 12 So, assuming people understood this hastily-written post, can anyone do any better? ;-) The above just uses a simple sequence of additions/subtractions. More advanced versions would probably extract common sub-expressions, and would eliminate some of the MOVs at the start & end. Also it doesn’t deal with overflow detection. |
Kuemmel (439) 384 posts |
Hi Jeffrey, I guess I can’t help without taking a day off and digging into that ;-) …but I instantly remember some code from Peter Teichmann way back in time who had may be one of the fastest div codes. Site is offline but I found it with the wayback machine here and found also similar thing from Jörg Scheurich here . Nevermind if I was on the wrong track here… |
Jeffrey Lee (213) 6048 posts |
Those look like they’re all division routines that will cope with arbitrary divisors. Good performance if your divisor is unknown, but in this case we know the same divisor will be used many times, so it’s more efficient to generate custom code each time the divisor changes. This 3DO documentation (part of a developer’s manual I guess?) has a few pages which talk about the subject: Division by a constant, Multiplication by a constant, ARM6 MUL performance issues. There’s also this research paper which talks about a few different approaches to the problem. Or for something more RISC OS related, the following Acorn Computing cover discs featured code for fast division (as part of the Tech Forum series):
I’m not sure if there’s an online source for the cover discs, but 8bs has scans of most of the issues: http://8bs.com/tmucovers.htm September ’94 is where they lay down the challenge to create a code generator for fast division. |
Kuemmel (439) 384 posts |
Ok, now it made click. Interesting stuff, need to read on those. In good old DOS 16 Bit real mode sizecoding it’s very common to use a sequence “mov ax,0xcccd / mul di”, where di would be the screen memory pointer in 320×200×8 Bit. With that ‘magic’ constant you can derive screen x (in 0…255) and y (0…200) in the higher 2 Bytes of the 32 Bit result. Now I see were those consants came from… |