RISC OS Open: Forum: ARM6 fast multiply/division

Jul 10, 2020 1:01pm

Jeffrey Lee (213) 6048 posts

Although the 1MHz monotonic timer project hasn’t progressed much over the past few months, one thing that I did start to look at recently was the math needed for converting between internal timer values and 1MHz values (and vice-versa). This is essentially a 64bit unsigned int multipled by a 32bit unsigned int (producing a 96bit intermediate value) and then divided by a 32bit unsigned int, producing a 64bit result. It’ll probably have to detect overflow, and there’ll probably have to be some rounding (but I’m not sure yet whether rounding up or rounding to nearest will be the most useful).

Since most machines don’t have hardware divide, I’ve already decided that the division will be handled by multiplication by a reciprocal of the denominator. Actually, both of the 32bit values are going to be constant, so it can just pre-calculate A/B (to sufficient precision), so that the main routine only needs to perform (e.g.) a 64×32 → 96 bit multiply and then shift the result by a power of two.

StrongARM and above have UMULL, so this is all pretty straightforward:

        EXPORT   scale_calc
        
; In: a1 = Numerator (N)
;     a2 = Denominator (D)
; Out: a1 = Scale (X)
;      a2 = Shift (S)
; Where X >= 0x80000000, and X>>S = N/D
; S will be in the range [0,63]
scale_calc
        ; Shfit N & D as far left as possible
        CLZ      ip, a2
        MOV      a4, a2, LSL ip
        CLZ      a3, a1
        SUB      a2, a3, ip
        MOV      a3, a1, LSL a3
        MOV      a1, #0
        ; Now a1 = X, a2 = S, a3 = N, a4 = D
        ; Both N and D have their top bit set
        ; Make sure that N < D
        CMP      a3, a4
        SUBHS    a3, a3, a4
        ADDHS    a1, a1, #1
10
        ; N < D, D >= 0x80000000
        ; There are two interesting possibilities for N:
        ; (1) 0 <= N < 0x80000000
        ; (2) 0x80000000 <= N < D
        ; For (1), a shift left won't overflow, but might cause N >= D
        ; For (2), a shift left will overflow, and will cause N >= D
        MOVS     a3, a3, LSL #1
        CMPCC    a3, a4     ; (1), is N >= D?
        SUBCS    a3, a3, a4 ; Keep N < D
        ADCS     a1, a1, a1
        ADD      a2, a2, #1
        BPL      %BT10
        CMP      a3, #0
        ADDNE    a1, a1, #1 ; Round up. Overflow should be impossible.
        MOV      pc, lr

        EXPORT   mul_asm
        
; In: a1,a2 = 64 bit value
;     a3 = Scale (X)
;     a4 = Shift (S)
; Out: a1,a2 = ((a1,a2) * a3) >> a4
mul_asm
        STMFD    sp!,{v1,v2,lr}
        MOV      ip,a2
        UMULL    a1,a2,a3,a1
        MOV      v1,#0
        UMLAL    a2,v1,a3,ip
        ; 96 bit product {a1,a2,v1}
        CMP      a4,#32
        MOVHS    v2,a1 ; 128 bit product {v2,a1,a2,v1}
        MOVHS    a1,a2 ;
        MOVHS    a2,v1 ;
        MOVHS    v1,#0 ;
        MOVLO    v2,#0
        ; Return {a1,a2} >> (a4 & 31)
        ANDS     a4,a4,#31
        RSBNE    ip,a4,#32
        ORRNE    v2,v2,a1,LSL ip
        MOVNE    a1,a1,LSR a4
        ORRNE    a1,a1,a2,LSL ip
        MOVNE    a2,a2,LSR a4
        ORRNE    a2,a2,v1,LSL ip
        ; Round up
        CMP      v2,#1
        ADCS     a1,a1,#0
        ADC      a2,a2,#0
        LDMFD    sp!,{v1,v2,pc}

But ARM6 & ARM7 don’t have UMULL, and they have terrible MUL performance (up to 17 cycles), so swapping the UMULLs for multiple MULs isn’t going to be great. So for those CPUs the approach will probably be to replace mul_asm with a routine that gets generated by a code generator. Whenever the scale factors change (which will be relatively infrequently) the code generator can be run again to generate the new routine.

The tricky bit is that I’m not sure if there’s an “optimal” code generator available. And this routine is a bit more special than the typical case (e.g. code generators designed for 32×32 multiply or 32/32 division), so even if a generator was available it might not work as well when modified for this case.

I’ve managed to produce an OK-ish generator of my own, but I know it’s possible to do better, and for slow machines like the ARM6 every cycle might matter. So my challenge for you is, can anyone do better than the below?

void mul_gen(scale_t s)
{
    int8_t raw_terms[33] = {0};
    int bits = 0;
    uint32_t scale = s.scale;
    while (!(scale & (1<<bits)))
        bits++;
    int8_t *terms = raw_terms+1;
    bits = 32-bits;
    int shift = -1;
    int dir = 1;
    /* Yuck, we need to special-case the initial inversion */
    if ((scale & 0xc0000000) == 0xc0000000)
    {
        terms[shift] += dir;
        dir = -dir;
        scale = ~scale;
    }
    shift++;
    while (bits > 0)
    {
        if (scale & 0x80000000)
        {
            terms[shift] += dir;
        }
        else if ((scale & 0xe0000000) == 0x60000000)
        {
            terms[shift] += dir;
            dir = -dir;
            scale = ~scale;
        }
        shift++;
        bits--;
        scale = scale<<1;
        if (dir < 0)
        {
            scale |= 1;
        }
    }
    /* Worst case: 17 ops */
    printf("%08x = \n",s.scale);
    uint32_t val = 0;
    for(int i=-1;i<32;i++)
    {
        if (terms[i])
        {
            printf("%c (x>>%d)\n",terms[i] > 0 ? '+' : '-',i);
            if (terms[i] > 0)
            {
                val += 0x80000000u >> i;
            }
            else
            {
                val -= 0x80000000u >> i;
            }
        }
    }
    printf("%08x\n",val);

    printf(" MOV outlo,#&ffffffff ; Bias for rounding\n");
    printf( "MOV outmid,#0\n");
    printf(" MOV outhi,#0\n");
    shift = s.shift-33;
    int cyc=3;
    for(int i=-1;i<32;i++)
    {
        shift++; /* Range now [-32,63] */
        if (!terms[i])
        {
            continue;
        }
        printf(" ; %c= X >> %d\n",terms[i] > 0 ? '+' : '-',shift);
        char *noc = terms[i] > 0 ? "ADD" : "SUB";
        char *c = terms[i] > 0 ? "ADC" : "SBC";
        if (shift > 32)
        {
            printf(" %sS outlo,outlo,hi,LSL #%d\n",noc,64-shift);
            printf(" %sS outmid,outmid,hi,LSR #%d\n",c,shift&31);
            printf(" %s outhi,outhi,#0\n",c);
            cyc+=3;
            printf(" %sS outlo,outlo,lo,LSR #%d\n",noc,shift&31);
            printf(" %sS outmid,outmid,#0\n",c);
            printf(" %s outhi,outhi,#0\n",c);
            cyc+=3;
        }
        else if (shift == 32)
        {
            printf(" %sS outlo,outlo,lo\n",noc);
            printf(" %sS outmid,outmid,hi\n",c);
            printf(" %s outhi,outhi,#0\n",c);
            cyc+=3;
        }
        else if (shift > 0)
        {
            printf(" %sS outmid,outmid,hi,LSL #%d\n",noc,32-shift);
            printf(" %s outhi,outhi,hi,LSR #%d\n",c,shift);
            cyc+=2;
            printf(" %sS outlo,outlo,lo,LSL #%d\n",noc,32-shift);
            printf(" %sS outmid,outmid,lo,LSR #%d\n",c,shift);
            printf(" %s outhi,outhi,#0\n",c);
            cyc+=3;
        }
        else if (shift == 0)
        {
            printf(" %sS outmid,outmid,lo\n",noc);
            printf(" %s outhi,outhi,hi\n",c);
            cyc+=2;
        }
        else if (shift > -32)
        {
            printf(" %sS outmid,outmid,lo,LSL #%d\n",noc,-shift);
            printf(" %s outhi,outhi,lo,LSR #%d\n",c,32+shift);
            printf(" %s outhi,outhi,hi,LSL #%d\n",noc,-shift);
            cyc+=3;
        }
        else
        {
            printf(" %s outhi,outhi,lo\n",noc);
            cyc+=1;
        }
    }
    printf(" MOV lo,outmid\n");
    printf(" MOV hi,outhi\n");
    printf(" MOV pc,lr\n");
    cyc += 2; // Ignore MOV pc,lr in cycle count
    printf("; %d cycles\n",cyc);
}

Which will happily work with this register mapping:

lo      RN       0
hi      RN       1
outlo   RN       2
outmid  RN       3
outhi   RN       12

So, assuming people understood this hastily-written post, can anyone do any better? ;-)

The above just uses a simple sequence of additions/subtractions. More advanced versions would probably extract common sub-expressions, and would eliminate some of the MOVs at the start & end. Also it doesn’t deal with overflow detection.

Jul 10, 2020 3:46pm

Kuemmel (439) 384 posts

Hi Jeffrey, I guess I can’t help without taking a day off and digging into that ;-) …but I instantly remember some code from Peter Teichmann way back in time who had may be one of the fastest div codes. Site is offline but I found it with the wayback machine here and found also similar thing from Jörg Scheurich here . Nevermind if I was on the wrong track here…

Jul 10, 2020 5:27pm

Jeffrey Lee (213) 6048 posts

…but I instantly remember some code from Peter Teichmann way back in time who had may be one of the fastest div codes. Site is offline but I found it with the wayback machine here and found also similar thing from Jörg Scheurich here . Nevermind if I was on the wrong track here…

Those look like they’re all division routines that will cope with arbitrary divisors. Good performance if your divisor is unknown, but in this case we know the same divisor will be used many times, so it’s more efficient to generate custom code each time the divisor changes.

This 3DO documentation (part of a developer’s manual I guess?) has a few pages which talk about the subject: Division by a constant, Multiplication by a constant, ARM6 MUL performance issues. There’s also this research paper which talks about a few different approaches to the problem.

Or for something more RISC OS related, the following Acorn Computing cover discs featured code for fast division (as part of the Tech Forum series):

9409 (September 1994)
9412A (December 1994) (code generator)
9413A (Special 1994) (code generator)
9502 (February 1995) (no code, just an addendum to the article)
9503 (March 1995)

I’m not sure if there’s an online source for the cover discs, but 8bs has scans of most of the issues: http://8bs.com/tmucovers.htm September ’94 is where they lay down the challenge to create a code generator for fast division.

Jul 10, 2020 6:20pm

Kuemmel (439) 384 posts

Ok, now it made click. Interesting stuff, need to read on those. In good old DOS 16 Bit real mode sizecoding it’s very common to use a sequence “mov ax,0xcccd / mul di”, where di would be the screen memory pointer in 320×200×8 Bit. With that ‘magic’ constant you can derive screen x (in 0…255) and y (0…200) in the higher 2 Bytes of the 32 Bit result. Now I see were those consants came from…

ARM6 fast multiply/division

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

Jul 10, 2020 1:01pm Jeffrey Lee (213) 6048 posts	Although the 1MHz monotonic timer project hasn’t progressed much over the past few months, one thing that I did start to look at recently was the math needed for converting between internal timer values and 1MHz values (and vice-versa). This is essentially a 64bit unsigned int multipled by a 32bit unsigned int (producing a 96bit intermediate value) and then divided by a 32bit unsigned int, producing a 64bit result. It’ll probably have to detect overflow, and there’ll probably have to be some rounding (but I’m not sure yet whether rounding up or rounding to nearest will be the most useful). Since most machines don’t have hardware divide, I’ve already decided that the division will be handled by multiplication by a reciprocal of the denominator. Actually, both of the 32bit values are going to be constant, so it can just pre-calculate A/B (to sufficient precision), so that the main routine only needs to perform (e.g.) a 64×32 → 96 bit multiply and then shift the result by a power of two. StrongARM and above have UMULL, so this is all pretty straightforward: EXPORT scale_calc ; In: a1 = Numerator (N) ; a2 = Denominator (D) ; Out: a1 = Scale (X) ; a2 = Shift (S) ; Where X >= 0x80000000, and X>>S = N/D ; S will be in the range [0,63] scale_calc ; Shfit N & D as far left as possible CLZ ip, a2 MOV a4, a2, LSL ip CLZ a3, a1 SUB a2, a3, ip MOV a3, a1, LSL a3 MOV a1, #0 ; Now a1 = X, a2 = S, a3 = N, a4 = D ; Both N and D have their top bit set ; Make sure that N < D CMP a3, a4 SUBHS a3, a3, a4 ADDHS a1, a1, #1 10 ; N < D, D >= 0x80000000 ; There are two interesting possibilities for N: ; (1) 0 <= N < 0x80000000 ; (2) 0x80000000 <= N < D ; For (1), a shift left won't overflow, but might cause N >= D ; For (2), a shift left will overflow, and will cause N >= D MOVS a3, a3, LSL #1 CMPCC a3, a4 ; (1), is N >= D? SUBCS a3, a3, a4 ; Keep N < D ADCS a1, a1, a1 ADD a2, a2, #1 BPL %BT10 CMP a3, #0 ADDNE a1, a1, #1 ; Round up. Overflow should be impossible. MOV pc, lr EXPORT mul_asm ; In: a1,a2 = 64 bit value ; a3 = Scale (X) ; a4 = Shift (S) ; Out: a1,a2 = ((a1,a2) * a3) >> a4 mul_asm STMFD sp!,{v1,v2,lr} MOV ip,a2 UMULL a1,a2,a3,a1 MOV v1,#0 UMLAL a2,v1,a3,ip ; 96 bit product {a1,a2,v1} CMP a4,#32 MOVHS v2,a1 ; 128 bit product {v2,a1,a2,v1} MOVHS a1,a2 ; MOVHS a2,v1 ; MOVHS v1,#0 ; MOVLO v2,#0 ; Return {a1,a2} >> (a4 & 31) ANDS a4,a4,#31 RSBNE ip,a4,#32 ORRNE v2,v2,a1,LSL ip MOVNE a1,a1,LSR a4 ORRNE a1,a1,a2,LSL ip MOVNE a2,a2,LSR a4 ORRNE a2,a2,v1,LSL ip ; Round up CMP v2,#1 ADCS a1,a1,#0 ADC a2,a2,#0 LDMFD sp!,{v1,v2,pc} But ARM6 & ARM7 don’t have UMULL, and they have terrible MUL performance (up to 17 cycles), so swapping the UMULLs for multiple MULs isn’t going to be great. So for those CPUs the approach will probably be to replace mul_asm with a routine that gets generated by a code generator. Whenever the scale factors change (which will be relatively infrequently) the code generator can be run again to generate the new routine. The tricky bit is that I’m not sure if there’s an “optimal” code generator available. And this routine is a bit more special than the typical case (e.g. code generators designed for 32×32 multiply or 32/32 division), so even if a generator was available it might not work as well when modified for this case. I’ve managed to produce an OK-ish generator of my own, but I know it’s possible to do better, and for slow machines like the ARM6 every cycle might matter. So my challenge for you is, can anyone do better than the below? void mul_gen(scale_t s) { int8_t raw_terms[33] = {0}; int bits = 0; uint32_t scale = s.scale; while (!(scale & (1<<bits))) bits++; int8_t terms = raw_terms+1; bits = 32-bits; int shift = -1; int dir = 1; / Yuck, we need to special-case the initial inversion / if ((scale & 0xc0000000) == 0xc0000000) { terms[shift] += dir; dir = -dir; scale = ~scale; } shift++; while (bits > 0) { if (scale & 0x80000000) { terms[shift] += dir; } else if ((scale & 0xe0000000) == 0x60000000) { terms[shift] += dir; dir = -dir; scale = ~scale; } shift++; bits--; scale = scale<<1; if (dir < 0) { scale \|= 1; } } / Worst case: 17 ops / printf("%08x = \n",s.scale); uint32_t val = 0; for(int i=-1;i<32;i++) { if (terms[i]) { printf("%c (x>>%d)\n",terms[i] > 0 ? '+' : '-',i); if (terms[i] > 0) { val += 0x80000000u >> i; } else { val -= 0x80000000u >> i; } } } printf("%08x\n",val); printf(" MOV outlo,#&ffffffff ; Bias for rounding\n"); printf( "MOV outmid,#0\n"); printf(" MOV outhi,#0\n"); shift = s.shift-33; int cyc=3; for(int i=-1;i<32;i++) { shift++; / Range now [-32,63] / if (!terms[i]) { continue; } printf(" ; %c= X >> %d\n",terms[i] > 0 ? '+' : '-',shift); char noc = terms[i] > 0 ? "ADD" : "SUB"; char *c = terms[i] > 0 ? "ADC" : "SBC"; if (shift > 32) { printf(" %sS outlo,outlo,hi,LSL #%d\n",noc,64-shift); printf(" %sS outmid,outmid,hi,LSR #%d\n",c,shift&31); printf(" %s outhi,outhi,#0\n",c); cyc+=3; printf(" %sS outlo,outlo,lo,LSR #%d\n",noc,shift&31); printf(" %sS outmid,outmid,#0\n",c); printf(" %s outhi,outhi,#0\n",c); cyc+=3; } else if (shift == 32) { printf(" %sS outlo,outlo,lo\n",noc); printf(" %sS outmid,outmid,hi\n",c); printf(" %s outhi,outhi,#0\n",c); cyc+=3; } else if (shift > 0) { printf(" %sS outmid,outmid,hi,LSL #%d\n",noc,32-shift); printf(" %s outhi,outhi,hi,LSR #%d\n",c,shift); cyc+=2; printf(" %sS outlo,outlo,lo,LSL #%d\n",noc,32-shift); printf(" %sS outmid,outmid,lo,LSR #%d\n",c,shift); printf(" %s outhi,outhi,#0\n",c); cyc+=3; } else if (shift == 0) { printf(" %sS outmid,outmid,lo\n",noc); printf(" %s outhi,outhi,hi\n",c); cyc+=2; } else if (shift > -32) { printf(" %sS outmid,outmid,lo,LSL #%d\n",noc,-shift); printf(" %s outhi,outhi,lo,LSR #%d\n",c,32+shift); printf(" %s outhi,outhi,hi,LSL #%d\n",noc,-shift); cyc+=3; } else { printf(" %s outhi,outhi,lo\n",noc); cyc+=1; } } printf(" MOV lo,outmid\n"); printf(" MOV hi,outhi\n"); printf(" MOV pc,lr\n"); cyc += 2; // Ignore MOV pc,lr in cycle count printf("; %d cycles\n",cyc); } Which will happily work with this register mapping: lo RN 0 hi RN 1 outlo RN 2 outmid RN 3 outhi RN 12 So, assuming people understood this hastily-written post, can anyone do any better? ;-) The above just uses a simple sequence of additions/subtractions. More advanced versions would probably extract common sub-expressions, and would eliminate some of the MOVs at the start & end. Also it doesn’t deal with overflow detection.

Jul 10, 2020 3:46pm Kuemmel (439) 384 posts	Hi Jeffrey, I guess I can’t help without taking a day off and digging into that ;-) …but I instantly remember some code from Peter Teichmann way back in time who had may be one of the fastest div codes. Site is offline but I found it with the wayback machine here and found also similar thing from Jörg Scheurich here . Nevermind if I was on the wrong track here…

Jul 10, 2020 5:27pm Jeffrey Lee (213) 6048 posts	…but I instantly remember some code from Peter Teichmann way back in time who had may be one of the fastest div codes. Site is offline but I found it with the wayback machine here and found also similar thing from Jörg Scheurich here . Nevermind if I was on the wrong track here… Those look like they’re all division routines that will cope with arbitrary divisors. Good performance if your divisor is unknown, but in this case we know the same divisor will be used many times, so it’s more efficient to generate custom code each time the divisor changes. This 3DO documentation (part of a developer’s manual I guess?) has a few pages which talk about the subject: Division by a constant, Multiplication by a constant, ARM6 MUL performance issues. There’s also this research paper which talks about a few different approaches to the problem. Or for something more RISC OS related, the following Acorn Computing cover discs featured code for fast division (as part of the Tech Forum series): 9409 (September 1994) 9412A (December 1994) (code generator) 9413A (Special 1994) (code generator) 9502 (February 1995) (no code, just an addendum to the article) 9503 (March 1995) I’m not sure if there’s an online source for the cover discs, but 8bs has scans of most of the issues: http://8bs.com/tmucovers.htm September ’94 is where they lay down the challenge to create a code generator for fast division.

Jul 10, 2020 6:20pm Kuemmel (439) 384 posts	Ok, now it made click. Interesting stuff, need to read on those. In good old DOS 16 Bit real mode sizecoding it’s very common to use a sequence “mov ax,0xcccd / mul di”, where di would be the screen memory pointer in 320×200×8 Bit. With that ‘magic’ constant you can derive screen x (in 0…255) and y (0…200) in the higher 2 Bytes of the 32 Bit result. Now I see were those consants came from…