Does anyone have a fast AVR assembler algorithm to divide a 32-bit unsigned number by 10 with remainder, i.e. as required for simple decimal output conversion? The traditional shift and subtract method takes about 400 clock cycles, so anything significantly faster than this would be useful.
Efficient 32-bit divide by 10 with remainder
Well divide by 10 is the same as multiply by 0.1 I guess and the binary representation of 0.1 is 0.0001100110011001100110011001100110011001100110011..(recurring) which is basically saying that 1/10th of a number is 1/16th + 1/32th + 1/256th + 1/512th + 1/4096th + 1/8192th + etc.
So if you take your number and shift it 4 places (1/16th) and put that into an "accumulator". Then shift it once more (1/32th) and add that to accumulator then shift 3 times, add that, once more add that, 3 times add that, once more, add that, 3 times, add that, once more add that and so on - stopping once the original value is 0 then you should have 1/10th (ish!) in the accumulator.
Having said all that I only just thought it up and cannot guarantee that it WILL work but I don't see why not. If it does work then the chance are I just re-invented the wheel!
Cliff
PS Just noticed you need a remainder too - it's not immediately obvious to me how that one "falls out" I'm afraid (well it is Monday morning!)
PPS Except to notice that x10d is x1010b so it's shift 2 + shift 4. Do that on result then subtract from original to get remainder.
Well divide by 10 is the same as multiply by 0.1 I guess and the binary representation of 0.1 is 0.0001100110011001100110011001100110011001100110011..(recurring) which is basically saying that 1/10th of a number is 1/16th + 1/32th + 1/256th + 1/512th + 1/4096th + 1/8192th + etc.So if you take your number and shift it 4 places (1/16th) and put that into an "accumulator". Then shift it once more (1/32th) and add that to accumulator then shift 3 times, add that, once more add that, 3 times add that, once more, add that, 3 times, add that, once more add that and so on - stopping once the original value is 0 then you should have 1/10th (ish!) in the accumulator.
Having said all that I only just thought it up and cannot guarantee that it WILL work but I don't see why not. If it does work then the chance are I just re-invented the wheel!
Cliff
PS Just noticed you need a remainder too - it's not immediately obvious to me how that one "falls out" I'm afraid (well it is Monday morning!)
PPS Except to notice that x10d is x1010b so it's shift 2 + shift 4. Do that on result then subtract from original to get remainder.
Trouble is, 1/10 is a recurring binary fraction, so how far down do I have to carry out this process? Down to 32 bits, I suppose, which means at least 16 32-bit additions, not to mention lots of 32-bit shifts. And I would need a lot of extra registers to store partial results. And then of course I have to multiply back and subtract to get the remainder - not too bad using the MUL instruction, but I still think the whole process would take longer (and use a lot more code) than the 400-odd cycles for the conventional method.
I thought I had a solution by using the MUL instruction to multiply each byte by 25 or 26, take the MS byte of the result, multiply back by 10 and correct the result by +/-1 - messy but possibly faster than bit-by-bit - then I realised that after the first byte I would have a carry and so would have to divide a 2-byte number in the range 0-2559 by 10, and my method didn't seem such a good idea any more.
I just wondered whether, back in the old days of very restricted instructuin sets (and no hardware multiply or divide), the pioneers of programming had devised a cunning method of decimal conversion that I could apply to one of the restricted instruction sets of today. It's not looking very promising - even Knuth seems to bypass the problem by imagining a computer that works in decimal.
It's not too much of a big deal. A Mega168 running at 20MHz can convert five numbers to decimal in less than a millisecond - it would just be neat to get it down a bit faster.
PS. Just thought of a possible optimisation of your idea: Start by multiplying the original number by 3 (i.e. shift left 1 and add the original number - though of course the number might now be 34 bits!), then you need to multiply by .000010001 etc. This mostly involves only shifting by 4 (after the first shift), which can be done quickly using the SWAP instruction. I still think it would be slower to do the whole 32 bits at once like this, but maybe to divide 0-2559 by 10 and process a byte at a time (applicable to numbers of indefinite length) ... hmmm, needs more thought.
Take a look at this project (16-bit divide) and see f You can expand it to 32 bits?
timbierman,
Yup for a full 32 bit number you would need quite a few shift/adds (15 worst case I think) but on smaller numbers, because you break out as soon as the original goes to 0 it could break much earlier. So the length of time to execute will be dependent on how big the original number is.
Cliff
Take a look at this project (16-bit divide) and see f You can expand it to 32 bits?
Do you mean the "16/16 bit division" thread? I've looked there and it just seems to point to app note 200, which describes the traditional shift and subtract methods from antiquity. What I am looking for is a short cut, particularly in the case of dividing by ten, if it exists.
Or am I missing an attachment to your post?
You may want to have a look at
https://www.avrfreaks.net/index.php?module=FreaksAcademy&func=viewItem&item_id=131&item_type=project
( Div16_xx Math Lib)
It's a performance optimized implemantation of 'division by reciprocal multiplication' based on Mr. Jones excellent primer for this stuff http://www.cs.uiowa.edu/~jones/bcd/divide.html
:)
Yup for a full 32 bit number you would need quite a few shift/adds (15 worst case I think) but on smaller numbers, because you break out as soon as the original goes to 0 it could break much earlier. So the length of time to execute will be dependent on how big the original number is.Cliff
I'm not sure how you'd decide that your result is going to be zero when you are shifting and adding - the next add might well make it non-zero. However, your general point is good: most numbers will be no more than 24, 16 or even 8 bits to begin with, and during decimal output the number will always shrink as conversion proceeds. I can skip 8, 16 or 24 times around the shift-subtract loop by shifting leading zeros off a byte at a time at the beginning.
Not quite the magic bullet I was looking for, but it could easily cut execution time by about half.
Always useful to talk these things through with someone else :)
@Tim:
;*************************************************************************** ;* ;* Function "Div16_10" ;* Divides an unsigned 16 bit word (XH:XL) by 10 ;* Returns quotient in YH:YL and remainder in XL ;* ;* Author: Andreas Lenze (andreas.lenze@t-online.de) ;* Equations by D. W. Jones: ;* ;* Reciprocal mul w. extra precision: ;* unsigned int A; ;* unsigned int scaled_reciprocal = 0xCCCD; ;* unsigned int Q; /* the quotient */ ;* ;* Q = ((A * 0xCCCD) >> 19) ;* ;* Uses: high regs: 7 (r17, r18, r19, X, Y) ;* low regs: 3 (r0, r1, r2) ;* ;* words: 36 (w. push/pop = 8 words) ;* cycles: 46 (w. push/pop = 16 cycles) ;* ;* Note: Hardware multiplier required ("mul" instruction) ;* ;*************************************************************************** Div16_10: push r2 push r19 push r18 push r17 ldi YH,0xCC ; scaled reciprocal for /10 ldi YL,0xCD ; Q = A * 0xCCCD ; (r19:r18:r17[:rXX] = XH:XL * YH:YL) clr r2 mul XH, YH ; ah * bh movw r19:r18, r1:r0 mul XL, YL ; al * bl mov r17,r1 ; r0 to [rXX] is superfluous mul XH, YL ; ah * bl add r17, r0 adc r18, r1 adc r19, r2 mul YH, XL ; bh * al add r17, r0 adc r18, r1 adc r19, r2 ; Q = Q >> 16: use r19:r18 as word ; Q = Q >> 3 lsr r19 ; do the last 3 shifts ror r18 lsr r19 ror r18 lsr r19 ror r18 ; r19:r18 now "Q" (= result >> 19) ; R = A - 10*Q; ldi r17,10 ; multiply r19:r18 by 10 mul r18, r17 ; al * bl sub XL,r0 clr XH movw YL,r18 ; make copy of Q ; XL holds "R" ; YH:YL holds "Q" pop r17 pop r18 pop r19 pop r2 ret ;**** End of function Div16_10 -----------------------------------------****
This is from above mentioned 'academy project' - uses 30 cycles (w/o push/pop) for a 16 bit divide. If you're looking for 'magic bullets', this *could* be a starting point for further research.
You may want to have a look at
https://www.avrfreaks.net/index.php?module=FreaksAcademy&func=viewItem&item_id=131&item_type=project
( Div16_xx Math Lib)
It's a performance optimized implemantation of 'division by reciprocal multiplication' based on Mr. Jones excellent primer for this stuff http://www.cs.uiowa.edu/~jones/bcd/divide.html
:)
Thanks - that's the sort of thing I was looking for. I'll have to see how easily it generalises to 32 bits. My first concern is that I might need a second bank of 4 registers to accumulate what is basically a 64-bit result. The program in which I need to do this has most of the register bank permanently allocated to a time-critical interrupt routine, so I can't borrow registers temporarily for foreground tasks. Still, it may be possible.
Sorry, timbierman, I totally failed to paste the link to the project I wanted to tip You about. All is well that ends well though as alenze posted a link to the very same project (the reciprocal division stuff). Hope I didn't cause too much confusion!
Sorry, timbierman, I totally failed to paste the link to the project I wanted to tip You about. All is well that ends well though as alenze posted a link to the very same project (the reciprocal division stuff). Hope I didn't cause too much confusion!
No problem. I don't have time to check out a 32-bit version of the routine in the project right now, but it looks promising. So far I've managed to knock 50% off total message formatting time by using the multiply-by-reciprocal method for 8-bit divisions, and optimising 32-bit divisions where the top 1, 2 or 3 bytes are zero. Not a bad day's work so far.
(This may all seem like cheese-paring, but the messages I need to convert come in at anything up to one every 2 milliseconds, so shaving half a millisecond off processing is a big improvement).