Is there a simple way to round a 24 bit number to the nearest 100 (ex: 1234567 gives 1234600, 14445970 gives 1446000, 223344 gives 223300).

This is using avr assembler. I suppose I may have to resort to some large mult or divide routines again.

Author

Message

Is there a simple way to round a 24 bit number to the nearest 100 (ex: 1234567 gives 1234600, 14445970 gives 1446000, 223344 gives 223300).

This is using avr assembler. I suppose I may have to resort to some large mult or divide routines again.

value = value + 50

value = value - (value mod 100)

You could easily round to the nearest 128...

Or you could do it during printing, when you're dividing anyway.

There are assorted "interesting" algorithms for dividing or multiplying by a constant. This "divide by 10" function (from "Hacker's Delight") looks particularly interesting because of all the shifts that are multiples of 8bits.

unsigned divu10(unsigned n) { unsigned q, r; q = (n >> 1) + (n >> 2); q = q + (q >> 4); q = q + (q >> 8); q = q + (q >> 16); q = q >> 3; r = n - q*10; return q + ((r + 6) >> 4); // return q + (r > 9); }

To round to the nearest 100 wiould require doing this twice.

I wonder if ther is something even trickier & shorter. I did get my program working using a long divide & long multiply (add 50, div by 100 & mult by 100), so now its more of an interesting quetion.

Surely it depends on the rounding you need.

http://en.wikipedia.org/wiki/Rounding#Types_of_rounding

Quote:

Mod is simply a divide where you keep the remainder instead of the result. So if you already have the (add 50, div by 100 & mult by 100) solution, then you only need a rearrangement of the code to get my solution which saves you the multiply.
Is there an easy way to do the mod?

Quote:

There was similar "divide by 100" code, but it didn't look like it would optimize as well for a CPU without a barrel shifter.
I wonder if ther is something even trickier & shorter.

Here is some pseudocode that might help.

The loops should be completely unrolled.

Use a macro for the big one.

I think it will come out to 65 instructions total.

hi, mid, lo=src for mask=0x01; mask; mask<<=1 if hi & mask lo+=mask*36%100 if carry lo+=56 if mid & mask lo+=mask*56%100 if carry lo+=56 while lo>=100 lo-=100 if lo>=50 result=src+(100-lo) else result=src-lo

It just rounds to a multiple of 100.

If you want decimal digits, this won't give them to you.

If you HAVE modulo, you'd just do

r = (v+50) - ((v+50)%100)

But Modulo is usually just as expensive and long as division.

Where do 36 and 56 come from?

If you HAVE modulo, you'd just do

r = (v+50) - ((v+50)%100)

But Modulo is usually just as expensive and long as division.

Where do 36 and 56 come from?

Once you unroll the big loop, mask becomes constants and the assembler can do the percent work.

That is why the big loop unrolls to only 65 instructions.

Quote:

I think it will come out to 65 instructions total.

Quote:

That is why the big loop unrolls to only 65 instructions.

Quote:

But Modulo is usually just as expensive and long as division.

+1 on the last...riddle me that--how are the two %100 each loop done "free"?

Quote:

But Modulo is usually just as expensive and long as division.

+1 on the last...riddle me that--how are the two %100 each loop done "free"?

Once you unroll the big loop, mask becomes constants and the assembler can do the percent work.

Ah. I get it. p Part of it, anyway...

When you said to unroll the loop, you meant to expand all the unrolled constants as well.

if hi & 1 lo += (1*36)%100 if carry lo +=56 if mid & 1 lo += (1*56)%100 : ; next unrolled iteration if hi & 2 lo += (2*36)%100 if carry lo += 56 if mid & 2 lo += (2*56)%100 :

That should work out pretty well using .irp and such; perhaps I'll give it a try. Thanks for the hints!

mid shouldn't nest inside hi.

Hmm. Does this look right? It expands to a lot more than 65 instructions (100+); don't forget that you can't do "sbic sreg, x" (that would have helped.)

.irp mask, 1, 2, 4, 8, 16, 32, 64, 128 andi NH, mask breq 0f subi NL, -((36*mask)%100) brcc 0f subi NL, -56 0: andi NM, mask breq 1f subi NL, -((56*mask)%100) brcc 1f subi NL, -56 1: .endr

Quote:

Modulo is division, only what you keep as the return value is different.
But Modulo is usually just as expensive and long as division.

The .irp code seems like it would work.

It expands to 80 instructions.

One can get it down to 65 if one realizes that my nested ifs do not need to be nested.

@Koshchi: long division naturally produces a remainder,

but not all division is done that way, especially division by a constant.

Size, AVG speed, worst case speed, easy to understand?

I don't like the +50 version because it can overflow.

If it's worst case speed (and a AVR with HW mul) I would go for some ASM code that do the division with a mul by 1/100.

mul by 167772 and forget about the lower 24 bit, could be a way.

Quote:

I don't like the +50 version because it can overflow.

Surely not unless the round operation itself would actually overflow?

Quote:I don't like the +50 version because it can overflow.

Surely not unless the round operation itself would actually overflow?

2**24-1+50 will overflow.

BTW with a slightly different algorithm, my big loop can be replaced by 57 instructions.

With hardware multiply and careful ordering,

my big loop can be replaced by 39 words and 41 cycles.

Debugging code you don't understand is difficult!

It's a good thing that this comment was there: "; 0x100*sum1 + sum0 == 36*src2 + 56*src1 + src0", cause that's not what the first few lines actually calculated.

Here is a corrected version that should be callable from C (with a "long" as argument, where we only use 24 bits.) It worked for all powers of 2. It's still not perfect, though; I'll try to do additional debugging, but am hoping something jumps out. First errors:

***ERROR***Rounding 228 Got: 100 ***ERROR***Rounding 229 Got: 100 ***ERROR***Rounding 230 Got: 100

.global round100 round100: ; rounds input to multiple of 100 ; input src2, src1, src0 ; output src2, src1, src0 ;;; If we use r24:r22, we should be compatible with C sending us a long ;;; (and sending it back.) #define src2 r24 #define src1 r23 #define src0 r22 ; they need not be upper registers ; clears r1 (and r0) push r0 push r1 ; 35 words, up to 40 cycles ; temporaries s56, s36, sum0 need to be upper registers ; s56, s36 could be the same register #define s36 r26 #define s56 r26 #define sum0 r27 ; temporaries s0, sum1 do not #define s0 r18 #define sum1 r19 ;;; convenience macros .macro addi r, c subi \r, -(\c) .endm .macro cmpi r, c cpi \r, \c .endm ldi s36, 36 ; 2**16 mod 100 mul src2, s36 mov sum0, r0 mov sum1, r1 ldi s56, 56 ; 2**8 mod 100 mul src1, s56 add sum0, r0 adc sum1, r1 ; <=91 clr s0 add sum0, src0 adc sum1, s0 ; <= 92 ; 0x100*sum1 + sum0 == 36*src2 + 56*src1 + src0 ; r1, sum0 <- 36*sum1 + sum0 mul sum1, s56 add sum0, r0 adc r1, s0 ; <= 21 ; r1, sum0 <- 36*r1 + sum0 mul r1, s56 add sum0, r0 adc r1, s0 ; <= 4 ; r1, sum0 <- 36*r1 + sum0 mul r1, s56 ; clears r1 add sum0, r0 brcc 1f addi sum0, 56 brcc 1f addi sum0, 56 1: sbrc sum0, 7 ; 128 subi sum0, 100 ; <=155 cmpi sum0, 100 brlt .+2 subi sum0, 100 ; <=99 ; s0, s0, sum0 <- sum0 or sum0-100 cmpi sum0, 50 brlt .+4 subi sum0, 100 dec s0 ; s0 repurposed as upper 2 bytes of subtrahend ; src2, src1, src0 -= s0, s0, sum0 sub src0, sum0 sbc src1, s0 sbc src2, s0 pop r1 pop r0 ret

And the Arduino test sketch. (no, Arduino doesn't like .S files. I let the compile fail, then manually link in the assembled .S file and subsequent upload steps.)

void setup() { Serial.begin(9600); } extern "C" unsigned long round100(unsigned long); unsigned long n; void loop() { unsigned long r; unsigned char i = 0; for (n=1; n < (1L<<24); n+=1) { i++; r = round100(n); if (r != ((n+50)/100L)*100L) { Serial.print("***ERROR***"); i=0; } if (i == 0) { Serial.print("Rounding "); Serial.println(n); Serial.print("Got: "); Serial.println(round100(n)); } } Serial.println("Done!"); while (Serial.read() < 0) ; }

In this section, the comments say *36, but the code has *56. I get more correct answers if I go with the code, but...

And I was wrong about powers of 2 working; 2048 didn't work.

; r1, sum0 <- 36*sum1 + sum0 mul sum1, s56 add sum0, r0 adc r1, s0 ; <= 21 ; r1, sum0 <- 36*r1 + sum0 mul r1, s56 add sum0, r0 adc r1, s0 ; <= 4 ; r1, sum0 <- 36*r1 + sum0 mul r1, s56 ; clears r1 add sum0, r0 brcc 1f addi sum0, 56 brcc 1f addi sum0, 56

I think the problem is the first brlt instruction.

Changing the brlt's to brcc's should do the trick.

brlt is for a signed test.

For the second test, the values are in 0..127,

whether or not interpreted as signed,

so the difference is not important.

What are the pop instructions for?

I didn't see the corresponding pushes.

Also, some of the comments still have 36 where they should have 56.

The code runs on the principle that

0x10000*d2 + 0x100*d1 + d0 == 36*d2 + 56*d1 + d0 mod 100.

Except for the initial input, d2 was 0.

Quote:

Changing the brlt's to brcc's should do the trick.

Changing to "brlo" (unsigned equiv of brlt, which is actually brcs) helps. Still gets errors:

***ERROR***Rounding 21456 Got: 21444 ***ERROR***Rounding 21457 Got: 21444 ***ERROR***Rounding 21458

Quote:

What are the pop instructions for?

I didn't see the corresponding pushes.

they're there: r0 isn't necessary, I guess, but r1 is the gcc "known zero" register, so it needs saved.

; clears r1 (and r0) push r0 push r1

Level: Moderator

Joined: Mon. Jul 18, 2005

Posts: 96425 View posts

Location: (using avr-gcc in) Finchingfield, Essex, England

Quote:

so it needs saved.

Surely all it actually needs is clearing to 0:

eor r1, r1

at the end?

Quote:

so it needs saved.

Surely all it actually needs is clearing to 0:eor r1, r1at the end?

As noted in a comment, r1 is already clear.

Perhaps another comment is in order.

(I see how it works, now. Mostly.)

I found my problem:

add sum0, r0 brcc 1f addi sum0, 56 brcc 1f addi sum0, 56

Remember that "addi" is a macro that does actually does a subi (because avr doesn't have addi!) Apparently that makes the state of the carry bit opposite what I'd expect from an add operation. It worked better when i reverse the condition, but since 56 is already in a register anyway, I thought it would be clearer if I just did a reg-reg add.

The code now works up to an input of: 14548980

It now looks like:

;; deleted; see subsequent post.

Last Edited: Sat. Apr 20, 2013 - 08:01 AM

Ahh. It turns out that the assumption that r1 is cleared by the last multiply is not quite true. (I knew I had saved it for a reason!) Changing it thusly results in correct answers over the whole range:

;;; one more time... 0x100*r1 + sum0 = 36*(0) + 56*r1 + sum0 ; r1, sum0 <- 56*r1 + sum0 mul r1, s56 ; clears r1 cpse r1, s0 ; (except occasionally) add sum0, s56 add sum0, r0

Quote:

If this "function" should be able to be in a ISR

The function is currently written to the ABI specs of gcc. If you wanted to call it from an ISR, the ISR would be responsible for saving R0 and R1 (and the standard gcc ISR prologue does do this.) If you wanted to include the code in a non-C ISR, it would have to save a LOT more registers.

BTW: what made things finally 'click' for me was that THIS:

Quote:

0x10000*d2 + 0x100*d1 + d0 == 36*d2 + 56*d1 + d0 mod 100.

is repeated multiple times. Each time, the result gets smaller. Once the result fits in a single byte, the "modulo 100" becomes a trivial "up to two subtractions." Very neat.

.

Final (?) version:

;;; minimized code to round a 24bit number to the nearest 100 (decimal) ;;; Algorithm by Michael Hennebry ("Skeeve") ;;; https://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=131682 ;;; ;;; The code runs on the principle that ;;; 0x10000*d2 + 0x100*d1 + d0 == 36*d2 + 56*d1 + d0 mod 100. .global round100 round100: ; rounds input to multiple of 100 ; input src2, src1, src0 ; output src2, src1, src0 ;;; If we use r24:r22, we should be compatible with C sending us a long ;;; (and sending it back.) All registers used are "call-used" and shouldn't ;;; need to be saved (except for r1, which is assigned "known zero" by C, but ;;; used by the avr multiply instructions. #define src2 r24 #define src1 r23 #define src0 r22 ; they need not be upper registers ; clears r1 ;;; if r1 is really cleared, we don't need to save it. But be safe. push r1 ; 35 words, up to 40 cycles ; temporaries s56, s36, sum0 need to be upper registers ; s56, s36 could be the same register #define s36 r26 #define s56 r26 #define sum0 r27 ; temporaries s0, sum1 do not #define s0 r18 #define sum1 r19 ldi s36, 36 ; 2**16 mod 100 mul src2, s36 mov sum0, r0 mov sum1, r1 ldi s56, 56 ; 2**8 mod 100 mul src1, s56 add sum0, r0 adc sum1, r1 ; <=91 clr s0 add sum0, src0 adc sum1, s0 ; <= 92 ; 0x100*sum1 + sum0 == 36*src2 + 56*src1 + src0 ;;; repeat... 0x100*sum1 + sum0 = 36*(0) + 56*sum1 + sum0 ; r1, sum0 <- 56*sum1 + sum0 mul sum1, s56 add sum0, r0 adc r1, s0 ; <= 21 (sum < 0x21FF) ;;; repeat... 0x100*r1 + sum0 = 36*(0) + 56*r1 + sum0 ; r1, sum0 <- 56*r1 + sum0 mul r1, s56 add sum0, r0 adc r1, s0 ; <= 4 (sum < 0x4FF) ;;; one more time... 0x100*r1 + sum0 = 36*(0) + 56*r1 + sum0 ; r1, sum0 <- 56*r1 + sum0 mul r1, s56 ; clears r1 cpse r1, s0 ; (except occasionally) add sum0, s56 add sum0, r0 brcc 1f ;last 56*r1 + sum0 add sum0, s56 brcc 1f add sum0, s56 1: sbrc sum0, 7 ; 128 subi sum0, 100 ; <=155 cpi sum0, 100 brlo 2f subi sum0, 100 ; <=99 2: ; s0, s0, sum0 <- sum0 or sum0-100 cpi sum0, 50 brlt 3f subi sum0, 100 dec s0 ; s0 repurposed as upper 2 bytes of subtrahend 3: ; src2, src1, src0 -= s0, s0, sum0 sub src0, sum0 sbc src1, s0 sbc src2, s0 pop r1 ret

Quote:

<= 4 should be <=5

clears R1 should be <=1

Also, 21 != 0x21.

R1 does not need to be saved, just cleared.

That would save 3 cycles.

The last multiply sequence takes 9 cycles in the worst case.

It could be replaced by an 8-cycle sequence of skips and adds.