simple way to round?

Go To Last Post
36 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Is there a simple way to round a 24 bit number to the nearest 100 (ex: 1234567 gives 1234600, 14445970 gives 1446000, 223344 gives 223300).

This is using avr assembler. I suppose I may have to resort to some large mult or divide routines again.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

divide by 100. if the remainder > 50 then add 1 to the dividend then multiply the dividend by 100. If you already have the number as bcd, then test the 10's digit for >=5 and add 100 else zero the 1's and 10's digits.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

value = value + 50
value = value - (value mod 100)

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Is there an easy way to do the mod? I suppose there's no other way than a long 24 bit division..wan't sure if there were any "tricks" that would work.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You could easily round to the nearest 128...
Or you could do it during printing, when you're dividing anyway.

There are assorted "interesting" algorithms for dividing or multiplying by a constant. This "divide by 10" function (from "Hacker's Delight") looks particularly interesting because of all the shifts that are multiples of 8bits.

unsigned divu10(unsigned n) { unsigned q, r;
   q = (n >> 1) + (n >> 2);
   q = q + (q >> 4);
   q = q + (q >> 8);
   q = q + (q >> 16);
q = q >> 3;
r = n - q*10;
return q + ((r + 6) >> 4);
// return q + (r > 9);
}
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks--That's an interesting trick, I suppose it can be done in asm, even with the *10 required.
To round to the nearest 100 wiould require doing this twice.
I wonder if ther is something even trickier & shorter. I did get my program working using a long divide & long multiply (add 50, div by 100 & mult by 100), so now its more of an interesting quetion.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Surely it depends on the rounding you need.
http://en.wikipedia.org/wiki/Rounding#Types_of_rounding

Four legs good, two legs bad, three legs stable.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Is there an easy way to do the mod?
Mod is simply a divide where you keep the remainder instead of the result. So if you already have the (add 50, div by 100 & mult by 100) solution, then you only need a rearrangement of the code to get my solution which saves you the multiply.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Why do you want to do this? The more we know about why (and under what circumstances) you want to do it, the better answers we might provide.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I wonder if ther is something even trickier & shorter.
There was similar "divide by 100" code, but it didn't look like it would optimize as well for a CPU without a barrel shifter.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here is some pseudocode that might help.
The loops should be completely unrolled.
Use a macro for the big one.
I think it will come out to 65 instructions total.

hi, mid, lo=src
for mask=0x01; mask; mask<<=1
    if hi & mask
        lo+=mask*36%100
        if carry
            lo+=56
    if mid & mask
        lo+=mask*56%100
        if carry
            lo+=56

while lo>=100
    lo-=100

if lo>=50
    result=src+(100-lo)
else
    result=src-lo

It just rounds to a multiple of 100.
If you want decimal digits, this won't give them to you.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't get [skeeve's code] at all.
If you HAVE modulo, you'd just do
r = (v+50) - ((v+50)%100)
But Modulo is usually just as expensive and long as division.
Where do 36 and 56 come from?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:
I don't get [skeeve's code] at all.
If you HAVE modulo, you'd just do
r = (v+50) - ((v+50)%100)
But Modulo is usually just as expensive and long as division.
Where do 36 and 56 come from?
56==0x100%100
Once you unroll the big loop, mask becomes constants and the assembler can do the percent work.
That is why the big loop unrolls to only 65 instructions.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I think it will come out to 65 instructions total.

Quote:

That is why the big loop unrolls to only 65 instructions.

Quote:

But Modulo is usually just as expensive and long as division.

+1 on the last...riddle me that--how are the two %100 each loop done "free"?

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
Quote:

But Modulo is usually just as expensive and long as division.

+1 on the last...riddle me that--how are the two %100 each loop done "free"?
skeeve wrote:
Once you unroll the big loop, mask becomes constants and the assembler can do the percent work.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ah. I get it. p Part of it, anyway...
When you said to unroll the loop, you meant to expand all the unrolled constants as well.

if hi & 1
  lo += (1*36)%100
  if carry
     lo +=56
  if mid & 1
     lo += (1*56)%100
   :
; next unrolled iteration
if hi & 2
  lo += (2*36)%100
  if carry
    lo += 56
  if mid & 2
    lo += (2*56)%100
   :

That should work out pretty well using .irp and such; perhaps I'll give it a try. Thanks for the hints!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mid shouldn't nest inside hi.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hmm. Does this look right? It expands to a lot more than 65 instructions (100+); don't forget that you can't do "sbic sreg, x" (that would have helped.)

.irp mask, 1, 2, 4, 8, 16, 32, 64, 128
	andi NH, mask
	breq 0f
	  subi NL, -((36*mask)%100)
	  brcc 0f
	    subi NL, -56
0:	andi NM, mask
	breq 1f
	  subi NL, -((56*mask)%100)
	  brcc 1f
	    subi NL, -56
1:
.endr
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
But Modulo is usually just as expensive and long as division.
Modulo is division, only what you keep as the return value is different.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The .irp code seems like it would work.
It expands to 80 instructions.
One can get it down to 65 if one realizes that my nested ifs do not need to be nested.

@Koshchi: long division naturally produces a remainder,
but not all division is done that way, especially division by a constant.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But if you have a form of division that does not produce a remainder, then it can't be used for modulo, so I don't see the relevance.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What do you mean with easy ? :)
Size, AVG speed, worst case speed, easy to understand?
I don't like the +50 version because it can overflow.
If it's worst case speed (and a AVR with HW mul) I would go for some ASM code that do the division with a mul by 1/100.
mul by 167772 and forget about the lower 24 bit, could be a way.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I don't like the +50 version because it can overflow.

Surely not unless the round operation itself would actually overflow?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:
Quote:
I don't like the +50 version because it can overflow.

Surely not unless the round operation itself would actually overflow?
2**24-1 will round down.
2**24-1+50 will overflow.

BTW with a slightly different algorithm, my big loop can be replaced by 57 instructions.

With hardware multiply and careful ordering,
my big loop can be replaced by 39 words and 41 cycles.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
  ; rounds input to multiple of 100
  ; input  src2, src1, src0
  ; output src2, src1, src0
  ; they need not be upper registers
  ; clears r1
  ; 35 words, up to 40 cycles
  ; temporaries s56, s36, sum0 need to be upper registers
  ; temporaries s0, sum1 do not
  ; s56, s36 could be the same register


  ldi s36, 36  ; 2**16  mod 100
  mul src0, s36
  mov sum0, r0
  mov sum1, r1

  ldi s56, 56  ; 2**8 mod 100
  mul src2, s56

  add sum0, r0
  adc sum1, r1 ; <=91

  clr s0
  add sum0, src0
  adc sum1, s0 ; <= 92

  ; 0x100*sum1 + sum0 == 36*src2 + 56*src1 + src0

  ; r1, sum0 <- 36*sum1 + sum0
  mul sum1, s56
  add sum0, r0
  adc r1, s0 ; <= 21

  ; r1, sum0 <- 36*r1 + sum0
  mul r1, s56
  add sum0, r0
  adc r1, s0   ; <= 4

  ; r1, sum0 <- 36*r1 + sum0
  mul r1, s56  ; clears r1
  add sum0, r0
  brcc 1f
  addi sum0, 56
  brcc 1f
  addi sum0, 56
  1:
  sbrc sum0, 7 ; 128
  subi sum0, 100 ; <=155
 
  cmpi sum0, 100
  brlt .+2
  subi sum0, 100 ; <=99

  ; s0, s0, sum0 <- sum0 or sum0-100
  cmpi sum0, 50
  brlt .+4
  subi sum0, 100
  dec s0  ; s0 repurposed as upper 2 bytes of subtrahend

  ; src2, src1, src0 -= s0, s0, sum0
  sub src0, sum0
  subc src1, s0
  subc src2, s0

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Debugging code you don't understand is difficult!
It's a good thing that this comment was there: "; 0x100*sum1 + sum0 == 36*src2 + 56*src1 + src0", cause that's not what the first few lines actually calculated.

Here is a corrected version that should be callable from C (with a "long" as argument, where we only use 24 bits.) It worked for all powers of 2. It's still not perfect, though; I'll try to do additional debugging, but am hoping something jumps out. First errors:

***ERROR***Rounding 228
Got:     100
***ERROR***Rounding 229
Got:     100
***ERROR***Rounding 230
Got:     100
.global round100
round100:
  ; rounds input to multiple of 100
  ; input  src2, src1, src0
  ; output src2, src1, src0
;;; If we use r24:r22, we should be compatible with C sending us a long
;;; (and sending it back.)
#define src2 r24
#define src1 r23
#define src0 r22
  ; they need not be upper registers
  ; clears r1 (and r0)
	push r0
	push r1

  ; 35 words, up to 40 cycles
  ; temporaries s56, s36, sum0 need to be upper registers
  ; s56, s36 could be the same register
#define s36 r26
#define s56 r26	
#define sum0 r27
  ; temporaries s0, sum1 do not
#define s0  r18
#define sum1 r19

;;; convenience macros
.macro addi r, c
	subi \r, -(\c)
.endm
.macro cmpi r, c
	cpi \r, \c
.endm

  ldi s36, 36	; 2**16  mod 100
  mul src2, s36
  mov sum0, r0
  mov sum1, r1

  ldi s56, 56  ; 2**8 mod 100
  mul src1, s56

  add sum0, r0
  adc sum1, r1 ; <=91

  clr s0
  add sum0, src0
  adc sum1, s0 ; <= 92

  ; 0x100*sum1 + sum0 == 36*src2 + 56*src1 + src0

  ; r1, sum0 <- 36*sum1 + sum0
  mul sum1, s56
  add sum0, r0
  adc r1, s0 ; <= 21

  ; r1, sum0 <- 36*r1 + sum0
  mul r1, s56
  add sum0, r0
  adc r1, s0   ; <= 4

  ; r1, sum0 <- 36*r1 + sum0
  mul r1, s56  ; clears r1
  add sum0, r0
  brcc 1f
  addi sum0, 56
  brcc 1f
  addi sum0, 56
1:
  sbrc sum0, 7 ; 128
  subi sum0, 100 ; <=155
 
  cmpi sum0, 100
  brlt .+2
  subi sum0, 100 ; <=99

  ; s0, s0, sum0 <- sum0 or sum0-100
  cmpi sum0, 50
  brlt .+4
  subi sum0, 100
  dec s0  ; s0 repurposed as upper 2 bytes of subtrahend

  ; src2, src1, src0 -= s0, s0, sum0
  sub src0, sum0
  sbc src1, s0
  sbc src2, s0 
	
	pop r1
	pop r0
	ret

And the Arduino test sketch. (no, Arduino doesn't like .S files. I let the compile fail, then manually link in the assembled .S file and subsequent upload steps.)

void setup()
{
  Serial.begin(9600);
}
extern "C" unsigned long round100(unsigned long);

unsigned long n;
void loop()
{
  unsigned long r;
  unsigned char i = 0;

  for (n=1; n < (1L<<24); n+=1) {
    i++;
    r = round100(n);
    if (r != ((n+50)/100L)*100L) {
      Serial.print("***ERROR***");
        i=0;
    }
    if (i == 0) {
      Serial.print("Rounding ");
      Serial.println(n);
      Serial.print("Got:     ");
      Serial.println(round100(n));
    }
  }
  Serial.println("Done!");

  while (Serial.read() < 0)
    ;
}
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In this section, the comments say *36, but the code has *56. I get more correct answers if I go with the code, but...

And I was wrong about powers of 2 working; 2048 didn't work.

  ; r1, sum0 <- 36*sum1 + sum0
  mul sum1, s56
  add sum0, r0
  adc r1, s0 ; <= 21

  ; r1, sum0 <- 36*r1 + sum0
  mul r1, s56
  add sum0, r0
  adc r1, s0   ; <= 4

  ; r1, sum0 <- 36*r1 + sum0
  mul r1, s56  ; clears r1
  add sum0, r0
  brcc 1f
  addi sum0, 56
  brcc 1f
  addi sum0, 56 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think the problem is the first brlt instruction.
Changing the brlt's to brcc's should do the trick.
brlt is for a signed test.
For the second test, the values are in 0..127,
whether or not interpreted as signed,
so the difference is not important.
What are the pop instructions for?
I didn't see the corresponding pushes.

Also, some of the comments still have 36 where they should have 56.

The code runs on the principle that
0x10000*d2 + 0x100*d1 + d0 == 36*d2 + 56*d1 + d0 mod 100.
Except for the initial input, d2 was 0.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Changing the brlt's to brcc's should do the trick.

Changing to "brlo" (unsigned equiv of brlt, which is actually brcs) helps. Still gets errors:

***ERROR***Rounding 21456
Got:     21444
***ERROR***Rounding 21457
Got:     21444
***ERROR***Rounding 21458

Quote:
What are the pop instructions for?
I didn't see the corresponding pushes.

they're there: r0 isn't necessary, I guess, but r1 is the gcc "known zero" register, so it needs saved.

  ; clears r1 (and r0)
   push r0
   push r1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

so it needs saved.

Surely all it actually needs is clearing to 0:

eor r1, r1

at the end?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
Quote:

so it needs saved.

Surely all it actually needs is clearing to 0:

eor r1, r1

at the end?

Doesn't even need additional clearing.
As noted in a comment, r1 is already clear.
Perhaps another comment is in order.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

(I see how it works, now. Mostly.)

I found my problem:

  add sum0, r0
  brcc 1f
  addi sum0, 56
  brcc 1f
  addi sum0, 56 

Remember that "addi" is a macro that does actually does a subi (because avr doesn't have addi!) Apparently that makes the state of the carry bit opposite what I'd expect from an add operation. It worked better when i reverse the condition, but since 56 is already in a register anyway, I thought it would be clearer if I just did a reg-reg add.
The code now works up to an input of: 14548980
It now looks like:

 ;; deleted; see subsequent post.
Last Edited: Sat. Apr 20, 2013 - 08:01 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If this "function" should be able to be in a ISR they (r0 and r1) need to be stored, and you don't know the value!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ahh. It turns out that the assumption that r1 is cleared by the last multiply is not quite true. (I knew I had saved it for a reason!) Changing it thusly results in correct answers over the whole range:

;;; one more time...  0x100*r1 + sum0 = 36*(0) + 56*r1 + sum0
  ; r1, sum0 <- 56*r1 + sum0
  mul r1, s56  ; clears r1
  cpse r1, s0  ;  (except occasionally)
   add sum0, s56
  add sum0, r0

Quote:
If this "function" should be able to be in a ISR

The function is currently written to the ABI specs of gcc. If you wanted to call it from an ISR, the ISR would be responsible for saving R0 and R1 (and the standard gcc ISR prologue does do this.) If you wanted to include the code in a non-C ISR, it would have to save a LOT more registers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

BTW: what made things finally 'click' for me was that THIS:

Quote:
0x10000*d2 + 0x100*d1 + d0 == 36*d2 + 56*d1 + d0 mod 100.

is repeated multiple times. Each time, the result gets smaller. Once the result fits in a single byte, the "modulo 100" becomes a trivial "up to two subtractions." Very neat.
.
Final (?) version:

;;; minimized code to round a 24bit number to the nearest 100 (decimal)
;;; Algorithm by Michael Hennebry ("Skeeve")
;;; https://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=131682
;;; 
;;; The code runs on the principle that
;;;   0x10000*d2 + 0x100*d1 + d0 == 36*d2 + 56*d1 + d0 mod 100.
	
.global round100
round100:
  ; rounds input to multiple of 100
  ; input  src2, src1, src0
  ; output src2, src1, src0
;;; If we use r24:r22, we should be compatible with C sending us a long
;;; (and sending it back.)  All registers used are "call-used" and shouldn't
;;; need to be saved (except for r1, which is assigned "known zero" by C, but
;;; used by the avr multiply instructions.
#define src2 r24
#define src1 r23
#define src0 r22
  ; they need not be upper registers
  ; clears r1
;;; if r1 is really cleared, we don't need to save it.  But be safe.
	push r1

  ; 35 words, up to 40 cycles
  ; temporaries s56, s36, sum0 need to be upper registers
  ; s56, s36 could be the same register
#define s36 r26
#define s56 r26	
#define sum0 r27
  ; temporaries s0, sum1 do not
#define s0  r18
#define sum1 r19

  ldi s36, 36	; 2**16  mod 100
  mul src2, s36
  mov sum0, r0
  mov sum1, r1

  ldi s56, 56  ; 2**8 mod 100
  mul src1, s56

  add sum0, r0
  adc sum1, r1 ; <=91

  clr s0
  add sum0, src0
  adc sum1, s0 ; <= 92

  ; 0x100*sum1 + sum0 == 36*src2 + 56*src1 + src0
	
;;; repeat...  0x100*sum1 + sum0 = 36*(0) + 56*sum1 + sum0
  ; r1, sum0 <- 56*sum1 + sum0
  mul sum1, s56
  add sum0, r0
  adc r1, s0 ; <= 21  (sum < 0x21FF)

;;; repeat...  0x100*r1 + sum0 = 36*(0) + 56*r1 + sum0
  ; r1, sum0 <- 56*r1 + sum0
  mul r1, s56
  add sum0, r0
  adc r1, s0   ; <= 4  (sum < 0x4FF)
	
;;; one more time...  0x100*r1 + sum0 = 36*(0) + 56*r1 + sum0
  ; r1, sum0 <- 56*r1 + sum0
  mul r1, s56  ; clears r1
  cpse r1, s0  ;  (except occasionally)
   add sum0, s56
  add sum0, r0

  brcc 1f		;last 56*r1 + sum0
  add sum0, s56
  brcc 1f
  add sum0, s56
1:
  sbrc sum0, 7 ; 128
   subi sum0, 100 ; <=155
 
  cpi sum0, 100
  brlo 2f
   subi sum0, 100 ; <=99
2:
  ; s0, s0, sum0 <- sum0 or sum0-100
  cpi sum0, 50
  brlt 3f
   subi sum0, 100
   dec s0  ; s0 repurposed as upper 2 bytes of subtrahend
3:
  ; src2, src1, src0 -= s0, s0, sum0
  sub src0, sum0
  sbc src1, s0
  sbc src2, s0 
	
	pop r1
	ret

Quote:

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A couple of my comments are wrong:
<= 4 should be <=5
clears R1 should be <=1
Also, 21 != 0x21.
R1 does not need to be saved, just cleared.
That would save 3 cycles.
The last multiply sequence takes 9 cycles in the worst case.
It could be replaced by an 8-cycle sequence of skips and adds.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods