## Sum of many 2-byte signed numbers?

30 posts / 0 new
Author
Message

How can one keep a running sum of many two-byte signed integers?

The path taken thus far has been to determine if the first signed integer  is negative. If so, it is placed in the lower two bytes of the total and  the upper three bytes of the total are filled with 0xFF FF FF.  If found to be positive it is just placed in the lower two bytes.

From that point on, the lower bytes are added.  If the Vflag is clear, nothing is done.  Where I get into trouble is when the Vflag is set (when I must modify the next higher byte(s)).

Is there a way to properly handle handle the overflow based on the other S-register flags?

Moderator:  If this is not the proper forum for this message, please transfer it.

Last Edited: Mon. May 1, 2017 - 05:58 PM

I think you are making it too hard.

-- Start with 0.  No need to examine the first number for signedness.

-- Add each new number to the total.  No need to examine either the number or total for signedness.

How many 16-bit numbers are you going to add to the total?  If up to e.g. 256, then use a 24-bit or 32-bit "accumulator".  32-bit accumulator would handle up to 64K 16-bit numbers with max value.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Add each new number to the total.  No need to examine either the number or total for signedness.

Perhaps the unstated question (or answer) is that each sum must simply propagate over ALL 5 bytes (even though only 2 byte numbers are being added)..then the current sum is represented by the 5 byte total.

adc sum2, zero   (some register set to zero)

for signed numbers just extend the leading ones

```reset:  clr sum0
clr sum1
clr sum2
clr sum3
clr sum4

ldi ZL, low(-23555)
ldi ZH, high(-23555)

ldi ZL, low(30555)
ldi ZH, high(30555)

ldi ZL, low(-12345)
ldi ZH, high(-12345)

done:   nop
rjmp done

;  zh:zl contains number to add
addup:  clr extend  ;for positive value
sbrc ZH,7   ;check if neg value
ser extend  ;extend leading ones for negative values

ret```

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Tue. May 2, 2017 - 07:45 AM

avrcandies wrote:
then the current sum is represented by the 5 byte total.

Indeed, OP's mention of 5 bytes is interesting.  That is 2**24 of max value (actually 2**25 as it is signed so max is 2**15) or 16 million to 32 million "samples".  If the typical value is e.g. 1/10 of max then 10x more, etc.

I guess OP's mention of the V and S flags implies assembler.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

One way is to simply sign-extend each 2-byte value to 5 bytes.

Another is to have separate sums for positive and negative values.

Form the "sum" of the negatives by subtracting.

Subtract the "sum" of negatives from the sum of positives.

If efficiency is a consideration,

you might want 3- and 4-byte intermediate sums.

Iluvatar is the better part of Valar.

avrcandies wrote:
for signed numbers just extend the leading ones

Aah, yes, I missed that case when I said "no fussing".

Just for fun, let's see what a couple compilers do adding a signed int to a signed long...

GCC:

```volatile int addend;
volatile long accumulator;
```
```		accumulator += addend;
8a:	80 91 00 01 	lds	r24, 0x0100	; 0x800100 <_edata>
8e:	90 91 01 01 	lds	r25, 0x0101	; 0x800101 <_edata+0x1>
92:	40 91 02 01 	lds	r20, 0x0102	; 0x800102 <accumulator>
96:	50 91 03 01 	lds	r21, 0x0103	; 0x800103 <accumulator+0x1>
9a:	60 91 04 01 	lds	r22, 0x0104	; 0x800104 <accumulator+0x2>
9e:	70 91 05 01 	lds	r23, 0x0105	; 0x800105 <accumulator+0x3>
a2:	09 2e       	mov	r0, r25
a4:	00 0c       	add	r0, r0
a6:	aa 0b       	sbc	r26, r26
a8:	bb 0b       	sbc	r27, r27
aa:	84 0f       	add	r24, r20
ac:	95 1f       	adc	r25, r21
ae:	a6 1f       	adc	r26, r22
b0:	b7 1f       	adc	r27, r23
b2:	80 93 02 01 	sts	0x0102, r24	; 0x800102 <accumulator>
b6:	90 93 03 01 	sts	0x0103, r25	; 0x800103 <accumulator+0x1>
ba:	a0 93 04 01 	sts	0x0104, r26	; 0x800104 <accumulator+0x2>
be:	b0 93 05 01 	sts	0x0105, r27	; 0x800105 <accumulator+0x3>
```

I'd have to work it out on paper, but the highlighted lines are the sign-extend?

CodeVision:  (CWD can be read as "Convert Word to Double-word I think")

```00003c d010      	RCALL __CWD1
...
00004c 9508      	RET

__CWD1:
00004d 2f6f      	MOV  R22,R31
00004f 0b66      	SBC  R22,R22
000050 2f76      	MOV  R23,R22
000051 9508      	RET
```

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

is this for speed size or just to get it to work?

the easy one is like the GCC sign extend and make the 5 byte addition , (optimize have one register that is 0 and become FF if MSB of number is 1, and use that register for the last 3 adds).

this is all you need

```;  zh:zl contains number to add
addup:  clr extend  ;for positive value
sbrc ZH,7   ;check if neg value
ser extend  ;extend leading ones for negative values

ret```

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

On re-reading, I think OP's problem is just how to do a multibyte add.
I think OP sign-extends correctly,
but for addition he uses the V flag (overflow) instead of the C flag (carry).

```; sign-extend  3 cycles
CLR A2   ; zero
SBRC A1,7  ; if negative
DEC A2    ; 0xFF

; S += A   5 cycles
ADC S3, A2  ; A3 would be same as A2

Note that it takes 3 cycles to note the sign of the addend
and 5 cycles to do a 5-byte addition.
To do better than 8 cycles at least one will need improving.
Note that it takes at least 4 cycles just to fetch 2 bytes.

For 5 bytes to be useful, I expect that this is in a large loop.
The calculation above ignored the branching required for said loop.
Enough unrolling might make it neglible.
Nesting the loop so that the inner loop only does 3-byte
of adjusting A2 requires only 2 cycles.
Adjust the sum outside the inner loop.
The brings us down to 5 cycles for most additions,
4 cycles to fetch each addend and however
many cycles it took to generate each addend.

Iluvatar is the better part of Valar.

Nesting the loop so that the inner loop only does 3-byte

can you explain?  I believe you must ALWAYS add all 5 bytes, or risk chaos.  You could put in extra "skip" checks, but those would make things even longer.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

I still want to hear more about the 10s of millions of samples. ;)

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

avrcandies wrote:

Nesting the loop so that the inner loop only does 3-byte

can you explain?  I believe you must ALWAYS add all 5 bytes, or risk chaos.  You could put in extra "skip" checks, but those would make things even longer.

```sum5=0
for
sum3=0
for ; at most 0x100 times  might need unrolling
sum5+=sum3```

Iluvatar is the better part of Valar.

with 8 clk for this, we need to know more about how many numbers we talk about, how big they normally are, and does the sum go up/down or will the sum be around 0.

which format do they come in, and what else does the micro do?

do you need the partial sum while you are doing this, or is this the sum of a known amount of numbers?

if it's the sum of 2^n amount of numbers it could be faster to handle this as positive numbers (I guess add 32768 can be done a fast way), and then add . the final result then need to be shifted but that is only once.

An other way could be to add all negative numbers one place and the positive an other place. (but if that is smart depends of what is really needed)

Don't add the negative numbers separately.

Test and branch or jump is 2 or 3 cycles.

Just counting the negative numbers only requires 2 cycles.

Iluvatar is the better part of Valar.

Thanks everyone.  I set up for a five byte sum because I haven't decided how many additions to do.

I did get a notification after the first message from theusch.  The message gave me more confidence and I found I had made an error in one of the stim files I wrote.  I received no notification of the subsequent messages.

The goal is to add up a running sum as the data comes in and then, at some point, find the average.  The input data is in the range 0x00 to 0x1F and this is converted to two byte SIN and two byte COS signed values stored in a look up table.  The SINs and COSs are summed in two separated additions.  The average of each is then found.  The code seems to work at 128 measurements for all inputs 0x02 (22.5 degrees), for all inputs 0x1D (337.5 degrees), and for alternating inputs 0x02; 0x1D (the latter average express 0x0000 as 0x8000).

Yet to be written is a more complex stim file.  Also, my code may not be very efficient.  Luckily, the single byte data occurs only every 800 uS so it should get the job done.

As it now exists, my code uses BRCC to skip adjusting the higher bytes of the sum.  Otherwise the next higher byte is incremented by an ADC with a cleared register and a BRCC if the still higher byte can be left unmodified.

Last Edited: Wed. May 3, 2017 - 02:52 PM

make a 0x20 array where you ++ (one more of this value), and then only do the rest one time.

John3 wrote:
I set up for a five byte sum because I haven't decided how many additions to do.
theusch wrote:
I still want to hear more about the 10s of millions of samples. ;)

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

make a 0x20 array where you ++ (one more of this value), and then only do the rest one time.

if you do y=sin(x), taking the avg of many x, then find sin using discrete look up is not the same as taking the avg of (many sin(x)) ,due to quantization.

If you did an avg of many x followed by a true high res sin(x) calc, you would get the resolution.

The question is how much sin resolution is finally needed.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

The final resolution should be a good as possible, but anything much over five degrees probably is overkill.  This is a hobby project to try to adapt a 1970's era radio direction finder to use with a modern day mapping program.

The input data is in 11.25 degree increments.  Sine and cosine values are from a table.  The table was made by multiplying by 16383 and converting to signed two-byte hex.  As each reading comes in, the sine and cosine value for it is added to the appropriate sum.  After 2^N (right now 128) the average sine and average cosine are found by right shifting the sums.

The quadrant is found by testing bit 7 of the high byte of the average sine and average cosine.

What is left is to resolve the angle represented by the sine and cosine averages.

The 10's of millions of samples can be limited to, perhaps, only 1024.  :)  That is approximately under one second of measurements.  It's also OK to postpone beginning a series of measurements while the result is output by the UART/USART.

Last Edited: Wed. May 3, 2017 - 05:18 PM

linear relation:

in out

1  3

2  6

3  9

4  12

in: 1,2,3,4,1,2,2,2,2==> avg=2.111111, LUT out==>6 (not so accurate)

outs: 3,6,9,12,3,6,6,6,6==>avg=6.33333 much better

COULD interpolate 2.111 in the lookup table for same result

Beware when the relation is nonlinear, an extreme example:

Nonlinear square relation:

in out

1  1

2  4

3  9

4  16

in: 1,2,3,4,1,2,2,2,2==> avg=2.111111, LUT out==>4 (not so accurate)

outs:  1 4 9 16 1 4 4 4 4==>avg=5.2222 much better and correct

CAN'T interpolate 2.111 in the lookup table for same result (and 2.2111^2===4.457)

Bottom line---if your final answer comes straight from look up, it will be as low res as the table itself.  If the table is linear, interpolation can be used to improve.  If the table is nonlinear, then the avg of the input correlated to the avg of the output can range from good to horrible (depending upon the degree of nonlinearity).

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Wed. May 3, 2017 - 06:01 PM

No :

if we  only receive values [0..0x1f] it can't be better, just faster

remember there is not a linear relation to the numbers!

You have to do this :

0x00 recerived 210 times

0x01 recerived 110 times

0x02 recerived 550 times

0x03 recerived 1260 times

We need to do this:

sin(0x00) *210 +sin(0x01) *110 +sin(0x02) *550 +sin(0x03) *1260.........

and then div with the sum (but if it a 2^n numbers it's just a shift).

It don't save much in addition we started with (the is a good gain if written in ASM), but 1000's of LUT reads will become 32

(where sin (x) is a LUT)

Last Edited: Wed. May 3, 2017 - 05:55 PM

We need to do this:

sin(0x00) *210 +sin(0x01) *110 +sin(0x02) *550 +sin(0x03) *1260.........

This approach if used "literally" is not so hot, since there are several totals to keep track of (1 total for each LUT item), plus a multiply for each.

For high accuracy it is best to take the incoming angle, look up the sin and form the answer summation then finally avg of those values.  Of course taking a million samples to reduce variance doesn't get very far if the lut values have limited accuracy themselves.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

John3 wrote:
The 10's of millions of samples can be limited to, perhaps, only 1024.
theusch wrote:
Indeed, OP's mention of 5 bytes is interesting. That is 2**24 of max value (actually 2**25 as it is signed so max is 2**15) or 16 million to 32 million "samples". If the typical value is e.g. 1/10 of max then 10x more, etc.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

But you are missing one key point here, OP receive only the values 0..31 and nothing else!

It will make no difference if you make the avg of  200 sums of the LUT or 1 LUT and multiply with 200!

sparrow2 wrote:
But you are missing one key point here, OP receive only the values 0..31 and nothing else!
I was working off the original post, which implied 16-bit signed values.

Indeed, if 8-bit signed values, then 256 fit into a 16-bit accumulator.

If really 5 bits extended to 8 (and if indeed 0-31 and not signed), then many thousands fit into a 16-bit accumulator and this whole discussion becomes relatively moot.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

it's because OP add the sin(x) and not x, but because x only can be 32 different numbers, it suggest just to find out how many of each, and only do the LUT once (for each input value group.)

But from the information in #16 we now know that there is 800us between numbers, so no real need for optimization.

I'm not clear what more OP needs to know.

By now he should know how to add a 2-byte signed number to a 5-byte signed number.

Five bytes is overkill, with only 128 numbers to add, three bytes is enough.

Even with twice that many, three bytes would be enough/

Iluvatar is the better part of Valar.

I'm not clear what more OP needs to know.

I suggest he looks up & studies CORDIC

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

I'm not clear what more OP needs to know.

I'm not  either.  :)   Seven years ago I used CORDIC rotation to figure angles with an accelerometer, but that was the last time I did anything with micro-controllers.

I have written some code to move the bearing into the first geometric quadrant and hope to resole the bearing there.  That needs to be finished.   Then in quadrant I, I hope to find if the completed bearing needs more than 45 degrees added in.  That will determine whether the adjusted average sin or adjusted average cosine will be used to further reduce the angle.

I realize I need to learn more about using the stimulator.  Right now, the stim file generates the interrupt and the input data.  It would be nice to separate the interrupt stim from the data stim, but synchronize them so there is just one data input per interrupt.

That may be the topic for another thread.