## Shift right multi byte using reciprocal multiplication

43 posts / 0 new
Author
Message

Everyone knows, to shift right a 24 bits variable (3 bytes) it needs 3 instructions per bit shifted, if doing straight without using loop.

per bit:

ASR ByteH ; if signed, or LSR if unsigned

ROR ByteM

ROR ByteL

When shifting 6 bits, it will use 18 clock cycles, if this is part of a repeating routine, it can easily eat time.

Using reciprocal multiplication can save some cycles.

Shift right ABC (24 bits) "n" bits:
The same as ABC x 2^(8-n), then ignore LSByte, result into ABC.
For "n" 4~7 it saves clock cycles.
If ABC is signed and negative, needs to pad resulting "A" n MSbits with "1", not showing below.

If n=4, D=0x10
If n=5, D=0x08
If n=6, D=0x04

A   B   C
x D
--- --- --- ---
CDH  /
BDH BDL  /
--- --- --- ---
A   B   C   /

Option A (10 clock cycles), considering variable Zero = 0:

```        MUL A, D          ;

MUL C, D          ;
MOV C, R1         ; CDH

MUL B, D
ADD C, R0         ; CDH + BDL

Option B,  just changing sequence, same clock cycles.

```        MUL C, D           ;
MOV C, R1          ; CDH

MUL A, D           ;

MUL B, D           ;
ADD C, R0          ; CDH + BDL

Any other ideas for less than 10 clock cycles?

The example above will run in 10 clock cycles no matter the "n" for unsigned.

In the normal shifting fashion, 6 bits will eat 18 clock cycles for unsigned (LSR) or signed (ASR).

Wagner Lipnharski
Orlando Florida USA

Last Edited: Wed. Oct 6, 2021 - 03:29 AM

24 bit, right shift by 6 bits, in 8 cycles (assumes  EXTRA_HI is zeroed, or that the top 6 EXTRA_HI bits should be ignored)

```LSL LOW
ROL MED
ROL HIGH
ROL EXTRA_HI
LSL LOW
ROL MED
ROL HIGH
ROL EXTRA_HI ;Right-justified result in [EXTRA_HI : HIGH : MED]```

I think this is correct, unless I became fumble fingered.

Since the low byte goes to the trashcan, all of the commands could actually be ROL

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Wed. Oct 6, 2021 - 02:53 AM

Remember that shifting 6 times "down" is the same as shifting 2 times up. (and move the bytes around, )

for sign extend:

SBC a register with itself will give 0 if carry clr and 0xff is is set.

so something like:

sbc r19,r19

and 3 moves if you need the correct place.

Last Edited: Wed. Oct 6, 2021 - 04:21 AM

wagnerlip wrote:
Any other ideas for less than 10 clock cycles

9 clock cycles:

In your examples, if X is Even reg, let A be Even reg+1 and then you can replace two mov by movw X, R0.

Great find!

and then you can replace two mov by movw X, R0

more explicit notation  movw A:X, r1:r0

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

UPS

You have to remember that MUL take 2 clk so your code take 13 not 10 clk.

My code take 8 for signed and 9 for unsigned.(and the 6 top bit of r19)

sparrow2 wrote:
You have to remember that MUL take 2 clk so your code take 13 not 10 clk.

My 0856 AVR instruction set file shows only 1 clock cycle for MUL (unsigned) instruction.

avrcandies wrote:
You have to remember that MUL take 2 clk so your code take 13 not 10 clk.

Wagner Lipnharski
Orlando Florida USA

avrcandies wrote:

24 bit, right shift by 6 bits, in 8 cycles (assumes  EXTRA_HI is zeroed, or that the top 6 EXTRA_HI bits should be ignored)

```LSL LOW
ROL MED
ROL HIGH
ROL EXTRA_HI
LSL LOW
ROL MED
ROL HIGH
ROL EXTRA_HI ;Right-justified result in [EXTRA_HI : HIGH : MED]```

I think this is correct, unless I became fumble fingered.

Since the low byte goes to the trashcan, all of the commands could actually be ROL

Yes, I considered it on my calculations, some shift count may use different approach.

For example, 1, 2 and 3 bit shift right is economically use plain and simple ASR/ROR, since they will use 3, 6 or 9 clock cycles.

But 4 and above benefits from the reciprocal multiplication.

The backward (left shift) can also be used in place of 6 and 7 right shifts.

The only problem that I gave up pursuing backwards, is considering negative value.

Lets say 6 bits right would be 2 set of backwards, it means 8 clock cycles, then move the 3 bytes back right into position since they are 1 byte shifted to the left, and then, check if the highest bit shifted is 1, meaning the whole thing is negative, then propagate that bit to the bits ("0") at the left of the highest byte.  It will blow up more than 12 or 13 clock cycles.

What I tried with great effort is keeping the reciprocal multiplication to results in the same registers as it started, except for "X" temporary register.

Wagner Lipnharski
Orlando Florida USA

My 0856 AVR instruction set file shows only 1 clock cycle for MUL

The newer manual , and all of the datasheets with an instruction set summary, show 2 cycles for MUL.

is 8bits always sufficient for the reciprocal of a power of 2?  My occasional experience with integer reciprocals is that they tend to need a lot of bits to yield correct results :-(  (usually as part of the "faster divide by 10" discussions, so NOT involving powers-of-2.)

sparrow2 wrote:

Remember that shifting 6 times "down" is the same as shifting 2 times up. (and move the bytes around, )

for sign extend:

SBC a register with itself will give 0 if carry clr and 0xff is is set.

so something like:

sbc r19,r19

and 3 moves if you need the correct place.

Also a great idea, when the shift count is 6 or 7.

I need to remember that, ADC itself, itself is the same as LSL.

Still the problem that R19 is the left over from all the shift lefts, and at the end it will required to move (reposition) the bytes to the right.

Also, the same problem if the number is negative, need to propagate the highest bit to the leftmost zero bits on R19.

This all thing makes part of the CORDIC bit shifting, it goes from 1 to 21 bit shifting to the right, or 3 or 4 8 bits registers, depending of how high I want accuracy, 24 or 32 bits.

When the bit shifting count is multiple of 8 it is easy, just move the whole bytes,

When it is 1, 2 or 3 bits, also easy, will use 3, 6, 9 clock cycles (for 3 bytes) or 4, 8, 12 for 4 bytes.

For 4~7 seems the reciprocal multiplication still winning with 10 cycles, but increasing cycles when negative.

Will keep fighting this thing.

Thanks.

Wagner Lipnharski
Orlando Florida USA

grohote wrote:

wagnerlip wrote:
Any other ideas for less than 10 clock cycles

9 clock cycles:

In your examples, if X is Even reg, let A be Even reg+1 and then you can replace two mov by movw X, R0.

That is a great catch, and simple, really.

The temp register (x) can easily be aside the A register.

That is around 10% increase in performance... :-)

Wagner Lipnharski
Orlando Florida USA

First MUL take 2 clk. (and I found the datasheet that show 1 clk, but at least the table at the top show 2 clk for all families with HW mul)(there is a clone that do it in 1 clk, but that is an other story)

2.

I hate to use "fake" instructions, I want to write the same code as I see it in the debugger.

3.

In #10 you wrote :

need to propagate the highest bit to the leftmost zero bits on R19

No the SBC will leave R19 with 0 if a positive number or 0xff if negative, which is a correct handling of a signed number.

4. I would also count the load of D as one clk.

I have not spend time on it but I think that you can handle signed numbers by using some of the signed mul instructions.

Are you sure that you need the add with zero?

(It looks like you just could or the bits (the mul have a 16 result but only 8 can hold a value the rest are 0) one of the bits in the add will always be 0 ).

as I remember the fastest way for shift by 4 is using swap.

Last Edited: Thu. Oct 7, 2021 - 12:50 PM

Is the intention to use hand-crafted methods where the number of shifts is known in advance for that location in the code, or to somehow choose which method to use on the fly?

#1 Hardware Problem? https://www.avrfreaks.net/forum/...

#3 All grounds are not created equal

#4 Have you proved your chip is running at xxMHz?

#5 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand."

I hate to use "fake" instructions, I want to write the same code as I see it in the debugger.

@sparrow2

Do you know if there is a complete list of which commands are "clones" of the others.   Let the real commands stand tall!

Sparrow flies in for the win!

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

No but I just ran thru the instructions and found these instructions:

CBR  is an ANDI with mask
CLR  is EOR register with it self
LSL  ADD register with it self
ROL  ADC register with it self
TST  AND register with it self

avrcandies wrote:

I hate to use "fake" instructions, I want to write the same code as I see it in the debugger.

@sparrow2

Do you know if there is a complete list of which commands are "clones" of the others.   Let the real commands stand tall!

The specializations are the "fakes".

LSL Rfred is ADD Rfred, Rfred .

Moderation in all things. -- ancient proverb

I learned (hard way) that inc reg and add reg, rOne are not the same.

For example, inc XL adc XH,rZero - does not works, must be: add XL,rOne  adc XH, rZero

Brian Fairchild wrote:
Is the intention to use hand-crafted methods where the number of shifts is known in advance for that location in the code, or to somehow choose which method to use on the fly?

Methods using mul in principle can be more generic than the ones using shifts.

westfw wrote:
is 8bits always sufficient for the reciprocal of a power of 2?

8 bits is enough for the inverses 1/2 to 1/128, they are exact because you can use the trick of multiplying by 256/256 ( which is 1, obviously). So

```1/2 * 256/256 = 256/2 * 1/256 = 128/256
1/4 * 256/256 = 256/4 * 1/256 = 64/256
...
1/128 = ... = 2/256```

In sum, multiplying by inverse powers of 2 is the same as multiplying by powers of 2 and then dividing by 256 (or 65536, 256^n etc). Since division by 256^n is free and exact (just moving registers around, or even just reassigning them) it's always possible to do this exactly (unlike powers of 10).

Last Edited: Thu. Oct 7, 2021 - 10:09 PM

MUL with 2**n is a poor mans barrel shifter.

and if you want to shift a byte within it self you or the result(r0 and r1 ) together . (the good thing about MUL is that the 8 unwanted bits are 0)

Methods using mul in principle can be more generic than the ones using shifts.

No you first need to convert n to a mask.

with shift you just make a loop that run n times

grohote wrote:

I learned (hard way) that inc reg and add reg, rOne are not the same.

For example, inc XL adc XH,rZero - does not works, must be: add XL,rOne  adc XH, rZero

The instruction set shows INC does not affect C Status bit.

Wagner Lipnharski
Orlando Florida USA

Confirmed, at least on simulator, MUL instruction, AtMega328, is 2 clock cycles.

Wagner Lipnharski
Orlando Florida USA

The, considering use of movw, and X=r20. A=r21 for example, minimum of 12 cycles for 24 bits positive numbers, for any quantity of right shift bits, being D the reciprocal 2^(8-n).

```C0:     MUL C, D           ;
MOV C, R1          ; CDH

MUL A, D           ;

MUL B, D           ;
ADD C, R0          ; CDH + BDL

Obviously we need to do a total calculation to finish this doubt forever.

The above takes 12 cycles for "n" being any number from 1 to 7.

We can define the question of shifting 3 bytes to the right, in a sequence from 1 to 23 bits.

Lets assume the bytes are A, B and C (lsb), and F will be the counter 1 to 23.

It  is clear that when the count is multiple of 8, just moving the whole byte to the right will be economic, C=B, B=A, A=0, just 3 clock cycles.

This will happens on count of 8 and 16, so we need to keep comparing the count with 8, on each loop of the shifter.

```A1:   cpi F, 8
brlo B0
mov C, B
mov B, A
clr A
subi F,8
rjmp A1

```

As we can see above, it is NOT only 3 clock cycles, it takes 11 clock cycles if 16>F>7,  or 19 cycles if F>15 and a waste of 3 if F<8.

The total A0 cycles will be 11+19+21*3 = 93 cycles.

Considering the normal right shift if nothing else;

```B0:   ASR A
ROR B
ROR C```

for each F count, 3 cycles only.

If using only 23 times B0 (not using A0), it would take, it will take a total of 1+2+3...+22+23 shifts = 276 packs of B0, total of 828 cycles.

If mixing A0 and B0, A0 will run on F=8 and 16, and B0 runs 3 times when F<8, it would be;

3 * B0 * (1+2+3+4+5+6+7) = B0*28*3 * 3Cycles = 252 cycles

and we need to add the A0 93 cycles, total of 252+93 = 345 cycles.

Not counting here the loop control and the moving ending values to the use of the application, just bit shifting.

B0 takes care of negative numbers, since the first shift is ASR, it propagates bit 7 down the range.

But A0 has no negative numbers control, it needs to be added.

```A1:   cpi F, 8
brlo B0
mov C, B
mov B, A
clr A
sbrc B,7   ; if B = 0b1xxxxxx
com a      ; A=FF
subi F,8
rjmp A1```

The total A1 cycles will be 13+21+21*3 = 97 cycles.

Total A1 & B0 = 349 cycles, with total control for negative numbers.

A B C exit shifted right  correctly from 1 to 23 times, even if negative.

Now, lets go back to the C0 routine (multiplication) above, it takes 12 cycles per every whole shift, from 1 to 7 shift counts.

It can of course do up to 21 shifts promoting 3 times 7 shifts, fixing the rest with B0 routine.

But as C0 takes 12 cycles, it is the same as B0 running 4 chaining shifts, and 1, 2 and 3 shifts B0 gains.

So, C0 only gains for 5, 6 or 7 shifts.

Also, C0 needs some dressing around, first, the D registers is not F, need a conversion.

F = 7, D = 2

F = 6, D = 4

F = 5, D = 8

This can be done by compare and branch, by table, or other.

and second, it needs the negative number fix and that is not simple.

The negative works by converting all the left zeros bits on resulting A to "1", if the original A7=1, there are several ways to do it.

I can not see any of the above dressing of C0 be done with less than 10 to 12 instructions, branches, and perhaps more than 15 to 18 cycles.

Lets see, below a very wild brute guessing for dressing B0 with C0, with reciprocal value for D, and dressing for negative numbers.

F is the number of shifts, 1 to 23 in the sequence.

The routine try to use C0 in bumps of no more than 7 shifts, so C1 and C2 below takes care of that and loops to C1 if F>13

C3 do the economical simple 3 instructions low cost shift in case F<5.

C0 is the actual above routine, with the negative dressing, easier than I thought, since temp1 was carrying the value to be ORED to resulting A in case original A was negative.

```C1:   tst F
breq exit
mov temp,F
cpi temp,8     ; if count < 8 can go directly for the reciprocal multiplication
brlo C2
ldi temp,7     ; if count > 7 then do 7 reciprocal mult and reduce 7 from count

C2:   sub F,temp
cpi temp,5     ; if count < 5 no reciprocal for this, just bare B0 times count
brlo C3        ; this line and the above could be eliminated with expense or more cycles below if temp<5

ldi D,2
ldi temp1, \$FE ; dressing left 7 bits if negative
cpi temp,7     ; if count =7 D=2 and do reciprocal
beq C0

ldi D,4
ldi temp1, \$FC ; dressing left 6 bits if negative
cpi temp,6     ; if count =6 D=4 and reciprocal
breq C0

ldi D,8
ldi temp1, \$F8 ; dressing left 5 bits if negative
cpi temp,5     ; if count =5 D=8 and reciprocal
breq C0

C3:   asr A          ; here repeat bare shift "temp" times, up to 4, economical.
ror B
ror C
dec temp
brne C
rjmp C1

C0:   sbrs A,7       ; skip next if original A is not negative, bit 7=0
clr temp1      ; no negative bits
;  here the regular C0, reciprocal multiplication, A B C and D are set
or A,temp1     ; repopulate left bits if original A was negative
```

We have a bunch of instructions to accommodate B0 and C0 routines to run together.

Considering we will have at least 4x B0s when F is lower than 5, one extra B0 when F=21, C0 when F=5, 6, 7, 12, 13, 14, 18, 19, 20.

We have around 20 cycles per iteration just for the overhead/dressing, guessing 1/2 of it per F value, 200 cycles, plus the 349 cycles above, 549 cycles for B0 & C0, still better than 828 for just B0 chain sequence.  Need to simulate the above to get the real cycles count, and perhaps optimize the brute code.  May be including the A1 code (97 cycles) could reduce 2xC0 (8 and 16).

But A1 & B0 do it all with only 349 cycles, so that is the winner, forget the reciprocal multiplication, even being a good idea, the overhead murdered it bloody violently.

And for the curious, yeah, I had nothing better to do this Saturday evening.  ;-)  ... and that closed the doubt.

Wagner Lipnharski
Orlando Florida USA

And one extra thing, if speed matter the code often is placed in ISR, which mean that you have to store R0 and R1 for each interrupt, that would be an extra 8 clk with push/pop or 2 clk if you reserve 2 registers and use MOVW.

avrcandies wrote:
Do you know if there is a complete list of which commands are "clones" of the others.
See this post:

https://www.avrfreaks.net/commen...

It's in a thread about writing an AVR simulator in AVR C (don't ask!) and I showed my analysed list of AVR opcodes in which the "=" lines are simply synonyms ant not "real" insttructions:

```0000 0000 0000 0000   nop				1
0000 0001 dddd rrrr   movw    v,v		1
0000 0010 dddd rrrr   muls    d,d		1
0000 0011 0ddd 0rrr   mulsu   a,a		1
0000 0011 0ddd 1rrr   fmul    a,a		1
0000 0011 1ddd 0rrr   fmuls   a,a		1
0000 0011 1ddd 1rrr   fmulsu  a,a		1
0000 01rd dddd rrrr   cpc     r,r		1
0000 10rd dddd rrrr   sbc     r,r		1
0000 11rd dddd rrrr   add     r,r		1
=             lsl     r

0001 00rd dddd rrrr   cpse    r,r		1
0001 01rd dddd rrrr   cp      r,r		1
0001 10rd dddd rrrr   sub     r,r		1
0001 11rd dddd rrrr   adc     r,r		1
=             rol     r

0010 00rd dddd rrrr   and     r,r		1
=             tst     r
0010 01rd dddd rrrr   eor     r,r		1
=             clr     r
0010 10rd dddd rrrr   or      r,r		1
0010 11rd dddd rrrr   mov     r,r		1

0011 KKKK dddd KKKK   cpi     d,M		1

0100 KKKK dddd KKKK   sbci    d,M		1

0101 KKKK dddd KKKK   subi    d,M		1

0110 KKKK dddd KKKK   ori     d,M		1
=             sbr     d,M

0111 KKKK dddd KKKK   andi    d,M		1
=             cbr     d,n

100! 000d dddd ee-+   ld      r,e		1
100! 001r rrrr ee-+   st      e,r		1
10o0 oo0d dddd booo   ldd     r,b		1
10o0 oo1r rrrr booo   std     b,r		1

1001 000d dddd 0000   lds     r,i		2
1001 000d dddd 010+   lpm     r,z		1
1001 000d dddd 011+   elpm    r,z		1
1001 000r rrrr 1111   pop     r			1
1001 001d dddd 0000   sts     i,r		2
1001 001r rrrr 1111   push    r			1
1001 0100 0000 1001   ijmp				1
1001 0100 0001 1001   eijmp				1
1001 0100 0SSS 1000   bset    S			1
=             sec
=             sez
=             sen
=             sev
=             ses
=             seh
=             set
=             sei
1001 0100 1SSS 1000   bclr    S			1
=             clc
=             clz
=             cln
=             clv
=             cls
=             clh
=             clt
=             cli
1001 0101 0000 1000   ret				1
1001 0101 0000 1001   icall				1
1001 0101 0001 1000   reti				1
1001 0101 0001 1001   eicall			1
1001 0101 1000 1000   sleep				1
1001 0101 1001 1000   break				1
1001 0101 1010 1000   wdr				1
1001 0101 1100 1000   lpm     ?			1
1001 0101 1101 1000   elpm    ?			1
1001 0101 1110 1000   spm				1
1001 010h hhhh 110h   jmp     h			2
1001 010h hhhh 111h   call    h			2
1001 010r rrrr 0000   com     r			1
1001 010r rrrr 0001   neg     r			1
1001 010r rrrr 0010   swap    r			1
1001 010r rrrr 0011   inc     r			1
1001 010r rrrr 0101   asr     r			1
1001 010r rrrr 0110   lsr     r			1
1001 010r rrrr 0111   ror     r			1
1001 010r rrrr 1010   dec     r			1
1001 0110 KKdd KKKK   adiw    w,K		1
1001 0111 KKdd KKKK   sbiw    w,K		1
1001 1000 pppp psss   cbi     p,s		1
1001 1001 pppp psss   sbic    p,s		1
1001 1010 pppp psss   sbi     p,s		1
1001 1011 pppp psss   sbis    p,s		1
1001 11rd dddd rrrr   mul     r,r		1

1011 0PPd dddd PPPP   in      r,P		1
1011 1PPr rrrr PPPP   out     P,r		1

1100 LLLL LLLL LLLL   rjmp    L			1

1101 LLLL LLLL LLLL   rcall   L			1

1110 KKKK dddd KKKK   ldi     d,M		1
=			  ser     d (K=FF)

1111 00ll llll lsss   brbs    s,l		1
=             brcs    l
=             brlo    l
=             breq    l
=             brmi    l
=             brvs    l
=             brlt    l
=             brhs    l
=             brts    l
=             brie    l
1111 01ll llll lsss   brbc    s,l		1
=             brcc    l
=             brsh    l
=             brne    l
=             brpl    l
=             brvc    l
=             brge    l
=             brhc    l
=             brtc    l
=             brid    l
1111 100d dddd 0sss   bld     r,s		1
1111 101d dddd 0sss   bst     r,s		1
1111 110r rrrr 0sss   sbrc    r,s		1
1111 111r rrrr 0sss   sbrs    r,s		1```

I seem to remember it reduces the "132/135 opcodes" that are mentioned at the top of each AVR datasheet to something like 78 actual opcodes.

I seem to remember it reduces the "132/135 opcodes" that are mentioned at the top of each AVR datasheet to something like 78 actual opcodes.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

clawson wrote:
reduces the "132/135 opcodes" that are mentioned at the top of each AVR datasheet to something like 78 actual opcodes.

As they're keen to emphasise that it's RISC (ie, Reduced Instruction Set), why would they want to pump the number up ?!

Top Tips:

1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...

awneil wrote:

As they're keen to emphasise that it's RISC (ie, Reduced Instruction Set), why would they want to pump the number up ?!

To help those poor ASM coders who have to rely on knowing each and every instruction, and who expect the instruction set to be obvious.

#1 Hardware Problem? https://www.avrfreaks.net/forum/...

#3 All grounds are not created equal

#4 Have you proved your chip is running at xxMHz?

#5 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand."

I suppose some might argue that it is easier to remember things like SEI which is really just a virtual instruction for BSET 7. A lot of the virtual instructions are the BSET and BCLR synonyms. BRBS and BRBC also have eight virtual synonyms each too. Again something like a BRNE x is probably easier to remember than BRBC 1, x though if there was a Z=1 somewhere then perhaps you could live with BRBC Z, x ?

Actually there are 9 synonyms under BRBS and BRBC so of those 9 two must be synonyms of each other in each group! .... ah yes, BRCS and BRLO are the same thing  (and in turn there are both really BRBS 0, x)

Brian Fairchild wrote:
To help those poor ASM coders

Poor ASM coders are perfectly happy with a half of that number of instructions. Some of them will never be necessary.

My latest project with 'Tiny13, 90% code use: Instructions used: 51 out of 105 (48.6%)

My previous ATmega16 project: Instructions used: 75 out of 113 (66.4%)

Last Edited: Mon. Oct 11, 2021 - 05:55 PM

grohote wrote:

My latest project with 'Tiny13, 90% code use: Instructions used: 51 out of 105 (48.6%)

My previous ATmega16 project: Instructions used: 75 out of 113 (66.4%)

I had a quick look at the report from my compiler for some production code.

45k LOC compiling down to 19352 bytes (on a 328)

Instructions used: 73 out of 116 (62.9%)

```.lds  :   0 .lds.l:   0 .sts  :   0 .sts.l:   0 adc   :  89 add   :  98
adiw  : 162 and   :   0 andi  :  66 asr   :   0 bclr  :   0 bld   :   2
brbc  :   0 brbs  :   0 brcc  :   2 brcs  :   0 break :   0 breq  : 157
brge  :  34 brhc  :   0 brhs  :   0 brid  :   0 brie  :   0 brlo  :  35
brlt  :  29 brmi  :   1 brne  : 312 brpl  :   4 brsh  :  15 brtc  :   5
brts  :   0 brvc  :   0 brvs  :   0 bset  :   0 bst   :   2 call  : 545
cbi   :  67 cbr   :   1 clc   :   0 clh   :   0 cli   :   5 cln   :   0
clr   :  83 cls   :   0 clt   :   3 clv   :   0 clz   :   0 com   :  11
cp    :  69 cpc   : 109 cpi   : 439 cpse  :   0 dec   :  13 des   :   0
eor   :   8 fmul  :   0 fmuls :   0 fmulsu:   0 icall :   8 ijmp  :   0
in    :  12 inc   :   2 jmp   :  37 ld    :  99 ldd   : 294 ldi   :1489
lds   : 350 lpm   :  14 lsl   :  39 lsr   :   8 mov   : 201 movw  : 128
mul   :  20 muls  :   0 mulsu :   0 neg   :  15 nop   :   0 or    :   9
ori   :  23 out   :  35 pop   :  18 push  :  18 rcall :  85 ret   :  63
reti  :   7 rjmp  : 469 rol   :  52 ror   :   9 sbc   :  42 sbci  :  46
sbi   :  52 sbic  :   8 sbis  :  16 sbiw  :  55 sbr   :   2 sbrc  :  20
sbrs  :  16 sec   :   0 seh   :   0 sei   :   3 sen   :   0 ser   :  17
ses   :   0 set   :   3 sev   :   0 sez   :   0 sleep :   0 spm   :   0
st    : 420 std   : 144 sts   : 244 sub   :  24 subi  : 119 swap  :   0
tst   :   9 wdr   :   3 ```

#1 Hardware Problem? https://www.avrfreaks.net/forum/...

#3 All grounds are not created equal

#4 Have you proved your chip is running at xxMHz?

#5 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand."

Seems that we have similar numbers.

There is my biggest ASM project, and my record score on Instructions used:

"ATmega32" instruction use summary:
.lds  :   0 .sts  :   0 adc   : 140 add   : 149 adiw  :  21 and   :  30
andi  :  78 asr   :   2 bclr  :   0 bld   :   6 brbc  :   0 brbs  :   0
brcc  :  31 brcs  :  55 break :   0 breq  : 199 brge  :   1 brhc  :   0
brhs  :   0 brid  :   0 brie  :   0 brlo  :  97 brlt  :   6 brmi  :  10
brne  : 318 brpl  :  15 brsh  :  35 brtc  :   7 brts  :   0 brvc  :   0
brvs  :   0 bset  :   0 bst   :   3 call  : 793 cbi   : 111 cbr   :  16
clc   :   6 clh   :   0 cli   :   2 cln   :   0 clr   : 194 cls   :   0
clt   :   2 clv   :   0 clz   :   5 com   :  44 cp    :  43 cpc   :  42
cpi   : 241 cpse  :  75 dec   : 199 eor   :  54 fmul  :   0 fmuls :   0
fmulsu:   0 icall :   0 ijmp  :  30 in    :  31 inc   : 140 jmp   : 122
ld    : 153 ldd   :  86 ldi   :1528 lds   : 612 lpm   :  21 lsl   :  64
lsr   :  69 mov   : 427 movw  : 148 mul   :  18 muls  :   0 mulsu :   0
neg   :   3 nop   :  74 or    :  51 ori   :   1 out   :  72 pop   : 273
push  : 258 rcall : 739 ret   :1020 reti  :   4 rjmp  : 828 rol   :  50
ror   :  67 sbc   :  34 sbci  :  61 sbi   :  84 sbic  :  10 sbis  :   2
sbiw  :   4 sbr   :  24 sbrc  : 148 sbrs  :  94 sec   :  10 seh   :   0
sei   :   2 sen   :   0 ser   :  25 ses   :   0 set   :   3 sev   :   0
sez   :   2 sleep :   1 spm   :   0 st    : 176 std   :  32 sts   : 615
sub   :  40 subi  : 124 swap  :  23 tst   : 143 wdr   :   0
Instructions used: 83 out of 113 (73.5%)

"ATmega32" memory use summary [bytes]:
Segment   Begin    End      Code   Data   Used    Size   Use%
---------------------------------------------------------------
[.cseg] 0x000000 0x006ec8  27422    938  28360   32768  86.5%
[.dseg] 0x000060 0x0007e4      0   1888   1888    2048  92.2%

Well, the AVR have some features of dubious utility, such as the half-carry flag. It's not surprising that instructions that manipulate this flag are rarely, if ever, used.

El Tangas wrote:

Well, the AVR have some features of dubious utility, such as the half-carry flag. It's not surprising that instructions that manipulate this flag are rarely, if ever, used.

Oh come on, H flag is so nice...

Very useful when converting hex to bcd, math in bcd, or a software DAA...

or producing a square wave every 90 instructions without even setting a counter... ;-)

CLR R16

SEC

ROL R16

BRHC PC+2

SBI PINB,2

RJMP PC-3

or producing two square waves, 90° out of phase to each other, full wave every 126 instructions, 63.5kHz if clock 8MHz;

CLR R16

SEC

ROL R16

BRHC PC+2

SBI PINB, 2

BRCC PC+2

SBI PINB, 3

RJMP PC-5

Wagner Lipnharski
Orlando Florida USA

What I would love is to have a second set of registers, R00' to R31' that you could load with a specific value, and activate a compare R05 with R05' (for example) and set a flag or an interrupt...

Think, you are plotting pixels on a color LCD, the screen is like a scope, there are X and Y axis lines in the center of screen, your line being plotted will cross the axis, but you don't want to destroy (plot over) the axis.

So, for every pixel you put on LCD, you need to compare if the address is the same of the axis, and if yes, ignore that plot.

For that simple verification you added 4 or more instructions FOR EVERY pixel plotted, comparing addresses.

If you could compare in hardware the registers containing the address to plot, against a preset value in the counterpart registers, and be interrupted ONLY when it matches... wonderful, isn't it?

May be more simple, R24&R25 and R26&R27 could be compared against any other register pair, programmed in two other compare register I/Os, CPRA and CPRB (8 bits each)

CPRA is the address of the low register (R00~R31, 0x00~0x1F) to compare against R24&R25 and CPRB the same to R26&R27.

Bit 7 of CPRA or CPRB if "1" enables interrupt, Bit 6 is a flag that will be "1" whenever CPRA or B matches the register vectored.

CPRA = 0x06 means R06 will be compared against R24, and R07 against R25, when matches, CPRA bit 6 will be up.

CPRA = 0x86 means the same, but interrupt (vector CPRA_INT) will be generated.

Hardware compare for registers, the same as used for counter/timers.

Wagner Lipnharski
Orlando Florida USA

wagnerlip wrote:
Oh come on, H flag is so nice...

I knew someone would say something like that the moment I hit the "post" button

Personally, I'd rather have a parity flag instead of H, but oh, well...

I like the H flag so much that I left it (and its associated test instructions) out of my nearly-8080 emulator :)

When did anyone in the real world last do BCD arithmetic? (Which seems to work quite well in the 6502 without a half-carry flag...)

Neil

barnacle wrote:
out of my nearly-8080 emulator :)
Did the 8080 have "DAA" like the Z80 does ?

EDIT: answering my own question - so it seems the 8080, like the Z80 that followed it, had DAA but in the Z80 they added the N flag to know is the previous op was an ADD or a SUB whereas in 8080 DAA only made sense after an ADD so it was more limited. But basically if you added something like 0x19 + 0x16 the initially the accumulator would hold 0x2F but after DAA it would be adjusted to 0x35 because 16 + 19 = 35 in decimal. The op is really "if lower nibble > 9 then add 6"

Last Edited: Wed. Oct 13, 2021 - 11:09 AM

wagnerlip wrote:
Using reciprocal multiplication can save some cycles.

Perhaps David Grossman can help to save cycles:

https://getpocket.com/explore/it...

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

theusch wrote:

Perhaps David Grossman can help to save cycles:

... if you are multiplying numbers that contain 50000 decimal digits or more.

The names of the guys who helped us all to "save cycles" are well-known: Anatoly Karatsuba and Andrei Toom - the fathers of high-efficiency multiplication and the following derivative work by Schönhage, Strassen, Fürer and others.

The title immediately hints that a "garbage grade source" qualification is likely.

Hence, the first step after opening the page is, of course, textual search for the combination "Karatsuba"... Comes up with nothing. End of story. The article at the link is immediately stamped as "garbage grade", gets crumpled, draws a happy parabolic trajectory through the air and bounces off the wall into the nearest waste basket.

P.S. Ah, this is actually from "Popular Mechanics"... People normally get insulted by links to "Popular Mechanics" so apparently the new thing is to camouflage them using "pin" sites like pocket.com. It is a form of rickrolling, I guess...

Dessine-moi un mouton

Last Edited: Tue. Nov 30, 2021 - 06:24 PM

Other than being a bit of a non sequitur, I do not see the problem with post 39.

'Twas even labeled as a reference to Popular Mechanics.

BTW what is the matter with a reference to Popular Mechanics?

Moderation in all things. -- ancient proverb

I found the whole thing rather interesting & wanted to see the method.  My only disappointment is the video shows the old way & says there is this great new way we have, but never shows it being done!  Oh well.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!