Shift right multi byte using reciprocal multiplication

Go To Last Post
38 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Everyone knows, to shift right a 24 bits variable (3 bytes) it needs 3 instructions per bit shifted, if doing straight without using loop.

 

per bit:

ASR ByteH ; if signed, or LSR if unsigned

ROR ByteM

ROR ByteL

 

When shifting 6 bits, it will use 18 clock cycles, if this is part of a repeating routine, it can easily eat time.

Using reciprocal multiplication can save some cycles.

 

Shift right ABC (24 bits) "n" bits:
The same as ABC x 2^(8-n), then ignore LSByte, result into ABC.
For "n" 4~7 it saves clock cycles.
If ABC is signed and negative, needs to pad resulting "A" n MSbits with "1", not showing below.

If n=4, D=0x10
If n=5, D=0x08
If n=6, D=0x04

    
         A   B   C   
               x D 
    --- --- --- ---        
            CDH  /
    ADH ADL      / 
        BDH BDL  /
    --- --- --- ---   
     A   B   C   /

Option A (10 clock cycles), considering variable Zero = 0:

       

        MUL A, D          ; 
        MOV A, R1         ; ADH
        MoV X, R0         ; ADL
        
        MUL C, D          ;
        MOV C, R1         ; CDH
        
        MUL B, D
        MOV B, X          ; ADL
        ADD C, R0         ; CDH + BDL
        ADC B, R1         ; ADL + BDH + cy
        ADC A, ZERO       ; ADH + cy

        
Option B,  just changing sequence, same clock cycles.
      

        MUL C, D           ; 
        MOV C, R1          ; CDH

        MUL A, D           ;
        MOV A, R1          ; ADH
        MOV X, R0          ; ADL
        
        MUL B, D           ;
        MOV B, X           ; ADL
        ADD C, R0          ; CDH + BDL
        ADC B, R1          ; ADL + BDH + cy
        ADC A, ZERO        ; ADH + cy

 

Any other ideas for less than 10 clock cycles?

The example above will run in 10 clock cycles no matter the "n" for unsigned.

In the normal shifting fashion, 6 bits will eat 18 clock cycles for unsigned (LSR) or signed (ASR).
 

Wagner Lipnharski
Orlando Florida USA

Last Edited: Wed. Oct 6, 2021 - 03:29 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

24 bit, right shift by 6 bits, in 8 cycles (assumes  EXTRA_HI is zeroed, or that the top 6 EXTRA_HI bits should be ignored)

 

LSL LOW
ROL MED
ROL HIGH
ROL EXTRA_HI
LSL LOW
ROL MED
ROL HIGH
ROL EXTRA_HI ;Right-justified result in [EXTRA_HI : HIGH : MED]

I think this is correct, unless I became fumble fingered.

Since the low byte goes to the trashcan, all of the commands could actually be ROL

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Wed. Oct 6, 2021 - 02:53 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Remember that shifting 6 times "down" is the same as shifting 2 times up. (and move the bytes around, )

for sign extend:

SBC a register with itself will give 0 if carry clr and 0xff is is set. 

so something like:

add r16,r16

adc r17,r17

adc r18,r18

sbc r19,r19

add r16,r16

adc r17,r17

adc r18,r18

adc r19,r19

and 3 moves if you need the correct place. 

Last Edited: Wed. Oct 6, 2021 - 04:21 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wagnerlip wrote:
Any other ideas for less than 10 clock cycles

 

9 clock cycles:

In your examples, if X is Even reg, let A be Even reg+1 and then you can replace two mov by movw X, R0.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Great find! 

and then you can replace two mov by movw X, R0

more explicit notation  movw A:X, r1:r0 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

UPS 

You have to remember that MUL take 2 clk so your code take 13 not 10 clk.

 

My code take 8 for signed and 9 for unsigned.(and the 6 top bit of r19)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
You have to remember that MUL take 2 clk so your code take 13 not 10 clk.

 

My 0856 AVR instruction set file shows only 1 clock cycle for MUL (unsigned) instruction.

avrcandies wrote:
You have to remember that MUL take 2 clk so your code take 13 not 10 clk.
 

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avrcandies wrote:

24 bit, right shift by 6 bits, in 8 cycles (assumes  EXTRA_HI is zeroed, or that the top 6 EXTRA_HI bits should be ignored)

 

LSL LOW
ROL MED
ROL HIGH
ROL EXTRA_HI
LSL LOW
ROL MED
ROL HIGH
ROL EXTRA_HI ;Right-justified result in [EXTRA_HI : HIGH : MED]

I think this is correct, unless I became fumble fingered.

Since the low byte goes to the trashcan, all of the commands could actually be ROL

 

Yes, I considered it on my calculations, some shift count may use different approach.

For example, 1, 2 and 3 bit shift right is economically use plain and simple ASR/ROR, since they will use 3, 6 or 9 clock cycles.

But 4 and above benefits from the reciprocal multiplication.

The backward (left shift) can also be used in place of 6 and 7 right shifts.

 

The only problem that I gave up pursuing backwards, is considering negative value.

Lets say 6 bits right would be 2 set of backwards, it means 8 clock cycles, then move the 3 bytes back right into position since they are 1 byte shifted to the left, and then, check if the highest bit shifted is 1, meaning the whole thing is negative, then propagate that bit to the bits ("0") at the left of the highest byte.  It will blow up more than 12 or 13 clock cycles.

 

What I tried with great effort is keeping the reciprocal multiplication to results in the same registers as it started, except for "X" temporary register.

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My 0856 AVR instruction set file shows only 1 clock cycle for MUL

The newer manual , and all of the datasheets with an instruction set summary, show 2 cycles for MUL.

 

is 8bits always sufficient for the reciprocal of a power of 2?  My occasional experience with integer reciprocals is that they tend to need a lot of bits to yield correct results :-(  (usually as part of the "faster divide by 10" discussions, so NOT involving powers-of-2.)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

Remember that shifting 6 times "down" is the same as shifting 2 times up. (and move the bytes around, )

for sign extend:

SBC a register with itself will give 0 if carry clr and 0xff is is set. 

so something like:

add r16,r16

adc r17,r17

adc r18,r18

sbc r19,r19

add r16,r16

adc r17,r17

adc r18,r18

adc r19,r19

and 3 moves if you need the correct place. 

 

Also a great idea, when the shift count is 6 or 7.

I need to remember that, ADC itself, itself is the same as LSL.

Still the problem that R19 is the left over from all the shift lefts, and at the end it will required to move (reposition) the bytes to the right.

Also, the same problem if the number is negative, need to propagate the highest bit to the leftmost zero bits on R19.

 

This all thing makes part of the CORDIC bit shifting, it goes from 1 to 21 bit shifting to the right, or 3 or 4 8 bits registers, depending of how high I want accuracy, 24 or 32 bits.

When the bit shifting count is multiple of 8 it is easy, just move the whole bytes,

When it is 1, 2 or 3 bits, also easy, will use 3, 6, 9 clock cycles (for 3 bytes) or 4, 8, 12 for 4 bytes.

For 4~7 seems the reciprocal multiplication still winning with 10 cycles, but increasing cycles when negative.

 

Will keep fighting this thing.

Thanks. 

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

grohote wrote:

wagnerlip wrote:
Any other ideas for less than 10 clock cycles

 

9 clock cycles:

In your examples, if X is Even reg, let A be Even reg+1 and then you can replace two mov by movw X, R0.

 

That is a great catch, and simple, really.

The temp register (x) can easily be aside the A register.

That is around 10% increase in performance... :-)

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

First MUL take 2 clk. (and I found the datasheet that show 1 clk, but at least the table at the top show 2 clk for all families with HW mul)(there is a clone that do it in 1 clk, but that is an other story)

 

2.

LSL is an added name for ADD

ROL is ADC

I hate to use "fake" instructions, I want to write the same code as I see it in the debugger.

 

3.

In #10 you wrote :

 need to propagate the highest bit to the leftmost zero bits on R19

No the SBC will leave R19 with 0 if a positive number or 0xff if negative, which is a correct handling of a signed number.

 

4. I would also count the load of D as one clk.

 

I have not spend time on it but I think that you can handle signed numbers by using some of the signed mul instructions.

 

add:

Are you sure that you need the add with zero? 

(It looks like you just could or the bits (the mul have a 16 result but only 8 can hold a value the rest are 0) one of the bits in the add will always be 0 ).

 

as I remember the fastest way for shift by 4 is using swap.

 

 

          

 

 

  

Last Edited: Thu. Oct 7, 2021 - 12:50 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Is the intention to use hand-crafted methods where the number of shifts is known in advance for that location in the code, or to somehow choose which method to use on the fly?

#1 Hardware Problem? https://www.avrfreaks.net/forum/...

#2 Hardware Problem? Read AVR042.

#3 All grounds are not created equal

#4 Have you proved your chip is running at xxMHz?

#5 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I hate to use "fake" instructions, I want to write the same code as I see it in the debugger.

@sparrow2

Do you know if there is a complete list of which commands are "clones" of the others.   Let the real commands stand tall! 

Sparrow flies in for the win!

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No but I just ran thru the instructions and found these instructions: 

  CBR  is an ANDI with mask
  CLR  is EOR register with it self
  LSL  ADD register with it self
  ROL  ADC register with it self
  TST  AND register with it self
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avrcandies wrote:

I hate to use "fake" instructions, I want to write the same code as I see it in the debugger.

@sparrow2

Do you know if there is a complete list of which commands are "clones" of the others.   Let the real commands stand tall!

The specializations are the "fakes".

LSL Rfred is ADD Rfred, Rfred .

Moderation in all things. -- ancient proverb

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I learned (hard way) that inc reg and add reg, rOne are not the same.

 

For example, inc XL adc XH,rZero - does not works, must be: add XL,rOne  adc XH, rZero

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Brian Fairchild wrote:
Is the intention to use hand-crafted methods where the number of shifts is known in advance for that location in the code, or to somehow choose which method to use on the fly?

 

Methods using mul in principle can be more generic than the ones using shifts.

 

westfw wrote:
is 8bits always sufficient for the reciprocal of a power of 2?

8 bits is enough for the inverses 1/2 to 1/128, they are exact because you can use the trick of multiplying by 256/256 ( which is 1, obviously). So

1/2 * 256/256 = 256/2 * 1/256 = 128/256
1/4 * 256/256 = 256/4 * 1/256 = 64/256
...
1/128 = ... = 2/256

In sum, multiplying by inverse powers of 2 is the same as multiplying by powers of 2 and then dividing by 256 (or 65536, 256^n etc). Since division by 256^n is free and exact (just moving registers around, or even just reassigning them) it's always possible to do this exactly (unlike powers of 10).

Last Edited: Thu. Oct 7, 2021 - 10:09 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

MUL with 2**n is a poor mans barrel shifter.

 

and if you want to shift a byte within it self you or the result(r0 and r1 ) together . (the good thing about MUL is that the 8 unwanted bits are 0)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Methods using mul in principle can be more generic than the ones using shifts.

 

No you first need to convert n to a mask.

 

with shift you just make a loop that run n times 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

grohote wrote:

I learned (hard way) that inc reg and add reg, rOne are not the same.

 

For example, inc XL adc XH,rZero - does not works, must be: add XL,rOne  adc XH, rZero

 

The instruction set shows INC does not affect C Status bit.

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Confirmed, at least on simulator, MUL instruction, AtMega328, is 2 clock cycles.

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The, considering use of movw, and X=r20. A=r21 for example, minimum of 12 cycles for 24 bits positive numbers, for any quantity of right shift bits, being D the reciprocal 2^(8-n).

C0:     MUL C, D           ; 
        MOV C, R1          ; CDH

        MUL A, D           ;
        MOVW A:X, R1,R0    ; ADH:ADL
        
        MUL B, D           ;
        MOV B, X           ; ADL
        ADD C, R0          ; CDH + BDL
        ADC B, R1          ; ADL + BDH + cy
        ADC A, ZERO        ; ADH + cy

Obviously we need to do a total calculation to finish this doubt forever.

The above takes 12 cycles for "n" being any number from 1 to 7.

 

We can define the question of shifting 3 bytes to the right, in a sequence from 1 to 23 bits.

Lets assume the bytes are A, B and C (lsb), and F will be the counter 1 to 23.

 

It  is clear that when the count is multiple of 8, just moving the whole byte to the right will be economic, C=B, B=A, A=0, just 3 clock cycles.

This will happens on count of 8 and 16, so we need to keep comparing the count with 8, on each loop of the shifter.

A1:   cpi F, 8
      brlo B0
      mov C, B
      mov B, A
      clr A
      subi F,8
      rjmp A1
      

As we can see above, it is NOT only 3 clock cycles, it takes 11 clock cycles if 16>F>7,  or 19 cycles if F>15 and a waste of 3 if F<8.

The total A0 cycles will be 11+19+21*3 = 93 cycles.

 

Considering the normal right shift if nothing else;

B0:   ASR A
      ROR B
      ROR C

for each F count, 3 cycles only.

If using only 23 times B0 (not using A0), it would take, it will take a total of 1+2+3...+22+23 shifts = 276 packs of B0, total of 828 cycles.

 

If mixing A0 and B0, A0 will run on F=8 and 16, and B0 runs 3 times when F<8, it would be;

3 * B0 * (1+2+3+4+5+6+7) = B0*28*3 * 3Cycles = 252 cycles

and we need to add the A0 93 cycles, total of 252+93 = 345 cycles.

 

Not counting here the loop control and the moving ending values to the use of the application, just bit shifting.

B0 takes care of negative numbers, since the first shift is ASR, it propagates bit 7 down the range.

 

But A0 has no negative numbers control, it needs to be added.

 

A1:   cpi F, 8
      brlo B0
      mov C, B
      mov B, A
      clr A
      sbrc B,7   ; if B = 0b1xxxxxx
      com a      ; A=FF
      subi F,8
      rjmp A1

The total A1 cycles will be 13+21+21*3 = 97 cycles.

Total A1 & B0 = 349 cycles, with total control for negative numbers.

A B C exit shifted right  correctly from 1 to 23 times, even if negative.

 

Now, lets go back to the C0 routine (multiplication) above, it takes 12 cycles per every whole shift, from 1 to 7 shift counts.

It can of course do up to 21 shifts promoting 3 times 7 shifts, fixing the rest with B0 routine.

But as C0 takes 12 cycles, it is the same as B0 running 4 chaining shifts, and 1, 2 and 3 shifts B0 gains.

So, C0 only gains for 5, 6 or 7 shifts.

Also, C0 needs some dressing around, first, the D registers is not F, need a conversion.

F = 7, D = 2

F = 6, D = 4

F = 5, D = 8

This can be done by compare and branch, by table, or other.

and second, it needs the negative number fix and that is not simple.

The negative works by converting all the left zeros bits on resulting A to "1", if the original A7=1, there are several ways to do it.   

I can not see any of the above dressing of C0 be done with less than 10 to 12 instructions, branches, and perhaps more than 15 to 18 cycles.

 

Lets see, below a very wild brute guessing for dressing B0 with C0, with reciprocal value for D, and dressing for negative numbers.

F is the number of shifts, 1 to 23 in the sequence.

The routine try to use C0 in bumps of no more than 7 shifts, so C1 and C2 below takes care of that and loops to C1 if F>13

C3 do the economical simple 3 instructions low cost shift in case F<5.

C0 is the actual above routine, with the negative dressing, easier than I thought, since temp1 was carrying the value to be ORED to resulting A in case original A was negative.

  

C1:   tst F
      breq exit
      mov temp,F
      cpi temp,8     ; if count < 8 can go directly for the reciprocal multiplication
      brlo C2
      ldi temp,7     ; if count > 7 then do 7 reciprocal mult and reduce 7 from count

C2:   sub F,temp
      cpi temp,5     ; if count < 5 no reciprocal for this, just bare B0 times count
      brlo C3        ; this line and the above could be eliminated with expense or more cycles below if temp<5

      ldi D,2     
      ldi temp1, $FE ; dressing left 7 bits if negative  
      cpi temp,7     ; if count =7 D=2 and do reciprocal
      beq C0

      ldi D,4
      ldi temp1, $FC ; dressing left 6 bits if negative
      cpi temp,6     ; if count =6 D=4 and reciprocal
      breq C0

      ldi D,8
      ldi temp1, $F8 ; dressing left 5 bits if negative
      cpi temp,5     ; if count =5 D=8 and reciprocal
      breq C0

C3:   asr A          ; here repeat bare shift "temp" times, up to 4, economical.
      ror B 
      ror C 
      dec temp      
      brne C
      rjmp C1

C0:   sbrs A,7       ; skip next if original A is not negative, bit 7=0
      clr temp1      ; no negative bits
      ;  here the regular C0, reciprocal multiplication, A B C and D are set
      or A,temp1     ; repopulate left bits if original A was negative

We have a bunch of instructions to accommodate B0 and C0 routines to run together.

Considering we will have at least 4x B0s when F is lower than 5, one extra B0 when F=21, C0 when F=5, 6, 7, 12, 13, 14, 18, 19, 20.  

We have around 20 cycles per iteration just for the overhead/dressing, guessing 1/2 of it per F value, 200 cycles, plus the 349 cycles above, 549 cycles for B0 & C0, still better than 828 for just B0 chain sequence.  Need to simulate the above to get the real cycles count, and perhaps optimize the brute code.  May be including the A1 code (97 cycles) could reduce 2xC0 (8 and 16).

 

But A1 & B0 do it all with only 349 cycles, so that is the winner, forget the reciprocal multiplication, even being a good idea, the overhead murdered it bloody violently.

 

And for the curious, yeah, I had nothing better to do this Saturday evening.  ;-)  ... and that closed the doubt.

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

And one extra thing, if speed matter the code often is placed in ISR, which mean that you have to store R0 and R1 for each interrupt, that would be an extra 8 clk with push/pop or 2 clk if you reserve 2 registers and use MOVW. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

avrcandies wrote:
Do you know if there is a complete list of which commands are "clones" of the others.
See this post:

 

https://www.avrfreaks.net/commen...

 

It's in a thread about writing an AVR simulator in AVR C (don't ask!) and I showed my analysed list of AVR opcodes in which the "=" lines are simply synonyms ant not "real" insttructions:

 

0000 0000 0000 0000   nop				1
0000 0001 dddd rrrr   movw    v,v		1
0000 0010 dddd rrrr   muls    d,d		1
0000 0011 0ddd 0rrr   mulsu   a,a		1
0000 0011 0ddd 1rrr   fmul    a,a		1
0000 0011 1ddd 0rrr   fmuls   a,a		1
0000 0011 1ddd 1rrr   fmulsu  a,a		1
0000 01rd dddd rrrr   cpc     r,r		1
0000 10rd dddd rrrr   sbc     r,r		1
0000 11rd dddd rrrr   add     r,r		1
        =             lsl     r

0001 00rd dddd rrrr   cpse    r,r		1
0001 01rd dddd rrrr   cp      r,r		1
0001 10rd dddd rrrr   sub     r,r		1
0001 11rd dddd rrrr   adc     r,r		1
        =             rol     r

0010 00rd dddd rrrr   and     r,r		1
        =             tst     r
0010 01rd dddd rrrr   eor     r,r		1
        =             clr     r
0010 10rd dddd rrrr   or      r,r		1
0010 11rd dddd rrrr   mov     r,r		1

0011 KKKK dddd KKKK   cpi     d,M		1

0100 KKKK dddd KKKK   sbci    d,M		1

0101 KKKK dddd KKKK   subi    d,M		1

0110 KKKK dddd KKKK   ori     d,M		1
        =             sbr     d,M

0111 KKKK dddd KKKK   andi    d,M		1
        =             cbr     d,n

100! 000d dddd ee-+   ld      r,e		1
100! 001r rrrr ee-+   st      e,r		1
10o0 oo0d dddd booo   ldd     r,b		1
10o0 oo1r rrrr booo   std     b,r		1

1001 000d dddd 0000   lds     r,i		2
1001 000d dddd 010+   lpm     r,z		1
1001 000d dddd 011+   elpm    r,z		1
1001 000r rrrr 1111   pop     r			1
1001 001d dddd 0000   sts     i,r		2
1001 001r rrrr 1111   push    r			1
1001 0100 0000 1001   ijmp				1
1001 0100 0001 1001   eijmp				1
1001 0100 0SSS 1000   bset    S			1
        =             sec
        =             sez
        =             sen
        =             sev
        =             ses
        =             seh
        =             set
        =             sei
1001 0100 1SSS 1000   bclr    S			1
        =             clc
        =             clz
        =             cln
        =             clv
        =             cls
        =             clh
        =             clt
        =             cli
1001 0101 0000 1000   ret				1
1001 0101 0000 1001   icall				1
1001 0101 0001 1000   reti				1
1001 0101 0001 1001   eicall			1
1001 0101 1000 1000   sleep				1
1001 0101 1001 1000   break				1
1001 0101 1010 1000   wdr				1
1001 0101 1100 1000   lpm     ?			1
1001 0101 1101 1000   elpm    ?			1
1001 0101 1110 1000   spm				1
1001 010h hhhh 110h   jmp     h			2
1001 010h hhhh 111h   call    h			2
1001 010r rrrr 0000   com     r			1
1001 010r rrrr 0001   neg     r			1
1001 010r rrrr 0010   swap    r			1
1001 010r rrrr 0011   inc     r			1
1001 010r rrrr 0101   asr     r			1
1001 010r rrrr 0110   lsr     r			1
1001 010r rrrr 0111   ror     r			1
1001 010r rrrr 1010   dec     r			1
1001 0110 KKdd KKKK   adiw    w,K		1
1001 0111 KKdd KKKK   sbiw    w,K		1
1001 1000 pppp psss   cbi     p,s		1
1001 1001 pppp psss   sbic    p,s		1
1001 1010 pppp psss   sbi     p,s		1
1001 1011 pppp psss   sbis    p,s		1
1001 11rd dddd rrrr   mul     r,r		1

1011 0PPd dddd PPPP   in      r,P		1
1011 1PPr rrrr PPPP   out     P,r		1

1100 LLLL LLLL LLLL   rjmp    L			1

1101 LLLL LLLL LLLL   rcall   L			1

1110 KKKK dddd KKKK   ldi     d,M		1
	    =			  ser     d (K=FF)

1111 00ll llll lsss   brbs    s,l		1
        =             brcs    l
        =             brlo    l
        =             breq    l
        =             brmi    l
        =             brvs    l
        =             brlt    l
        =             brhs    l
        =             brts    l
        =             brie    l
1111 01ll llll lsss   brbc    s,l		1
        =             brcc    l
        =             brsh    l
        =             brne    l
        =             brpl    l
        =             brvc    l
        =             brge    l
        =             brhc    l
        =             brtc    l
        =             brid    l
1111 100d dddd 0sss   bld     r,s		1
1111 101d dddd 0sss   bst     r,s		1
1111 110r rrrr 0sss   sbrc    r,s		1
1111 111r rrrr 0sss   sbrs    r,s		1

I seem to remember it reduces the "132/135 opcodes" that are mentioned at the top of each AVR datasheet to something like 78 actual opcodes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I seem to remember it reduces the "132/135 opcodes" that are mentioned at the top of each AVR datasheet to something like 78 actual opcodes.

Thanks!  I always wondered about this ---how many were "clones"

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
reduces the "132/135 opcodes" that are mentioned at the top of each AVR datasheet to something like 78 actual opcodes.

As they're keen to emphasise that it's RISC (ie, Reduced Instruction Set), why would they want to pump the number up ?!

 

frown

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

awneil wrote:

As they're keen to emphasise that it's RISC (ie, Reduced Instruction Set), why would they want to pump the number up ?!

 

To help those poor ASM coders who have to rely on knowing each and every instruction, and who expect the instruction set to be obvious. 

#1 Hardware Problem? https://www.avrfreaks.net/forum/...

#2 Hardware Problem? Read AVR042.

#3 All grounds are not created equal

#4 Have you proved your chip is running at xxMHz?

#5 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I suppose some might argue that it is easier to remember things like SEI which is really just a virtual instruction for BSET 7. A lot of the virtual instructions are the BSET and BCLR synonyms. BRBS and BRBC also have eight virtual synonyms each too. Again something like a BRNE x is probably easier to remember than BRBC 1, x though if there was a Z=1 somewhere then perhaps you could live with BRBC Z, x ?

 

Actually there are 9 synonyms under BRBS and BRBC so of those 9 two must be synonyms of each other in each group! .... ah yes, BRCS and BRLO are the same thing  (and in turn there are both really BRBS 0, x)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Brian Fairchild wrote:
To help those poor ASM coders

 

Poor ASM coders are perfectly happy with a half of that number of instructions. Some of them will never be necessary.

My latest project with 'Tiny13, 90% code use: Instructions used: 51 out of 105 (48.6%)

My previous ATmega16 project: Instructions used: 75 out of 113 (66.4%)

 

Last Edited: Mon. Oct 11, 2021 - 05:55 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

grohote wrote:

My latest project with 'Tiny13, 90% code use: Instructions used: 51 out of 105 (48.6%)

My previous ATmega16 project: Instructions used: 75 out of 113 (66.4%)

 

I had a quick look at the report from my compiler for some production code.

 

45k LOC compiling down to 19352 bytes (on a 328)

Instructions used: 73 out of 116 (62.9%)

 

.lds  :   0 .lds.l:   0 .sts  :   0 .sts.l:   0 adc   :  89 add   :  98 
adiw  : 162 and   :   0 andi  :  66 asr   :   0 bclr  :   0 bld   :   2 
brbc  :   0 brbs  :   0 brcc  :   2 brcs  :   0 break :   0 breq  : 157 
brge  :  34 brhc  :   0 brhs  :   0 brid  :   0 brie  :   0 brlo  :  35 
brlt  :  29 brmi  :   1 brne  : 312 brpl  :   4 brsh  :  15 brtc  :   5 
brts  :   0 brvc  :   0 brvs  :   0 bset  :   0 bst   :   2 call  : 545 
cbi   :  67 cbr   :   1 clc   :   0 clh   :   0 cli   :   5 cln   :   0 
clr   :  83 cls   :   0 clt   :   3 clv   :   0 clz   :   0 com   :  11 
cp    :  69 cpc   : 109 cpi   : 439 cpse  :   0 dec   :  13 des   :   0 
eor   :   8 fmul  :   0 fmuls :   0 fmulsu:   0 icall :   8 ijmp  :   0 
in    :  12 inc   :   2 jmp   :  37 ld    :  99 ldd   : 294 ldi   :1489 
lds   : 350 lpm   :  14 lsl   :  39 lsr   :   8 mov   : 201 movw  : 128 
mul   :  20 muls  :   0 mulsu :   0 neg   :  15 nop   :   0 or    :   9 
ori   :  23 out   :  35 pop   :  18 push  :  18 rcall :  85 ret   :  63 
reti  :   7 rjmp  : 469 rol   :  52 ror   :   9 sbc   :  42 sbci  :  46 
sbi   :  52 sbic  :   8 sbis  :  16 sbiw  :  55 sbr   :   2 sbrc  :  20 
sbrs  :  16 sec   :   0 seh   :   0 sei   :   3 sen   :   0 ser   :  17 
ses   :   0 set   :   3 sev   :   0 sez   :   0 sleep :   0 spm   :   0 
st    : 420 std   : 144 sts   : 244 sub   :  24 subi  : 119 swap  :   0 
tst   :   9 wdr   :   3 

 

#1 Hardware Problem? https://www.avrfreaks.net/forum/...

#2 Hardware Problem? Read AVR042.

#3 All grounds are not created equal

#4 Have you proved your chip is running at xxMHz?

#5 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Seems that we have similar numbers.

 

There is my biggest ASM project, and my record score on Instructions used:

 

"ATmega32" instruction use summary:
.lds  :   0 .sts  :   0 adc   : 140 add   : 149 adiw  :  21 and   :  30
andi  :  78 asr   :   2 bclr  :   0 bld   :   6 brbc  :   0 brbs  :   0
brcc  :  31 brcs  :  55 break :   0 breq  : 199 brge  :   1 brhc  :   0
brhs  :   0 brid  :   0 brie  :   0 brlo  :  97 brlt  :   6 brmi  :  10
brne  : 318 brpl  :  15 brsh  :  35 brtc  :   7 brts  :   0 brvc  :   0
brvs  :   0 bset  :   0 bst   :   3 call  : 793 cbi   : 111 cbr   :  16
clc   :   6 clh   :   0 cli   :   2 cln   :   0 clr   : 194 cls   :   0
clt   :   2 clv   :   0 clz   :   5 com   :  44 cp    :  43 cpc   :  42
cpi   : 241 cpse  :  75 dec   : 199 eor   :  54 fmul  :   0 fmuls :   0
fmulsu:   0 icall :   0 ijmp  :  30 in    :  31 inc   : 140 jmp   : 122
ld    : 153 ldd   :  86 ldi   :1528 lds   : 612 lpm   :  21 lsl   :  64
lsr   :  69 mov   : 427 movw  : 148 mul   :  18 muls  :   0 mulsu :   0
neg   :   3 nop   :  74 or    :  51 ori   :   1 out   :  72 pop   : 273
push  : 258 rcall : 739 ret   :1020 reti  :   4 rjmp  : 828 rol   :  50
ror   :  67 sbc   :  34 sbci  :  61 sbi   :  84 sbic  :  10 sbis  :   2
sbiw  :   4 sbr   :  24 sbrc  : 148 sbrs  :  94 sec   :  10 seh   :   0
sei   :   2 sen   :   0 ser   :  25 ses   :   0 set   :   3 sev   :   0
sez   :   2 sleep :   1 spm   :   0 st    : 176 std   :  32 sts   : 615
sub   :  40 subi  : 124 swap  :  23 tst   : 143 wdr   :   0
Instructions used: 83 out of 113 (73.5%)

"ATmega32" memory use summary [bytes]:
Segment   Begin    End      Code   Data   Used    Size   Use%
---------------------------------------------------------------
[.cseg] 0x000000 0x006ec8  27422    938  28360   32768  86.5%
[.dseg] 0x000060 0x0007e4      0   1888   1888    2048  92.2%

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, the AVR have some features of dubious utility, such as the half-carry flag. It's not surprising that instructions that manipulate this flag are rarely, if ever, used.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

El Tangas wrote:

Well, the AVR have some features of dubious utility, such as the half-carry flag. It's not surprising that instructions that manipulate this flag are rarely, if ever, used.

 

Oh come on, H flag is so nice... 

Very useful when converting hex to bcd, math in bcd, or a software DAA...

or producing a square wave every 90 instructions without even setting a counter... ;-)

 

CLR R16

SEC

ROL R16

BRHC PC+2

SBI PINB,2

RJMP PC-3

 

or producing two square waves, 90° out of phase to each other, full wave every 126 instructions, 63.5kHz if clock 8MHz;

 

CLR R16

SEC

ROL R16

BRHC PC+2

SBI PINB, 2

BRCC PC+2

SBI PINB, 3

RJMP PC-5

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What I would love is to have a second set of registers, R00' to R31' that you could load with a specific value, and activate a compare R05 with R05' (for example) and set a flag or an interrupt...

Think, you are plotting pixels on a color LCD, the screen is like a scope, there are X and Y axis lines in the center of screen, your line being plotted will cross the axis, but you don't want to destroy (plot over) the axis.

So, for every pixel you put on LCD, you need to compare if the address is the same of the axis, and if yes, ignore that plot.

For that simple verification you added 4 or more instructions FOR EVERY pixel plotted, comparing addresses.

If you could compare in hardware the registers containing the address to plot, against a preset value in the counterpart registers, and be interrupted ONLY when it matches... wonderful, isn't it? 

 

May be more simple, R24&R25 and R26&R27 could be compared against any other register pair, programmed in two other compare register I/Os, CPRA and CPRB (8 bits each)

CPRA is the address of the low register (R00~R31, 0x00~0x1F) to compare against R24&R25 and CPRB the same to R26&R27.

Bit 7 of CPRA or CPRB if "1" enables interrupt, Bit 6 is a flag that will be "1" whenever CPRA or B matches the register vectored.

 

CPRA = 0x06 means R06 will be compared against R24, and R07 against R25, when matches, CPRA bit 6 will be up.

CPRA = 0x86 means the same, but interrupt (vector CPRA_INT) will be generated.

 

Hardware compare for registers, the same as used for counter/timers.

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wagnerlip wrote:
Oh come on, H flag is so nice... 

I knew someone would say something like that the moment I hit the "post" button wink

Personally, I'd rather have a parity flag instead of H, but oh, well...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I like the H flag so much that I left it (and its associated test instructions) out of my nearly-8080 emulator :)

 

When did anyone in the real world last do BCD arithmetic? (Which seems to work quite well in the 6502 without a half-carry flag...)

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

barnacle wrote:
out of my nearly-8080 emulator :)
Did the 8080 have "DAA" like the Z80 does ?

 

EDIT: answering my own question - so it seems the 8080, like the Z80 that followed it, had DAA but in the Z80 they added the N flag to know is the previous op was an ADD or a SUB whereas in 8080 DAA only made sense after an ADD so it was more limited. But basically if you added something like 0x19 + 0x16 the initially the accumulator would hold 0x2F but after DAA it would be adjusted to 0x35 because 16 + 19 = 35 in decimal. The op is really "if lower nibble > 9 then add 6"

Last Edited: Wed. Oct 13, 2021 - 11:09 AM