Right Rotate 32 bit value - minimal instructions

Go To Last Post
101 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi,

 

I'm currently writing a project in assembly (because it's more interesting than using the higher level language) for the ATMega328P, I have a 32-bit dword stored in registers {R3...R0} and my aim is to right rotate the value in these registers by n bits (n value is stored in R23:R22):

 

My code so far is below, it works but I can't help thinking it could be done in fewer instructions - I'm hoping someone with more experience might be able to point me in the right direction?

 

Any advice on more elegant code would be appreciated!

 

Many Thanks 

 

Ps. You'll have to excuse the comments formatting - the code editor doesn't seem to like assembly code much

 

start_test:
    LDI		R22, 0x05
    LDI		R23, 0x00

Rotate:
    CLC					; Clear the carry flag
    SBRC	R0, 0	    ; If bit 0 in the lower byte is clear,
                        ;skip setting carry flag
    SEC					; Set carry flag
    ROR		R3			; ROR R3 1 bit with carry
    ROR		R2			; ROR R2 1 bit with carry
    ROR		R1			; ROR R1 1 bit with carry
    ROR		R0			; ROR R0 1 bit with carry
    ADIW	R24, 1		; Increment R25:R24 by 1
    CLZ					; Clear the zero flag
    CP		R22,R24		; Compare to R23:R22 (0x0005)
    CPC		R23,R25
    BREQ	ExitLoop	; Exit if equal
    JMP		Rotate
    ExitLoop:

 

Last Edited: Sat. Feb 10, 2018 - 08:32 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

n does not need to be 16-bit.  And rather than counting up to n, decrement n down to zero.

 

Greg Muth

Portland, OR, US

Xplained/Pro/Mini Boards mostly

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Something like this:

Rotate:
    CLC			    ; Clear the carry flag
    SBRC	R0, 0	; If bit 0 in the lower byte is clear,
                    ;    skip setting carry flag
    SEC			    ; Set carry flag
    ROR		R3	    ; ROR R3 1 bit with carry
    ROR		R2	    ; ROR R2 1 bit with carry
    ROR		R1	    ; ROR R1 1 bit with carry
    ROR		R0	    ; ROR R0 1 bit with carry
    DEC		R22
    BRNE	Rotate
    RET

 

Greg Muth

Portland, OR, US

Xplained/Pro/Mini Boards mostly

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'd also try writing it in a high level language and see what that produces (the guys who write the code generation stuff are often fiendishly good Asm programmers!)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Greg_Muth wrote:
n does not need to be 16-bit. And rather than counting up to n, decrement n down to zero.
And don't use a Subtract for it but a Decrement. DEC does not touch the carry bit, so you don't need to "recreate" it on every loop iteration.

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you all for your responses. 

 

Greg_Muth / Sternst - thanks for the code I wasn't aware DEC also set the zero flag when it hit 0 so there would be no need to compare registers! I've programmed in high level languages but am fairly new to assembly (other than a bit of Z80 programming in college) I've also reduced n to an 8 bit value in one register.

 

Clawson - thanks for the advice, I'll try writing it in higher level decompiling and see what I get! 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

NickB wrote:
I can't help thinking it could be done in fewer instructions

 

Does fewer mean less words or less clocks?

 

Less clocks has lots of options.

 

Less words you are not really going to get much better than Greg has shown.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

And don't use a Subtract for it but a Decrement. DEC does not touch the carry bit, so you don't need to "recreate" it on every loop iteration

S


 

 

 

Edit: remove misleading routine
 

When in the dark remember-the future looks brighter than ever.

Last Edited: Sun. Feb 11, 2018 - 08:02 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avrcandies wrote:

So the brne goes back to ROR R3, saving 3 instructions each cycle (still need them preceeding the first time through the loop).

What would be the result of rotating 0x00000001 two positions ?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

When shifting more than 7 positions you can "shift" bytes like this:

Ror32:
	subi r22, 8
	brlo Ror32Add8

	push r0
	mov r0, r1
	mov r1, r2
	mov r2, r3
	pop r3

	brne Ror32
	ret

Ror32Add8:
	subi r22, 248

Ror32Rotate:
	clc
	sbrc r0, 0
	sec

	ror r3
	ror r2
	ror r1
	ror r0
	dec r22
	brne Ror32Rotate
	ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can obviously mask out multiple of 32 bits from n.

 

 

Think a bit about tradeoffs. Is it a good tradeoff if your algorithm is more expensive for some n, and cheaper for others? For instance, doing the loop seven times seems like it might be expensive. In some cases it might make sense to use and/or operations to move blocks of bits at a time. Possibly not, though, because the shifts and masks might be spendy. But also consider; a rotate right of 7 bits is a rotate left of one and a rotate right of eight, with rotate right of eight being possible to optimize into whole-byte moves.

 

I know nothing about AVR conventions for register saving, etc., so I can't advise on that. I would be sort of tempted, though, to do it with a fixed number of masks/shifts... oh, wait. rotate is single-bit only, making that a lot less appealing.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What would be the result of rotating 0x00000001 two positions ?

Apparently, this statement really won't help...And don't use a Subtract for it but a Decrement. DEC does not touch the carry bit, so you don't need to "recreate" it on every loop iteration, since the last bit of the register set always needs carried to the front.

When in the dark remember-the future looks brighter than ever.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

andrewm1973 wrote:

NickB wrote:
I can't help thinking it could be done in fewer instructions

 

Does fewer mean less words or less clocks?

 

 

^^^^That.

 

In microcontrollers 'less words' is usually not important, and if it is then you've picked the wrong chip for the job. 'Less clocks' though is often important. Sure, you can throw a faster chip at the problem but that brings think like extra power consumption along with it.

 

As this post is in 'Compilers and General Programming' then I guess the real answer is to use a 32-bit chip with a barrel shifter. Which'll win in both words and clocks.

 

"This forum helps those that help themselves."

"How have you proved that your chip is running at xxMHz?" - Me

"If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avrcandies wrote:
Apparently, this statement really won't help...And don't use a Subtract for it but a Decrement. DEC does not touch the carry bit, so you don't need to "recreate" it on every loop iteration, since the last bit of the register set always needs carried to the front.
Yes, I was wrong there.

 

But there is still a small improvement possible. Instead of preloading the carry bit you could just copy the LSB:

Rotate:
    ror r3
    bst r0,0
    bld r3,7
    ror r2
    ror r1
    ror r0
    dec r22
    brne Rotate
    ret

 

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Brian Fairchild wrote:
In microcontrollers 'less words' is usually not important, and if it is then you've picked the wrong chip for the job. 'Less clocks' though is often important.

 

Disagree with that.

 

Sometimes you want to save clocks.

 

Sometimes you want to save words.

 

I'll wait for OP to respond his/her needs.

 

Also sometimes you didn't get to choose the microcontroller.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think that Brian has put it very eloquently. You can't improve on 7 cycles on an AVR. It is going to be up to 40 cycles for big shifts with intelligence. The simple code in #14 is going to be 9 x N cycles which gets pretty expensive for N rotates.
.
An ARM barrel shifter does it all in one cycle.
.
Only the OP knows her requirements. If Flash size is down to 9 available words maintenance is going to be a nightmare.
.
David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you all for your responses, I'll try and expand with more details on the problem. 

 

n is predetermined and will not be greater than 25, the values possible for n are {2, 6, 7, 11, 13,17, 18, 19, 22, 25}  

 

By less instructions I mean reducing the number of CPU instructions in order to keep the clock cycles down as much as possible (I want it to complete the process as fast as possible) I know some instructions take more clock cycles than others (looking at the datasheet it seems to be either 1, 2 or 3 cycles) but in general I figure less instructions to complete a task mean less clock cycles to complete the task.

 

 But also consider; a rotate right of 7 bits is a rotate left of one and a rotate right of eight, with rotate right of eight being possible to optimize into whole-byte moves.

 

I'm afraid I'm new to masking so will have to look that one up and do a bit of reading. I do like the approach above though and can see how it would work where 8-n < n or 16-n < n etc..  

 

Using the code above in #14 it looks like the largest would be 9 x 25 = 225 clock cycles which seems like a lot to me to do a simple rotation, however I could swap whole bytes like below with MOVW / MOV then do one ROR:

 

R3 | R2 | R1 | R0

R2 | R1 | R0 | R3

 

 

I do have a lot of space to play with in the program memory so rather than trying to encapsulate it all in one subroutine I could write something that's more tailored and use byte shifting for certain n values that would suit it? 

 

Currently I only have a ATMega328P available to me on an ardruino board so for now I'm stuck using 8-bit registers, it would definitely make sense to move to a 32-bit chip with barrel shifter though to be honest I am quite enjoying the challenge. 

 

Thanks again for all your responses, I'm definitely learning a lot! 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Follow Cliff's advice.    Write a simple app in C.   Inspect the LSS file to see what the C compiler does.    Whit -O1, -O2, -O3

 

I would guess that it will optimise 7-bits right as 8-bits left with 1-bit right.   Similarly with your other shifts.    My 40-cycles in #16 was guesswork.

 

I could test your {2, 6, 7, 11, 13,17, 18, 19, 22, 25} in the Simulator to give real-life cycles.

 

The simple answer is that I would expect GCC to beat the 9 x N figure from #14.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

NickB wrote:

n is predetermined and will not be greater than 25, the values possible for n are {2, 6, 7, 11, 13,17, 18, 19, 22, 25}  

 

And is each of those values on n equally likely to be used as all the others?

 

NickB wrote:

By less instructions I mean reducing the number of CPU instructions in order to keep the clock cycles down as much as possible ...

 

Generally, the fastest way to speed up any iterative process is to unroll it.So you could have a line of 13 32-bit right rotates, for the n = 2 to 13 cases, and 15 32-bit left rotates for the n= 17 to 25 cases, and you just jump to the right entry point depending on the value of n. Like that you save the expense of the loop counter, test, and jump at the expense of the test on n at the top. Some of the longer rotates could be a mix of a smaller rotate and a byte swap. To make your test of n as efficient as possible you'd test the most likely values first

 

Or you have 10 optimal routines for your 10 different shift amounts and a jump table at the top which picks the routine address to jump to based on n.

"This forum helps those that help themselves."

"How have you proved that your chip is running at xxMHz?" - Me

"If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

First for optimized code you need to look at the hole picture!

Perhaps we optimize this, but you lose the gain in the "mapping" to the routine.

 

Do you want max speed in avg. or worst case? 

 

 

So do you know which value you have (2, 6, 7, 11, 13,17, 18, 19, 22, 25) so you without any cost make 10 different routines and just call the needed one?

 

 

2 : make 2 shift R

6 : make 2 shift L and move bytes

7 : make 1 shift L and move bytes

11:make 3 shift R and move bytes (perhaps there is a way using swap and one shift)

13:make 3 shift L and move word (again perhaps swap)

17:make 1 shift R and move word

18:make 2 shift R and move word

19:make 3 shift R and move word (swap ?)

22:make 2 shift L and move bytes

25:make 1 shift R and move bytes

 

11 13 and 19 is the slow ones perhaps it can be done faster by doing the shifts using the mul instruction but then you have to move the number away from r1:r0

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well,  I was too idle to write the code in AS7.   From Sparrow2.   11, 13, 19 would be 26 cycles if unrolled.   35 with loop.   i.e. much better than my initial 40-cycle  guess.

 

worst case 12 rotates would be 4 shift R and move bytes i.e. 33 cycles if unrolled.

MUL might be useful.    See what the C compiler does.

 

You must make your own decisions about inline sequences or subroutine calls.   An RCALL+RET is 7 cycles.

 

Untested.   I did the cycle calcs in my head.

 

David.

Last Edited: Sun. Feb 11, 2018 - 02:27 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

how do you get 26?

 

it's something like this for 11(untested number in r27:r26:25:r24):

	mov r0,r24
	ror r0 ; now carry is correct
	ror r27
	ror r26
	ror r25
	ror r24
	ror r27
	ror r26
	ror r25
	ror r24
	ror r27
	ror r26
	ror r25
	ror r24
	mov r0,r24
	mov r24,r25
	mov r25,r26
	mov r26,r27
	mov r27,r0

that is only 19 clk

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
it's something like this for 11
You made the same mistake I made in #5. You can't use the carry bit from the end of one rotating cycle as input for the next cycle.

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

???

Why not

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
??? Why not
After finishing the rotation the carry bit contains the old LSB (equal to the current MSB), but for the next rotation you need the current LSB in the carry bit.

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am guessing:

Rotate27:
; rotate right 3 bits
    ror r3
    bst r0,0
    bld r3,7
    ror r2
    ror r1
    ror r0
    ror r3
    bst r0,0
    bld r3,7
    ror r2
    ror r1
    ror r0
    ror r3
    bst r0,0
    bld r3,7
    ror r2
    ror r1
    ror r0
; move bytes left 8 bits (right 24)
	mov tmp,r0
	mov r0,r1
	mov r1,r2
	mov r2,r3
	mov r3,tmp
;
    ret

I make that 23 cycles (without the RET)

 

Obviously it is wise to test these properly in the Simulator.

 

You can probably save the 3 bottom bits at the start.   Do 3 sets of ROR.

Then place the 3 saved bits into the top byte with BST, BLD

That would be 1 + N*2 for the save/BLD and N*4 for the right shift + 5 for the byte move.

i.e. 6 + 6*N = 24 cycles.  which is no better than the original

 

Ah-ha.  You can use AND, SWAP, LSL, OR instead of 3 sets of BST/BLD.   5 cycles vs 6 cycles.   No better than original 23 cycles.

 

David.

Last Edited: Sun. Feb 11, 2018 - 03:55 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

How about  the following, for overall flexibility  (might not be optimal for a specific shift):

 

 

			andi r22, 31  ;0-31 shifts with argument wraparound
			
			cpi r22, 16
   			brlo less_than16
  
   			push r1   ;fast 16 rotate
   			mov r1, r3
   			pop r3
   			push r0
   			mov r0, r2
   			pop r2
  
   			subi r22, 16 ;just did 16

less_than16:		cpi r22, 8
			brlo less_than8	

			push r0  ;fast 8 bit rotate
			mov r0, r1
			mov r1, r2
			mov r2, r3
			pop r3

			subi r22, 8   ;just did 8

less_than8: 		cpi r22, 7
			breq shift_7more
			cpi r22, 6
			breq shift_6more
			cpi r22, 5
			breq shift_5more
			cpi r22, 4
			breq shift_4more
			cpi r22, 3
			breq shift_3more
			cpi r22, 2
			breq shift_2more
			cpi r22, 1
			breq shift_1more
			ret

shift_7more:		lsr r3   ;msb will be entered by the following two instructions
		    	bst r0,0
		    	bld r3,7
		    	ror r2
		    	ror r1
		    	ror r0

shift_6more:		lsr r3   ;msb will be entered by the following two instructions
		    	bst r0,0
		    	bld r3,7
		    	ror r2
		    	ror r1
		    	ror r0

shift_5more:		lsr r3   ;msb will be entered by the following two instructions
		    	bst r0,0
		    	bld r3,7
		    	ror r2
		    	ror r1
		    	ror r0

shift_4more:		lsr r3   ;msb will be entered by the following two instructions
		    	bst r0,0
		    	bld r3,7
		    	ror r2
		    	ror r1
		    	ror r0

shift_3more:		lsr r3   ;msb will be entered by the following two instructions
		    	bst r0,0
		    	bld r3,7
		    	ror r2
		    	ror r1
		    	ror r0

shift_2more:		lsr r3   ;msb will be entered by the following two instructions
		    	bst r0,0
		    	bld r3,7
		    	ror r2
		    	ror r1
		    	ror r0

shift_1more:		lsr r3   ;msb will be entered by the following two instructions
		    	bst r0,0
		    	bld r3,7
		    	ror r2
		    	ror r1
		    	ror r0

			ret

edit : improved tabs for reading

edit:  increased range to 31 shifts

When in the dark remember-the future looks brighter than ever.

Last Edited: Sun. Feb 11, 2018 - 05:52 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

yes I get my error (shifting 33 bit not 32)

so the best 3 shift I can find is :

	lsr r3
	ror r2
	ror r1
	ror r0
	ror r24
	lsr r3
	ror r2
	ror r1
	ror r0
	ror r24
	lsr r3
	ror r2
	ror r1
	ror r0
	ror r24
	andi r24,0xe0
	or r3,r24

untested but should work.

Last Edited: Sun. Feb 11, 2018 - 05:34 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

yes I get my error (shifting 33 bit not 32)

so the best 3 shift I can find is :

	lsr r3 ; prior to this you need to make sure carry bit is cleared
	ror r2
	ror r1
	ror r0
	ror r24 ; prior to this you need to make sure bits 0, 1, 2 are cleared, otherwise they would go to the carry bit and end up on r3
	lsr r3
	ror r2
	ror r1
	ror r0
	ror r24
	lsr r3
	ror r2
	ror r1
	ror r0
	ror r24
	andi r24,0xe0
	or r3,r24

untested but should work.

therefore on every shift you initially save one cycle - in total 3, and add two at the end. Adding these two more cycles I mentioned you end up one cycle longer than the variant presented many posts before.

 

What this website has become; like full of fake news. As a beginner looking for help, you ask, and expect to get help. You need in fact to outsmart the answers here in order not to fall in a trap, which kinda defeats the purpose of asking.

 

Edit: Oops, that lsr... Note that r24 would likely need a push and pop which is 4 cycles.

Last Edited: Mon. Feb 12, 2018 - 04:54 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

angelu wrote:
prior to this you need to make sure carry bit is cleared

...

prior to this you need to make sure bits 0, 1, 2 are cleared, otherwise they would go to the carry bit and end up on r3

Nope. He is using LSR for r3, not ROR.

 

angelu wrote:
What this website has become; like full of fake news.
You mean like your own statements above?

Perfect shot into your own foot, I think.

Stefan Ernst

Last Edited: Sun. Feb 11, 2018 - 06:52 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I do not understand the code in 27, 28 and 29.

I think the worst case for this code is 44 cycles, including ret.

rotateright:
  sbrc R22, 4
  rjmp leftbytes
  ; n & ~7 <= 8
  sbrs R22, 3
  rjmp shiftbits

  ; shift rite one byte
  mov R9, R0
  mov R0, R1
  mov R1, R2
  mov R2, R3
  mov R3, R9
  rjmp shiftbits

leftbytes:
  sbrc R22, 3
  rjmp leftonebyte
  ; two bytes
  xor r0, r2
  xor r2, r0
  xor r0, r2
  xor r1, r3
  xor r3, r1
  xor r1, r3
  rjmp shiftbits

leftonebyte:
  mov R9, R3
  mov R3, R2
  mov R2, R1
  mov R1, R0
  mov R0, R9

shiftbits:
  sbrc R22, 2
  rjmp rotleft
  ; n & 7 in 0..3

  mov R9, R0
  .macro rite
  lsr R9
  ror R3
  ror R2
  ror R1
  ror R0
  .endm

  sbrs R22, 1
  rjmp rite1
  rite
  rite
rite1:
  sbrc R22, 0
  ret
  rite
  ret

rotleft:
  ; n & 7 in 4..7
  ; shift left 1..4 times

  mov R9, R3
  .macro left
  lsl R9
  rol R0
  rol R1
  rol R2
  rol R3
  .endm

  ; at least once
  left
  ; if n even, again
  sbrs R22
  rjmp left2
  left

left2:
  sbrc R22, 1  ; n & 7 >= 6?
  ret          ; yes
  left
  left
  ret

I expect it to work for all values of n.

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't really thing glib offhand comments made without qualifying the OP request are eloquent or actually even useful.

 

So now it has been established by NickB that the shifts come from a small subset {2, 6, 7, 11, 13,17, 18, 19, 22, 25} and that he is enjoying the challenge of doing it on a Mega328.

 

As Brian, David and Sparrow have pointed out it is now obvious that unrolling loops and/or decisions is going to the fastest way in terms of clock cycles.  (Using an ARM with a 32bit barrel shifter would not be an enjoyable challenge.)

 

I would like to ask NickB (OP) a few more questions.

 

1, Same as others have asked - is the distribution even

2, Is it advantageous to have one of the shifts faster than others

3, Is the subset of numbers to shift by ever likely to change

4, Are we allowed to use another register pair

5, Can we trash r22:23

6, What is the big picture.  Why do you need to shift a 32 bit value by a strange subset of values.  Where does this subset come from.  What is happening upstream from this.  We may be able to come up with a solution that is even faster and maybe even more enjoyable.

 

The more info we know, the more fun we can make this.  At very least we can show once again that a human can usually beat a compiler but the economics of doing that is hardly ever worth the effort and you should just use the compiler.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm really grateful for so many of you offering your own approach to solve this, I think all the comments have been useful in moving forward to the best solution. 

 

So I'll provide as much information as I can think of:

 

1. The bigger picture is that I'm writing a hashing function in my spare time for fun and learning - what I've learnt already and a new way of thinking has already been useful in my day job!   

2. I will be reading a 32-bit value from SRAM into registers, creating 3 copies, rotating each copy by a different value n then XORing the 3 results of the rotates into a final value.

    This is done until all values stored in SRAM are read - around 256 to 512 bytes. 

3. As such the set of possible values for n {2, 6, 7, 11, 13,17, 18, 19, 22, 25} will not change.

4. The values can be divided into subsets that will always be performed together on the three copies {3, 7 18} {10, 17, 19} {6, 11, 25} {2, 13, 22}

   

    So for instance I will copy the 32-bit value into the registers make 3 copies of that value, rotate each copy by 3, 7 and 18 bits respectively,

    then XOR the three results from the rotates.

    The numbers 3, 7 and 18 will not need to be used again until the next 32-bit value is read from SRAM

 

5. The registers in the original code I posted in #1 were arbitrarily chosen so you can trash R22:R23, however in my head I had set out the registers like so:

 

Set1:  R3   | R2   | R1   | R0      -  Buffer for storing 32-bit value from SRAM

Set2:  R7   | R6   | R5   | R4      

Set3:  R11 | R10 | R9   | R8     

Set4:  R15 | R14 | R13 | R12

 

My main reason for storing the 32-bit value in the R3...R0 registers is to have it persist and save reading it from SRAM multiple times to create the copies - as far as I can tell two MOVW's are faster than four LD's.

 

I could either have one routine that uses only one register set and caters for all possible values of n.

 

Or I could either make the three copies in different register sets and let each register set have their own inline routine with it's own n value e.g.

 

R3.....R0     Routine for rotating by 3,    10,   6,      2 bits 

R7.....R4     Routine for rotating by 7,    17,   11,    13 bits

R11...R8     Routine for rotating by 18,  19,    25,    22 bits

 

I'm favouring the latter unless any of you can suggest reasons why the former may be better? 

 

As you've all mentioned above it looks like unrolling loops and combining rotates with byte switching is going to be the fastest way of accomplishing it. I'll try writing something taking all the suggestions above into account and see what I come up with.

 

Once again thank you for all your responses and I hope you're finding solving this as interesting as I am! 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm confused why don't

{2, 6, 7, 11, 13,17, 18, 19, 22, 25}

match

{3, 7 18} {10, 17, 19} {6, 11, 25} {2, 13, 22}

?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry, my mistake it should read

{7 18} {17, 19} {6, 11, 25} {2, 13, 22}

That'll teach me not to respond at 2 in the morning!

The sets with only two values only require 2 copies and rotations

I meant to use {6, 11, 25} in the example just below that section.

Last Edited: Mon. Feb 12, 2018 - 07:42 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A combination of byte rotates along with left and right bit rotates would speed things up vs just doing bit rotates.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Then it's just a matter of align the bits correct, no need for byte move, that can be handled with who gets XOR'ed.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

And because it's so many times, I guess that some LUT's could speed things up.

 

You still need to answer :

 

Do you want max speed in avg. or worst case? 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm a little confused about the desired numbers,

but it is clear that they will be known at compile time.

Optimizing for each value of n might be practical.

More importantly, one can reduce the effective value of n.

 

As noted by others, byte rotates can be taken care of in the xor implementation.

For rotates less than eight:

rot(s1, V1) xor rot(s2, V2) = rot(s1, V1 xor rot(s2-s2, V2))

requires at most 5 total shifts

rot(s1, V1) xor rot(s2, V2) xor rot(s3, V3) =
rot(s1, V1) xor rot(s2, V2 xor rot(s3-s2, V3)) =
rot(s1, V1 xor rot(s2-s1, V2 xor rot(s3-s2, V3)))

requires at most 6 shifts

 

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I see that mister fake news has made an update in #29

therefore on every shift you initially save one cycle - in total 3, and add two at the end. Adding these two more cycles I mentioned you end up one cycle longer than the variant presented many posts before.

 

What this website has become; like full of fake news. As a beginner looking for help, you ask, and expect to get help. You need in fact to outsmart the answers here in order not to fall in a trap, which kinda defeats the purpose of asking.

 

Edit: Oops, that lsr... Note that r24 would likely need a push and pop which is 4 cycles.

I assume that he hasn't done much AVR ASM (but probably on cisc micros), at least r24 and r25 (only non pointer reg. pair that can do 16bit oprations) should never hold any long lasting data, so push and pop would be stupid!

(OP use r22,r23,r24,25 in #1)

 

I guess that I have to add that I don't expect this to be code in a ISR then it would be an other matter.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

at least r24 and r25 (only non pointer reg. pair that can do 16bit oprations) should never hold any long lasting data (I intentionally left the remaining text out)

1. Except a 16 bit loop counter where this routine would (most likely) be called from. Not to think about a double (nested) loop where you may not even have enough 16 bit register pairs available.

2. Since this started as a generic 32 bit rotate routine, I believe some people using compilers may find it useful to add to their code (in asm). Using a compiler I think you have less control of the register usage.

sparrow2 wrote:

(OP use r22,r23,r24,25 in #1)

Because he used it as a 16bit counter. Once it was replaced with an 8 bit one, r22 have been used.

 

Leaving the implementation aside, can we agree that the routine in #28 uses more hardware resources that the one would be built from the one in #14 ? I do find the one using the T bit clever.

 

@sparrow

I do have a nickname here and you can use it. While I did criticize a general trend that sometime, some people are rush to post things before making sure it is right (including me), with reference to one of your posts, It was a comment in general (there were other misleading posts before yours). However, I did not name anyone in particular let alone to do the way you did.

 

sparrow2 wrote:

I assume that he ..

I got it. Too low to comment on it.

 

@moderator(s)

May I have that appellative edited to a non mocking form or deleted ?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

None of OP's examples require shifting more than 5 total positions.

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

Last Edited: Tue. Feb 13, 2018 - 03:04 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

angelu wrote:
@moderator(s) May I have that appellative edited to a non mocking form or deleted ?
No, because when I search this thread for "fake news" to see where this school yard argument started I see it was you who started it in #29 apparently insulting the quality of sparrow2's response. So I would suggest he has every right to feel aggrieved. If you were a regular here you would know that sparrow2 rather famously posts some of the very best Asm advice we see here and does so on a regular basis so I would be a bit slower to insult in future if I were you.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Oooooh! That's harsh!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Update about the 3 shifts the code in #28 seems to work (here is a copy but with other registers (there is a reason))

	lsr r5
	ror r4
	ror r3
	ror r2
	ror r24
	lsr r5
	ror r4
	ror r3
	ror r2
	ror r24
	lsr r5
	ror r4
	ror r3
	ror r2
	ror r24
	andi r24,0xe0
	or r5,r24

And it take 17 clk (it's 1 clk faster than others)

But then I took a look at my wrong code in #22 and found out that it's actually faster still to use it (parts of) and then make a correction at the end.

    lsr r5; shift a 0 in
    ror r4
    ror r3
    ror r2
    ror r5
    ror r4
    ror r3
    ror r2
    ror r5
    ror r4
    ror r3
    ror r2
    ror r5 ;top 3 bit now correct
    mov r24,r5
    andi r24,0x1f
    adc r5,r24 ; shift low 5 bit 1 up, get C in aswell

This is 16 clk

 

Now using MUL (and that was why I changed reg. numbers)

; MUL with 32 is the same a shift 3 bit down
    ldi  r24,32
    mul  r3,r24
    movw r6,r0
    mul  r5,r24
    movw r8,r0
    mul  r2,r24
    or   r9,r0
    or   r6,r1
    mul  r4,r24
    or   r7,r0
    or   r8,r1
    

This is 15 clk

And the benefit is that the org number don't change. And it's free to shift either 3,11,19,27 it's just the way the bytes get's moved around.

 

Now for op's problem I guess it's faster just to make xor direct from r1:r0 (double amount of xor but no moves (first of the 3 numbers need to be moved) ).

 

It's untested but I'm sure that the idea is correct.

Last Edited: Tue. Feb 13, 2018 - 06:17 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

With a spare register, an in-place shift of n positions

in either direction can be done in 1+5n cycles.

Adding in 4 xor's makes it 5+5n cycles.

 

With MUL, an out-of-place shift can be done in 15 cycles.

XORing the target only costs 2 more cycles: 17 cycles.

 

In either case, MUL is a time win if n is 3 or 4.

n need not be larger than 4.

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I will just add:

If you have a zero register, you save one clk when shifting left (use ADC with 0 to get C into LSB).

 

So now we are close to have a number for OP's  {6, 11, 25}

 

do 11 with MUL = 15 clk

do  6 with MUL but no mov(w) and eor direct into the result for 11 = 17 clk (1 ldi 4 mul 8 eor)

do 25 in place and eor = 10 clk (6 for shift 4 eor).

 

So about 42 clk. For a uses of 11 registers (r1:r0  2*32 bit plus 1 high reg.)

If we preload 2 reg. from start , we are down at 40 clk. 

(do 6 with real shift will cost the same 11 clk + 4 clk eor + 2 clk for the extra copy of the number)

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

As you only need ROT for specific offsets, you can hand-optimize for these offsets.
 
However, as you only need it for specific offsets, the question arises: what for. If it's some checksum like CRC, then it might be a good idea to have more global perspective and whether you actually need rotation or some other adjustments of the algorithm brings best performance to you.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I get 36 cycles for {6, 11, 25}

'Tis equivalent to {1, 3, -2}

  1: Make a zero register.

11: Rotate copyA two (3-1) places right (ADC doesn't help)

 4: copyA ^= rearange_bytes(copyB)

 6: Rotate copyA one place right (ADC doesn't help)

10: Rotate copyB two places left  (ADC helps)

 4: copyA ^= rearrange_bytes(copyB)

 

Edit: got confused about left=high and right=low

Edit: hadn't realized all three were the same

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

Last Edited: Mon. Feb 19, 2018 - 03:29 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

for #48  read #33 it take out the guesses.

 

for #49

now we start to need the bigger picture (see #20), your code expect 3 copy's of the number , my code only use the number (1 copy).

And because we know that OP works on block's of code, I don't think that you should count "make a zero register" as one clk (it's a part of init).

 

For left rotations you can do it in 5 clk aswell if you have a 0x7f  register!

 

move MSB into carry can be done with:

 

CP _127,r_highest_byte

 

and then 4 times ADC (ROL) 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
for #49

now we start to need the bigger picture (see #20), your code expect 3 copy's of the number , my code only use the number (1 copy).

Even my original version only requied two at a time.
Quote:
For left rotations you can do it in 5 clk aswell if you have a 0x7f  register!

 

move MSB into carry can be done with:

 

CP _127,r_highest_byte

 

and then 4 times ADC (ROL)

Contrary to my previous algorithm, left rotates are also what ADC is good for.

Neither help right rotates.

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?