## fast bit mask from a number

21 posts / 0 new
Author
Message

Ok I'm still playing with the AVR emulator on a AVR, and I want a fast way to make a bit mask.
So I want to make this mapping :
xxxxx000 -> 00000001
xxxxx001 -> 00000010
xxxxx010 -> 00000100
xxxxx011 -> 00001000
xxxxx100 -> 00010000
xxxxx101 -> 00100000
xxxxx110 -> 01000000
xxxxx111 -> 10000000

The best code I can think of is this:

R0 holds a number 0..7 (best if high 5 bits are don't care)
R17 get's the result

ldi r17,1
sbrc r0,1
ldi r17,4
sbrc r0,0
sbrc r0,2
swap r17

Any better way ?

Jens

table lookup?

hehehe, There was a "fastest" thread about this a while back. And the solution you have above is exactly what Chuck and myself arrived at (only the registers used are different, and you used add r17,r17 instead of shl r17 which has the exact same effect). It is likely as optimal as you'll ever get. It's already faster than a table lookup, if you do all the necessary checking. (still faster even if you don't, in most cases)

If you ask me this is the fastest, most elegant, and most compact code to do this. You can shave off a few cycles with a table approach, but you have to cut corners to do it, and limiting the input to the 0-7 range is one of them.

https://www.avrfreaks.net/index.p...

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Last Edited: Wed. Dec 12, 2007 - 05:43 AM

That really depends on what you mean by "better" (faster, less room, more elegant).

Regards,
Steve A.

The Board helps those that help themselves.

in c
you would

r17 = 1<<r0;// shift left 1 r0 times

AVR has an LSL instruction but only shift by one place,
perhaps you could create a loop around the LSL?

Troy

to steve
Better means faster faster

Glitch
And that got me thinking, I have no problem to make a 8 byte RAM lookup (and that way I don't need to use Z as for LPM)
So If I save the lookup addr in R6 and r7 the code can be
MOVW X,r6
ADD XL,r16 ; no page change
LD R17,X

From 7 to 4 CLK

Jens
A dane in Seattle

Quote:
and you used add r17,r17 instead of shl r17 which has the exact same effect

Not only the same effect, but the same opcode. The assembler converts LSL r17 to ADD r17, r17.

Regards,
Steve A.

The Board helps those that help themselves.

As long as you you can GUARANTEE that the table is aligned on an 8 byte boundary AND that the value in R16 never exceeds 7.

Table alignment is easy. But guaranteeing the range of the value likely means you're masking it somewhere (min 1 cycle), so your 4 clocks is misleading. You're also not counting the loading of the table address into r6:r7 (min 2 cycles)

The original code posted needs no loads, and is not affected by values greater than 7 in r16.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Koshchi wrote:
Quote:
and you used add r17,r17 instead of shl r17 which has the exact same effect

Not only the same effect, but the same opcode. The assembler converts LSL r17 to ADD r17, r17.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Hi glitch
For the emulator I don't care for anything but speed when running so my init code will make the table and load R6 and R7 and they will never change (I have no problems giving away some low registers).
So for the BLD,BST,SBRC,SBRS instructions I only need to add a ANDI r16,7.
And for the BSET and BCLR it will be SWAP R16 and then ANDI r16,7
So yes it is 4 or 5 but still way better, it's importend to get the brances fast.

Jens

Well it's 5 cycles, because of the AND, so you're only saving 2 cycles. since you can leave out the andi altogether with the other code.

So if you really think 2 cycles is worth 2 registers and 8 bytes of ram, then by all means go that way. But don't kid yourself, you're only saving 2 cycles not 3.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Last Edited: Wed. Dec 12, 2007 - 07:04 AM

my mistake I was thinking 3 instructions 3 CLK and not 4.
And yes I know that it's not sounding of much but it's about 10% in speed for the emulator.
But for normal things I will still use the old rutine. I made a bit array lib back in 2000 that use it.(but that was on a org. 8515 without MOVW).

Jens

Actually thought of another missed cycle. For the branches you would need to copy the low byte, of the instruction, before hitting the mask & table code, because the byte is modified and it also contains part of the branch address. The initial code, does not modify anything, so no copy is required. Thus saving yet another register, and cycle in the overall process, vs the table method.

So your true savings are down to 1 cycle. Not very attractive if you ask me.

Now let's talk hard numbers here. That's 1 second longer runtime for every million branches, assuming you're AVR is running at 1MHz.

And for BSET & BCLR I would make a separate version of the bit# to bit routine working off the upper nibble instead, instead of using SWAP. Taking another cycle out of what you have to do for the table technique, making it at par in performance with the conditional branch instructions, and only 1 cycle longer for the skip instructions.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Now you are ruine my day, but your are rigth so i will not change anything.
But then I will save 2 clk by splitting the BRBS and BRBC that.(I need to do something).
or make a RAM page where the tabel is repeated 32 times so the 5 top bit are don't care. then the code is

ldi XH,page
mov XL,r16
ld r17,X

or even
ldi XH,page
ld r17,X

if I load then instruction low byte into XL and not R16

and only
ld r17,X

if I don't use XH for anything else.

Or I just do something else !!!!

Remember I just do this for fun. This reminds me of a unsigned MUL I made for a MEGA103 that takes between 26 and 28 clk. But use 512 byte of RAM! useless!!!

Jens

I realize you're doing this just for fun. Here's my take on how I thought an emulator might look. I avoided call/ret so I don't need to use loops, and jumps are faster. The emulator is essentially stackless, and ramless, except for the 32 bytes needed to hold the virtual registers. This leaves the stack & stack pointer free for direct use by the emulated code. The only time we would need stack is in an interrupt, at which point the emulator could temporarily use part of the VM's stack, as all we need it for is PC long enough to register the interrupt

some notes here:
Z is used for jump tables by the VM
X is used as the virtual PC
virt_SREG is some CPU register dedicaed to holding the virtual SREG contents
r16:r17 holds the current instruction
r15 holds the interrupt vector that needs to execute
the T bit of SREG is a flag indicating that an IRQ needs to be fired

```.org 0
rjmp RESET
rjmp virtIRQ1
.
.
.

; example of the VMs internal IRQ handler
virtIRQ1:
set ; mark an event
ldi R15, 0x01 ; load the interrupt vector
ret ; note we use ret and not reti, so that
; interrupts stay disabled
; until the virtual handler runs and
; finishes with its reti
.
.
.

RESET:
.
.
.
clr r15 ; select the virtual reset vect by default
em_IRQ:
clt ;  clear the virtual IRQ pending flag
push YH ; save SP
push YL
ldi YH, high(virtFlashBase)

fetch:
brts em_IRQ

post_IRQ_fetch: ; IRQs return here, to guarantee that
; at least one op gets executed after
; (aligned to 256 byte boundary)
ld  r16, X+ ; opl
ld  r17, X+ ; oph
mov  r30, r17
ijmp
; end of fetch

optable:
rjmp fetch ; entry 0x00, NOP
.
.
.
rjmp em_brbs ; entry 0xf0
rjmp em_brbs ; entry 0xf1
rjmp em_brbs ; entry 0xf2
rjmp em_brbs ; entry 0xf3
rjmp em_brbc ; entry 0xf4
rjmp em_brbc ; entry 0xf5
rjmp em_brbc ; entry 0xf6
rjmp em_brbc ; entry 0xf7
.
.
.
rjmp em_sbrs ; entry 0xff
; end of opcode table

; dispatch routines
em_brbc:
ldi  r18, 0x01
sbrc r16, 1
ldi  r18, 0x04
sbrc r16, 0
shl  r18
sbrc r16, 2
swap r18

and  r18, virt_SREG

brne em_brbc_done
lsr  r17 ; shift the offset into position
ror  r16
lsr  r17
ror  r16
andi r16, 0xfe ; its a word address, so clear the LSB
clr  r17
sbci r17, 0 ; sign extend it
em_brbc_done:
rjmp fetch ; go get the next opcode

em_brbs:
ldi  r18, 0x01
sbrc r16, 1
ldi  r18, 0x04
sbrc r16, 0
shl  r18
sbrc r16, 2
swap r18

and  r18, virt_SREG

breq em_brbs_done
lsr  r17 ; shift the offset into position
ror  r16
lsr  r17
ror  r16
andi r16, 0xfe ; it's a word offset, so LSB=0 always
clr  r17 ; promote to 16bit
sbci r17, 0 ; sign extend it
em_brbs_done:
rjmp fetch ; go get the next opcode

em_reti:
pop YL
pop YH
sei ; re-enable IRQs
rjmp post_IRQ_fetch

```

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Last Edited: Wed. Dec 12, 2007 - 09:13 AM

ok this is my code, not to many comments, but it works
and with the emulated code placed in flash 0x0300
The ldi don't work (make a nop)
the speed is
NOP 16 clk
MOV 29 clk
INC 39 clk
brance not taken 25 clk
brance taken 38

R13 and R14 is used to save emulated and real SREG

```.org 0000
.include "m88def.inc"

ldi		yh,0x06 ; y is PC*2 so emulated code start in 0x0300
ldi 	yl,0x0
ldi 	zh,1	; Z point at the rjmp tabel
ldi		xl,0	; X point at 1 RAM page r0 is first
ldi		xh,1	; addr at that page
rjmp	top
; this is a test
; the code run from flash
topmish:
dec		zh
top:
;push 	zh
;mov		r0,zh
movw	z,y
lpm		r16,z+
lpm		r30,z
;pop		zh
;mov		zl,r0
;mov		zh,r0
ldi		zh,1
ijmp

; jump tabel
.org 0x100
rjmp nopinst	;0000 0000
rjmp nopinst	;0000 0001
rjmp nopinst	;0000 0010
rjmp nopinst	;0000 0011
rjmp nopinst	;0000 0100
rjmp nopinst	;0000 0101
rjmp nopinst	;0000 0110
rjmp nopinst	;0000 0111
rjmp nopinst	;0000 1000
rjmp nopinst	;0000 1001
rjmp nopinst	;0000 1010
rjmp nopinst	;0000 1011
rjmp nopinst	;0000 1100
rjmp nopinst	;0000 1101
rjmp nopinst	;0000 1110
rjmp nopinst	;0000 1111
rjmp nopinst	;0001 0000
rjmp nopinst	;0001 0001
rjmp nopinst	;0001 0010
rjmp nopinst	;0001 0011
rjmp nopinst	;0001 0100
rjmp nopinst	;0001 0101
rjmp nopinst	;0001 0110
rjmp nopinst	;0001 0111
rjmp nopinst	;0001 1000
rjmp nopinst	;0001 1001
rjmp nopinst	;0001 1010
rjmp nopinst	;0001 1011
rjmp nopinst	;0001 1100
rjmp nopinst	;0001 1101
rjmp nopinst	;0001 1110
rjmp nopinst	;0001 1111
rjmp nopinst	;0010 0000
rjmp nopinst	;0010 0001
rjmp nopinst	;0010 0010
rjmp nopinst	;0010 0011
rjmp nopinst	;0010 0100
rjmp nopinst	;0010 0101
rjmp nopinst	;0010 0110
rjmp nopinst	;0010 0111
rjmp nopinst	;0010 1000
rjmp nopinst	;0010 1001
rjmp nopinst	;0010 1010
rjmp nopinst	;0010 1011
rjmp movrrinst	;0010 1100
rjmp movrrinst	;0010 1101
rjmp movrrinst	;0010 1110
rjmp movrrinst	;0010 1111
rjmp nopinst	;0011 0000
rjmp nopinst	;0011 0001
rjmp nopinst	;0011 0010
rjmp nopinst	;0011 0011
rjmp nopinst	;0011 0100
rjmp nopinst	;0011 0101
rjmp nopinst	;0011 0110
rjmp nopinst	;0011 0111
rjmp nopinst	;0011 1000
rjmp nopinst	;0011 1001
rjmp nopinst	;0011 1010
rjmp nopinst	;0011 1011
rjmp nopinst	;0011 1100
rjmp nopinst	;0011 1101
rjmp nopinst	;0011 1110
rjmp nopinst	;0011 1111
rjmp nopinst	;0100 0000
rjmp nopinst	;0100 0001
rjmp nopinst	;0100 0010
rjmp nopinst	;0100 0011
rjmp nopinst	;0100 0100
rjmp nopinst	;0100 0101
rjmp nopinst	;0100 0110
rjmp nopinst	;0100 0111
rjmp nopinst	;0100 1000
rjmp nopinst	;0100 1001
rjmp nopinst	;0100 1010
rjmp nopinst	;0100 1011
rjmp nopinst	;0100 1100
rjmp nopinst	;0100 1101
rjmp nopinst	;0100 1110
rjmp nopinst	;0100 1111
rjmp nopinst	;0101 0000
rjmp nopinst	;0101 0001
rjmp nopinst	;0101 0010
rjmp nopinst	;0101 0011
rjmp nopinst	;0101 0100
rjmp nopinst	;0101 0101
rjmp nopinst	;0101 0110
rjmp nopinst	;0101 0111
rjmp nopinst	;0101 1000
rjmp nopinst	;0101 1001
rjmp nopinst	;0101 1010
rjmp nopinst	;0101 1011
rjmp nopinst	;0101 1100
rjmp nopinst	;0101 1101
rjmp nopinst	;0101 1110
rjmp nopinst	;0101 1111
rjmp nopinst	;0110 0000
rjmp nopinst	;0110 0001
rjmp nopinst	;0110 0010
rjmp nopinst	;0110 0011
rjmp nopinst	;0110 0100
rjmp nopinst	;0110 0101
rjmp nopinst	;0110 0110
rjmp nopinst	;0110 0111
rjmp nopinst	;0110 1000
rjmp nopinst	;0110 1001
rjmp nopinst	;0110 1010
rjmp nopinst	;0110 1011
rjmp nopinst	;0110 1100
rjmp nopinst	;0110 1101
rjmp nopinst	;0110 1110
rjmp nopinst	;0110 1111
rjmp andiinst	;0111 0000
rjmp andiinst	;0111 0001
rjmp andiinst	;0111 0010
rjmp andiinst	;0111 0011
rjmp andiinst	;0111 0100
rjmp andiinst	;0111 0101
rjmp andiinst	;0111 0110
rjmp andiinst	;0111 0111
rjmp andiinst	;0111 1000
rjmp andiinst	;0111 1001
rjmp andiinst	;0111 1010
rjmp andiinst	;0111 1011
rjmp andiinst	;0111 1100
rjmp andiinst	;0111 1101
rjmp andiinst	;0111 1110
rjmp andiinst	;0111 1111
rjmp nopinst	;1000 0000
rjmp nopinst	;1000 0001
rjmp nopinst	;1000 0010
rjmp nopinst	;1000 0011
rjmp nopinst	;1000 0100
rjmp nopinst	;1000 0101
rjmp nopinst	;1000 0110
rjmp nopinst	;1000 0111
rjmp nopinst	;1000 1000
rjmp nopinst	;1000 1001
rjmp nopinst	;1000 1010
rjmp nopinst	;1000 1011
rjmp nopinst	;1000 1100
rjmp nopinst	;1000 1101
rjmp nopinst	;1000 1110
rjmp nopinst	;1000 1111
rjmp nopinst	;1001 0000
rjmp nopinst	;1001 0001
rjmp nopinst	;1001 0010
rjmp nopinst	;1001 0011
rjmp mishinst	;1001 0100
rjmp mishinst	;1001 0101
rjmp nopinst	;1001 0110
rjmp nopinst	;1001 0111
rjmp nopinst	;1001 1000
rjmp nopinst	;1001 1001
rjmp nopinst	;1001 1010
rjmp nopinst	;1001 1011
rjmp nopinst	;1001 1100
rjmp nopinst	;1001 1101
rjmp nopinst	;1001 1110
rjmp nopinst	;1001 1111
rjmp nopinst	;1010 0000
rjmp nopinst	;1010 0001
rjmp nopinst	;1010 0010
rjmp nopinst	;1010 0011
rjmp nopinst	;1010 0100
rjmp nopinst	;1010 0101
rjmp nopinst	;1010 0110
rjmp nopinst	;1010 0111
rjmp nopinst	;1010 1000
rjmp nopinst	;1010 1001
rjmp nopinst	;1010 1010
rjmp nopinst	;1010 1011
rjmp nopinst	;1010 1100
rjmp nopinst	;1010 1101
rjmp nopinst	;1010 1110
rjmp nopinst	;1010 1111
rjmp nopinst	;1011 0000
rjmp nopinst	;1011 0001
rjmp nopinst	;1011 0010
rjmp nopinst	;1011 0011
rjmp nopinst	;1011 0100
rjmp nopinst	;1011 0101
rjmp nopinst	;1011 0110
rjmp nopinst	;1011 0111
rjmp nopinst	;1011 1000
rjmp nopinst	;1011 1001
rjmp nopinst	;1011 1010
rjmp nopinst	;1011 1011
rjmp nopinst	;1011 1100
rjmp nopinst	;1011 1101
rjmp nopinst	;1011 1110
rjmp nopinst	;1011 1111
rjmp rjmpinst	;1100 0000
rjmp rjmpinst	;1100 0001
rjmp rjmpinst	;1100 0010
rjmp rjmpinst	;1100 0011
rjmp rjmpinst	;1100 0100
rjmp rjmpinst	;1100 0101
rjmp rjmpinst	;1100 0110
rjmp rjmpinst	;1100 0111
rjmp rjmpinst	;1100 1000
rjmp rjmpinst	;1100 1001
rjmp rjmpinst	;1100 1010
rjmp rjmpinst	;1100 1011
rjmp rjmpinst	;1100 1100
rjmp rjmpinst	;1100 1101
rjmp rjmpinst	;1100 1110
rjmp rjmpinst	;1100 1111
rjmp nopinst	;1101 0000
rjmp nopinst	;1101 0001
rjmp nopinst	;1101 0010
rjmp nopinst	;1101 0011
rjmp nopinst	;1101 0100
rjmp nopinst	;1101 0101
rjmp nopinst	;1101 0110
rjmp nopinst	;1101 0111
rjmp nopinst	;1101 1000
rjmp nopinst	;1101 1001
rjmp nopinst	;1101 1010
rjmp nopinst	;1101 1011
rjmp nopinst	;1101 1100
rjmp nopinst	;1101 1101
rjmp nopinst	;1101 1110
rjmp nopinst	;1101 1111
rjmp nopinst	;1110 0000
rjmp nopinst	;1110 0001
rjmp nopinst	;1110 0010
rjmp nopinst	;1110 0011
rjmp nopinst	;1110 0100
rjmp nopinst	;1110 0101
rjmp nopinst	;1110 0110
rjmp nopinst	;1110 0111
rjmp nopinst	;1110 1000
rjmp nopinst	;1110 1001
rjmp nopinst	;1110 1010
rjmp nopinst	;1110 1011
rjmp nopinst	;1110 1100
rjmp nopinst	;1110 1101
rjmp nopinst	;1110 1110
rjmp nopinst	;1110 1111
rjmp nopinst	;1111 0000
rjmp nopinst	;1111 0001
rjmp nopinst	;1111 0010
rjmp nopinst	;1111 0011
rjmp brbcinst	;1111 0100
rjmp brbcinst	;1111 0101
rjmp brbcinst	;1111 0110
rjmp brbcinst	;1111 0111
rjmp nopinst	;1111 1000
rjmp nopinst	;1111 1001
rjmp nopinst	;1111 1010
rjmp nopinst	;1111 1011
rjmp nopinst	;1111 1100
rjmp nopinst	;1111 1101
rjmp nopinst	;1111 1110
rjmp nopinst	;1111 1111
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish
rjmp	 incinst
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish
rjmp	 topmish

.org 0x400
brbcinst:
ldi		r17,1
sbrc	r16,1
ldi		r17,4
sbrc	r16,0
sbrc	r16,2
swap	r17
and		r17,r14
breq	brbcinst1
rjmp	top
brbcinst1:
lsr		zl
ror		r16
lsr		zl
ror		r16
asr		r16
ldi		r17,0
sbrc	r16,6
ldi		r17,0xff
add		yl,r16 ; mul by 2 because Y=2*PC
rjmp	top

incinst:
mov		xl,r16
swap	xl
andi	xl,0x0f
sbrc	r12,0
subi	xl,-16
ld		r17,x
in 		r13,sreg
out 	sreg,r14
inc 	r17
in 		r14,sreg
out 	sreg,r13
st		x,r17
rjmp	topmish

mishinst:
inc		zh
mov		r12,zl
mov		zl,r16
andi	zl,0x0f
ijmp

nopinst:
rjmp	 top

movrrinst:
mov		xl,r16
andi	xl,0x0f
sbrc	zl,1
subi	xl,-16
ld		r17,x
mov		xl,r16
swap	xl
andi	xl,0x0f
sbrc	zl,0
subi	xl,-16
st		x,r17
rjmp 	top

rjmpinst:
andi	xl,0x0f
sbrc	xl,3
ori		xl,0xf0
add		yl,r16 ; mul by 2 because Y=2*PC
rjmp 	top

andiinst:
;opcode 0111 KKKK dddd KKKK
swap	r16
mov		xl,r16
andi 	xl,0x0f
subi 	xl,-16
ld 		r17,x
andi 	r16,0xf0
andi 	r30,0x0f
or 		r16,r30
swap 	r16
in 		r13,sreg
out 	sreg,r14
and 	r17,r16
in 		r14,sreg
out 	sreg,r13
st 		x,r17
rjmp top

;test code start this code get's emulated
.org 0x300
;q:
ldi r16,0
mov	r1,r16
q:
inc	r1
brne q
andi r16,0x80
mov r10,r20
rjmp w
mov r11,r10
w:
mov r1,r2
mov r10,r20
rjmp q
```

delete of 2. copy

Last Edited: Wed. Dec 12, 2007 - 09:17 AM

So as you can see I have to make a double IJMP for codes that start with 1001 010x xxxx xxxx that is push com neg swap inc asr lsr ror dec .... + call ret etc that only have 1 opcode and that slow it down. but I only make a 4 bit lookup.
So far no use of stack at all!!!

I only emulate from flash because it's so easy to write the code that way, Thats why my load code is a bit a little bad, it should come from RAM or eeprom.

Jens

Take a look at the code fragment I posted above, it might give you some ideas, though it looks like we're pretty close.

By my calculation a branch should run 21/31 if executing from RAM, and 22/32 from flash. And it includes support for interrupts.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Yes is is close, and I can see that I should restructure some of the things, and as you can see it is a "working code" not clean. But this time I think that you count a bit wrong, how can you change two LD to two LPM in one clk ?

Have fun
Jens
I'm done for today.