## General C: Determine MSB position in 32 Bit variable

53 posts / 0 new
Author
Message

Hi,
have you an idea how to quickly determine the MSB in a 32 Bit variable
without masking each bit one after the other?

Examples:
0x1234 -> MSB = 12
0x2345 -> MSB = 13
0x4567 -> MSB = 14

Sebastian

union {}

Hello,

Maybe you are looking for a function like ffsl() declared in

I don't know how much optimal is but you can study the generated code or download avr-libc sources. Don't expect any kind of miracle: the AVR architecture has not instructions to get the position of the last non-zero bit.

Bye. Carlos.

You could do a successive approximation. AND with 0xffff0000. If the answer is 0, AND with 0xffffff00, else AND with 0xff000000. I'll have to think about how to generate the masks on the fly though.

You could also check each byte against 0 starting with the high byte. That would reduce shifting to a max of 8, or successive approximation on a single byte.

Regards,
Steve A.

The Board helps those that help themselves.

clawson is correct and that a union is the fastest way to do this.

[Moderator hat]
And since this is a general C question, I'm moving this thread from the AVR GCC Forum to the AVR Forum.

MSB mean Most Significant Bit or Byte? You said 'in a 32 bit' variable, but your example used a 16 bit variable??

Imagecraft compiler user

Union is not as easy...

``` union {
uint32_t  dword;
uint8_t   byte[4];
} msb;
```

Now if we have LITTLE ENDIAN device MOST SIGNIFICANT BYTE is msb.byte[3]. If it is BIG ENDIAN than it is msb.byte[0]. So it is not general C, but device specific question.

Hello Sebastian,

what do you want to do? Determine the position of the MSB (like you said in the topic) or the value of the MSB?
Another way would be to interpret the value as a signed 32 bit integer and check if it is negative (i.e. MSB set). But I think this also a trick like the union trick which works on nearly all systems but is not guaranted to work on every machine by the c standard...

Michael

P.S: Your example is not clear for me.

Shifting the Most Significant Bit of a 32bit variable to the right would look like this
0x80000000
0x40000000
0x20000000
0x10000000
0x08000000

0x00000001

In the beginning was the Word, and the Word was with God, and the Word was God.

Quote:
So it is not general C, but device specific question.

But he is not asking which is the most significant byte. He is asking how to find out what is the highest bit that contains a 1.

Regards,
Steve A.

The Board helps those that help themselves.

nzbit=0;
unsigned long msk=0x00000001;
for i=0; i<32; i++){
if((n & msk) != 0) nzbit=i;
msk <<=1;
}

Imagecraft compiler user

I think Steve's idea sounds interesting.
Checking each byte against not zero (starting with byte3)
and then right shift of the first non zero byte.

I'm going to check this against the ffsl() function.

I was quite unsure where to post this topic since this is a general C-question I want
to implement with GCC. So next time I will know better!

Regards
Sebastian

Hi,

the ffsl() function fails since it only detects the LSB and also consumes too many cycles.
But Steve's idea works fine:

```uint8_t get_msb (uint32_t value)
{
union byte_access_t
{
uint32_t dword;
uint8_t byte[4];
} byteAccess;
uint8_t msb;
uint8_t byte;

byteAccess.dword = value;

if (byteAccess.byte[3] > 0)
{
byte = byteAccess.byte[3];
msb = 24;
}
else if (byteAccess.byte[2] > 0)
{
byte = byteAccess.byte[2];
msb = 16;
}
else if (byteAccess.byte[1] > 0)
{
byte = byteAccess.byte[1];
msb = 8;
}
else if (byteAccess.byte[0] > 0)
{
byte = byteAccess.byte[0];
msb = 0;
}
else
{
return 0;
}

do
{
msb++;
byte /= 2;
}
while (byte > 0);

return msb;
}```

I've tested this function with some random numbers in AVR Studio
Here are some execution samples (cycle times include: parameter handover,
function execution, and storing return value in register):

0x000041A7 -> 49 cycles
0x10D63AF1 -> 35 cycles
0x60B7ACD9 -> 43 cycles
0x00FC4111 -> 50 cycles

The code was compiled with GCC 3.4.6 with -s option for a ATmega8.

Regards
Sebastian

Here's my size optimized version:

```#include

int8_t FindMSB( uint32_t value )
// returns -1, if value == 0
{
int8_t BitIndex = 31;
do {
if( (int32_t) value < 0 )	break;
value <<= 1;
} while( --BitIndex >= 0 );

return BitIndex;
}```

(14 instructions)

--Brian

Nice work Brian!
Could you even make your code faster than mine???
Since I need speed not small size.
Regards
Sebastian

Quote:

Since I need speed not small size

For speed, I think that a check on each byte for empty would end up making for a faster solution as you outlined. Would the high bit be evenly distributed?

30 to 50 cycles doesn't seem bad. Probably a few could be gained with ASM, and if the low half tends to have the MSb, then some tricks with ADIW might help with a couple cycles.

30-50 cycles is 4-6us @8MHz. Plus the function call overhead? Still doesn't sound too bad. What are the constraints? What goal are you trying to achieve?

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Another way:

```#include

uint8_t get_msb (uint32_t value)
{
union byte_access_t
{
uint32_t b32;
uint16_t b16[2];
uint8_t b8[4];
} v;

uint8_t msb = 0;

v.b32 = value;

if( v.b16[1] ){
msb = 16;
v.b16[0] = v.b16[1];
}

if( v.b8[1] ){
msb += 8;
v.b8[0] = v.b8[1];
}

if( v.b8[0] > 0x0F ){
msb += 4;
v.b8[0] >>= 4;
}

if( v.b8[0] > 0x03 ){
msb += 2;
v.b8[0] >>= 2;
}

if( v.b8[0] > 0x01 ){
msb += 1;
v.b8[0] >>= 1;
}

msb += v.b8[0];

return msb; // 0, 1, ... 32
}
```

It need 21 .. 28 cycle without call & return.

Peter

here's my pass... I haven't run it through a simulator. It basically uses binary subdivision to find the bit.

```// returns 0 for no bit found, otherwise 1-32 for the bit positon
// of the msb found

unit8_t find_msb(uint32_t value) {
uint16_t msw;
uint8_t  msb;
uint8_t  bit_index;

bit_index = 0;

// Binary split for the first 2 passes
// the compiler should optimize the shifts to simple byte moves

if(val > 0x0000ffffl) {
bit_index = 16;
msw = value >> 16;
} else msw = value;

if(msw > 0x00ff) {
bit_index += 8;
msb = msw >> 8;
} else msb = msw;

// at this point we are at sub-byte shifts, which
// are not optimial for multi-bit moves using shifting,
// so we want  to use multiplication instead, and
// basically shifting left instead of right now.

// on parts that do not support the mul instruction, I would
// suggest using Steves scan method from this point forward

bit_index += 8;

if(msb < 0x10) {
bit_index -= 4;
msb *= 16;
}

if(msb < 0x40) {
bit_index -= 2;
msb *= 4;
}

if(msb < 0x80) {
bit_index -= 1;
}

if(msb == 0) return 0;

return bit_index;
}
```

Should result in somewhere around 25 cycles worst case.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

GREAT!!!!
Many thanks!
I'm going to check it out.

Regards
Sebastian

note that Peter's and my solution are esentially the same, except his usses a union, and I do it algebraically. You might get further reduction in cycles by combining some of my ideas with his. (he posted, while I was composing mine, didin't see his result till after)

Specifically by using the union, you can get rid of the need for the 32bit compare. The 16bit compare is also more optimized. Then once down to the final 8 bits, my method should be a bit quicker.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

You might take glitch's mask and shift method one step further since shifting by 4 should be optimized to a SWAP command (plus an AND). But the set up for it might negate any savings. You'll just have to try it and see.

Regards,
Steve A.

The Board helps those that help themselves.

Careful what y'all post. ;)

Quote:
Finding a significant bit in a computer data word
Document: United States Patent 6904518
Abstract: The most or least significant bit of a datum can bet determined using parallel operations. This may result in faster location of the most or least significant bit without necessarily introducing more overhead in some embodiments.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

If size really doesn't matter, then a lookup table after the byte is determined will probably shave half the cycles and be more deterministic as well. Kudos to this thread:
http://c.ittoolbox.com/groups/te...

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Hey, hey guys not so fast!

I just checked the versions of "glitch" and "danni".
glitch: Your code wasn't compiled as expected, so I had to do some changes

- The execution cycles of find_msb are measured by Timer0 of ATmega8.
- A pseudo random value (32 Bit) is used to feed function find_msb
- The results are send out via RS232 (115200-8-N-1) in CSV format
- Column 1 contains the pseudo random number
- Column 2 contains the calculated MSB
- Column 3 contains the measured execution cycles

The different find_msb versions can be selected by a compiler switch which is defined at the top of file find_msb.c

I already recorded the output of both find_msb versions and put the results
into a Excel file:
"danni"'s version won with max of execution cycles of 36
The adjusted version of "glitch" was with max 38 cycles very fast, too.

Look at the Excel sheet and the project files in the attachment if you're
interested.

Thanks again for all your help!
I'm going to check all your new input tommorrow.

Regards
Sebastian

## Attachment(s):

The cited patent has no applicability here.

A lookup table implementation will not be faster (worst case) than the solutions presented so far, as the LPM instruction takes three cycles, and setting up the "Z" register to use it takes two more. It might be faster on average, as it can discard more computations on small input numbers than the other methods described.

There's a bunch of stuff like this in 'snippets' on Simtel

Imagecraft compiler user

theusch wrote:
Careful what y'all post. ;)
Quote:
Finding a significant bit in a computer data word
Document: United States Patent 6904518
Abstract: The most or least significant bit of a datum can bet determined using parallel operations. This may result in faster location of the most or least significant bit without necessarily introducing more overhead in some embodiments.

Lee

Perfect example of a software patent that shouldn't have been granted. The technique is simply using a binary sieve (a well known technique) to find the msb.

It's like granting a patent for using a sieve to strain peas instead of pasta.

The technique works well, and improves in performance with an increase in word size of the processor. However it has little advantage over other methods on an 8 bit processor.

if someone would like to time it out, here is an implementation of the code:

```#define MaskHi_16 0xffff0000l

// returns 0 for no bit found, otherwise 1-32 for the bit position
// of the msb found

unit8_t find_msb(uint32_t value) {
uint8_t result = 1;
uint32_t value_s;

if(value==0) return 0;

value_s = value >> 1;

return result;
}
```

S-Sohn: What didn't compile as expected, could you elaborate please?

Here are a few variants you could try:

```unit8_t find_msb(uint32_t value) {
uint16_t msw;
uint8_t  msb;
uint8_t  bit_index;

bit_index = 0;

// Binary split for the first 2 passes
// the compiler should optimize the shifts to simple byte moves

if(val & 0xffff0000l) {
bit_index = 16;
msw = value >> 16;
} else msw = value;

if(msw & 0xff00) {
bit_index += 8;
msb = msw >> 8;
} else msb = msw;

// at this point we are at sub-byte shifts, which
// are not optimal for multi-bit moves using shifting,
// so we want  to use multiplication instead, and
// basically shifting left instead of right now.

// on parts that do not support the mul instruction, I would
// suggest using Steves scan method from this point forward

bit_index += 8;

if(msb < 0x10) {
bit_index -= 4;
msb *= 16;
}

if(msb < 0x40) {
bit_index -= 2;
msb *= 4;
}

if(msb < 0x80) {
bit_index -= 1;
}

if(msb == 0) return 0;

return bit_index;
}
```

or what I think should be a bit more optimized

```unit8_t find_msb(uint32_t value) {
uint16_t msw;
uint8_t  msb;
uint8_t  bit_index;

bit_index = 1;

if(val & 0xffff0000l) {
bit_index = 16;
msw = value >> 16;
} else msw = value;

if(msw & 0xff00) {
bit_index += 8;
msb = msw >> 8;
} else msb = msw;

if(msb & 0xf0) {
bit_index += 8;
msb = (msb >> 4) | (msb << 4);
msb &= 0x0f;
}

bit_index += 4;

if(msb < 0x04) {
bit_index -= 2;
msb *= 4;
}

if(msb < 0x08) {
bit_index -= 1;
}

if(msb == 0) return 0;
return bit_index;
}
```

Finally one could optimize the sieve method above for an 8 bit processor:

```#define MaskHi_4 0xf0

unit8_t find_msb(uint32_t value) {
uint16_t msw;
uint8_t  msb, msb_s;
uint8_t bit_index = 1;

if(val & 0xffff0000l) {
bit_index = 16;
msw = value >> 16;
} else msw = value;

if(msw & 0xff00) {
bit_index += 8;
msb = msw >> 8;
} else msb = msw;

if(msb == 0) return 0;
msb_s = msb >> 1;

return bit_index;
}
```

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Ok, since we're cycle-counting. Here's a version in assembly.

If each of bits in value you are checking has a 50% chance of being set, then 50% of the time the MSB will be the highest bit(32). In such a situation, this code will average 20.4 cycles, while ranging from 17-41 cycles.

If bits are not random, and the MSB has equal probabilty of being one of bits 31-0, or no-msb (when value=0), then the average will be 30.8 cycles.

get_msb_asm.S:

```; c-prototype:  uint8_t get_msb_asm (uint32_t value)

get_msb_asm: .global get_msb_asm

; input:
;   value -> R22/23/24/25
;
; result:
;   R24/25 -> Bit Index [0..32] ( must be 16bit )
;
; working regs:
;   R23 -> Value of most significant BYTE.
;   R24 -> BaseIndex or BitIndex

CheckB3:
; if MSByte is not B3, Goto CheckB2
tst R25
breq CheckB2

; MSByte is B3
mov R23, R25       ; MSByte = B3
ldi R24, 32        ; BaseIndex = 32
clr R25            ; clear result.B1

FindBit:    ; MSByte _must_not_ equal zero

;if Bit7 set, return
sbrc R23, 7
ret

;if Bit6 set, goto Bit6 return
sbrc R23, 6
rjmp ReturnBit6

;if Bit5 set, goto Bit5 return
sbrc R23, 5
rjmp ReturnBit5

;if Bit4 set, goto Bit4 return
sbrc R23, 4
rjmp ReturnBit4

;if Bit3 set, goto Bit3 return
sbrc R23, 3
rjmp ReturnBit3

;if Bit2 set, goto Bit2 return
sbrc R23, 2
rjmp ReturnBit2

;if Bit1 CLEAR, goto _Bit0_ return
sbrs R23, 1
rjmp ReturnBit0

; Bit0 need not be checked, since MSByte is guaranteed to be != 0
;    sbrs R23, 0
;    rjmp Bit0

ReturnBit1: subi R24, 6        ;Return BaseIndex-6
ret

ReturnBit6: subi R24, 1        ;Return BaseIndex-1
ret
ReturnBit5: subi R24, 2        ;Return BaseIndex-2
ret
ReturnBit4: subi R24, 3        ;Return BaseIndex-3
ret
ReturnBit3: subi R24, 4        ;Return BaseIndex-4
ret
ReturnBit2: subi R24, 5        ;Return BaseIndex-5
ret
ReturnBit0: subi R24, 7        ;Return BaseIndex-7
ret

CheckB2:
; if MSByte is not B2, Goto CheckB1
tst R24
breq CheckB1

mov R23, R24       ; MSByte = B2
ldi R24, 24        ; BaseIndex = 24
rjmp FindBit

CheckB1:
; if MSByte is not B1, Goto CheckB0
tst R23
breq CheckB0
;    mov R23, R23       ; MSByte = B1
ldi R24, 16       ; BaseIndex = 16
rjmp FindBit

CheckB0:
; Assume MSByte will be B0
mov R23, R22      ; MSByte = B0
ldi R24, 8        ; BaseIndex = 8

; if MSByte is B0, Goto FindBit
tst R23
brne FindBit

; all bits are clear, return zero

;    clr R25          ; R25 is already zero
clr R24
ret

```

--Brian

Last Edited: Fri. Jun 2, 2006 - 05:17 AM

Here's my assembly version that uses a 256-byte lookup table.

I didn't plan for this, but it _always_ takes 25 cycles.

It could be reduced to 23 cycles if the lookup table is put on a 256byte boundary.

get_msb_asm_table.S:

```#include

; uint8_t get_msb_asm_table (uint32_t value)

get_msb_asm_table: .global get_msb_asm_table

; input:
;   value -> R22/23/24/25
;   const_zero -> R1
;
; result:
;   R24/25 -> Bit Index [0..32] ( must be 16bit )
;
; working regs:
;   Z -> &msb_table
;   R24 -> BaseIndex or BitIndex

ldi ZH, hi8( msb_table )
ldi ZL, lo8( msb_table )    ;OPT: this can be removed, if msb_table is aligned by 256

CheckW1:
; if MSWord is not W1, goto CheckW0
breq CheckW0

; if MSByte is not B3, goto CheckB2
tst R25
breq UseB2

; MSByte is B3
add ZL, R25         ; Z += B3

lpm R24, Z          ; Get BitIndex for byte
subi R24, -24       ; BitIndex += 24
clr R25             ; Clear BitIndex.B1
ret

UseB2:
add ZL, R24         ; Z += B2

lpm R24, Z          ; Get BitIndex for byte
subi R24, -16       ; BitIndex += 16

; R25(B3) will always be zero
;    clr R25            ; Clear BitIndex.B1
ret

CheckW0:

; if MSByte is not B1, Goto UseB0
tst R23
breq UseB0

; MSByte is B1
add ZL, R23         ; Z += B1

lpm R24, Z          ; Get BitIndex for byte
subi R24, -8        ; BitIndex += 8

; R25(B3) will always be zero
;    clr R25            ; Clear BitIndex.B1
ret

UseB0:
; if MSByte is B0, Goto FindBit
; R22(B0) does not need to be tested for zero, the table lookup takes care of this
;    tst R22
;    breq ReturnZero

add ZL, R22         ; Z += B0

lpm R24, Z          ; Get BitIndex for byte
;    subi R24, 0        ; BitIndex += 0

; R25(B3) will always be zero
;    clr R25            ; Clear BitIndex.B1
ret

;----------------------------------------------------------------------------------
msb_table: .global msb_table
#include "msb_table.h"
```

--Brian

## Attachment(s):

Quote:

...the LPM instruction takes three cycles, and setting up the "Z" register ...

Did I >>say<< that the table should be in flash for optimal speed? ;) I thought size/resources were no object, only speed.

Quote:

It could be reduced to 23 cycles if the lookup table is put on a 256byte boundary.

Yes, when I was noodling with the "fastest" one needs to place things at the best spot. In this case, a lookup table on a magic boundary may save an instruction or two.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Lee,

I've only worked with the Mega8, so I have a strong prejudice against jamming large tables into the sram.

On the other hand, if the table is placed into sram it can easily be placed at a 'magic' boundry.

So, I'll leave optimizing my lookup code down to 22 cycles and adding a sram table generator as a simple exercise to the reader... :P

--Brian

glitch:
The compiler doesn't recognize when it won't need to test the whole variable:

```    if (value & 0xffff0000l)
112:	dc 01       	movw	r26, r24
114:	cb 01       	movw	r24, r22
116:	80 70       	andi	r24, 0x00	; 0
118:	90 70       	andi	r25, 0x00	; 0
11a:	00 97       	sbiw	r24, 0x00	; 0
11c:	a1 05       	cpc	r26, r1
11e:	b1 05       	cpc	r27, r1
120:	31 f0       	breq	.+12     	; 0x12e ```

The same happens to your ?: operations, too.
I adjusted your 1st version in a way that gcc understands which operation is really neccessary:

```uint8_t find_msb (uint32_t value)
{
10c:	9b 01       	movw	r18, r22
10e:	ac 01       	movw	r20, r24
union my_uint32_t
{
uint32_t dword;
uint16_t word[2];
uint8_t byte[4];
} val;
uint8_t  bit_index;

val.dword = value;
bit_index = 0;
110:	90 e0       	ldi	r25, 0x00	; 0

// Binary split for the first 2 passes
// the compiler should optimize the shifts to simple byte moves

if (val.word[1] > 0)
112:	41 15       	cp	r20, r1
114:	51 05       	cpc	r21, r1
116:	11 f0       	breq	.+4      	; 0x11c
{
bit_index = 16;
118:	90 e1       	ldi	r25, 0x10	; 16
val.word[0] = val.word[1];
11a:	9a 01       	movw	r18, r20
}

if (val.byte[1] > 0)
11c:	33 23       	and	r19, r19
11e:	11 f0       	breq	.+4      	; 0x124
{
bit_index += 8;
120:	98 5f       	subi	r25, 0xF8	; 248
val.byte[0] = val.byte[1];
122:	23 2f       	mov	r18, r19
}

// at this point we are at sub-byte shifts, which
// are not optimial for multi-bit moves using shifting,
// so we want  to use multiplication instead, and
// basically shifting left instead of right now.

// on parts that do not support the mul instruction, I would
// suggest using Steves scan method from this point forward

bit_index += 8;
124:	98 5f       	subi	r25, 0xF8	; 248

if (val.byte[0] < 0x10)
126:	20 31       	cpi	r18, 0x10	; 16
128:	18 f4       	brcc	.+6      	; 0x130
{
bit_index -= 4;
12a:	94 50       	subi	r25, 0x04	; 4
val.byte[0] *= 16;
12c:	22 95       	swap	r18
12e:	20 7f       	andi	r18, 0xF0	; 240
}

if (val.byte[0] < 0x40)
130:	20 34       	cpi	r18, 0x40	; 64
132:	18 f4       	brcc	.+6      	; 0x13a
{
bit_index -= 2;
134:	92 50       	subi	r25, 0x02	; 2
val.byte[0] *= 4;
136:	22 0f       	add	r18, r18
138:	22 0f       	add	r18, r18
}

if (val.byte[0] < 0x80)
13a:	27 ff       	sbrs	r18, 7
{
bit_index -= 1;
13c:	91 50       	subi	r25, 0x01	; 1
}

if (val.byte[0] == 0)
13e:	22 23       	and	r18, r18
140:	19 f4       	brne	.+6      	; 0x148
return 0;
142:	80 e0       	ldi	r24, 0x00	; 0
144:	90 e0       	ldi	r25, 0x00	; 0
146:	08 95       	ret

return bit_index;
148:	89 2f       	mov	r24, r25
14a:	99 27       	eor	r25, r25
}
14c:	08 95       	ret```

If you look at the 2nd part of the function you'll see that gcc doesn't use the mul instruction as you expected.
In this case gcc found better solutions.
Since this behavior is my constant companion, I always look at the generated assembler of gcc to
check if gcc understood what I wanted.

Brian:
Your solution seems to be promising!
I would try it out as soon as possible.

It's been a long and exiting night... but now it's time for me to go to work.
So please don't mind if I can't send any reply before tomorrow.

Regards
Sebastian

S-Sohn, the speed tests that you did were not necessarily optimal. It may give a good representation of average speed, but it may not find the highest value. But there is a much simpler way of doing it.

There are only 33 possibilities for the return value, and in each case what other bits are set besides the msb is irrelevant to the speed of the function. So you only really need to time those 33 possibilities. With these results you can easily see what the real maximum and minimum values are. You can also calculate the average from them.

There are two methods of averaging that might be of use. The first is a simple average of the 33 values. This would be relevant in cases where each bit has as equal chance of being the msb. This would be a specialized case and would not be applicable in the majority of cases.

More likely the numbers are random. In this case half of the time the bit 31 will be the msb, 1/4 of the time bit 30, and so on. So the time taken must be weighted by how often the case is likely to occur. So the equation for the average is:

`avg = t31/(2^1) + t30/(2^2) + t29/(2^3) + ... + t0/(2^32) + tnone/(2^33)`

I rewrote your code to test for the 33 cases and calculate the averages. Here are the results. The columns are (in order) high, low, simple average and weighted average:

```S_SOHN    55    19    36     43
DANNI     37    30    33     36
GLITCH    39    33    36     35```

I also tested the straight forward method of testing the high bit and shifting if it is 0, or exiting if it is 1. The simple average is much worse as you would expect, getting a value of 158 cycles, and the high was 290 (I had to change the code to check for timer overflow!). However the weighted average was only 24! This was because bit 31 had the shortest time of 15. So it could be that all this bit twiddling was for nothing!

Regards,
Steve A.

The Board helps those that help themselves.

If you want a laugh, take a look at my entry for best weighted average.

It has a minimum execution time of 15 cycles, including 11 cycles of overhead for parameter-setup and the call/return. So, that's only 4 cycles of actual work!

If I did the math correctly, the weighted average should be 7 cycles (excluding the overhead).

((... and it only uses 135 instruction words ...)) :oops:

edit: Well, crap... I can't get the file to attach, so here it is as text.

get_msb_waopt.S:

```; c-prototype:  uint8_t get_msb_waopt (uint32_t value)

get_msb_waopt: .global get_msb_waopt

#define TA( MSB )               \
sbrs R25, (MSB-1)&0x7            \
\$    rjmp NotBit##MSB                \
\$    clr R25                         \
\$    ldi R24, MSB                    \
\$    ret                             \
\$ NotBit##MSB:
TA( 32 ) \$ TA( 31 ) \$ TA( 30 ) \$ TA( 29 ) \$ TA( 28 ) \$ TA( 27 ) \$ TA( 26 ) \$ TA( 25 )

#define TB( MSB )                 \
sbrs 22+((MSB-1)/8), (MSB-1)&0x7    \
\$    rjmp NotBit##MSB                   \
\$    ldi R24, MSB                       \
\$    ret                                \
\$ NotBit##MSB:
TB( 24 ) \$ TB( 23 ) \$ TB( 22 ) \$ TB( 21 ) \$ TB( 20 ) \$ TB( 19 ) \$ TB( 18 ) \$ TB( 17 )
TB( 16 ) \$ TB( 15 ) \$ TB( 14 ) \$ TB( 13 ) \$ TB( 12 ) \$ TB( 11 ) \$ TB( 10 ) \$ TB( 9 )
TB( 8 ) \$ TB( 7 ) \$ TB( 6 ) \$ TB( 5 ) \$ TB( 4 ) \$ TB( 3 ) \$ TB( 2 ) // \$ TB( 1 )

sbrc 22, 0
ldi R24, 1
ret
```

yes a LUT (Look Up Table) should be the fastest since after figuring out which byte has the MSB (pretty much the same overhead for all the methods) it's only overhead is loading a 16bit pointer, adding an offset to it (the byte with the msb) and executing the LPM instruction. The drawback is the amount of memory required for the table. In this case size was no object, and speed was the goal, so this method works well. I chose to go the algebraic route, because it's more challenging to come up with a quick, yet compact, piece of code to do the same job.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

bcb wrote:
It has a minimum execution time of 15 cycles, including 11 cycles of overhead for parameter-setup and the call/return. So, that's only 4 cycles of actual work!

Umm, not quite.. it takes more than 4 cycles to determine which byte in the 4byte value needs to be used. This is part of the calculation, and not 'set-up' overhead.

Here is an optimized version of the seive. If the compiler does not
optimize the 32bit and 16 bit compares, use the union method.

```#define MaskHi_4 0xf0

unit8_t find_msb(uint32_t value) {
uint16_t msw;
uint8_t  msb, msb_s;
uint8_t bit_index  = 0;

if(val & 0xffff0000l) {
bit_index += (16 >> 3);
msw = value >> 16;
} else msw = value;

if(msw & 0xff00) {
bit_index += (8 >> 3);
msb = msw >> 8;
} else msb = msw;

if(msb == 0) return 0;

msb_s = msb >> 1;
bit_index <<= 1;
bit_index <<= 1;
bit_index <<= 1;

bit_index++;

return bit_index;
}
```

Here is a semi-optimized assembly version of the above. It should
be relatively "c" friendly. It assumes that R16-R21 are scratch registers
and can be modified without saving/restoring. It has a worst case runtime
ff 25 cycles (not including call/ret), and a best case of 21 (for non-zero values).
A zero value will be processed in 10 cycles.

```find_msb_sieve:
; could use R22 and up for input, but dwanted to too keep it 'C'
; friendly, most compilers pass parameters starting at R16.
; in a full assembly program this can be moved to shave off a few
; cycles, allowing for an adiw for the initial compare. Also The result
; could be left in R20, to save another cycle
;
; input:
;   value -> R16/17/18/19

;  result:
;   R16 -> 8bit index of bit 0=not found, 1-32 bit position

; working registers:
;   R20,  R21

tstW1:
mov   R21, R19
or    R21, R18
breq  tstW0

tstB3:
tst   R19
breq  tstB2

mov   R21, R19
lsr   r21
ldi   r20, 3
mov   R16, R19
mov   R17, R19

andi  R19, \$F0
cp	  R21, R19
rol   R20
andi  R16, \$cc
cp	  R21, R16
rol   R20
andi  R17, \$aa
cp	  R21, R17
rol   R20

inc   R20
mov   R16, R20
ret

tstB2:
mov   R21, R18
lsr   r21
ldi   r20, 2
mov   R16, R18
mov   R17, R16

andi  R16, \$F0
cp	  R21, R16
rol   R20
andi  R17, \$cc
cp	  R21, R17
rol   R20
andi  R18, \$aa
cp	  R21, R18
rol   R20

inc   R20
mov   R16, R20
ret

tstW0:
tst   R17
breq  tstB0

mov   R21, R17
lsr   r21
ldi   r20, 1
mov   R16, R17
mov   R18, R16

andi  R16, \$F0
cp	  R21, R16
rol   R20
andi  R17, \$cc
cp	  R21, R17
rol   R20
andi  R18, \$aa
cp	  R21, R18
rol   R20

inc   R20
mov   R16, R20
ret

tstB0:
tst   R16
breq  done

mov   R21,R16
lsr   r21
clr   R20
mov   R17, R16
mov   R18, R16

andi  R16, \$F0
cp	  R21, R16
rol   R20
andi  R17, \$cc
cp	  R21, R17
rol   R20
andi  R18, \$aa
cp	  R21, R18
rol   R20

inc   R20
mov   R16, R20

done:
ret
```

In comparing to the LUT method, which requires about 9 cycles of setup/execution once the byte is known, this method requres about 14. So for an 8bit test it runs at about 60% the speed of the LUT. As the input value increases in size, the difference becomes less and less, though it will always be slightly slower, by approximately 5 cycles/pass.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

I'm not claiming my last method was completely practical, but I expect it would have the best weighted average.

glitch wrote:
Umm, not quite.. it takes more than 4 cycles to determine which byte in the 4byte value needs to be used. This is part of the calculation, and not 'set-up' overhead.

My get_msb_waopt code skips the the entire byte selection process.

If the highest bit of the input is set (result = 32), this is what gets executed:

```; input = R22/R23/R24/R25  from least to most signifcant byte

get_msb_waopt:
sbrs R25, 7      ;1
rjmp NotBit32    ;1 -
clr R25          ;1
ldi R24, 32      ;1
ret
NotBit32:
<....>
```

4 cycles, 8 with the return.

Now if you imagine the above code repeating 32 times to do a brute force check of all 32 bits, you will find that it will run exceedingly fast for the first few bits, and exceedingly slow for the rest.

Now, if each of the 33 results are equally probable, it won't win any speed tests.

But, for Steve's weighted average described above, this code is speed-optimal. ;)

So, like anything else, the definition of 'best' depends upon what the real world inputs will be. At this point we're just guessing.

--Brian

Quote:
and exceedingly slow for the rest.

It's not really that bad. The maximum would be 107 (including the overhead) and the simple average is 62. This still beats the simple shift and mask method, and it still beats or ties the other methods through 16 bits.

Regards,
Steve A.

The Board helps those that help themselves.

Ok, I thought you were referring to the LUT method.

SBRS = 2cycles if skip taken, 1cycle if not
RJMP = 2cycles

so esentially you have a 3 cycle penalty for each bit you have to traverse. Interesting... but I think in practice it's not as useful as the other methods. As the input is not likely random, and more likely used for range checking.

Incidentially, you can shave another 2 cycles off, if you rearrange things a bit. Really this is just an unrolled version of the standard brute force shift & test. In this case it's 2 cycles (not incl call/ret) for bit 31, and a penalty of 3 cycles for each additional bit

```; input r16/17/18/19
; output r20
get_msb_brute:
ldi  r20, 32    ; 1
sbrc r19,7      ; 1/2
ret             ;

dec  r20        ;	1
sbrc r19,6	  ; 1/2
ret
.
.
.
dec  r20        ;	1
sbrc r19,0      ; 1/2
ret             ;

dec  r20        ;	1
sbrc r18,7	  ; 1/2
ret
.
.
.
dec  r20        ;	1
sbrc r18,0      ; 1/2
ret             ;

dec  r20        ;	1
sbrc r17,7	  ; 1/2
ret
.
.
.
dec  r20        ;	1
sbrc r17,0      ; 1/2
ret             ;

dec  r20        ;	1
sbrc r16,7	  ; 1/2
ret
.
.
.
dec  r20        ;	1
sbrc r16,0	  ; 1/2
ret

dec  r20        ;	1
ret
```

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Last Edited: Fri. Jun 2, 2006 - 08:21 PM

Koshchi wrote:
Quote:
and exceedingly slow for the rest.

It's not really that bad. The maximum would be 107 (including the overhead) and the simple average is 62. This still beats the simple shift and mask method, and it still beats or ties the other methods through 16 bits.

Actually I think the crossover is more likely around 8 bits in, so anything with a msb around 24 or below or so it will be slower than the other methods.

The optimized sieve method I posted, has a runtime of 22 cycles, if the msb is in the high order byte. The brute force method will take 4 + (3*7) = 25 cycles if bit 24 (bit 0 of the high order byte) is the msb. The unrolled brute force that I psoted above would match it for the high byte, and be slower for anything else.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

glitch wrote:

```; input r16/17/18/19
; output r20
get_msb_brute:
ldi  r20, 32    ; 1
sbrc r19,7      ; 1/2
ret             ;

dec  r20        ;	1
sbrc r19,6	   ; 1/2
ret
. . .```

Nice catch glitch.

Unfortunately for me, I'm using GCC, whose C parameter passing scheme isn't compatible with your modification.

A GCC version would make your tweaks look like this:

```; input r22/23/24/25
; output r24/25
get_msb_brute:
clr r25         ; 1 -- clobbers the input

ldi  r24, 32    ; 1 -- clobbers the input
sbrc r25,7      ; 1/2
ret             ;

dec  r24        ;	1
sbrc r25,6	   ; 1/2
ret
. . .```

It would be a fine mod if the caller were written in Asm instead of C.

--Brian

Ok I'm not up on GCC parameter passing.. I thought it passed parameters like IAR, starting at r16. You could mov / copy the shared registers into some scratch registers, however the cost of doing so would bring it on par with your solution still, or even add an additional cycle. Note that even my code would not work well for IAR, as the result would need to be in r16 as well, although I could cheat a little there, by passing a dummy variable as the first parameter, this would put the value in R20-24, and the dummy parameter & return value in R16.

I am curious though, if you're returning a uint8_t, why are you concerned about r25 being cleard before exit? It shouldn't be involved in the result since it's an 8 bit value (r24). That reduction would shave a cycle off bringing you down to 3+(3n). I guess I'll have to go look at the GCC docs to see what's going on.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Last Edited: Fri. Jun 2, 2006 - 09:44 PM

For function return values, WinAVR's GCC promotes 8bit types to 16 bit types. So the calling function would expect the upper 8bits to be zeroed.

I can't explain advantage of that behavior. That's just the way it is.

bcb wrote:
For function return values, WinAVR's GCC promotes 8bit types to 16 bit types. So the calling function would expect the upper 8bits to be zeroed.

I can't explain advantage of that behavior. That's just the way it is.

Interesting, you would think that it would be more efficient to do the promotion after the return, allowing it to be optimized away. I wonder if that's an artifact of GCC being generalized so much to work on multiple platforms, or if it's just an implementation quirk in the AVR port? Though even with it's quirks it seems to produce damn good code. I use IAR myself mostly because that's what was available back when I started using AVR's. I've been happy, and haven't felt a need to switch. but would use GCC if I moved away form IAR.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

bcb wrote:
For function return values, WinAVR's GCC promotes 8bit types to 16 bit types. So the calling function would expect the upper 8bits to be zeroed.

I can't explain advantage of that behavior. That's just the way it is.

https://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=37011&highlight=

Regards
Sebastian

Many thanks for all your support!!!
So here's my implementation:

```#include
#include

typedef union
{
uint32_t dword;
uint16_t word[2];
uint8_t byte[4];
} my_uint32_t;

void find_msb (uint32_t value);
void do_something (uint8_t value);

// The Return Value Of Function find_msb
// Is Stored Here To Prevent Function Overhead
// (Smallest Return Value Of GCC Is A 16 Bit Number)
register uint8_t ReturnValue asm("r16");

int main (void)
{
// The variable From Which
// The MSB Shall Be Determined
my_uint32_t value;
// Just A Simple Counter
uint8_t count;
// The Result Of Function find_msb Is Stored
// In This Variable (Same Register Location
// As Return Value Of Function find_msb)
register uint8_t msb asm("r16");

value.dword = 0;

for (count = 33; count > 0; count--)
{
find_msb (value.dword);
// This will Cause No Operation Since Both Values
// Are Located In The Same Registers
msb = ReturnValue;
// Just A Dummy Function
do_something (msb);
// Calculate Next Value From Which The
// MSB Shall Be Calculated
value.dword *= 2;
value.byte[0] |= 1;
}

while (1);

return 0;
}

void find_msb (uint32_t value)
{
my_uint32_t val;

val.dword = value;
ReturnValue = 0;

// Is MSB Located In High Word?
if (val.word[1] != 0)
{
ReturnValue += 16;
val.word[0] = val.word[1];
}

// Is MSB Located In High Byte?
if (val.byte[1] != 0)
{
ReturnValue += 8;
val.byte[0] = val.byte[1];
}

// Is MSB Located In High Nibble?
if (val.byte[0] > 0x0F)
{
ReturnValue += 4;
val.byte[0] >>= 4;
}

// Is MSB Located In The 2
// Most Significant Bits Of Nibble?
if (val.byte[0] > 0x03)
{
ReturnValue += 2;
val.byte[0] >>= 2;
}

// Is MSB Located In The
// Most Significant Bit Of 2 Bits
if (val.byte[0] > 0x01)
{
ReturnValue += 1;
val.byte[0] >>= 1;
}

ReturnValue += val.byte[0];
}

void do_something (uint8_t value)
{
while (value > 0)
{
value--;
}
}```

The mixed code shows how efficient the MSB is calculated and
that no overhead is generated for function find_msb():
(Compiled With GCC 3.4.6 with -Os optimization For ATmega8)

```@0000002E: find_msb
---- main.c ---------------------------------------------------------------------------------------
67:           ReturnValue = 0;
+0000002E:   E000        LDI     R16,0x00         Load immediate
70:           if (val.word[1] != 0)
+0000002F:   9700        SBIW    R24,0x00         Subtract immediate from word
+00000030:   F011        BREQ    PC+0x03          Branch if equal
72:               ReturnValue += 16;
+00000031:   E100        LDI     R16,0x10         Load immediate
73:               val.word[0] = val.word[1];
+00000032:   01BC        MOVW    R22,R24          Copy register pair
77:           if (val.byte[1] != 0)
+00000033:   2377        TST     R23              Test for Zero or Minus
+00000034:   F011        BREQ    PC+0x03          Branch if equal
79:               ReturnValue += 8;
+00000035:   5F08        SUBI    R16,0xF8         Subtract immediate
80:               val.byte[0] = val.byte[1];
+00000036:   2F67        MOV     R22,R23          Copy register
84:           if (val.byte[0] > 0x0F)
+00000037:   3160        CPI     R22,0x10         Compare with immediate
+00000038:   F018        BRCS    PC+0x04          Branch if carry set
86:               ReturnValue += 4;
+00000039:   5F0C        SUBI    R16,0xFC         Subtract immediate
87:               val.byte[0] >>= 4;
+0000003A:   9562        SWAP    R22              Swap nibbles
+0000003B:   706F        ANDI    R22,0x0F         Logical AND with immediate
92:           if (val.byte[0] > 0x03)
+0000003C:   3064        CPI     R22,0x04         Compare with immediate
+0000003D:   F018        BRCS    PC+0x04          Branch if carry set
94:               ReturnValue += 2;
+0000003E:   5F0E        SUBI    R16,0xFE         Subtract immediate
95:               val.byte[0] >>= 2;
+0000003F:   9566        LSR     R22              Logical shift right
+00000040:   9566        LSR     R22              Logical shift right
100:          if (val.byte[0] > 0x01)
+00000041:   3062        CPI     R22,0x02         Compare with immediate
+00000042:   F010        BRCS    PC+0x03          Branch if carry set
102:              ReturnValue += 1;
+00000043:   5F0F        SUBI    R16,0xFF         Subtract immediate
103:              val.byte[0] >>= 1;
+00000044:   9566        LSR     R22              Logical shift right
107:          ReturnValue += val.byte[0];
+00000046:   9508        RET                      Subroutine return
@00000047: do_something
+00000047:   2388        TST     R24              Test for Zero or Minus
+00000048:   F011        BREQ    PC+0x03          Branch if equal
116:              value--;
+00000049:   5081        SUBI    R24,0x01         Subtract immediate
+0000004A:   CFFC        RJMP    PC-0x0003        Relative jump
+0000004B:   9508        RET                      Subroutine return
@0000004C: main
+0000004C:   E5CF        LDI     R28,0x5F         Load immediate
+0000004D:   E0D4        LDI     R29,0x04         Load immediate
+0000004E:   BFDE        OUT     0x3E,R29         Out to I/O location
+0000004F:   BFCD        OUT     0x3D,R28         Out to I/O location
39:           value.dword = 0;
+00000050:   24CC        CLR     R12              Clear Register
+00000051:   24DD        CLR     R13              Clear Register
+00000052:   0176        MOVW    R14,R12          Copy register pair
41:           for (count = 33; count > 0; count--)
+00000053:   E2C1        LDI     R28,0x21         Load immediate
43:               find_msb (value.dword);
+00000054:   01C7        MOVW    R24,R14          Copy register pair
+00000055:   01B6        MOVW    R22,R12          Copy register pair
+00000056:   DFD7        RCALL   PC-0x0028        Relative call subroutine
48:               do_something (msb);
+00000057:   2F80        MOV     R24,R16          Copy register
+00000058:   DFEE        RCALL   PC-0x0011        Relative call subroutine
51:               value.dword *= 2;
+00000059:   0CCC        LSL     R12              Logical Shift Left
+0000005A:   1CDD        ROL     R13              Rotate Left Through Carry
+0000005B:   1CEE        ROL     R14              Rotate Left Through Carry
+0000005C:   1CFF        ROL     R15              Rotate Left Through Carry
52:               value.byte[0] |= 1;
+0000005D:   E081        LDI     R24,0x01         Load immediate
+0000005E:   2AC8        OR      R12,R24          Logical OR
41:           for (count = 33; count > 0; count--)
+0000005F:   50C1        SUBI    R28,0x01         Subtract immediate
+00000060:   F799        BRNE    PC-0x0C          Branch if not equal
55:           while (1);
+00000061:   CFFF        RJMP    PC-0x0000        Relative jump```

Once More: MANY THANKS!!!
Regards
Sebastian

Sebastian,

I've modified your code so it uses the standard C calling convention by avoiding the use of R16 as a return value.

The nifty thing is that it has the exact same cycle count as your version.

It takes advantage of the fact that the 16bit return value is overlapped with the highword of the 32bit input. It treats value.word[1] as the return value.

```uint8_t find_msb_ssohn_std (uint32_t value)
{
my_uint32_t val;

val.dword = value;

// Result will be placed in val.word[1]

// Is MSB Located In High Word?
if (val.word[1] != 0)
{
val.word[0] = val.word[1];
val.byte[2] = 16;   // Result = 16
val.byte[3] = 0;    // Make sure high byte is cleared
}
//    else  Do nothing. val.word[1] is known to be zero
//    {
//        val.byte[2] = 0;   // Result = 0
//        val.byte[3] = 0;   // Make sure high byte is cleared
//    }

// Is MSB Located In High Byte?
if (val.byte[1] != 0)
{
val.byte[2] += 8;   // Result += 8
val.byte[0] = val.byte[1];
}

// Is MSB Located In High Nibble?
if (val.byte[0] > 0x0F)
{
val.byte[2] += 4;   // Result += 4
val.byte[0] >>= 4;
}

// Is MSB Located In The 2
// Most Significant Bits Of Nibble?
if (val.byte[0] > 0x03)
{
val.byte[2] += 2;   // Result += 2
val.byte[0] >>= 2;
}

// Is MSB Located In The
// Most Significant Bit Of 2 Bits
if (val.byte[0] > 0x01)
{
val.byte[2] += 1;   // Result += 1
val.byte[0] >>= 1;
}

val.byte[2] += val.byte[0];

return val.word[1];
}```

The only problem is that the the longest path through the function is when the highest bit is set. So if there is a 50% chance of that the result will be 32, then this isn't the best routine.

Otherwise, the speed/size/clarity of this function should make it a good general library routine.

In addition, the above code will run 0.5 cycle faster (on a simple average), when the return type is changed to uint16_t.

`uint16_t find_msb_ssohn_std (uint32_t value)`

This allows us to manage the 8->16bit return value expansion, rather than the compiler.

And lastly, this should be compatible with all compilers, but compilers other than GCC might not have the same speed advantage.

--Brian

Quote:
The only problem is that the the longest path through the function is when the highest bit is set. So if there is a 50% chance of that the result will be 32, then this isn't the best routine.

To solve this problem I would rearrange the structure. Start ReturnValue as the highest value instead of the lowest value, then change all the 'if's from this form:

```    // Is MSB Located In High Word?
if (val.word[1] != 0)
{
ReturnValue += 16;
val.word[0] = val.word[1];
} ```

to this form:

```    // Is MSB Located In Low Word?
if (val.word[1] == 0)
{
ReturnValue -= 16;
val.word[1] = val.word[0];
} ```

This way if the highest bit is set, the code will follow the shortest path. Then if the input numbers are random, the average execution time will be much shorter. But if the each bit has an equal chance of being the msb, the average is the same as the way you have your code and you have lost nothing.

Regards,
Steve A.

The Board helps those that help themselves.

Great ideas, impossible to ignore them!!!!!
Thank you!

If one accepts that C (and other language) library writers take great pains to do stuff "right", wouldn't examination of log2() and FP normalization routines be pertinent?

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Lee:
Sorry, but my english seems to be too bad to understand your question.

Sebastian

Lee is saying that the writers of the LibC maths routines work as hard as they can to make their routines as small as as "good" as they can get them, so their code should be optimal.

The log2(y) (logarithm base 2) function returns "x" where 2^x=y. Say you have the number four, log2(4) returns 2 since 2^2=4. You can then use this to grab the MSB, since the integer part of the value returned by log2(y) would be the bit number of the MSB. Unfortunately the log2() function uses floating point maths, which will waste a significant amount of space so if you download the LibC source code Lee is asking if you can have a look at the log2() routine and write your own which only returns the whole number.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

Further, many floating point operations require normalization to find the highest 1 bit. Maybe this is in fact what OP is up to.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.