How to check all bits in uint32_t ? 8 Bit microprocessor.

Go To Last Post
44 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi guys,

AVR noobie here trying to figure out how to fast read multiple bits in uint32_t variable.

I am using SBIT() macro to read and write bits.

 

sbit.h


#ifndef SBIT_H_
#define SBIT_H_

struct bits {
	uint8_t b0:1;
	uint8_t b1:1;
	uint8_t b2:1;
	uint8_t b3:1;
	uint8_t b4:1;
	uint8_t b5:1;
	uint8_t b6:1;
	uint8_t b7:1;
	uint8_t b8:1;
	uint8_t b9:1;
	uint8_t b10:1;
	uint8_t b11:1;
	uint8_t b12:1;
	uint8_t b13:1;
	uint8_t b14:1;
	uint8_t b15:1;
	uint8_t b16:1;
	uint8_t b17:1;
	uint8_t b18:1;
	uint8_t b19:1;
	uint8_t b20:1;
	uint8_t b21:1;
	uint8_t b22:1;
	uint8_t b23:1;
	uint8_t b24:1;
	uint8_t b25:1;
	uint8_t b26:1;
	uint8_t b27:1;
	uint8_t b28:1;
	uint8_t b29:1;
	uint8_t b30:1;
	uint8_t b31:1;
} __attribute__((__packed__));

#define SBIT(port,pin) ((*(volatile struct bits*)&port).b##pin)

#endif /* SBIT_H_ */

Example.c

#include <avr/io.h>
#include <sbit.h>

volatile uint32_t data = 0xAAAAAAAA;
volatile uint8_t temp = 0; 

int main(void)
{
    SBIT(data,0) = 1;				// Works fine no problem.
    temp = SBIT(data,0);			// Also works.

    for(int count=0; count < 32; count++)	// To check each individual bits of data
    {
        if(SBIT(data,count) == 1)		// Does not work ERROR - 'volatile struct bits' has no member named 'bcount'
        {
            // Do something.
        }
    }

    while (1)
    {
        __asm__ __volatile__ ("nop");
    }
}

The problem is I am passing the value of variable count into SBIT(port,pin). But the "pin" expects a manual input like 0,1,2 etc. Is there a way to make this work so I can just pass the variable containing the number. I am sure there must be other ways, but I need this specific way with the loop and  SBIT() macro.

 

Second Question - Is there a faster way to do this? Less clock cycles. 

 

My project contains handling individual 32 LEDS + other code.

Thanks.

This topic has a solution.
Last Edited: Sat. Nov 30, 2019 - 03:35 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If the issue is how to control 32 LEDs with the bits from a 32-bit variable,

methinks your goal should be clarity rather than speed.

C does not allow bit arrays,

so you might want to use a byte per LED.

The resulting code is likely  to be more clear as well as faster.

 

Your attempt appears to be an effort to affect individual bits

regardless of other bits that might need manipulation later.

For consecutive bits, if you really need speed,

indexing and mask shifting are the way to go.

Burying stuff in macros is not.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Speed is not my goal right now, primary concern is to check each bit through loop. Can't use bytes need to save space for storing the LED effects. Indexing and mask shifting don't know how to do. Will look into it.

 

Last Edited: Fri. Nov 29, 2019 - 06:17 AM
This reply has been marked as the solution. 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 2

I would head in the direction of using functions-

https://godbolt.org/z/MSH-a5

 

They will be as small as defines, you get the compiler to do some work which it does quite well, and also get the compiler to complain when you do things wrong instead of having to figure out what the preprocessor did with all the text substitutions.

 

The functions as static functions can go in a header, also. 

 

I think the less the preprocessor is involved, the better. Sometimes it seems there is no choice, but more often than you think it is not required and the compiler is more often a better tool.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Heisen wrote:
The problem is I am passing the value of variable count into SBIT(port,pin).
You need to understand how C and the preprocessor works. The preprocessor is a phase that executes before compilation - not something at run time. So what happens to your code here is that it is first preprocessed to create:

# 3 ".././main.c"
struct bits {
 uint8_t b0:1;
 uint8_t b1:1;
 uint8_t b2:1;
 uint8_t b3:1;
 uint8_t b4:1;
 uint8_t b5:1;
 uint8_t b6:1;
 uint8_t b7:1;
 uint8_t b8:1;
 uint8_t b9:1;
 uint8_t b10:1;
 uint8_t b11:1;
 uint8_t b12:1;
 uint8_t b13:1;
 uint8_t b14:1;
 uint8_t b15:1;
 uint8_t b16:1;
 uint8_t b17:1;
 uint8_t b18:1;
 uint8_t b19:1;
 uint8_t b20:1;
 uint8_t b21:1;
 uint8_t b22:1;
 uint8_t b23:1;
 uint8_t b24:1;
 uint8_t b25:1;
 uint8_t b26:1;
 uint8_t b27:1;
 uint8_t b28:1;
 uint8_t b29:1;
 uint8_t b30:1;
 uint8_t b31:1;
} __attribute__((__packed__));



volatile uint32_t data = 0xAAAAAAAA;
volatile uint8_t temp = 0;

int main(void)
{
 ((*(volatile struct bits*)&data).b0) = 1;
 temp = ((*(volatile struct bits*)&data).b0);

 for(int count=0; count < 32; count++)
 {
  if(((*(volatile struct bits*)&data).bcount) == 1)
  {

  }
 }

 while (1)
 {
  __asm__ __volatile__ ("nop");
 }
}

The way the macro works is using ## to "glue" a fixed pin (/bit) number 0..31 to the letter 'b' to make .b17 or .b21 but because you have tried to use the macro with a variable called "count" then all it has done is to concatentae "count" onto 'b' to make "bcount".

 

Bottom line is you can't use a compile time macro for this.

 

Curt showed an approach. Here's my take on the same thing just to show a simplified version:

#include <avr/io.h>

void SETBIT(uint32_t * data, uint8_t bitnum, uint8_t state) {
	if (state) {
		*data |= (1 << bitnum);
	}
	else {
		*data &= ~(1 << bitnum);
	}
}

uint8_t TESTBIT(uint32_t data, uint8_t bitnum) {
	uint8_t retval = 0;
	
	if (data & (1UL << bitnum)) {
		retval = 1;
	}
	return retval;
}

volatile uint32_t data = 0xAAAAAAAA;
volatile uint8_t temp = 0;

int main(void)
{
	SETBIT(&data, 13, 0);				// Works fine no problem.
	temp = TESTBIT(data, 13);	   // Also works.

	for(int count=0; count < 32; count++)	// To check each individual bits of data
	{
		if(TESTBIT(data,count) == 1)
		{
			// Do something.
		}
	}

	while (1)
	{
		__asm__ __volatile__ ("nop");
	}
}

The problem is that if you are going to have a run-time varying bit index then you almost inevitably need to involve (1 << n) to work out a bit mask based on the bit index and <<n is an "expensive" operation in computing terms.

 

One thing you might do to mitigate this is to pre-calculate the bit masks:

const __flash uint32_t masks[] = {
	0x00000001,
	0x00000002,
	0x00000004,
	0x00000008,
	0x00000010,
	0x00000020,
	0x00000040,
	0x00000080,
	0x00000100,
	0x00000200,
	0x00000400,
	0x00000800,
	0x00001000,
	0x00002000,
	0x00004000,
	0x00008000,
	0x00010000,
	0x00020000,
	0x00040000,
	0x00080000,
	0x00100000,
	0x00200000,
	0x00400000,
	0x00800000,
	0x01000000,
	0x02000000,
	0x04000000,
	0x08000000,
	0x10000000,
	0x20000000,
	0x40000000,
	0x80000000
};

This does "cost" 128 bytes of flash but now, wherever you might have used (1UL << n) you just use masks[n]. This way the compiler does not need to do run-time 1 bit shifts but just read the right entry in this table (it's still going to be doing a masks+(n * 4) kind of operation though.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you want to test whether each bit of uint32_t data is set then why not

 

int main(void)
{
    uint32_t data = 0xaaaaaaaa;

    for (uint32_t mask = 1; mask != 0; mask <<= 1)
    {
        if (data & mask)
        {
            /* the bit was set */
        }
    }
    while (1);
}

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The previous 2 ideas are probably not any better/worse than the function version I showed-

 

https://godbolt.org/z/D8pt6X

 

I didn't count instructions/clocks in each loop, but the function version is the smallest code. They may be all close enough to not worry about, so then go with the easiest to read or least code or whatever is the tie-breaker. Its also possible there is something better, but I would guess anything better starts 'looking' worse.

Last Edited: Fri. Nov 29, 2019 - 11:55 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Heisen wrote:
Speed is not my goal right now, primary concern is to check each bit through loop. Can't use bytes need to save space for storing the LED effects. Indexing and mask shifting don't know how to do. Will look into it.
If you do not know how to do indexing or mask shifting,

that might be a problem in itself.

 

How much SRAM do you have?

A byte per LED costs at most an extra 28 bytes.

Less if you have more than one set of bits.

The bits could be interleaved to make more use of already used byte.

 

In any case, go for clarity first.

If the result is fast and frugal enough, you are done.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

These functions work absolutely fine. yes Thanks.

These tiny functions are also faster than using below macros which I was looking at before. I don't know if it is true or not but in my simulation tests your functions were much faster. I might be wrong here.

 

#define BIT(x) (0x01 << (x))
#define LONGBIT(x) ((unsigned long)0x00000001 << (x))
#define bit_get(p,m) ((p) & (m))
#define bit_set(p,m) ((p) |= (m))
#define bit_clear(p,m) ((p) &= ~(m))

One thing more, can you make functions for uint16_t too in your link https://godbolt.org/z/MSH-a5. I can't use them on 16 bit variable. indecision 

bitset16();
bitclr16();
bitval16();

While you are at in there, throw in those for 64 bit too. I don't know if that is possible but anyway thank you.

Last Edited: Sat. Nov 30, 2019 - 03:35 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am still very new to this but I think I got it. So these bit masks can I use them for the functions that curt showed above, will they reduce the processing time significantly?

 

const __flash uint32_t masks[] = {
	0x00000001,
	0x00000002,
	0x00000004,
	0x00000008,
	0x00000010,
	0x00000020,
	0x00000040,
	0x00000080,
	0x00000100,
	0x00000200,
	0x00000400,
	0x00000800,
	0x00001000,
	0x00002000,
	0x00004000,
	0x00008000,
	0x00010000,
	0x00020000,
	0x00040000,
	0x00080000,
	0x00100000,
	0x00200000,
	0x00400000,
	0x00800000,
	0x01000000,
	0x02000000,
	0x04000000,
	0x08000000,
	0x10000000,
	0x20000000,
	0x40000000,
	0x80000000
};

static bool bitval8(volatile uint8_t* reg, uint8_t bit){
	return *reg & masks[bit];
}
static bool bitval32(volatile uint32_t* reg, uint8_t bit){
	return bitval8((volatile uint8_t*)reg+bit/8, bit%8);
}

is this correct?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I didn't even know what a mask is but I get it now. This still puzzles me. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

You need to read up on bitwise AND and OR operators then. The basic boolean functions are AND,OR and NOT. Learn these and it will make sense.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I had a look at the link in #7

 

And tried to follow the code, but I can't because lines like :

34               brpl 1b

don't make any sense, are there a way to get it to show a label number ?

 

like 

 39            brne .L3

 

I made a new project in stdio7 and get a compile error that it don't like bool ! 

something I need to include, or set in the compiler ? (I made a mega328 project but don't think that matter).

 

 

 

 

 

 

 

Last Edited: Sat. Nov 30, 2019 - 11:19 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
I had a look at the link in #7

I looked at #4 which is similar. I as going to contrast the naive method { (var & 1<<n) where n=0 to 31} against the special cases, but got sidetracked looking at the code for other processors. Several examples did recognise the intention of the special cases but produced quite lengthy code.

 

Anyway in brpl 1b:

1b just refers to a local label. In this case the closest label marked 1: looking backwards from the line.

 

Whereas in rjmp 2f:

The 2f means the closest label marked 2: forwards from the line.

 


{Edit: I forgot your 2nd question}

sparrow2 wrote:
I made a new project in stdio7 [sic] and get a compile error that it don't like bool

Indeed: I got that when modding the code for other processors. I guess it was included via <avr/io.h>

 

Adding #include <stdbool.h> does fix that.

 

Last Edited: Sat. Nov 30, 2019 - 12:25 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

I made a new project in stdio7 and get a compile error that it don't like bool ! 

Assuming we are talking C code, you want

#include <stdbool.h>

 

(this also assumes C99 or later, which will probably be he case by default)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

>I am still very new to this but I think I got it. So these bit masks can I use them for the functions that curt showed above, will they reduce the processing time significantly?

 

https://godbolt.org/z/D8pt6X

You can try all 3 versions in 'main' by changing a define. Look at the assembly output. You can count instructions in the loop (some instructions take 1 clock, others 2), but the code size is normally a good indicator in many cases. They all end up similar enough. The downside to the mask code in this case, is it takes 128 bytes of flash, needs to use lpm to read flash, and does the masking on all 4 bytes which on an 8bit mcu is still not a great thing. That may work out better on a 32bit mcu, but you don't know until its tried.

 

>One thing more, can you make functions for uint16_t too

 

You just do the same thing as the 32bit version. You are taking in a var of some bit width (8,16,32,64) in the function argument, then finding/casting the byte of interest to an 8bit volatile by using the /8, and the bit of interest with the %8. In many cases of use, the compiler will have all the info needed so will be a few instructions (or 1 if using on something like &PORTB), and when it cannot know at compile time (like 'count' in the loop), it does a good job of coming up with code.

 

uint16_t a = 0x1234; //in memory as bytes -> 0x34, 0x12

bitset16( &a, 11 ); //we will say &a = 0x100 in this case

so 0x100+(11/8) = 0x101 (address of 0x12)

and 11%8 = 3

*(0x101) |= 1<<3; ->  0x12 |= 0x80, and we have just set bit 11 in a

 

 

static void bitset8(volatile uint8_t* reg, uint8_t bit){

  *reg |= 1<<bit;

}

static void bitset16(volatile uint16_t* reg, uint8_t bit){

  bitset8((volatile uint8_t*)reg+bit/8, bit%8);

}

static void bitset32(volatile uint32_t* reg, uint8_t bit){

  bitset8((volatile uint8_t*)reg+bit/8, bit%8);

}

static void bitset64(volatile uint64_t* reg, uint8_t bit){

  bitset8((volatile uint8_t*)reg+bit/8, bit%8);

}

 

 

>that it don't like bool ! 

 

The compiler explorer online compiler seems to use arduino backend, so my examples may be missing the required stdbool.h, and because it is already included somewhere else behind the scenes it works (I should include it in my examples because that would be the correct thing to do in any case).

 

 

Here is the same example compiled for a cortex-m4-

https://godbolt.org/z/MVPm2y

they all still look similar enough, so would not use the mask version there either as there is no clear advantage.

In most cases, you can create code that is simple to understand and the compiler will take over and translate to efficient code. There seems to be few places left where you have to 'persuade' the compiler to do something in a certain way to see a big benefit.

 

Last Edited: Sat. Nov 30, 2019 - 03:58 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The big question is why does OP want to use 32 bit in the first place (after all a AVR is 8bit).

 

The code in #7 take upto 61clk to check a bit (clk for a for loop and IO output) it sounds of a lot, but I guess that in many cases it don't matter.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

>The code in #7 take upto 61clk to check a bit

 

Seems to be in the range of 20's to 40's, but I could be wrong.

 

>The big question is why does OP want to use 32 bit in the first place (after all a AVR is 8bit).

 

There are 32 led's, so I assume there will be 32 bits of data to process in any case so really does not matter if it is 4 uint8_t's or one uint32_t. Probably the bigger problem it what is going on in the code after a decision is made on a bit (to presumably light an led, or turn it off).

 

It may end up one great big set of if's may be best if the led's are scattered about the ports where you have to deal with them one at a time. You would end up with a long line of cbi/sbi/sbrs/rjmp's, but in the end may be the quickest and possibly easiest. The advantage of dealing with a bit/pin at a time is you can easily change pins without having to change the bit patterns wherever they are coming from.

 

Since this is most likely just setting led patterns, time is probably not an issue in any case, and the slowest code you could come up with would probably work just fine. The mcu will probably still be in some loop doing nothing for 95% of the time.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am using atmega328p with 2 pca9685 chips over I2C. So now I have 32 pwm pins to control. I thought it's better to use a bit for an led (1 on) (0 off) instead of using a byte for one led. This loop is in a function with takes 1 uint32_t data which has 1's and 0's for 32 LEDs and further arguments for Start pwm value to end pwm value for each bit. A lot of data I don't know how to handle. Trying to save space as much as possible to store led patterns. I guess I can use 4 uint8_t instead of 1 uint32_t.

I would possibly make a windows applications where I can make patterns which come out like this.

 

1100000000000000000000000000011
0110000000000000000000000000110
0011000000000000000000000001100
0001100000000000000000000011000
0000110000000000000000000110000
0000011000000000000000001100000
0000001100000000000000011000000
0000000110000000000000110000000
0000000011000000000001100000000
0000000001100000000011000000000
0000000000110000000110000000000
0000000000011000001100000000000
0000000000110000000110000000000
0000000001100000000011000000000
0000000011000000000001100000000
0000000110000000000000110000000
0000001100000000000000011000000
0000011000000000000000001100000
0000110000000000000000000110000
0001100000000000000000000011000
0011000000000000000000000001100
0110000000000000000000000000110
1100000000000000000000000000011
1111111111111111111111111111111

So I can just copy and paste it into my code to make very long running different patterns easily. My approach seems very inefficient but I don't how to make it easy.

Last Edited: Sun. Dec 1, 2019 - 12:56 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I find this a particularly interesting "optimization."

The original bitfield/macro-based code lets you check any one bit of 32, very quickly.

Doing something with ALL the bits, one at a time, is a different optimization problem.

Finding "some" one bits is a 32bit string is yet a different problem.

And it tends to vary based on CPU architecture.  Any of the three problems I mentioned would be (best) done on an ARM CPU quite differently, because it's able to shift any number of bits in a single cycle.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

>A lot of data I don't know how to handle.

 

Then backup and figure that out first. You have a 12bit pwm value, and actually 32bits that go out for each led (on pwm value, off pwm value). There is also a 13th bit for always on/off also (I only quickly glanced at the datasheet). Many of the bits that go out can probably be fixed values, and the only thing really needed is a pwm value of up to 12bits. It may also be that registers that do not need setting (like the led on pwm value for example) can be skipped but may or may not make sense to do that depending on what the cost is in i2c to skip an address instead of auto increment, etc.

 

First do you need both pwm values? or can you use just the off value (and on always set to 0)? Do you need all 12 bits of pwm, or can you do with less? Do you even use the pwm feature? Do you need to have an on/off value when you already have the pwm value? (either pwm value can just be 0, or can set the off bit when see a pwm of 0) Where is this data coming from? Stored on mcu flash? or coming from some external source?

 

Seems like a number of decisions need to be made, and maybe they are already made and the original question is still mostly valid. Maybe you just want on/off with a common pwm value for all led's? If you want 8bits of pwm for each led (lets say), then that is a byte per led per frame (32 bytes/frame), and a mega328 has 32k flash so can hold 1024 frames less whatever your code needs, so maybe ~800 frames of data. You are then using a single byte per led (pwm value) and no longer need the bit stuff.

 

The above is just mostly thinking out loud, and I'm not an expert on your i2c chip or anything else.

Last Edited: Sun. Dec 1, 2019 - 05:59 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I haven't tried building and examining assembler from all these different ways of doing it, but I I think it will largely depend on what the main usage is.

If the main usage is going to be setting/clearig individual bits one by one, then treating it as uin32_t like

void set_bit32(uint32_t *x, uint8_t bit)

{

    *x |= 1UL << bit;

}

is going to be relatively expensive like clawson said.

The compiler probably constructs the mask by looping around left shifting (or it might use add since you're just doubling the mask each time), so for a high bit num like 31 you do a lot of loops.

In this case, using curtvm method of breaking into 8bit chunks could well be faster (it relies on the endianess being correct but not a big deal since you probably aren't going to use this on other processors).

 

If, however, the main usage is going to be asssign a 32bit value pattern and then loop over all bits, then essentially you need to create a mask for all bit positions anyway.

So as long as you can convince the compiler to create the mask in a sensible manner (ie. just do one shift or add to the mask per iteration),  I think treating as a uint32_t would be pretty quick

eg.

 

uint32_t pattern = some 32 bit value;

uint8_t bit = 0;

for (uint32_t mask = 1; mask != 0; mask <<=1)

{

    if (pattern & mask)

    {

        some_function(bit);

    }

    bit++;

}

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

>I haven't tried building and examining assembler

 

You can go to the link in #7 or #16 and see it for yourself, its already done for you. Or can change to test something new.

 

I'm guessing from the latest post, bit testing may be yesterdays news.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

curtvm wrote:
You can go to the link in #7 or #16 and see it for yourself, its already done for you.

Yes, but that's the easy bit, and I can do that locally without having to go to some website and enable a bunch of javascript.

It's quite a neat site though to be fair.

The harder bit is taking the time to work out, in each case, what the assembler is doing, how many times will it loop around, how many cycles is that, how much difference does it make if  it inlines the function compared to if it doesn't inline the function, is it affected by optimisation in a trivial test program that wouldn't apply in the general case etc.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

>and I can do that locally without having to go to some website and enable a bunch of javascript.

 

Its a good place to test out ideas without having to clutter up the local pc with various tests. Its not going to be the final answer, but it can point you in the right direction. Its also a good place to send others for simple examples that can be shared and modified so someone interested does not have to copy/paste code in a post, then go through the work of getting the local compiler to work on the example.

 

Compiler Explorer can also be installed locally, and you can use your own compilers. I imagine ide's will eventually be doing something like this as its quite handy to see the compiler produce the asm output and show you errors as you type in code.

 

There is a utube video I saw recently where Matt Godbolt gave a talk at cppcon 2019 (something like that) about why/how he created Compiler Explorer.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The direct way :

    for(uint8_t count=0; count < 32; count++){
        if( data & (1UL<<count)) 
          PORTB = count;
    }
 

This one would make a rather slow code for worst case (237clk), and rather big. It's because it insist on making complete 32bit calculations.  

 

00000063  LDI R24,0x00        Load immediate 
00000064  LDI R25,0x00        Load immediate 
        if( data & (1UL<<count)) 
00000065  LDS R20,0x0100        Load direct from data space 
00000067  LDS R21,0x0101        Load direct from data space 
00000069  LDS R22,0x0102        Load direct from data space 
0000006B  LDS R23,0x0103        Load direct from data space 
0000006D  MOV R0,R24        Copy register 
0000006E  RJMP PC+0x0005        Relative jump 
0000006F  LSR R23        Logical shift right 
00000070  ROR R22        Rotate right through carry 
00000071  ROR R21        Rotate right through carry 
00000072  ROR R20        Rotate right through carry 
00000073  DEC R0        Decrement 
00000074  BRPL PC-0x05        Branch if plus 
00000075  SBRC R20,0        Skip if bit in register cleared 
          PORTB = count;
00000076  OUT 0x05,R24        Out to I/O location 
00000077  ADIW R24,0x01        Add immediate to word 
    for(uint8_t count=0; count < 32; count++){
00000078  CPI R24,0x20        Compare with immediate 
00000079  CPC R25,R1        Compare with carry 
0000007A  BRNE PC-0x15        Branch if not equal 
 

for some reason it make count 16bit! 

 

add

For those that isn't used to the compiler output I guess I have to add that the compiler don't shift 1 up but rather shift the a copy of data down, and then check LSB. 

It give the correct result but is far from optimal on a AVR (I think it's because the new compiler frontends are made for ARM, I think the old AVR compiler would have made a better code)

 

 

Last Edited: Sun. Dec 1, 2019 - 04:13 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I just read #19 and it confused me, do you use PWM or just on/off ? (and don't use colour or it's set somewhere else?)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

From #26

sparrow2 wrote:
This one would make a rather slow code for worst case (237clk)

Yes, if I read the assembler correctly, within each iteration of the for loop it is looping around count times, so effectively 'recreating' the mask on each iteration of the for loop.

You want it to 'remember' the mask from one iteration to the next, so then it just has to double the mask for each iteration. Yes, the mask is a 32 bit value so it has to do 4 shifts (or 4 adds) to double it, but that's not too bad.

 

For what it's worth, this code

int main(void)
{
    uint32_t data = get_32();
    uint8_t bit = 0;
    for (uint32_t mask = 1; mask != 0; mask <<= 1)
    {
        if (data & mask)
        {
            my_bit_func(bit);
        }
        bit++;
    }

    while (1);
}

produces

00000098 <main>:
  98:	0e 94 40 00 	call	0x80	; 0x80 <get_32>
  9c:	4b 01       	movw	r8, r22
  9e:	5c 01       	movw	r10, r24
  a0:	c1 2c       	mov	r12, r1
  a2:	d1 2c       	mov	r13, r1
  a4:	76 01       	movw	r14, r12
  a6:	c3 94       	inc	r12
  a8:	c0 e0       	ldi	r28, 0x00	; 0
  aa:	d7 01       	movw	r26, r14
  ac:	c6 01       	movw	r24, r12
  ae:	88 21       	and	r24, r8
  b0:	99 21       	and	r25, r9
  b2:	aa 21       	and	r26, r10
  b4:	bb 21       	and	r27, r11
  b6:	89 2b       	or	r24, r25
  b8:	8a 2b       	or	r24, r26
  ba:	8b 2b       	or	r24, r27
  bc:	19 f0       	breq	.+6      	; 0xc4 <main+0x2c>
  be:	8c 2f       	mov	r24, r28
  c0:	0e 94 4a 00 	call	0x94	; 0x94 <my_bit_func>
  c4:	cf 5f       	subi	r28, 0xFF	; 255
  c6:	cc 0c       	add	r12, r12
  c8:	dd 1c       	adc	r13, r13
  ca:	ee 1c       	adc	r14, r14
  cc:	ff 1c       	adc	r15, r15
  ce:	c0 32       	cpi	r28, 0x20	; 32
  d0:	61 f7       	brne	.-40     	; 0xaa <main+0x12>
  d2:	ff cf       	rjmp	.-2      	; 0xd2 <main+0x3a>

The loop starts at 0xaa, and there is at least no looping within the loop this time.

Last Edited: Sun. Dec 1, 2019 - 05:09 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But this is not the idea with the code.

It should cover a random bit, so optimise a loop for each bit don't help here! 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

But this is not the idea with the code.

It should cover a random bit, so optimise a loop for each bit don't help here! 

Well, the title of the thread is "How to check all bits in uint32_t" and the very first post shows looping over all 32 bits!

So who knows.

I just thought it's intersting to look at different ways of doing things.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Perhaps I should stay out.

I just laughs (or rather cry) about how stupid C compilers "still" are.

 

I could come up with 10 different ways to do this in ASM with a worst case of less than 20clk.

Last Edited: Sun. Dec 1, 2019 - 09:13 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For setting individual bits, if you can afford 8 byes of ram for array of 8bit masks, then using curtvm method but with 1u << bit%8  replaced with mask8[bit%8], that's about the best I can come up with.

 

static const uint8_t mask8[] = {1,2,4,8,16,32,64,128};

void setbit32(uint32_t *x, uint8_t b)
{
    /* assumes little endian */
    *((uint8_t *)x + b/8) |= mask8[b%8];
}

The compiler I have here gives

000000aa <setbit32>:
  aa:	26 2f       	mov	r18, r22
  ac:	26 95       	lsr	r18
  ae:	26 95       	lsr	r18
  b0:	26 95       	lsr	r18
  b2:	dc 01       	movw	r26, r24
  b4:	a2 0f       	add	r26, r18
  b6:	b1 1d       	adc	r27, r1
  b8:	67 70       	andi	r22, 0x07	; 7
  ba:	e6 2f       	mov	r30, r22
  bc:	f0 e0       	ldi	r31, 0x00	; 0
  be:	e0 50       	subi	r30, 0x00	; 0
  c0:	ff 4f       	sbci	r31, 0xFF	; 255
  c2:	8c 91       	ld	r24, X
  c4:	90 81       	ld	r25, Z
  c6:	98 2b       	or	r25, r24
  c8:	9c 93       	st	X, r25
  ca:	08 95       	ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That is actually a good compiler output. And it will also work up to 32 byte (256bits).

 

From asm point of view it use a LOT of registers (8), so in real code I fear that it will end up costing some ekstra push and pop's !  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
I just laughs (or rather cry) about how stupid C compilers "still" are.

Sure, you can still find corner cases where finely-honed assembler beats 'C'.

 

EDIT: See westfw's comment in #20 - are you optimising for just one case, or being general for any case ... ?

 

But it will certainly require that the programmer has a solid grasp of the basics of bit manipulation - shifts, masks, etc - without that, they're not going to do it in any language!

 

I could come up with 10 different ways to do this in ASM with a worst case of less than 20clk.

This is probably a good example of where writing the function in assembler and calling it from 'C' would be the way to go ...

 

 

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
Last Edited: Mon. Dec 2, 2019 - 06:37 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sure, you can still find corner cases where finely-honed assembler beats 'C'

LOL this thread show how much fiddling people are willing to do with C instead of just write it direct in ASM.  

 

Any one that know about bit programming (which also is needed in C), could have made the code in #32 if you know the AVR instruction set, faster that go thru this thread, and know how good or bad the result would be. Instead of writing all kind of odd C programs to see if the result is better. And at the end you find a nice solution for a short program, just to find out that all the saved clk's cost more in push and pop's i a real program.

 

    This is probably a good example of where writing the function in assembler and calling it from 'C' would be the way to go ...

 

I don't think so, a good ASM code for bit fiddling will be structured different, and not easily interfaced with C. I do believe that the code in #32 is close to optimal for a C compiler.

 

 

EDIT: See westfw's comment in #20 - are you optimising for just one case, or being general for any case ... ?  

In this thread I talk about OP's question, from #1

And it would be stupid to believe that deal with individual check of 32 bit's can be done in 20 clk.

 

I'm a bit surprised that no body have suggested to place the 32 bit variable in registers :)  

 

The real question is the structure of the hole program, and for this thread is how to fetch the 32 bit patterns , and how to shuffle the information to the driver chips, and perhaps you wait for the I2C communication to end, so the time for this don't really matter.

 

OP haven't answered my question in #27

   

 

  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
And at the end you find a nice solution for a short program, just to find out that all the saved clk's cost more in push and pop's i a real program.

The example in #32 doesn't need any push/pop since it uses only call_used registers (I'm talking about  as part of a C program).

Or are you referring to the calling code?

 

sparrow2 wrote:

I'm a bit surprised that no body have suggested to place the 32 bit variable in registers :)  

The example from #28 has the 32bit value entirely in registers, r22-25, which is then copied to r8-r11.

You maybe can avoid the copy, but either way it is entirely in registers.

 

I'm not interested in saving a handful of clock cycles.

What I did find interesting was recognising that with uint32_t data you could have about an order of magnitude difference between

data |= 1uL << bit

vs writing it (yes still as C code) slightly differently.

It's easy to forget the limitations sometimes.

Compare to for example ARM instruction set where you get shifts for free.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

MrKendo wrote:
Compare to for example ARM instruction set where you get shifts for free.

And the ARM, being a 32-bit (as least) processor, gives you handling of 32-bit data for free - clearly, it's going to be (a lot of) extra work for any 8-bit processor!

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This is not a case for assembly.

When the primary limit is clarity,

C is pretty much always a better choice than assembly.

OP has marked #4 as the solution.

It is good enough.

OP's next problem is avoiding a massive switch in "Do something".

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Reading through all your posts makes me wonder how little I know C, here I am after how to read and write bits to 32 bit integer to save clk's but I might be wasting a lot of them somewhere else, giant switch , if else statements rendering this particular optimization barely any good in the big picture. frown

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

>This loop is in a function with takes 1 uint32_t data which has 1's and 0's for 32 LEDs and further arguments for Start pwm value to end pwm value for each bit. A lot of data I don't know how to handle.

 

> but I might be wasting a lot of them somewhere else, giant switch , if else statements 

 

Instead of now heading off in another direction, why not decide what exactly you want to do first. See #21.

 

If you want pwm data for each bit, there is no need for bit testing as the pwm value has all the info and there is most likely no need for any big switch/if either. You would simply be sending out pwm data from some source-

 

uint8_t myData[16384]; //flash, external, wherever led data comes from

void sendLedHFrame(uint8_t* data){

    for(uint8_t i = 0; i < 16; i++){

        //do something with data[i] (pwm value)

        //maybe 8bit pwm value to 12 bit pwm off register, with on register at 0

        //like i2c send- 0 0 data[i]>>4 data[i]<<4

    }

}

void sendLedFrame(uint8_t* data){

    //I2C start, ic1 led0-15

    sendLedHFrame(data);

    //ic2 finish

    //I2C start, ic2 led16-31

    sendLedHFrame(data+16);

    //i2c finish

}

 

int main(){

    //setup

    for( int idx = 0;;){

        sendLedFrame(&myData[idx]);

        _delay_ms(50); //whatever frame rate wanted

        idx += 32;

        if( idx > sizeof(myData) ) idx = 0;

    }

}

 

 

Last Edited: Tue. Dec 3, 2019 - 02:34 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

  

Last Edited: Wed. Dec 4, 2019 - 01:51 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Replied above post.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

‘A little slow’ - what is the time it takes currently and what time is your target? Use the atmel studio simulator to measure the time. Find out where the time is being spent. This information may guide you to where to optimise.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry I asked.

 

If I was doing something like this, I guess i would see if simply dumping data via i2c could be done something like my previous example. If I could store lets say 640 frames of data in flash (32 bytes of 8bit pwm data, per frame), and I ran at 10fps, that would be 64 seconds of display. One minute of time divided up into 4 'movies' would seem to be enough assuming these are mostly repetitive type things. Using the 8 bits of pwm means you can do all the fades you want right in the raw data, and if there is plenty of room in flash there would probably be no reason to do otherwise.

 

Then your main problem is just producing the data, which has to be done in any case. The mcu now just has to dump data to the pwm ic's, and and speed problems are mostly down to i2c speed and what frame rate you want or is acceptable.

Last Edited: Tue. Dec 3, 2019 - 06:21 AM