Speeding up C code!

Go To Last Post
26 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok, so the time has finally come for me to write my first C program which is optimized for as much speed as possible.

I am trying to use an attiny13 driven at 14MHz (but with some hacking on the rest of my hardware I COULD get a 20MHZ signal to it....) to emulate a shift register which is being driven quite fast (about 500ns pulse width).

The master sends out a LATCH pulse to signify that it wants data, followed by 16 CLK pulses. The tiny should present a new bit on the rising edge of each CLK pulse.

Here's my first attempt at code, but my logic analyzer says that the tiny is not responding fast enough....any suggestions as to how I could speed it up would be greatly appreciated!

#include 
#include 
#include 
#include 

#define LATCH PB2
#define CLK PB1
#define DATA PB4

#define setPinForOutput(pin) (DDRB |= byte(pin))
#define setPin(pin) (PORTB |= byte(pin))
#define clearPin(pin) (PORTB &= ~(byte(pin)))

#define testPin(bit) (PINB & (1 << bit))

#define byte(bit) (1 << bit) // byte with bit set

int main(void)
{

	int writing;
	int dataIndex;
	int keyData[16] = {0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1};  //Some test data
	
	
	writing = 0;
	dataIndex = 0;
	
	//set clock divider to /1
	CLKPR = (1 << CLKPCE);
	CLKPR = (0 << CLKPS3) | (0 << CLKPS2) | (0 << CLKPS1) | (0 << CLKPS0);
	
	setPinForOutput(DATA);
	clearPin(DATA);
	
	for(;;)
	{
		
		if(testPin(LATCH))
		{
			setPin(DATA);
			
			writing = 1;
			
			switch(keyData[dataIndex])
			{
					
				case 0:
					clearPin(DATA);
					break;
					
				case 1:
					setPin(DATA);
					break;
					
			}

			dataIndex++;
			
		}
		
		if(testPin(CLK) && writing == 1)
		{

			switch(keyData[dataIndex])
			{
					
				case 0:
					clearPin(DATA);
					break;
					
				case 1:
					setPin(DATA);
					break;
					
			}
			
			dataIndex++;
		}
			
		if (dataIndex >= 16)
		{
			//not sure if i have time to read the last button state here.
			clearPin(DATA);
			writing = 0;
			dataIndex = 0;
		}
		
		
    }
    return 0;   /* never reached */
}
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well start with this and try to spot any opcode sequences that look cumbersome or "heavy":

int main(void) 
{ 
  48:	df 93       	push	r29
  4a:	cf 93       	push	r28
  4c:	cd b7       	in	r28, 0x3d	; 61
  4e:	de b7       	in	r29, 0x3e	; 62
  50:	a0 97       	sbiw	r28, 0x20	; 32
  52:	0f b6       	in	r0, 0x3f	; 63
  54:	f8 94       	cli
  56:	de bf       	out	0x3e, r29	; 62
  58:	0f be       	out	0x3f, r0	; 63
  5a:	cd bf       	out	0x3d, r28	; 61

   int writing; 
   int dataIndex; 
   int keyData[16] = {0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1};  //Some test data 
  5c:	fe 01       	movw	r30, r28
  5e:	31 96       	adiw	r30, 0x01	; 1
  60:	80 e2       	ldi	r24, 0x20	; 32
  62:	df 01       	movw	r26, r30
  64:	1d 92       	st	X+, r1
  66:	8a 95       	dec	r24
  68:	e9 f7       	brne	.-6      	; 0x64 
  6a:	81 e0       	ldi	r24, 0x01	; 1
  6c:	90 e0       	ldi	r25, 0x00	; 0
  6e:	9c 83       	std	Y+4, r25	; 0x04
  70:	8b 83       	std	Y+3, r24	; 0x03
  72:	98 a3       	std	Y+32, r25	; 0x20
  74:	8f 8f       	std	Y+31, r24	; 0x1f
    
   writing = 0; 
   dataIndex = 0; 
    
   //set clock divider to /1 
   CLKPR = (1 << CLKPCE); 
  76:	80 e8       	ldi	r24, 0x80	; 128
  78:	86 bd       	out	0x26, r24	; 38
   CLKPR = (0 << CLKPS3) | (0 << CLKPS2) | (0 << CLKPS1) | (0 << CLKPS0); 
  7a:	16 bc       	out	0x26, r1	; 38
    
   setPinForOutput(DATA); 
  7c:	bc 9a       	sbi	0x17, 4	; 23
   clearPin(DATA); 
  7e:	c4 98       	cbi	0x18, 4	; 24
  80:	20 e0       	ldi	r18, 0x00	; 0
  82:	30 e0       	ldi	r19, 0x00	; 0
  84:	80 e0       	ldi	r24, 0x00	; 0
  86:	90 e0       	ldi	r25, 0x00	; 0
      { 
         setPin(DATA); 
          
         writing = 1; 
          
         switch(keyData[dataIndex]) 
  88:	af 01       	movw	r20, r30
   clearPin(DATA); 
    
   for(;;) 
   { 
       
      if(testPin(LATCH)) 
  8a:	b2 9b       	sbis	0x16, 2	; 22
  8c:	14 c0       	rjmp	.+40     	; 0xb6 <__stack+0x17>
      { 
         setPin(DATA); 
  8e:	c4 9a       	sbi	0x18, 4	; 24
          
         writing = 1; 
          
         switch(keyData[dataIndex]) 
  90:	fc 01       	movw	r30, r24
  92:	ee 0f       	add	r30, r30
  94:	ff 1f       	adc	r31, r31
  96:	e4 0f       	add	r30, r20
  98:	f5 1f       	adc	r31, r21
  9a:	01 90       	ld	r0, Z+
  9c:	f0 81       	ld	r31, Z
  9e:	e0 2d       	mov	r30, r0
  a0:	30 97       	sbiw	r30, 0x00	; 0
  a2:	19 f0       	breq	.+6      	; 0xaa <__stack+0xb>
  a4:	31 97       	sbiw	r30, 0x01	; 1
  a6:	21 f4       	brne	.+8      	; 0xb0 <__stack+0x11>
  a8:	02 c0       	rjmp	.+4      	; 0xae <__stack+0xf>
         { 
                
            case 0: 
               clearPin(DATA); 
  aa:	c4 98       	cbi	0x18, 4	; 24
  ac:	01 c0       	rjmp	.+2      	; 0xb0 <__stack+0x11>
               break; 
                
            case 1: 
               setPin(DATA); 
  ae:	c4 9a       	sbi	0x18, 4	; 24
               break; 
                
         } 

         dataIndex++; 
  b0:	01 96       	adiw	r24, 0x01	; 1
  b2:	21 e0       	ldi	r18, 0x01	; 1
  b4:	30 e0       	ldi	r19, 0x00	; 0
          
      } 
       
      if(testPin(CLK) && writing == 1) 
  b6:	b1 9b       	sbis	0x16, 1	; 22
  b8:	14 c0       	rjmp	.+40     	; 0xe2 <__stack+0x43>
  ba:	21 30       	cpi	r18, 0x01	; 1
  bc:	31 05       	cpc	r19, r1
  be:	89 f4       	brne	.+34     	; 0xe2 <__stack+0x43>
      { 

         switch(keyData[dataIndex]) 
  c0:	fc 01       	movw	r30, r24
  c2:	ee 0f       	add	r30, r30
  c4:	ff 1f       	adc	r31, r31
  c6:	e4 0f       	add	r30, r20
  c8:	f5 1f       	adc	r31, r21
  ca:	01 90       	ld	r0, Z+
  cc:	f0 81       	ld	r31, Z
  ce:	e0 2d       	mov	r30, r0
  d0:	30 97       	sbiw	r30, 0x00	; 0
  d2:	19 f0       	breq	.+6      	; 0xda <__stack+0x3b>
  d4:	31 97       	sbiw	r30, 0x01	; 1
  d6:	21 f4       	brne	.+8      	; 0xe0 <__stack+0x41>
  d8:	02 c0       	rjmp	.+4      	; 0xde <__stack+0x3f>
         { 
                
            case 0: 
               clearPin(DATA); 
  da:	c4 98       	cbi	0x18, 4	; 24
  dc:	01 c0       	rjmp	.+2      	; 0xe0 <__stack+0x41>
               break; 
                
            case 1: 
               setPin(DATA); 
  de:	c4 9a       	sbi	0x18, 4	; 24
               break; 
                
         } 
          
         dataIndex++; 
  e0:	01 96       	adiw	r24, 0x01	; 1
      } 
          
      if (dataIndex >= 16) 
  e2:	80 31       	cpi	r24, 0x10	; 16
  e4:	91 05       	cpc	r25, r1
  e6:	8c f2       	brlt	.-94     	; 0x8a 
      { 
         //not sure if i have time to read the last button state here. 
         clearPin(DATA); 
  e8:	c4 98       	cbi	0x18, 4	; 24
  ea:	20 e0       	ldi	r18, 0x00	; 0
  ec:	30 e0       	ldi	r19, 0x00	; 0
  ee:	80 e0       	ldi	r24, 0x00	; 0
  f0:	90 e0       	ldi	r25, 0x00	; 0
  f2:	cb cf       	rjmp	.-106    	; 0x8a 

For example you twice use:

switch(keyData[dataIndex])

and the genrated case:'s look about as tight as they could possibly be but look at all that overhead in calculating which case: to vector to. Is there a way to do that in a quicker/tighter way?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

optimized for as much speed as possible.

Quote:

present a new bit on the rising edge of each CLK pulse.

But your code, running "as fast as possible", doesn't look for a rising edge but rather a high state and thus would "clock out" n bits while the CLK is high...

So you are simulating a shift register...

For any of these "as fast as possible" cases, let's quantify what that means. What is the fastest CLK rate? What is your AVR clock speed?

If things are slow enough that you can get the 2nd byte loaded then SPI slave will be the fastest--from the description it appears LATCH is SS (or /SS).

In the real world you'd want some kind of a timeout counter in case a CLK is missed, or you'd be out of whack forever.

In this case, you've run into one of the limitations of an AVR--there is no convenient sequence to conditionally set an I/O bit to high/low. Register bit--yes. (BLD/BST) I/O bit--no.

So to drive a shift register as the master, I have something like this for a 75HC595:

  	for (looper = 0; looper < SPI_OUTPUTS; looper++)
  		{
		SPI_MOSI = (work & 256) ? 1 : 0;// present the bit to the data line
		SPI_SCK = 1;					// raise the clock to present the bit to the line
		work <<= 1;						// prepare next bit & take some time
		SPI_SCK = 0;					// drop the clock for the next go-round
  		}

which with my tool chain (unoptimized--I'm not on a quest for absolute speed) generates:

                 ;    8875   	for (looper = 0; looper < SPI_OUTPUTS; looper++)
002544 e020      	LDI  R18,LOW(0)
                 _0x3D2:
002545 3028      	CPI  R18,8
002546 f480      	BRSH _0x3D3
                 ;    8876   		{
                 ;    8877 		SPI_MOSI = (work & 256) ? 1 : 0;// present the bit to the data line
002547 ff10      	SBRS R17,0
002548 c002      	RJMP _0x3D4
002549 e0e1      	LDI  R30,LOW(1)
00254a c001      	RJMP _0x3D5
                 _0x3D4:
00254b e0e0      	LDI  R30,LOW(0)
                 _0x3D5:
00254c 30e0      	CPI  R30,0
00254d f411      	BRNE _0x3D7
00254e 98c5      	CBI  0x18,5
00254f c001      	RJMP _0x3D8
                 _0x3D7:
002550 9ac5      	SBI  0x18,5
                 _0x3D8:
                 ;    8878 		SPI_SCK = 1;					// raise the clock to present the bit to the line
002551 9ac7      	SBI  0x18,7
                 ;    8879 		work <<= 1;						// prepare next bit & take some time
002552 0f00      	LSL  R16
002553 1f11      	ROL  R17
                 ;    8880 		SPI_SCK = 0;					// drop the clock for the next go-round
002554 98c7      	CBI  0x18,7
                 ;    8881   		}
002555 5f2f      	SUBI R18,-1
002556 cfee      	RJMP _0x3D2
                 _0x3D3:

If you present the bit on the CLK rising edge, it is probably actually sampled on the falling edge. So we can't get ahead of ourselves, either.

If the CLK frequency is, say, 1/10 of AVR frequency we have 10 clocks for the whole shebang. [Another parameter to consider is the time from LATCH to first CLK--if that is short then you have to be a Boy Scout and Be Prepared.

And is it LSB first or MSB first?

And does the CLK idle high or low?

Let's do the obvious, with LSB first. Put 16 bits into a register variable, explicitly or implicitly. Depending on your compiler, the mask test may be best done with an 8-bit cast. Absolute speed will be better with unrolled loop. Skip those for now. But I guar-n-tee this will be better than what you posted. ;)

register unsigned int frog; // hopefully register-based 16-bit variable
// Load the data byte
frog = thisdata;
...
... wait for latch
...
for (looper = 0; looper < NUM_BITS; looper++)
   {
   while (CLK); // wait for falling edge
   while (!CLK); // wait for rising edge
   DATA = (frog & 1) ? 1 : 0;
   frog >>= 1;
   }

Then tune the above based on the answers to the questions earlier.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Oh, yeah--this ain't really about "speeding up the C code" as if it were C vs. ASM--it is picking the best algorithm. Once you have the best (or "good enough") approach--e.g., my approach is much different than yours--you should be able to match or very close with C any ASM implementation of it.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
(x & 1) ? 1 : 0

is equivalent to

x & 1

And I would do the calculations while I'm waiting instead of scrambling to do them after the rising edge arrives. So the loop becomes

for (looper = 0; looper < NUM_BITS; looper++) 
   { 
   bit = frog & 1;
   frog >>= 1;
   while (CLK); // wait for falling edge 
   while (!CLK); // wait for rising edge 
   DATA = bit; 
   } 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Code:
(x & 1) ? 1 : 0
is equivalent to
Code:
x & 1

IME not in this case. It may perhaps be with a particular "context" of code generation.

In this particular case, experimentation showed me that in my particular context the code to move the low bit from a variable to an output port bit location was NOT as good as the code to check that bit and then either set or clear the corresponding port bit.

This may definitely be toolchain dependent. I'll wager it will also be "context dependent" on what is in registers when. I'll repeat that this is one thing the AVR instruction set doesn't help us programmers.

[In fact, the best (least cycles on modern AVRs) way may be to check the bit and then do the pin toggle if it is in the wrong state. Not tested.]

And note that in my 74HC595 example I happen to be outputting 9 bits MSb first so I'm doing the 256 mask--do your shorthand on that.

Yep, I'll defend this one until you show me the clever code sequence that beats my researched results. It probably is out there... ;)

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

And I would do the calculations while I'm waiting instead of scrambling to do them after the rising edge arrives. So the loop becomes

Again, let's see the sequences. Either one may well satisfy the requirements--we don't know what they are. What type is your "bit" variable? How (what instruction sequence) will apply the value of that variable to the port bit?

In extreme cases, the fastest is the unrolled loop and creating a mask for all the bits of the PORT ahead of time then apply the whole port value. That assumes other PORT bits are unchanging during this sequence--not an unreasonable requirement for "fastest".

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
In this case, you've run into one of the limitations of an AVR--there is no convenient sequence to conditionally set an I/O bit to high/low. Register bit--yes. (BLD/BST) I/O bit--no.

If we're talking about GPIO bits specifically I guess you could do it in four instructions:

SBRC SREG, condition
CBI reg, bit
SBRS SREG, condition
SBI reg, bit

However as it's a pretty obvious sequence I suppose it doesn't fit your definition of convenient.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

If we're talking about GPIO bits specifically I guess you could do it in four instructions:
Code:
SBRC SREG, condition
CBI reg, bit
SBRS SREG, condition
SBI reg, bit

However as it's a pretty obvious sequence I suppose it doesn't fit your definition of convenient.


LOL-- You still have to get the SREG set, and there are two conditionals in there, one which will be skipped. It would be constant-time, though. Lessee--SBRC/SBRS is only one bit but probably most tests will end up with the Z flag alone?

2 if not taken; 3 if taken. So on a Mega88 it would be 2+3+2=7 cycles. Plus the test. So about 10 cycles.

(I wonder how to coerce a C toolchain into that?)

The sequence I showed with the 9th bit is about that. For the low bit it is comparable.

Anyway, the point here (besides the optimum sequence) is that it is a fairly painful (~10 cycle) AVR operation to "copy" a register bit to an I/O bit.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
2 if not taken; 3 if taken. So on a Mega88 it would be 2+3+2=7 cycles. Plus the test. So about 10 cycles.

The skip instructions would of course be SBIC/SBIS to test SREG, but on a mega88 it costs 1 cycle if not taken and 2 if a one-word instruction is skipped so 1+2+2=5 cycles plus setup. On XMEGA it's actually cheaper to branch, so a more general sequence would in fact be

BRxx 1
CBI reg, bit
BRyy 1
SBI reg, bit

which costs 4 cycles on XMEGA and ATtiny and 5 cycles on the rest. Feels less clever though.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hmmm--the rough C equivalent is

if ((unsigned char)regvar & 1) setiobit(n);
if (!(unsigned char)regvar & 1) cleariobit(n);

A clever compiler may determine that it doesn't need to do the second test. Even with the second test it might be better than my posted above with the RJMPs...checking...

                 ; 0000 0012 if ((unsigned char)frog & 1) PORTD |= 1;
00007e fd00      	SBRC R16,0
00007f 9a58      	SBI  0xB,0
                 ; 0000 0013 if (((unsigned char)frog & 1)==0) PORTD &= ~1;
000080 ff00      	SBRS R16,0
000081 9858      	CBI  0xB,0

Predicating the data in a register, explicitly or implictly, it is indeed SBRS/SBRC with the corresponding cycle counts. I think that comes out the same as the test, and then the SBIC/SBIS.

So actually there is a fairly short cycle to "move a bit in a register to an I/O bit" on an AVR. And the C compiler can (perhaps not in all contexts) be coerced into generating the desired sequence. And it should be constant time, right?

Bit set: SBRC 1 + SBI 2 + SBRS 2 = 5.
Bit clear: SBRC 2 + SBRS 1 + CBI 2 = 5.

So, C code can be sped up with the help of the 'Freaks. the point, tho, as I mentioned above is it ain't necessarily a C problem but rather picking the best (or good enough) approach.

In this particular case if this little transfer loop is critical enough it can be unrolled with cut-past of a small inline ASM snippet.

Thanks all for finding the better mousetrap than my conditional statement. I tried if/else but can't remember ever trying both if's as we've come up with above. 5 cycles constant ain't bad. ;)

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

When performing shifts, remember to explicitly cast to 8-bits (if 8bits is your desired result), see:
https://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&p=103876

So instead of

define testPin(bit) (PINB & (1 << bit)) 

I do this:

#define testbit(port, bit) (uint8_t)(((uint8_t)port & (uint8_t)_BV(bit)))

Also use

#include 

to be explicit on the size of your variables.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ganzziani wrote:
When performing shifts, remember to explicitly cast to 8-bits

In your example type promotions should not matter. Even if both the mask and the port bit values are sign extended you will get the same (zero, non-zero) result, i.e. (0x80 & 0x80) has the same truth value as (0xFF80 & 0xFF80).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

anders_m wrote:
you will get the same (zero, non-zero) result
I didn't say you wouldn't get the same result. The only thing would be that the compiler might generate extra code to create a 16-bit result.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ganzziani wrote:
I didn't say you wouldn't get the same result. The only thing would be that the compiler might generate extra code to create a 16-bit result.

The C language standard mandates that the operands to the shift operator are promoted to ints, and extra casts can't change that. Testing your pin test macro versus one where the superfluous casts have been removed in fact shows that the version with casts produces worse code!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
#define testPin1(port, bit) (port & (1 << bit))
#define testPin2(port, bit) (uint8_t)(((uint8_t)port & (uint8_t)_BV(bit))) 

yields:

	a = testPin1(PINB, b);
  92:	26 b3       	in	r18, 0x16	; 22
  94:	81 e0       	ldi	r24, 0x01	; 1
  96:	90 e0       	ldi	r25, 0x00	; 0
  98:	00 90 68 00 	lds	r0, 0x0068
  9c:	02 c0       	rjmp	.+4      	; 0xa2 
  9e:	88 0f       	add	r24, r24
  a0:	99 1f       	adc	r25, r25
  a2:	0a 94       	dec	r0
  a4:	e2 f7       	brpl	.-8      	; 0x9e 
  a6:	28 23       	and	r18, r24
  a8:	20 93 74 00 	sts	0x0074, r18

and

	a = testPin2(PINB, b);
  92:	26 b3       	in	r18, 0x16	; 22
  94:	81 e0       	ldi	r24, 0x01	; 1
  96:	90 e0       	ldi	r25, 0x00	; 0
  98:	00 90 68 00 	lds	r0, 0x0068
  9c:	02 c0       	rjmp	.+4      	; 0xa2 
  9e:	88 0f       	add	r24, r24
  a0:	99 1f       	adc	r25, r25
  a2:	0a 94       	dec	r0
  a4:	e2 f7       	brpl	.-8      	; 0x9e 
  a6:	28 23       	and	r18, r24
  a8:	20 93 74 00 	sts	0x0074, r18

which turns out to be like one of those "spot the difference" competitions ;-)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My test was of the form "if (testPin(..))", built with the latest release of WinAvr for ATmega88 with -Os and the with-cast version was similar to what you got but the other version was transformed into a slightly more efficient right-shift. In this case the with-cast version was only one instruction longer, but when used inside a slightly more complex function the difference grew to some 3-4 instructions.

Certainly not a major difference, but a difference nevertheless.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This is fast enough to keep up if F_CPU=14MHz
and clk lo time=clk hi time=500ns=7 cycles.

data ^= data>>1;
PORTB &= ~mask;
while(latch_pin) {}
for(uint8_t j=16; j; --j) {
    while(clk_pin) {}    // 3-cycle loop
    // 2:5 cycles after clk_pin lo
    uint8_t hi=data>>8;  // 1 cycle, 0 in asm
    data<<=1;            // 2 cycles, 1 if unrolled
    while(!clk_pin) {}   // 3-cycle loop
    // 2:5 cycles after clk_pin hi
    if(hi & 0x80) PINB=mask; // 2 cycles
    // 3 cycles to dec and branch, 0 if unrolled
} // j

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry, skeeve. Close, though (I mentioned the PINx above but I think the 5-cycle fixed will beat it, 'cause a conditional is needed.)

With your solution, it toggles the pin when the target bit is on. If it is off, it doesn't set it back to low as needed. In fact, two high bits in a row will result in a low. I think the PINx toggle method needs to read PORTx and apply logic to decide whether to toggle or not.

Note that depending on he context the compiler may well be able to "handle" figgerring out the part about "hi" but maybe not in all cases.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
Sorry, skeeve. Close, though (I mentioned the PINx above but I think the 5-cycle fixed will beat it, 'cause a conditional is needed.)
I think that 2 cycles beats 5 cycles.
Quote:
With your solution, it toggles the pin when the target bit is on. If it is off, it doesn't set it back to low as needed. In fact, two high bits in a row will result in a low. I think the PINx toggle method needs to read PORTx and apply logic to decide whether to toggle or not.
Note the first line. Apparently it could use a comment.
Quote:
Note that depending on he context the compiler may well be able to "handle" figgerring out the part about "hi" but maybe not in all cases.
I would hope that uint8_t hi=data>>8 would
be a single MOV instruction in most cases.
It wouldn't be needed at all in assembly because
one could deal with the carry flag explicitly.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I think that 2 cycles beats 5 cycles.

Huh? Either I'm right, or you are WAY too clever for me.

How can you do ANY toggle without knowing if it is wanted?

[...busy working out the possibilities with data == 0xf055 ...]

The light is beginning to dawn--very clever. I don't see what "carry" has to do with it; the previous examples predicated/assumed "data" is in registers, and probably SBRx would be used.

Re: 5 cycles vs. 2: True enough in this repetitive operation. In the general/inline/arbitrary/single operation the test for the possible toggle may take more than the straight 5 cycles. maybe not.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
Quote:

I think that 2 cycles beats 5 cycles.
Huh? Either I'm right, or you are WAY too clever for me.

How can you do ANY toggle without knowing if it is wanted?

The data in the high bit of hi is the answer to "do I need to toggle?".
Quote:
[...busy working out the possibilities with data == 0xf055 ...]

The light is beginning to dawn--very clever. I don't see what "carry" has to do with it; the previous examples predicated/assumed "data" is in registers, and probably SBRx would be used.

In assembly, I wouldn't need hi at all.
A BRCC would do.
The intervening instructions would not affect the carry flag.
Quote:
Re: 5 cycles vs. 2: True enough in this repetitive operation. In the general/inline/arbitrary/single operation the test for the possible toggle may take more than the straight 5 cycles. maybe not.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

In assembly, I wouldn't need hi at all.

I don't need it in my toolchain either (I'd have to verify). And I'm probably dense after poker night but I still have to figger out what is setting carry--the previous shift? And the loop counter increment is guaranteed to be an INC? And the flag setting isn't the one from the loop counter setting?

Anyway, for a loop/multibit version you've added more possibilities to the mix.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

anders_m wrote:
with -Os

I tried different optimizations options on a large project:

testbit(port, bit) (uint8_t)(((uint8_t)port & (uint8_t)_BV(bit)))
testbit(port, bit) ((port & _BV(bit)))

w/cast Program Data    no/cast Program Data
O0:    137340  4073    O0:      137176  4073
O1:     93950  4073    O1:       94132  4073
O2:     96424  4069    O2:       96782  4069
O3:    126038  4063    O3:      127666  4063
Os:     90912  4073    Os:       90904  4073

I don't have any conclusion other than do what works best for you. I usually compile using -O2 to try to gain execution speed, so I'm ok with the cast version.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
Quote:

In assembly, I wouldn't need hi at all.

I don't need it in my toolchain either (I'd have to verify). And I'm probably dense after poker night but I still have to figger out what is setting carry--the previous shift?
I think you've got it right.
In C, I was tempted by if(SREG & _BV(carry))PINB=mask.
The while wouldn't affect SREG.
Edit: PORTB corrected to PINB.
Quote:
And the loop counter increment is guaranteed to be an INC? And the flag setting isn't the one from the loop counter setting?
A DEC would work better.

Iluvatar is the better part of Valar.

Last Edited: Tue. Apr 6, 2010 - 03:20 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for the discussion, everyone! I've learned a lot from watching this dialog....