Browbeating GCC into using SBIS/SBIC to test I/O pins

Go To Last Post
25 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I was looking over a change I'm making to my almost-stable-now bootloader firmware, and was dumbfounded at the lengths the compiler went to in order to squander time and codespace. Boiling it down to the quintessential "simple test case", the sourcecode is:

extern bool i2cSend(uint8_t);
void sneekle(void) {
    uint8_t answer = 0;
    if (PIND &  0x40) {
        answer = 1;
    }
    i2cSend(answer);
}

. I kind of hoped that the compiler would generate something close to what I'd write in assembly:

    ldi r24, 0
    sbic PIND, 6
    ldi r24, 1
    rjmp i2cSend

(well, except that you have to good-naturedly expect that the compiler will use "rCALL i2cSend; ret"), because in other places where my code SETs I/O port pins, constructs like "PORTA |= _BV(2)" generate a nice clean "sbi PORTA, 2" instruction. But my subroutine was instead rendered thusly:

void sneekle(void)
{
    uint8_t answer = 0;
    if (PIND &  0x40) {
    7de6:	89 b1       	in	r24, 0x09	; 9
    7de8:	90 e0       	ldi	r25, 0x00	; 0
    7dea:	36 e0       	ldi	r19, 0x06	; 6
    7dec:	96 95       	lsr	r25
    7dee:	87 95       	ror	r24
    7df0:	3a 95       	dec	r19
    7df2:	e1 f7       	brne	.-8      	; 0x7dec <_Z7sneeklev+0x6>
    7df4:	81 70       	andi	r24, 0x01	; 1
    7df6:	0a df       	rcall	.-492    	; 0x7c0c 
        answer = 1;
    }
    i2cSend(answer);
}

This was with an optimization setting of -Os, and the compiler announces IDs itself as avr-gcc.exe (WinAVR 20100110) 4.3.3. The compiler is so freaking impressed with its discovery that a single-bit result can be generated by a loop to nudges the bit down to the bit0 position that it won't consider a cheaper brute-force compare and branch. Just for giggles, I changed the code so that the port pin selects between values of 0 or 19, and got this:

void sneekle(void)
{
    uint8_t answer = 0;
    if (PIND &  0x40) {
    7de6:	4e 9b       	sbis	0x09, 6	; 9
    7de8:	02 c0       	rjmp	.+4      	; 0x7dee <_Z7sneeklev+0x8>
    7dea:	83 e1       	ldi	r24, 0x13	; 19
    7dec:	01 c0       	rjmp	.+2      	; 0x7df0 <_Z7sneeklev+0xa>
    7dee:	80 e0       	ldi	r24, 0x00	; 0
    7df0:	0d df       	rcall	.-486    	; 0x7c0c 
        answer = 19;
    }
    i2cSend(answer);
    7df2:	08 95       	ret
}

Ordinarily, I'd be annoyed that the compiler throws in two extra jumps just so it can code the skip instruction backwards, but under the circumstances, I'll take it; the receiver's just looking for a nonzero value and 19 fills the bill just fine. And, hey! At least the compiler finally used an SBIS!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
void sneekle(void) {
    uint8_t answer;
    if ((answer = PIND & 0x40) != 0)
		answer = 1;
    i2cSend(answer);
}

Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This code:

void sneekle(void) { 
    uint8_t answer = PIND;
    answer = (answer & 0x40) ? 1 : 0;
    i2cSend(answer); 
}

yields this:

LDI        R22,0x00
SBIC      0x09,6
LDI        R22,0x01
LDI        R24,0x00
RCALL   i2cSend
RET

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I neglected to post the code generated by WinAVR_20100110 for my solution. With the compile/link options that I used, it produced this:

0000008e :
void sneekle(void)
{
    uint8_t answer;
    if ((answer = PIND & 0x40) != 0)
		answer = 1;
    i2cSend(answer);
  8e:	80 e0       	ldi	r24, 0x00	; 0
  90:	4e 99       	sbic	0x09, 6	; 9
  92:	81 e0       	ldi	r24, 0x01	; 1
}
  94:	02 c0       	rjmp	.+4      	; 0x9a 

Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dkinzer wrote:
I neglected to post the code generated by WinAVR_20100110 for my solution. With the compile/link options that I used, it produced this:
0000008e :
void sneekle(void)
{
    uint8_t answer;
    if ((answer = PIND & 0x40) != 0)
		answer = 1;
    i2cSend(answer);
  8e:	80 e0       	ldi	r24, 0x00	; 0
  90:	4e 99       	sbic	0x09, 6	; 9
  92:	81 e0       	ldi	r24, 0x01	; 1
}
  94:	02 c0       	rjmp	.+4      	; 0x9a 

Well, that's what I wanted, all right. But I'm missing something, and I'd like to learn what it is. You mention WinAVR20100110. Hmm... that's what I'm using. I would hope that the linker options are irrelevant to what sort of code gets generated, so that leaves the compiler options. Here's what "make"s echo of the command that compiled the file in question looks like in my environment:
Compiling: src/bootloader.cpp
avr-gcc.exe -c -mmcu=atmega324a -Wall -gdwarf-2 -Os -funsigned-char -funsigned-bitfields -fshort-enums -I. -Iinc -I../inc -DF_CPU=20000000UL -MD -MP -MT bootloader.o -MF dep/bootloader.o.d -mshort-calls src/bootloader.cpp -o bootloader.o

I'm home now, but after a bit of updating my home machine's snapshot of the work files, it produces the same results. The most suspicious thing is that the inefficient code came from a C++ compilation...

I'll try this again tomorrow.

Oh yeah; thanks to everyone for trying/comparing the different idioms in their environments. And yes, I'd tried 'most all those ways to restate the problem too; I always got the "giant computed shift loop" output.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't understand this obsession of beating C into this or that.

Why can't you use inline asm if you desire a particular outcome?

JW

[EDIT]

#include 

extern _Bool i2cSend(uint8_t);
void sneekle(void) {
    uint8_t answer = 0;
    if (PIND &  0x40) {
        answer = 1;
    }
    i2cSend(answer);
} 

void sneekle2(void) {
    uint8_t answer = 0;
    __asm__(
      "sbic  %[_PIND], 6 \n\t"
      "inc   %[_answer]\n\t"
      :
      : [_PIND]    "I"  (_SFR_IO_ADDR(PIND))
      , [_answer]  "r"  (answer)
    );
    i2cSend(answer);
} 
00000000 :
   0:	80 e0       	ldi	r24, 0x00	; 0
   2:	4e 99       	sbic	0x09, 6	; 9
   4:	83 95       	inc	r24
   6:	0e 94 00 00 	call	0	; 0x0 
   a:	08 95       	ret

0000000c :
   c:	89 b1       	in	r24, 0x09	; 9
   e:	90 e0       	ldi	r25, 0x00	; 0
  10:	26 e0       	ldi	r18, 0x06	; 6
  12:	96 95       	lsr	r25
  14:	87 95       	ror	r24
  16:	2a 95       	dec	r18
  18:	01 f4       	brne	.+0      	; 0x1a 
  1a:	81 70       	andi	r24, 0x01	; 1
  1c:	0e 94 00 00 	call	0	; 0x0 
  20:	08 95       	ret
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
void sneekle(void) {
    uint8_t answer;
    if ((answer = PIND & 0x40) != 0)
      answer = 1;
    i2cSend(answer);
}

Likewise, I cannot see why anyone would worry about micro-managing C code for the sake of a few cycles.

However, I would find it infinitely preferable to using inline gobbledygook.

The reason for Don's code to be optimised so well is because the Compiler knows the expression only needs 8 bits. By default, a logical expression in C is evaluated as an int.

I could use Don's function with any Compiler. It may not be optimised as well by other compilers, but it will always produce the correct result.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

david.prentice wrote:
However, I would find it infinitely preferable to using inline gobbledygook.
That of course depends on your definition of gobbledygook. For me (the C-hater), C is far worse ;-)

David wrote:
I could use Don's function with any Compiler. It may not be optimised as well by other compilers, but it will always produce the correct result.
So does mine:

#include 

extern _Bool i2cSend(uint8_t);

void sneekle(void) {
  uint8_t answer = 0;
  #if defined(__GNUC__) && defined(__AVR__)
    __asm__(
      "sbic  %[_PIND], 6 \n\t"
      "inc   %[_answer]\n\t"
      :
      : [_PIND]    "I"  (_SFR_IO_ADDR(PIND))
      , [_answer]  "r"  (answer)
    );
  #else
    if (PIND &  0x40) {
      answer = 1;
    }
  #endif
  i2cSend(answer);
} 

Enjoy! :-P

Jan

PS. The REAL solution of course is to write the whole program in asm, but then that requires REAL programmers not you quiche-eaters... :-D

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Jan,

I like ASM.
I like C.
I dislike gobbledygook.

Yes. I know the gobbledygook syntax is useful to the avr-gcc libc writers. I cringe.

I just feel that your solution is the worst of all worlds. I know it works.

We will always differ in our opinions!

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sometimes a factor of 2+ of size and 10+ of speed matter.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

From Don's post earlier.

Quote:

0000008e :
void sneekle(void)
{
    uint8_t answer;
    if ((answer = PIND & 0x40) != 0)
      answer = 1;
    i2cSend(answer);
  8e:   80 e0          ldi   r24, 0x00   ; 0
  90:   4e 99          sbic   0x09, 6   ; 9
  92:   81 e0          ldi   r24, 0x01   ; 1
}
  94:   02 c0          rjmp   .+4         ; 0x9a 


@sparrow2,

I wonder how you get "factor of 2+ of size and 10+ of speed".

The above code looks like 4 instructions and 5 cycles to me. I doubt that you could reduce this to 2 instruction and 0.5 cycles.

Ok. I will be kind. 2 instructions and 1 cycle.

I know that you have posted some pretty neat algorithms in the past.

David.

Last Edited: Fri. Apr 15, 2011 - 09:25 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

David wrote:
I like ASM.
I like C.
I dislike gobbledygook.

OK, then how about this one, from your like/dislike point of view:

sneekle.S:

#include 
  .text
  .global sneekle

sneekle:
  clr   r24
  sbic  _SFR_IO_ADDR(PIND), 6
  inc   r24
  jmp   i2cSend

  .size sneekle, .-sneekle

?

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yup. I have no problem with regular ASM syntax. I am quite happy with differently named directives too.

I count your solution as 4 instructions, 5 cycles too. (assuming rjmp)

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
@sparrow2,

I wonder how you get "factor of 2+ of size and 10+ of speed".

The above code looks like 4 instructions and 5 cycles to me. I doubt that you could reduce this to 2 instruction and 0.5 cycles.

Ok. I will be kind. 2 instructions and 1 cycle.

I know that you have posted some pretty neat algorithms in the past.

David.


The nice C code OP started with was 8 ASM instructions and about 30 CLK
A "nice" ASM is 3 ASM in 3 clk.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

The nice C code OP started with was 8 ASM instructions and about 30 CLK
A "nice" ASM is 3 ASM in 3 clk.

I would love to see your 3 cycle solution.

Jan may cry too!

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I talk about the the calc of answer, (if the function call is a rjmp jmp rcall call will only depend of the size of the chip and the rest of the structure).

Jens

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I do not understand. Do you have a better ASM sequence?

Re-reading the OP, I now understand the reason for the question. Yes. It really does matter if you can squeeze a bootloader into 256 or 512 words.

The question was about 'helping' gcc with a common C sequence.

It was answered by Don, koschi, Jan ...

If you assign a logical expression to an uint8_t the optimiser can avoid using the whole width of an 'int'.

There is nothing underhand about this. It should work for many C compilers.

I would agree that no C compiler can compete with ASM that uses the T or C flag. This example does not fall into that category. Likewise, a tightly nested loop. It is not that either!

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I do not understand. Do you have a better ASM sequence?
No It can't be better that 3 instructions as OP show and Jan C code generate.
Quote:
If you assign a logical expression to an uint8_t the optimiser can avoid using the whole width of an 'int'.

The problem is that OP's C code use it to!

PS
ASM in 2 instructions:
Depending of how "i2cSend(answer);" use answer perhaps this will do the same:

in   r24, 0x09   
andi r24, 0x40

Now the Z flag holds the correct value, and R24 is TRUE(non zero) or FALSE

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Levenkay wrote:
I would hope that the linker options are irrelevant to what sort of code gets generated, so that leaves the compiler options.
Usually, that's true. However, the --relax linker option allows the linker to substitute an rjmp for the rcall/ret sequence, something the compiler doesn't do.

I whittled down the compile line for the main module to this:

C:\WinAVR_20100110/bin/avr-gcc -c -mmcu=atmega644p -Os -Wa,-adhlns=./t0.lst t0.c -o ./t0.o

which resulted in the .lst file shown below. You could start with this command line and add other options you're using until the generated code deviates from this.

   1               		.file	"t0.c"
   2               	__SREG__ = 0x3f
   3               	__SP_H__ = 0x3e
   4               	__SP_L__ = 0x3d
   5               	__CCP__  = 0x34
   6               	__tmp_reg__ = 0
   7               	__zero_reg__ = 1
   8               		.text
   9               	.global	sneekle
  11               	sneekle:
  12               	/* prologue: function */
  13               	/* frame size = 0 */
  14 0000 80E0      		ldi r24,lo8(0)
  15 0002 4E99      		sbic 41-32,6
  16 0004 81E0      		ldi r24,lo8(1)
  17               	.L2:
  18 0006 0E94 0000 		call i2cSend
  19               	/* epilogue start */
  20 000a 0895      		ret
  22               	.global	main
  24               	main:
  25               	/* prologue: function */
  26               	/* frame size = 0 */
  27 000c 0E94 0000 		call sneekle
  28               	.L5:
  29 0010 00C0      		rjmp .L5
DEFINED SYMBOLS
                            *ABS*:00000000 t0.c
C:\DOCUME~1\Don\LOCALS~1\Temp/ccqXgDOa.s:2      *ABS*:0000003f __SREG__
C:\DOCUME~1\Don\LOCALS~1\Temp/ccqXgDOa.s:3      *ABS*:0000003e __SP_H__
C:\DOCUME~1\Don\LOCALS~1\Temp/ccqXgDOa.s:4      *ABS*:0000003d __SP_L__
C:\DOCUME~1\Don\LOCALS~1\Temp/ccqXgDOa.s:5      *ABS*:00000034 __CCP__
C:\DOCUME~1\Don\LOCALS~1\Temp/ccqXgDOa.s:6      *ABS*:00000000 __tmp_reg__
C:\DOCUME~1\Don\LOCALS~1\Temp/ccqXgDOa.s:7      *ABS*:00000001 __zero_reg__
C:\DOCUME~1\Don\LOCALS~1\Temp/ccqXgDOa.s:11     .text:00000000 sneekle
C:\DOCUME~1\Don\LOCALS~1\Temp/ccqXgDOa.s:24     .text:0000000c main

UNDEFINED SYMBOLS
i2cSend

Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
The problem is that OP's C code use it to!
No it doesn't. The 'if' in the OPs code uses 16 bit ints.
Quote:
Now the Z flag holds the correct value, and R24 is TRUE(non zero) or FALSE
That depends on how you want to use the answer. Since the function being called is i2cSend, I fully expect that what it does is send an 8 bit value over i2c. You can't do that with the Z flag, you can only use the Z flag to select between 2 values, which would certainly take more than the 1 opcode/clock you saved.

If you don't like the '=' inside the 'if', then use my version. It produces the exact same code as Don's (the extra LDI in mine was because I was actually calling a function that takes 2 uint8_t's since I was too lazy to write a new function and instead called an existing one).

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
I don't understand this obsession of beating C into this or that.

Why can't you use inline asm if you desire a particular outcome?

JW

[EDIT]

#include 

extern _Bool i2cSend(uint8_t);
void sneekle(void) {
    uint8_t answer = 0;
    if (PIND &  0x40) {
        answer = 1;
    }
    i2cSend(answer);
} 

void sneekle2(void) {
    uint8_t answer = 0;
    __asm__(
      "sbic  %[_PIND], 6 \n\t"
      "inc   %[_answer]\n\t"
      :
      : [_PIND]    "I"  (_SFR_IO_ADDR(PIND))
      , [_answer]  "r"  (answer)
    );
    i2cSend(answer);
} 


If the inefficiency of compiler-generated code got bad enough, even you would seek ways around it. If it were only bad "in places" - if there were certain programming idioms for which the compiler did an exceptionally bad job - it might be worthwhile to know that so you could avoid using them, assuming the alternatives are halfway close to reasonable.

As for resorting to inline assembler, I agree that would let me express the operation succinctly. The mechanisms that gcc's inline assembler requires you to use in order to reference the C variables, though, are more complex and (to me) obscure than the most bizzare C++ feature ever thought of being on its worst day. In the gcc docs, the "asm constraints" are explained by statements like, "When making an adjunctive covariant reference to an interstitially reflexive variable, prefix a treble-clef character to the intermediate term's constraint". I do use inline asm for cases where access to the carry or T flag(s) is key. But not being able to understand the documentation (which I suspect requires more knowledge about the compiler internals than I'm ever likely to have) makes each use fraught with worry. I try to be sparing. I have to remain in awe of folk like you who at least *seem* comfortable with it.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

To be honest, you only have to follow the common rules of thumb for microcontrollers:

1. use appropriate type of variables to suit your architecture and expressions.

2. write straightforward, unambiguous code.

Armed with this, most compilers do a pretty good job.

IME, the generated ASM is as good as I could write in ASM. Generally better!!

Yes. There are the odd occasions when writing an ISR() in assembly code is worthwhile. And of course, bootloaders. The reason for this thread in the first place.

I also admire Jan's ability to understand avr-gcc inline ASM. However, I have no desire to acquire that skill for myself.

It is easy to produce bloated code by flouting (1).
I would like to see an example of complex C that generates an improvement on (2). Especially with avr-gcc.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Inspire by this thread, I compiled

#include 
#include 

#define bool uint8_t

extern bool i2cSend(uint8_t);

void sneekle(void) {
    uint8_t answer = 0;
    if (PIND &  0x40) {
        answer = 1;
    }
    while(PIND & 0x40) {}
    i2cSend(answer);
}

and got

	.type	sneekle, @function
sneekle:
.LFB2:
.LM1:
/* prologue: function */
/* frame size = 0 */
.LM2:
	in r24,41-32
.L2:
.LM3:
	sbic 41-32,6
	rjmp .L2
.LM4:
	ldi r25,lo8(0)
	ldi r18,6
1:	lsr r25
	ror r24
	dec r18
	brne 1b
	andi r24,lo8(1)
	call i2cSend
/* epilogue start */
.LM5:
	ret
.LFE2:
	.size	sneekle, .-sneekle

The assembly is better with answer=3 than with answer=1,
but still a bit awful.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Adding inline asembler to hide the structure of the code from the compiler will stop mis-optimations like this. For example:

extern bool i2cSend(uint8_t);
void sneekle(void) {
    uint8_t answer = 0;
    asm ("" : "+g"(answer));
    if (PIND &  0x40) {
        answer = 1;
    }
    i2cSend(answer);
}

Compiles to:

.global sneekle
        .type   sneekle, @function
sneekle:
/* prologue: function */
/* frame size = 0 */
        ldi r24,lo8(0)
        sbic 41-32,6
        ldi r24,lo8(1)
.L2:
        call i2cSend
/* epilogue start */
        ret
        .size   sneekle, .-sneekle

This sort of transformation is an important optimisation for typical fast CPUs as it eliminates the branch misprediction penalty, the fact that it is bad for AVRs has obviously been overlooked. Send a bug report.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TimothyEBaldwin wrote:
This sort of transformation is an important optimisation for typical fast CPUs as it eliminates the branch misprediction penalty, the fact that it is bad for AVRs has obviously been overlooked. Send a bug report.

The problem is not a brach prediction. Anyway, as this problem is for gcc 4.3.x, sending a brug report will help *nothing*: The current release series is 4.7, which don't show that problem. For release series older than 4.7, the only modifications allowed are bug fixes (this is /not/ a bug, its an optimization flaw) and documentation updates.

avrfreaks does not support Opera. Profile inactive.