avr-gcc weird (non)optimization

Go To Last Post
28 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This is avr-gcc (AVR_8_bit_GNU_Toolchain_3.6.1_495) 5.4.0

Edit: It's an ATmega4809, with those new-fangled structure-based IO register definitions.

 

Supposed I am "somehow" restricted to using only "-O1" optimization, and I have the following bit of code:

Edit, add: avr-gcc -mmcu=atmega4809 -O1 -g baz.c -c

#include <avr/io.h>

extern int uart_getc();
extern void uart_putc(char);
extern void uart_puts(char*);

int main() {
    int c;

    while (1) {
        PORTB.OUTSET = (1<<5);
        PORTB.OUTCLR = (1<<5);
        c = uart_getc();
        if (c > 0) {
	    uart_puts("\"\r\n");
            uart_putc(c);
	}
    }
}

 

Now, even -O1 (even -O0, for that matter) should know about compile-time computation of constants, right?

And yet it, produces this rather strange code:

int main() {
    int c;

    while (1) {
        PORTB.OUTSET = (1<<5);
   0:   68 94           set
   2:   ee 24           eor     r14, r14
   4:   e5 f8           bld     r14, 5
   6:   ff 24           eor     r15, r15
   8:   f2 f8           bld     r15, 2

   a:   10 e2           ldi     r17, 0x20       ; 32
   c:   f7 01           movw    r30, r14
   e:   15 83           std     Z+5, r17        ; 0x05
        PORTB.OUTCLR = (1<<5);
  10:   16 83           std     Z+6, r17        ; 0x06
        c = uart_getc();
  12:   0e 94 00 00     call    0       ; 0x0 <main>
  16:   ec 01           movw    r28, r24
        if (c > 0) {
  18:   18 16           cp      r1, r24
  1a:   19 06           cpc     r1, r25
  1c:   04 f4           brge    .+0             ; 0x1e <main+0x1e>
            uart_puts("\"\r\n");
  1e:   80 e0           ldi     r24, 0x00       ; 0
  20:   90 e0           ldi     r25, 0x00       ; 0
  22:   0e 94 00 00     call    0       ; 0x0 <main>
            uart_putc(c);
  26:   8c 2f           mov     r24, r28
  28:   0e 94 00 00     call    0       ; 0x0 <main>
  2c:   00 c0           rjmp    .+0             ; 0x2e <__zero_reg__+0x2d>

Does anyone have any idea what it thinks it's doing?  I guess it's stashing the address of PORTB in a "permanent" register, but WHY IS IT BUILDING IT ONE BIT AT A TIME, through the T status bit?

 

If I comment out EITHER line in the if clause, or use ANY other optimization level, I get much more reasonable-looking output:

    while (1) {
        PORTB.OUTSET = (1<<5);
   0:   00 e2           ldi     r16, 0x20       ; 32
   2:   14 e0           ldi     r17, 0x04       ; 4
   4:   c0 e2           ldi     r28, 0x20       ; 32
   6:   f8 01           movw    r30, r16
   8:   c5 83           std     Z+5, r28        ; 0x05
        PORTB.OUTCLR = (1<<5);
   a:   c6 83           std     Z+6, r28        ; 0x06
        c = uart_getc();
   c:   0e 94 00 00     call    0       ; 0x0 <main>
        if (c > 0) {
  10:   18 16           cp      r1, r24
  12:   19 06           cpc     r1, r25
  14:   04 f4           brge    .+0             ; 0x16 <main+0x16>
            uart_puts("\"\r\n");
        //            uart_putc(c);
  16:   80 e0           ldi     r24, 0x00       ; 0
  18:   90 e0           ldi     r25, 0x00       ; 0
  1a:   0e 94 00 00     call    0       ; 0x0 <main>
  1e:   00 c0           rjmp    .+0             ; 0x20 <__zero_reg__+0x1f>
Last Edited: Tue. Oct 9, 2018 - 09:23 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

'Tis amazing alright: It does not understand enough to evaluate the compile-time constant,

but it does understand enough to use set and bld.

 

BTW are you sure you are using avr-gcc ?

I do not recognize the PORTB. syntax.

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It look like it ran out of high registers (because of other things), and find an "odd" way (setting bits by using T) on a low register pair , to get to the structure addr. 

 

but I guess all should have done ldi direct to Z (at all levels), but perhaps the pointer is used somewhere else (should still build up in Z and then move the pointer down, then any register pair could be used)

 

to make it clear this is the constant:

 a:   10 e2           ldi     r17, 0x20       ; 32

the code before is to get the addr. (remember there is no "real" clr instruction it eor a register with it self.)

 

 

Last Edited: Mon. Oct 8, 2018 - 05:31 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:

extern void uart_puts(char*);

 

I know this isn't the question, but if you have the ability to change

the signature of the above function to:

 

extern void uart_puts(const char*);

 

I believe the compiler has more optimization options, since it can

assume the pointed-to data is not going to be modified.

 

--Mike

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

BTW are you sure you are using avr-gcc ?

I do not recognize the PORTB. syntax.

 It's an ATmega4809, with those new-fangled structure-based IO register definitions.

I guess I should have mentioned that :-(

avr-gcc -mmcu=atmega4809 -O1 -c -g baz.c

 

It look like it ran out of high registers (because of other things), and find an "odd" way (setting bits by using T) on a low register pair , to get to the structure addr. 

 What "other things"?  That's the whole program!  I guess it decides that it MUST store "c" in r28/r29 to save it while it calls puts(); but that certainly seems to be a poor choice, in addition to using a poor method to set the constant in a low register...  I guess "c is a local variable, so it should be kept in a hand high register!"?  It'd be more understandable if all the other optimization modes didn't make better decisions!

 

uart_puts(const char*);  more optimization options?

 Maybe.  It doesn't help in this case.

 

I guess I never noticed before that gcc only has two double-registers (r16/17 and r28/29) that are "call saved."

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
but I guess all should have done ldi direct to Z (at all levels), but perhaps the pointer is used somewhere else

 

The ABI says Z may be changed when a function is called, so the compiler decided not to ldi the address directly to Z because it's afraid the address might be needed later? But it isn't used again in this case.

 

I compiled this with gcc 8.1.0 at -O1 level and get exactly the same code:

00000000 <main>:
   0:   68 94           set
   2:   ee 24           eor     r14, r14
   4:   e5 f8           bld     r14, 5
   6:   ff 24           eor     r15, r15
   8:   f2 f8           bld     r15, 2
   a:   10 e2           ldi     r17, 0x20       ; 32
   c:   f7 01           movw    r30, r14
   e:   15 83           std     Z+5, r17        ; 0x05
  10:   16 83           std     Z+6, r17        ; 0x06
  12:   0e 94 00 00     call    0       ; 0x0 <main>
  16:   ec 01           movw    r28, r24
  18:   18 16           cp      r1, r24
  1a:   19 06           cpc     r1, r25
  1c:   04 f4           brge    .+0             ; 0x1e <main+0x1e>
  1e:   80 e0           ldi     r24, 0x00       ; 0
  20:   90 e0           ldi     r25, 0x00       ; 0
  22:   0e 94 00 00     call    0       ; 0x0 <main>
  26:   8c 2f           mov     r24, r28
  28:   0e 94 00 00     call    0       ; 0x0 <main>
  2c:   00 c0           rjmp    .+0             ; 0x2e <__zero_reg__+0x2d>

 

But each optimization level generates different code. It's rather interesting, really:

-Og does ldi directly to Z:

00000000 <main>:
   0:   e0 e2           ldi     r30, 0x20       ; 32
   2:   f4 e0           ldi     r31, 0x04       ; 4
   4:   80 e2           ldi     r24, 0x20       ; 32
   6:   85 83           std     Z+5, r24        ; 0x05
   8:   86 83           std     Z+6, r24        ; 0x06
   a:   0e 94 00 00     call    0       ; 0x0 <main>

 

-Os, -O2 and -O3 do a direct store:

00000000 <main>:
   0:   10 e2           ldi     r17, 0x20       ; 32
   2:   10 93 25 04     sts     0x0425, r17     ; 0x800425 <__SREG__+0x8003e6>
   6:   10 93 26 04     sts     0x0426, r17     ; 0x800426 <__SREG__+0x8003e7>
   a:   0e 94 00 00     call    0       ; 0x0 <main>

 

edit: to reproduce the other OP code, I need to use -Os, -O2 or -O3 with -fno-gcse:

00000000 <main>:
   0:   00 e2           ldi     r16, 0x20       ; 32
   2:   14 e0           ldi     r17, 0x04       ; 4
   4:   80 e2           ldi     r24, 0x20       ; 32
   6:   f8 2e           mov     r15, r24
   8:   f8 01           movw    r30, r16
   a:   f5 82           std     Z+5, r15        ; 0x05
   c:   f6 82           std     Z+6, r15        ; 0x06
   e:   0e 94 00 00     call    0       ; 0x0 <main>

How do I know this? I did some experiments with the optimization flags some time ago and found out that -fgcse induces the use of direct load/store instructions, while -fno-gcse inhibits it.

Now I'm curious why -Og induces direct ldi to Z. Time for a bit more gcc flag research.

Last Edited: Mon. Oct 8, 2018 - 09:57 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for looking into it.

 

It turns out that the microchip MPLABx environment will support -Og as well as -O1.

The differences are "entertaining."

(also, if you set optimization to O1, and check "debugging" in the little box, it will generate compiler switches "-O1 -Og", which doesn't operate the way that they seem to think.)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

An interesting project for someone with the skills could be to write

an assembler -> assembler optimizer.  I've never heard of one, but

it seems like something that would be useful especially when your

compiler refuses to do it.

 

--Mike

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avr-mike wrote:
An interesting project for someone with the skills could be to write

an assembler -> assembler optimizer.  I've never heard of one, but

it seems like something that would be useful especially when your

compiler refuses to do it.

IIRC it's called peephole optimization.

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avr-gcc's handling of pointers has always been sub-optimal.  For another example, see:

https://www.avrfreaks.net/comment/2562646#comment-2562646

 

But your example is exceptionally bad :(

 

Note that SFRs are defined differently for the m4809 than for previous devices.  From iom4809.h:

 

typedef volatile uint8_t register8_t;
typedef struct PORT_struct
{
    register8_t DIR;  /* Data Direction */
    register8_t DIRSET;  /* Data Direction Set */
    register8_t DIRCLR;  /* Data Direction Clear */
    register8_t DIRTGL;  /* Data Direction Toggle */
    register8_t OUT;  /* Output Value */
    register8_t OUTSET;  /* Output Value Set */
    register8_t OUTCLR;  /* Output Value Clear */
    register8_t OUTTGL;  /* Output Value Toggle */
    register8_t IN;  /* Input Value */
    register8_t INTFLAGS;  /* Interrupt Flags */
    register8_t PORTCTRL;  /* Port Control */
    register8_t reserved_0x0B;
    register8_t reserved_0x0C;
    register8_t reserved_0x0D;
    register8_t reserved_0x0E;
    register8_t reserved_0x0F;
    register8_t PIN0CTRL;  /* Pin 0 Control */
    register8_t PIN1CTRL;  /* Pin 1 Control */
    register8_t PIN2CTRL;  /* Pin 2 Control */
    register8_t PIN3CTRL;  /* Pin 3 Control */
    register8_t PIN4CTRL;  /* Pin 4 Control */
    register8_t PIN5CTRL;  /* Pin 5 Control */
    register8_t PIN6CTRL;  /* Pin 6 Control */
    register8_t PIN7CTRL;  /* Pin 7 Control */
    register8_t reserved_0x18;
    register8_t reserved_0x19;
    register8_t reserved_0x1A;
    register8_t reserved_0x1B;
    register8_t reserved_0x1C;
    register8_t reserved_0x1D;
    register8_t reserved_0x1E;
    register8_t reserved_0x1F;
} PORT_t;
#define PORTB                (*(PORT_t *) 0x0420) /* I/O Ports */

What happens if you use the non-struct macros instead?

        PORTB_OUTSET = (1<<5);
        PORTB_OUTCLR = (1<<5);

 

FWIW, I tried your code (with -c, as I lack your external functions) with -O1 and get the same results.  However, as you noted, all other optimisation levels yield much better results:

$ avr-gcc -Os -mmcu=atmega4809 -B /tmp/Atmel.ATmega_DFP.1.2.272.atpack_FILES/gcc/dev/atmega4809/ -I /tmp/Atmel.ATmega_DFP.1.2.272.atpack_FILES/include/ foo.c -o foo.elf -c
$ avr-objdump -S foo.elf

foo.elf:     file format elf32-avr

Disassembly of section .text.startup:

00000000 <main>:

int main() {
    int c;

    while (1) {
        PORTB.OUTSET = (1<<5);
   0:	10 e2       	ldi	r17, 0x20	; 32
   2:	10 93 25 04 	sts	0x0425, r17	; 0x800425 <__SREG__+0x8003e6>
   6:	10 93 26 04 	sts	0x0426, r17	; 0x800426 <__SREG__+0x8003e7>
        PORTB.OUTCLR = (1<<5);
   a:	0e 94 00 00 	call	0	; 0x0 <main>
        c = uart_getc();
   e:	ec 01       	movw	r28, r24
  10:	18 16       	cp	r1, r24
  12:	19 06       	cpc	r1, r25
        if (c > 0) {
  14:	04 f4       	brge	.+0      	; 0x16 <main+0x16>
  16:	80 e0       	ldi	r24, 0x00	; 0
  18:	90 e0       	ldi	r25, 0x00	; 0
	    uart_puts("\"\r\n");
  1a:	0e 94 00 00 	call	0	; 0x0 <main>
  1e:	8c 2f       	mov	r24, r28
  20:	0e 94 00 00 	call	0	; 0x0 <main>
            uart_putc(c);
  24:	00 c0       	rjmp	.+0      	; 0x26 <__zero_reg__+0x25>

 

$ avr-gcc -Og -mmcu=atmega4809 -B /tmp/Atmel.ATmega_DFP.1.2.272.atpack_FILES/gcc/dev/atmega4809/ -I /tmp/Atmel.ATmega_DFP.1.2.272.atpack_FILES/include/ foo.c -o foo.elf -c
$ avr-objdump -S foo.elf

foo.elf:     file format elf32-avr

Disassembly of section .text:

00000000 <main>:

int main() {
    int c;

    while (1) {
        PORTB.OUTSET = (1<<5);
   0:	e0 e2       	ldi	r30, 0x20	; 32
   2:	f4 e0       	ldi	r31, 0x04	; 4
   4:	80 e2       	ldi	r24, 0x20	; 32
   6:	85 83       	std	Z+5, r24	; 0x05
        PORTB.OUTCLR = (1<<5);
   8:	86 83       	std	Z+6, r24	; 0x06
        c = uart_getc();
   a:	0e 94 00 00 	call	0	; 0x0 <main>
   e:	ec 01       	movw	r28, r24
        if (c > 0) {
  10:	18 16       	cp	r1, r24
  12:	19 06       	cpc	r1, r25
  14:	04 f4       	brge	.+0      	; 0x16 <main+0x16>
	    uart_puts("\"\r\n");
  16:	80 e0       	ldi	r24, 0x00	; 0
  18:	90 e0       	ldi	r25, 0x00	; 0
  1a:	0e 94 00 00 	call	0	; 0x0 <main>
            uart_putc(c);
  1e:	8c 2f       	mov	r24, r28
  20:	0e 94 00 00 	call	0	; 0x0 <main>
  24:	00 c0       	rjmp	.+0      	; 0x26 <__zero_reg__+0x25>

 

-O2 and -O3 yield the same result as -O1.

 

Wondering aloud, what would happen if the macros were changed to include const?:

$ cat foo.c
#include <avr/io.h>

typedef volatile uint8_t register8_t_;

typedef struct PORT_struct_
{
    register8_t_ DIR;  /* Data Direction */
    register8_t_ DIRSET;  /* Data Direction Set */
    register8_t_ DIRCLR;  /* Data Direction Clear */
    register8_t_ DIRTGL;  /* Data Direction Toggle */
    register8_t_ OUT;  /* Output Value */
    register8_t_ OUTSET;  /* Output Value Set */
    register8_t_ OUTCLR;  /* Output Value Clear */
    register8_t_ OUTTGL;  /* Output Value Toggle */
    register8_t_ IN;  /* Input Value */
    register8_t_ INTFLAGS;  /* Interrupt Flags */
    register8_t_ PORTCTRL;  /* Port Control */
    register8_t_ reserved_0x0B;
    register8_t_ reserved_0x0C;
    register8_t_ reserved_0x0D;
    register8_t_ reserved_0x0E;
    register8_t_ reserved_0x0F;
    register8_t_ PIN0CTRL;  /* Pin 0 Control */
    register8_t_ PIN1CTRL;  /* Pin 1 Control */
    register8_t_ PIN2CTRL;  /* Pin 2 Control */
    register8_t_ PIN3CTRL;  /* Pin 3 Control */
    register8_t_ PIN4CTRL;  /* Pin 4 Control */
    register8_t_ PIN5CTRL;  /* Pin 5 Control */
    register8_t_ PIN6CTRL;  /* Pin 6 Control */
    register8_t_ PIN7CTRL;  /* Pin 7 Control */
    register8_t_ reserved_0x18;
    register8_t_ reserved_0x19;
    register8_t_ reserved_0x1A;
    register8_t_ reserved_0x1B;
    register8_t_ reserved_0x1C;
    register8_t_ reserved_0x1D;
    register8_t_ reserved_0x1E;
    register8_t_ reserved_0x1F;
} PORT_t_;

#define PORTB_                (*(PORT_t_ * const) 0x0420) /* I/O Ports */

extern int uart_getc();
extern void uart_putc(char);
extern void uart_puts(char*);

int main() {
    int c;

    while (1) {
        PORTB_.OUTSET = (1<<5);
        PORTB_.OUTCLR = (1<<5);
        c = uart_getc();
        if (c > 0) {
	    uart_puts("\"\r\n");
            uart_putc(c);
	}
    }
}

Apparently nothing:

$ avr-gcc -g -O1 -mmcu=atmega4809 -B /tmp/Atmel.ATmega_DFP.1.2.272.atpack_FILES/gcc/dev/atmega4809/ -I /tmp/Atmel.ATmega_DFP.1.2.272.atpack_FILES/include/ foo.c -o foo.elf -c
$ avr-objdump -S foo.elf > foo.lss

foo.elf:     file format elf32-avr

Disassembly of section .text:

00000000 <main>:

int main() {
    int c;

    while (1) {
        PORTB_.OUTSET = (1<<5);
   0:	68 94       	set
   2:	ee 24       	eor	r14, r14
   4:	e5 f8       	bld	r14, 5
   6:	ff 24       	eor	r15, r15
   8:	f2 f8       	bld	r15, 2
   a:	10 e2       	ldi	r17, 0x20	; 32
   c:	f7 01       	movw	r30, r14
   e:	15 83       	std	Z+5, r17	; 0x05
        PORTB_.OUTCLR = (1<<5);
  10:	16 83       	std	Z+6, r17	; 0x06
        c = uart_getc();
  12:	0e 94 00 00 	call	0	; 0x0 <main>
  16:	ec 01       	movw	r28, r24
        if (c > 0) {
  18:	18 16       	cp	r1, r24
  1a:	19 06       	cpc	r1, r25
  1c:	04 f4       	brge	.+0      	; 0x1e <main+0x1e>
	    uart_puts("\"\r\n");
  1e:	80 e0       	ldi	r24, 0x00	; 0
  20:	90 e0       	ldi	r25, 0x00	; 0
  22:	0e 94 00 00 	call	0	; 0x0 <main>
            uart_putc(c);
  26:	8c 2f       	mov	r24, r28
  28:	0e 94 00 00 	call	0	; 0x0 <main>
  2c:	00 c0       	rjmp	.+0      	; 0x2e <__zero_reg__+0x2d>

 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"Read a lot.  Write a lot."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Mon. Oct 8, 2018 - 05:13 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I general how can one be surprised that a compiler at low optimize kind of do what you ask it to do!

 

I have not updated the compiler so I this is compiled to a xmegaE5 so I had to use PORTA (make no difference) 

 

I made this code (with -O1): 

		        PORTA.OUTSET = (1<<5);
		        PORTA.OUTCLR = (1<<5);
		       _SFR_MEM8(0x0605)= (1<<5);

  and it give this output :

		        PORTA.OUTSET = (1<<5);
  c4:	e0 e0       	ldi	r30, 0x00	; 0
  c6:	f6 e0       	ldi	r31, 0x06	; 6
  c8:	80 e2       	ldi	r24, 0x20	; 32
		        PORTA.OUTCLR = (1<<5);
				 _SFR_MEM8(0x0605)= (1<<5);
  ca:	a5 e0       	ldi	r26, 0x05	; 5
  cc:	b6 e0       	ldi	r27, 0x06	; 6
int main(void)
{
    /* Replace with your application code */
    while (1)
    {
		        PORTA.OUTSET = (1<<5);
  ce:	85 83       	std	Z+5, r24	; 0x05
		        PORTA.OUTCLR = (1<<5);
  d0:	86 83       	std	Z+6, r24	; 0x06
			_SFR_MEM8(0x0605)= (1<<5);
  d2:	8c 93       	st	X, r24

some comments:

my compiler will load direct to Z! (perhaps just because this is the only code)

if you give the compiler the correct pointer directly it will use X which it also load direct. (I guess that the pointer is declared I just made it work :)  )

 

so what happen it you use:

				 _SFR_MEM8(0x0625)= (1<<5);
				 _SFR_MEM8(0x0626)= (1<<5);

 

and here is the output with -O2

 

		        PORTA.OUTSET = (1<<5);
  c4:	80 e2       	ldi	r24, 0x20	; 32
  c6:	80 93 05 06 	sts	0x0605, r24	; 0x800605 <__TEXT_REGION_LENGTH__+0x700605>
		        PORTA.OUTCLR = (1<<5);
  ca:	80 93 06 06 	sts	0x0606, r24	; 0x800606 <__TEXT_REGION_LENGTH__+0x700606>
				 _SFR_MEM8(0x0605)= (1<<5);
  ce:	80 93 05 06 	sts	0x0605, r24	; 0x800605 <__TEXT_REGION_LENGTH__+0x700605>

so that is the same as you get

 

here is the map file info:

 

c:/program files (x86)/atmel/studio/7.0/toolchain/avr8/avr8-gnu-toolchain/bin/../lib/gcc/avr/5.4.0/avrxmega2\libgcc.a(_exit.o)
                             

C:/Program Files (x86)/Atmel/Studio/7.0/Packs/atmel/XMEGAE_DFP/1.0.30/gcc/dev/atxmega32e5/avrxmega2/crtatxmega32e5.o (exit)
 

EDIT  this is the corret addr for you:

so what happen it you use:

				 _SFR_MEM8(0x0425)= (1<<5);
				 _SFR_MEM8(0x0426)= (1<<5);
Last Edited: Mon. Oct 8, 2018 - 10:43 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
IIRC it's called peephole optimization.

Yes indeed.  And for -mike's "make a pass at the generated machine language" that is essentially what optimization passes do. 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
I general how can one be surprised that a compiler at low optimize kind of do what you ask it to do!
It didn't.

Blindly following the abstract machine would have produced quite different code.

 

Note that, according to C, 1<<5 is a constant expression.

Sometimes, the compiler pretty much has to evaluate it,

e.g. as an enumeration value.

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
skeeve wrote:IIRC it's called peephole optimization. Yes indeed. And for -mike's "make a pass at the generated machine language" that is essentially what optimization passes do.
To be clear, some optimizations, e.g. common sub-expression elimination, can be done at a higher level.

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

but all code in this thread deal with 1<<5 as a constant of 32 !!!

 

all the bulgy code is where to put the 32 :)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If there are people that is confused I will comment this from #1 :

 

        PORTB.OUTSET = (1<<5);
   0:   68 94           set
   2:   ee 24           eor     r14, r14
   4:   e5 f8           bld     r14, 5
   6:   ff 24           eor     r15, r15
   8:   f2 f8           bld     r15, 2

   a:   10 e2           ldi     r17, 0x20       ; 32
   c:   f7 01           movw    r30, r14
   e:   15 83           std     Z+5, r17        ; 0x05
        PORTB.OUTCLR = (1<<5);
  10:   16 83           std     Z+6, r17        ; 0x06
 

0 T flag =1

2 R14=0

4 R14=32

6 R15=0

8 R15=4

a R17=32 (this is our 1<<5)

c Z=R15:R14 (now Z point at PORTB)   

e store 32 in outset (of port b)

10 store 32 in outclr (of port b) 

 

0-8 is a way to make a LDI on R14 and R15, (from the compiler point of view a clean way because it don't trash any other registers, but LDI r31,4 LDI r30,32 movw r14,r30 would have solved the same problem :) )

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In other words, on the mega4809 and other UPDI chips, PORTB is a structure defined as:

#define PORTB                (*(PORT_t *) 0x0420) /* I/O Ports */

so its address is 0x0420. So, at lower optimization levels, the compiler's goal is to load this address in Z, then address the structure members with relative addressing. At -O1 optimization level, it uses an imaginative way using the T flag.

At higher levels, it tends to use direct addressing (but this leads to larger code if you want to access structure members more often).

 

edit: as I mentioned before, you can force the compiler to avoid direct addressing by adding the -fno-gcse option to the compiler (e.g. -Os -fno-gcse). This can be useful if you need to save a few bytes of code size.

Last Edited: Mon. Oct 8, 2018 - 10:40 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

0-8 is a way to make a LDI on R14 and R15, (from the compiler point of view a clean way because it don't trash any other registers, but LDI r31,4 LDI r30,32 movw r14,r30 would have solved the same problem :) )

 Yes.   I misspoke about it not "recognizing constants appropriately"; the "mysterious" instruction sequence is it's creative way of loading the address of the PORTB structure into a local register pair...  part of my confusion was probably caused by looking at the .o file rather than the final .elf.  Hopefully, this sequence is outside of the loop, with only the

"movw r30, r14" inside the loop.

 

I wonder what it would have done if &PORTB had more than two bits set?  Hmm...  PORTD has three bits (0x460):

int main() {
    int c;

    while (1) {
        PORTD.OUTSET = (1<<5);
  b8:   0f 2e           mov     r0, r31
  ba:   f0 e6           ldi     r31, 0x60       ; 96
  bc:   ef 2e           mov     r14, r31
  be:   f4 e0           ldi     r31, 0x04       ; 4
  c0:   ff 2e           mov     r15, r31
  c2:   f0 2d           mov     r31, r0
  c4:   10 e2           ldi     r17, 0x20       ; 32
  c6:   f7 01           movw    r30, r14
  c8:   15 83           std     Z+5, r17        ; 0x05
        PORTD.OUTCLR = (1<<5);
  ca:   16 83           std     Z+6, r17        ; 0x06
        c = uart_getc();
  cc:   0e 94 00 00     call    0       ; 0x0 <__vectors>
  d0:   ec 01           movw    r28, r24
        if (c > 0) {
  d2:   18 16           cp      r1, r24
  d4:   19 06           cpc     r1, r25
  d6:   bc f7           brge    .-18            ; 0xc6 <main+0xe>
            uart_puts("\"\r\n");
  d8:   8c ee           ldi     r24, 0xEC       ; 236
  da:   90 e4           ldi     r25, 0x40       ; 64
  dc:   0e 94 00 00     call    0       ; 0x0 <__vectors>
            uart_putc(c);
  e0:   ce 01           movw    r24, r28
  e2:   0e 94 00 00     call    0       ; 0x0 <__vectors>
  e6:   ef cf           rjmp    .-34            ; 0xc6 <main+0xe>

 

Heh.  I also figure out that you can get a program to link without providing the actual external functions by making them 'weak' symbols (thus the "call 0" statements.)

extern int uart_getc() __attribute__((weak));

The loading of the constant IS outside the loop...

It's still a bit weird, carefully saving/restoring R31 (via R0) before ... overwriting it with something else.  Although that's more in line with "less optimization" than the weird T-bit stuff.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:
The loading of the constant IS outside the loop...

 

Yes, the compiler is smart enough to do that. However, in this code (BTW, nice trick with the weak symbols, I was using dummy functions but this way is better):

 

#include <avr/io.h>

extern int uart_getc() __attribute__((weak));
extern void uart_putc(char) __attribute__((weak));
extern void uart_puts(char*) __attribute__((weak));

int main() {
	int c;
	PORTB.DIRSET = (1<<5);
	while (1) {
		PORTB.OUTSET = (1<<5);
		PORTB.OUTCLR = (1<<5);
		c = uart_getc();
		if (c > 0) {
			uart_puts("\"\r\n");
			uart_putc(c);
		}
	}
}

 

The compiler could have used the base address of PORTB for the PORTB.DIRSET outside the loop, but it doesn't. It also doesn't recycle the 0x20 constant already in r24. But, this is just -O1 level, maybe at higher levels it's better?

int main() {
	int c;
	PORTB.DIRSET = (1<<5);
  94:	80 e2       	ldi	r24, 0x20	; 32
  96:	80 93 21 04 	sts	0x0421, r24	; 0x800421 <__TEXT_REGION_LENGTH__+0x700421>
	while (1) {
		PORTB.OUTSET = (1<<5);
  9a:	68 94       	set
  9c:	ee 24       	eor	r14, r14
  9e:	e5 f8       	bld	r14, 5
  a0:	ff 24       	eor	r15, r15
  a2:	f2 f8       	bld	r15, 2
  a4:	10 e2       	ldi	r17, 0x20	; 32
  a6:	f7 01       	movw	r30, r14
  a8:	15 83       	std	Z+5, r17	; 0x05
		PORTB.OUTCLR = (1<<5);
  aa:	16 83       	std	Z+6, r17	; 0x06
		c = uart_getc();
  ac:	0e 94 00 00 	call	0	; 0x0 <__vectors>
  b0:	ec 01       	movw	r28, r24
		if (c > 0) {
  b2:	18 16       	cp	r1, r24
  b4:	19 06       	cpc	r1, r25
  b6:	bc f7       	brge	.-18     	; 0xa6 <main+0x12>
			uart_puts("\"\r\n");
  b8:	8c ec       	ldi	r24, 0xCC	; 204
  ba:	90 e8       	ldi	r25, 0x80	; 128
  bc:	0e 94 00 00 	call	0	; 0x0 <__vectors>
			uart_putc(c);
  c0:	8c 2f       	mov	r24, r28
  c2:	0e 94 00 00 	call	0	; 0x0 <__vectors>
  c6:	ef cf       	rjmp	.-34     	; 0xa6 <main+0x12>

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

By the way remember that the chip have fixed VPORT's so they are like normal AVR's.  I hope at even -O1 it will do SBI and CBI instructions for set and clear a port pin.

 

Add

I will need to update the compiler to check it :) (Read I will not check it, but like to hear the result :) )

 

but it should be something like :

SBI 0x08,5

CBI 0x08,5

Last Edited: Tue. Oct 9, 2018 - 05:32 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yeah, if you use VPORT, sbi and cli are used, even @ -O1 level:

 

#include <avr/io.h>

extern int uart_getc() __attribute__((weak));
extern void uart_putc(char) __attribute__((weak));
extern void uart_puts(char*) __attribute__((weak));

int main() {
	int c;
	while (1) {
		VPORTB.OUT |= (1<<5);
		VPORTB.OUT &= ~(1<<5);
		c = uart_getc();
		if (c > 0) {
			uart_puts("\"\r\n");
			uart_putc(c);
		}
	}
}

generates:

int main() {
	int c;
	while (1) {
		VPORTB.OUT |= (1<<5);
  94:	2d 9a       	sbi	0x05, 5	; 5
		VPORTB.OUT &= ~(1<<5);
  96:	2d 98       	cbi	0x05, 5	; 5
		c = uart_getc();
  98:	0e 94 00 00 	call	0	; 0x0 <__vectors>
  9c:	ec 01       	movw	r28, r24
		if (c > 0) {
  9e:	18 16       	cp	r1, r24
  a0:	19 06       	cpc	r1, r25
  a2:	c4 f7       	brge	.-16     	; 0x94 <main>
			uart_puts("\"\r\n");
  a4:	88 eb       	ldi	r24, 0xB8	; 184
  a6:	90 e8       	ldi	r25, 0x80	; 128
  a8:	0e 94 00 00 	call	0	; 0x0 <__vectors>
			uart_putc(c);
  ac:	8c 2f       	mov	r24, r28
  ae:	0e 94 00 00 	call	0	; 0x0 <__vectors>
  b2:	f0 cf       	rjmp	.-32     	; 0x94 <main>

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Do we know which target OP is -- well -- targeting?  There seems to have been an assumption of some Xmega model.  Do the same symptoms exist for Xtiny and Mega0?

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, it's not easy to find but it's mentioned in #5. It's not a xmega, but close, it's one of those new UPDI chips (ATmega4809).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I just looked at the datasheet and SBI and CBI should only take 1 clk, so that is the optimal solution (on a "normal" AVR with register preloaded  two OUT is faster than SBI/CBI).

 

#22

Yes these are scaled down Xmagas, but since they can't remap virtual ports, they have made a hard wired VPORT with DDR PORT and PIN for the different ports.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

if you use VPORT

 Yeah, yeah.  MANY ways to toggle a pin in the xtiny, xmega, and mega-zero chips.  PORT->OUT, PORT->OUTTGL, PORT->OUTSET/OUTCLR, VPORT->IN (I was pleased that that still works, BTW) VPORT->OUT (via out or via SBI/CBI)...

 

That's not the point here.

 

ALL the other SREGs on mega4809 are in memory space where SBI, CBI, IN, and OUT don't work...

 

 

Do we know which target OP is -- well -- targeting?

(info now added to original post.)

 

Do the same symptoms exist [elsewhere] ?

 Hmm.  Yes, I can get the same behavior for Xmega (requires careful picking of a a peripheral with the right two-bit base address!)

 

I guess that

   0:   68 94           set
   2:   ee 24           eor     r14, r14
   4:   e5 f8           bld     r14, 5
   6:   ff 24           eor     r15, r15
   8:   f2 f8           bld     r15, 2

Is actually somewhat better than

  b8:   0f 2e           mov     r0, r31
  ba:   f0 e6           ldi     r31, 0x60       ; 96
  bc:   ef 2e           mov     r14, r31
  be:   f4 e0           ldi     r31, 0x04       ; 4
  c0:   ff 2e           mov     r15, r31
  c2:   f0 2d           mov     r31, r0

Even if it's so weird looking.   But both seem to be pretty spectacular failures of register allocation, given that this is early in main(), where NONE of the high registers have active contents before the loop.  Sparrow's suggested:

    LDI r31,4

    LDI r30,32

    movw r14,r30

would have worked with ANY high register pair at that point.  (Saving the point in a low register pair is understandable.   How it chose to do that is ... not.)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The $100 question is why would anyone actually use -O1 !?

 

It can't be because of long compile time any more.

 

If the C code don't work at -O2 or -Os I would fix the C code. I have never used -O1.

 

But back to the -O1

Pehaps use :

				 _SFR_MEM8(0x0425)= (1<<5);
				 _SFR_MEM8(0x0426)= (1<<5);

I don't know if they are defined else it's a one time job.

By not have to use displacement (5th and 6th element in the port structure) it can use the X register (You can't get it to use Y :( )

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The $100 question is why would anyone actually use -O1 !?

Microchip has wrapped their "XC8" around a modified avr-gcc, and -O1 is the maximum optimization level it supports without paying.

(Yeah, I know, there are lots of ways to get "original gcc with all the optimization you want."  But...)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:
(Yeah, I know, there are lots of ways to get "original gcc with all the optimization you want." But...)

Yes, including a Mac version from Microchip themselves: https://www.microchip.com/mplab/...

You can just add it to MPLAB as an external compiler.