gcc unwanted loop hoisting?

Go To Last Post
31 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Compiling with -Os, avr-gcc 5.4.0 seems to be making unwanted speed optimizations at the cost of space.  Here's some sample code and the resulting asm:

void sput(char* s)
{
    asm volatile ("nop" : "+e"(s));
}

int main()
{
    do {
        sput("Hi");
        sput("By");
    }while (1);
}
00000030 <main>:
  30:   20 e6           ldi     r18, 0x60       ; 96
  32:   30 e0           ldi     r19, 0x00       ; 0
  34:   83 e6           ldi     r24, 0x63       ; 99
  36:   90 e0           ldi     r25, 0x00       ; 0
  38:   f9 01           movw    r30, r18
  3a:   00 00           nop
  3c:   fc 01           movw    r30, r24
  3e:   00 00           nop
  40:   fb cf           rjmp    .-10            ; 0x38 <main+0x8>

Hoisting the loads outside the loop makes sense for speed, as it would results in 2 fewer instructions per loop itteration.  However I'm using -Os, so the ldi instructions should go inside the loop.  If I remove the while loop, the movw instructions are not used:

00000030 <main>:
  30:   e0 e6           ldi     r30, 0x60       ; 96
  32:   f0 e0           ldi     r31, 0x00       ; 0
  34:   00 00           nop
  36:   e3 e6           ldi     r30, 0x63       ; 99
  38:   f0 e0           ldi     r31, 0x00       ; 0
  3a:   00 00           nop
  

I've read through most of the gcc optimization options, and couldn't find a fix.

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here are some suggestions:

 

All the -fno-tree-loop-optimize (because -ftree-loop-optimize
decreases performance), the -fno-reorder-blocks (because BB
reordering decreases performance), the -fno-move-loop-invariants
(you guess why) all that asm("":"+r"(var)) hacks to push modern
GCC into the right direction -- it's light years away from the
compiler GCC used to be

 

This is an excerpt from an interesting... rant.

 

(OT: I used the word "rant" for lack of better word in English, I searched, diatribe, tirade, harangue, etc. but I'm not a native speaker and can't find the right word, they are all harsh, which is not the correct meaning. In Portuguese the exact word exists, "desabafo", which is kind of a sad rant where you admit you have become defeated/demoralized/disillusioned/frustrated with something. If such concept/word exists in English, please let me know for future use.) 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Perhaps a fitting word would be lament. (?)

(Or perhaps even that is a bit strong...)

 

David (aka frog_jr)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I was going to suggest wail, but that is probably even stronger. Dirge?

 

Jim

Jim Wagner Oregon Research Electronics, Consulting Div. Tangent, OR, USA http://www.orelectronics.net

Last Edited: Sun. Mar 15, 2020 - 12:30 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

El Tangas wrote:

Here are some suggestions:

 

All the -fno-tree-loop-optimize (because -ftree-loop-optimize
decreases performance), the -fno-reorder-blocks (because BB
reordering decreases performance), the -fno-move-loop-invariants
(you guess why) all that asm("":"+r"(var)) hacks to push modern
GCC into the right direction -- it's light years away from the
compiler GCC used to be

 

This is an excerpt from an interesting... rant.

Thanks! -fno-move-loop-invariants did the trick.

Given how bad avr-gcc is, I've actually considered writing an optimizing recompiler for binaries.  While it would be fun, I'm starting to think I'm wasting my skills becoming even more of an expert with 8-bit AVR MCUs.  After I finish a few AVR projects I have on the go I think I'll put more time into honing my ARM skills, and finally pick up a RISC-V dev board to play with.

 

And to add to the rants, I'm looking forward to doing more work on platforms like the STM32 that have consistent register definitions.  Some recent AVR code I was writing that uses the watchdog interrupt wouldn't compile for the t85 despite working perfectly on the t13.  Looking at the datasheets, WDTCR looks the same, and even is at the same address.  Yet bit 6 on the t13 is defined as WDTIE, but on the t85 it's called WDIE!  This is just one of many examples I've seen like this.

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can apply apply the optimization attributes on a per function base, when you declare them. I also had this kind of problem:

https://www.avrfreaks.net/commen...

 

Again, that advice was from SprinterSB before he threw the towel on avr-gcc. But if you start micromanaging gcc code generation like this, then that kind of defeats the point of having a compiler.

I think gcc also has some #pragma to apply optimization only to specific segments of code, even finer than the per-function options.

 

Regarding my OT, I think "lament" is indeed closer to what I was looking for, thanks yes

 

edit: "bemoan" from the next post is close, too. If it could be used as a noun instead of verb, it would be very close. "Outburst" as suggested by google is too strong (I tried it too).

Last Edited: Sun. Mar 15, 2020 - 12:19 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I wouldn't have expected this. Compiling with -Og gives:

sput(char*):

        mov r30,r24

        mov r31,r25

        nop

ret

.LC0:

        .string "Hi"

.LC1:

        .string "By"

main:

.L3:

        ldi r24,lo8(.LC0)

        ldi r25,hi8(.LC0)

        rcall sput(char*)

        ldi r24,lo8(.LC1)

        ldi r25,hi8(.LC1)

        rcall sput(char*)

        rjmp .L3

 

ralphd wrote:
Given how bad avr-gcc is,

I wanted to run your example through some of the other targets on https://godbolt.org/ but because of your AVR specific constraint, it won't compile. Though to change your example so that it does compile on various targets, defeats the object of the test.

 

I think you're being a bit hard on the compiler here and your whinge (oh, that's a good word for desabafo ) should include comparisons against the competing compilers (Codevision, IAR ???).


Oh, and for other translations of desabafo how about:

  • Bemoan
  • Plaint
  • Fret
  • Carp
  • Outburst {from google translate}

 

Last Edited: Sun. Mar 15, 2020 - 11:08 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ralphd wrote:
... Given how bad avr-gcc is, I've actually considered writing an optimizing recompiler for binaries.  While it would be fun, I'm starting to think I'm wasting my skills becoming even more of an expert with 8-bit AVR MCUs....

I think you acted wisely. After 15 years of using avr-gcc, I finally had had enough, and wrote a function-level optimizer that refines the assembler code output by the compiler. It worked extremely well with avr-gcc 3.4.6, but may not help much with later gcc versions which increasingly replace easier to spot and fix, missed optimizations, with more ornery mis-optimizations. It was a rather involved project.

 

For your amusement, here are the changes it makes to the .s files of a ca. 1K ATtiny24 application:

[butler:trunk/Firmware/AVR]% make
avr-gcc-3.4.6 -g -Wall -mmcu=attiny24 -nostartfiles -nodefaultlibs -Os -DAVR -DOSCCAL_VAL=`awk '/^001 /{print $2}' osccal.txt` -x c -S ../flash.c -o - | \
	    ./optimizer.py -c -mcu=attiny24 -d function_args.txt > ../flash.s
======================================================= 82, delete_needless_tst
          or r18,r19
  .L7:
  .LM14:
-         tst r18
          brne .L7
  .L6:
  .LM15:
------
          or r18,r19
  .L7:
  .LM14:
+
          brne .L7
  .L6:
  .LM15:

======================================================== 40-46, use_fancier_lpm
  .LBB2:
  .LM6:
  /* #APP */
-         lpm
-         mov r24,r0

  /* #NOAPP */
  .LBE2:
  .LM7:
-         adiw r30,1
  .LM8:
  /* #APP */
          eor r18,r24
------
  .LBB2:
  .LM6:
  /* #APP */
+
+         

  /* #NOAPP */
  .LBE2:
  .LM7:
+         lpm r24,z+
  .LM8:
  /* #APP */
          eor r18,r24

avr-gcc-3.4.6 -g -Wall -mmcu=attiny24 -nostartfiles -nodefaultlibs -Os -DAVR -DOSCCAL_VAL=`awk '/^001 /{print $2}' osccal.txt` -x c -S ../main.c -o - | \
	    ./optimizer.py -c -mcu=attiny24 -d function_args.txt > ../main.s
=============================================== 83-119, shed_returned_byte_move
          rcall initialize_hardware
  .LM18:
          rcall receive_serial_nibble_finish
-         mov r18,r24
  .LM19:
          clr r25
          sbrs r24,5
          ...
  /* #APP */
          wdr
  /* #NOAPP */
-         ldi r24,lo8(24)
-         out 65-0x20,r24
-         ldi r24,lo8(13)
-         out 65-0x20,r24
  .LM24:
          cbi 59-0x20,4
  .LM25:
          ...
  .LM27:
  .LBE4:
  .LM28:
-         mov r17,r18
          andi r17,lo8(15)
  .LM29:
          rcall receive_serial_nibble
------
          rcall initialize_hardware
  .LM18:
          rcall receive_serial_nibble_finish
+
  .LM19:
          clr r25
          sbrs r24,5
          ...
  /* #APP */
          wdr
  /* #NOAPP */
+         ldi r18,lo8(24)
+         out 65-0x20,r18
+         ldi r18,lo8(13)
+         out 65-0x20,r18
  .LM24:
          cbi 59-0x20,4
  .LM25:
          ...
  .LM27:
  .LBE4:
  .LM28:
+         mov r17,r24
          andi r17,lo8(15)
  .LM29:
          rcall receive_serial_nibble

=============================================== 127-142
          rcall test_flash
  .LM31:
          rcall receive_serial_nibble_finish
-         mov r18,r24
  .LM32:
          clr r25
          sbrs r24,4
          ...
          rjmp .L20
  .L12:
  .LM35:
-         swap r18
-         andi r18,0xf0
  .LM36:
-         or r17,r18
  .LM37:
          subi r28,lo8(-(18947))
          sbci r29,hi8(-(18947))
------
          rcall test_flash
  .LM31:
          rcall receive_serial_nibble_finish
+
  .LM32:
          clr r25
          sbrs r24,4
          ...
          rjmp .L20
  .L12:
  .LM35:
+         swap r24
+         andi r24,0xf0
  .LM36:
+         or r17,r24
  .LM37:
          subi r28,lo8(-(18947))
          sbci r29,hi8(-(18947))

========================================== 85, eliminate_needless_int_promotion
          rcall receive_serial_nibble_finish

  .LM19:
-         clr r25
          sbrs r24,5
          rjmp .L8
  .LM20:
------
          rcall receive_serial_nibble_finish

  .LM19:
+
          sbrs r24,5
          rjmp .L8
  .LM20:

========================================== 129
          rcall receive_serial_nibble_finish

  .LM32:
-         clr r25
          sbrs r24,4
          rjmp .L12
  .LM33:
------
          rcall receive_serial_nibble_finish

  .LM32:
+
          sbrs r24,4
          rjmp .L12
  .LM33:

avr-gcc-3.4.6 -g -Wall -mmcu=attiny24 -nostartfiles -nodefaultlibs -Os -DAVR -DOSCCAL_VAL=`awk '/^001 /{print $2}' osccal.txt` -x c -S ../serial.c -o - | \
	    ./optimizer.py -c -mcu=attiny24 -d function_args.txt > ../serial.s
../serial.c: In function `receive_serial_nibble':
../serial.c:383: warning: comparison is always true due to limited range of data type
../serial.c:395: warning: comparison is always true due to limited range of data type
======================================================== 70-71, movw_for_2x_mov
          rjmp .L5
  .L10:
  .LM12:
-         mov r27,r21
-         mov r26,r20
          sbiw r26,2
  .LM13:
          ldi r23,lo8(47)
------
          rjmp .L5
  .L10:
  .LM12:
+
+         movw r26,r20
          sbiw r26,2
  .LM13:
          ldi r23,lo8(47)

======================================================== 117-118
          ori r30,lo8(1)
  .L19:
  .LM22:
-         mov r25,r19
-         mov r24,r18
          subi r24,lo8(-(-120))
          sbci r25,hi8(-(-120))
          cpi r24,178
------
          ori r30,lo8(1)
  .L19:
  .LM22:
+
+         movw r24,r18
          subi r24,lo8(-(-120))
          sbci r25,hi8(-(-120))
          cpi r24,178

======================================================== 125-126
          cpc r25,r1
          brlo .L44
  .LM23:
-         mov r25,r19
-         mov r24,r18
          subi r24,lo8(-(-443))
          sbci r25,hi8(-(-443))
          subi r24,lo8(365)
------
          cpc r25,r1
          brlo .L44
  .LM23:
+
+         movw r24,r18
          subi r24,lo8(-(-443))
          sbci r25,hi8(-(-443))
          subi r24,lo8(365)

======================================================== 204-205
          sbrc r0,4
          rjmp .L5
  .LM40:
-         mov r18,r20
-         mov r19,r21
          sub r18,r26
          sbc r19,r27
  .LM41:
------
          sbrc r0,4
          rjmp .L5
  .LM40:
+
+         movw r18,r20
          sub r18,r26
          sbc r19,r27
  .LM41:

=================================================== 80-85, movw_for_sub_out_sub
  .LM15:
          ldi r30,lo8(14)
  .LM16:
-         subi r20,lo8(-(924))
-         sbci r21,hi8(-(924))
-         out (74)+1-0x20,r21
-         out 74-0x20,r20
-         subi r20,lo8(-(-924))
-         sbci r21,hi8(-(-924))
          rjmp .L43
  .L13:
  .LM17:
------
  .LM15:
          ldi r30,lo8(14)
  .LM16:
+
+         movw r18,r20
+         subi r18,lo8(-(924))
+         sbci r19,hi8(-(924))
+         out (74)+1-0x20,r19
+         out 74-0x20,r18
          rjmp .L43
  .L13:
  .LM17:

=================================================== 89-94
          rjmp .L43
  .L13:
  .LM17:
-         subi r20,lo8(-(299))
-         sbci r21,hi8(-(299))
-         out (74)+1-0x20,r21
-         out 74-0x20,r20
-         subi r20,lo8(-(-299))
-         sbci r21,hi8(-(-299))
  .L43:
          sbi 43-0x20,1
  .L12:
------
          rjmp .L43
  .L13:
  .LM17:
+
+         movw r18,r20
+         subi r18,lo8(-(299))
+         sbci r19,hi8(-(299))
+         out (74)+1-0x20,r19
+         out 74-0x20,r18
  .L43:
          sbi 43-0x20,1
  .L12:

=================================================== 187-192
          andi r24,lo8(-17)
          out 91-0x20,r24
  .LM37:
-         subi r20,lo8(-(67))
-         sbci r21,hi8(-(67))
-         out (74)+1-0x20,r21
-         out 74-0x20,r20
-         subi r20,lo8(-(-67))
-         sbci r21,hi8(-(-67))
          sbi 43-0x20,1
  .LM38:
  /* #APP */
------
          andi r24,lo8(-17)
          out 91-0x20,r24
  .LM37:
+
+         movw r18,r20
+         subi r18,lo8(-(67))
+         sbci r19,hi8(-(67))
+         out (74)+1-0x20,r19
+         out 74-0x20,r18
          sbi 43-0x20,1
  .LM38:
  /* #APP */

=================================================== 377-382
          in r19,(76)+1-0x20
  .L59:
  .LM71:
-         subi r18,lo8(-(2538))
-         sbci r19,hi8(-(2538))
-         out (74)+1-0x20,r19
-         out 74-0x20,r18
-         subi r18,lo8(-(-2538))
-         sbci r19,hi8(-(-2538))
          sbi 43-0x20,1
  .LM72:
  /* #APP */
------
          in r19,(76)+1-0x20
  .L59:
  .LM71:
+
+         movw r20,r18
+         subi r20,lo8(-(2538))
+         sbci r21,hi8(-(2538))
+         out (74)+1-0x20,r21
+         out 74-0x20,r20
          sbi 43-0x20,1
  .LM72:
  /* #APP */

================================================ 317-348, shed_return_byte_move
          sleep
  .LM61:
  /* #NOAPP */
-         in r19,52-0x20
  .LBB7:
  .LM62:
  /* #APP */
          sbis 25,7
          inc r1
-         mov r24,r1
          rcall delay_10_cycles
          sbis 25,7
-         inc r24
          clr r1
          rcall delay_10_cycles
          sbis 25,7
-         inc r24

  /* #NOAPP */
  .LBE7:
-         sbrc r24,1
  .LM63:
-         ori r19,lo8(8)
  .L47:
  .LM64:
          in r18,51-0x20
-         in r24,74-0x20
-         in r25,(74)+1-0x20
-         add r24,r18
-         adc r25,r1
-         sbiw r24,4
-         out (74)+1-0x20,r25
-         out 74-0x20,r24
  .LM65:
-         mov r24,r19
          clr r25
  /* epilogue: frame size=0 */
          ret
------
          sleep
  .LM61:
  /* #NOAPP */
+         in r24,52-0x20
  .LBB7:
  .LM62:
  /* #APP */
          sbis 25,7
          inc r1
+         mov r26,r1
          rcall delay_10_cycles
          sbis 25,7
+         inc r26
          clr r1
          rcall delay_10_cycles
          sbis 25,7
+         inc r26

  /* #NOAPP */
  .LBE7:
+         sbrc r26,1
  .LM63:
+         ori r24,lo8(8)
  .L47:
  .LM64:
          in r18,51-0x20
+         in r26,74-0x20
+         in r27,(74)+1-0x20
+         add r26,r18
+         adc r27,r1
+         sbiw r26,4
+         out (74)+1-0x20,r27
+         out 74-0x20,r26
  .LM65:
+
          clr r25
  /* epilogue: frame size=0 */
          ret

========================================= 349, eliminate_needless_int_promotion
          out 74-0x20,r26
  .LM65:

-         clr r25
  /* epilogue: frame size=0 */
          ret
  /* epilogue end (size=1) */
------
          out 74-0x20,r26
  .LM65:

+
  /* epilogue: frame size=0 */
          ret
  /* epilogue end (size=1) */

avr-gcc-3.4.6 -g -Wall -mmcu=attiny24 -nostartfiles -nodefaultlibs -Os -DAVR -DOSCCAL_VAL=`awk '/^001 /{print $2}' osccal.txt` -x c -S avr_funcs.c -o - | \
	    ./optimizer.py -c -mcu=attiny24 -d function_args.txt > avr_funcs.s
cc -o ../Util/crcval ../Util/crcval.c
avr-gcc-3.4.6 -g -Wall -mmcu=attiny24 -nostartfiles -nodefaultlibs -Os -DAVR -DOSCCAL_VAL=`awk '/^001 /{print $2}' osccal.txt` -o msi.elf -Wl,--section-start=init0=0x00 -Wl,--section-start=.text=0x3C -Wl,--section-start=crc_value=0x7FA ../flash.o init.o ../main.o ../serial.o avr_funcs.o crc_value.o

 

 

 

 

- John

Last Edited: Sun. Mar 15, 2020 - 11:52 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

N.Winterbottom wrote:

I wouldn't have expected this. Compiling with -Og gives:

sput(char*):

        mov r30,r24

        mov r31,r25

        nop

ret

.LC0:

        .string "Hi"

.LC1:

        .string "By"

main:

.L3:

        ldi r24,lo8(.LC0)

        ldi r25,hi8(.LC0)

        rcall sput(char*)

        ldi r24,lo8(.LC1)

        ldi r25,hi8(.LC1)

        rcall sput(char*)

        rjmp .L3

 

ralphd wrote:
Given how bad avr-gcc is,

I wanted to run your example through some of the other targets on https://godbolt.org/ but because of your AVR specific constraint, it won't compile. Though to change your example so that it does compile on various targets, defeats the object of the test.

 

I think you're being a bit hard on the compiler here and your whinge (oh, that's a good word for desabafo ) should include comparisons against the competing compilers (Codevision, IAR ???).


 

Before I read your comment, I was planning to test -Og.  The docs state several optimizations are disabled with -Og including -fmove-loop-invariants.

 

I've been comparing avr-gcc's output to other targets (mainly x86 & ARM) for years.  I think gjlay's post on the gcc list is rather damning.

As for limiting the comparison to other AVR compilers, that sounds like the "it could be worse" logical fallacy.  And by implying that other AVR compilers aren't any better, you are reinforcing my concern that 8-bit AVR development may be a dead-end.

 

 

 

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Ralph, instead of complaining and fighting with the compiler, why not lend your expertise to making it a better gcc, then we all can benefit from your knowledge!

 

Jim

 

 

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

jfiresto wrote:

ralphd wrote:
... Given how bad avr-gcc is, I've actually considered writing an optimizing recompiler for binaries.  While it would be fun, I'm starting to think I'm wasting my skills becoming even more of an expert with 8-bit AVR MCUs....

I think you acted wisely. After 15 years of using avr-gcc, I finally had had enough, and wrote a function-level optimizer that refines the assembler code output by the compiler. It worked extremely well with avr-gcc 3.4.6, but may not help much with later gcc versions which increasingly replace easier to spot and fix, missed optimizations, with more ornery mis-optimizations. It was a rather involved project.

 

For your amusement, here are the changes it makes to the .s files of a ca. 1K ATtiny24 application:

[butler:trunk/Firmware/AVR]% make
avr-gcc-3.4.6 -g -Wall -mmcu=attiny24 -nostartfiles -nodefaultlibs -Os -DAVR -DOSCCAL_VAL=`awk '/^001 /{print $2}' osccal.txt` -x c -S ../flash.c -o - | \
	    ./optimizer.py -c -mcu=attiny24 -d function_args.txt > ../flash.s
======================================================= 82, delete_needless_tst
          or r18,r19
  .L7:
  .LM14:
-         tst r18
          brne .L7
  .L6:
  .LM15:

...

 

Wow, that's impressive!

I've also wondered how much work it would be to write a basic C compiler for embedded systems.  I wouldn't bother with a linker, and just compile all sources straight to executable code.  With no linker and therefore no ABI to follow, the compiler would be free to make much more intelligent choices for allocating registers.  It's rather wasteful on modern RISC-like architectures with large register sets to chain yourself to an ABI that makes half those registers unusable because they are assumed callee-clobbered.

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If the compiler was really clever, main would have consisted of 4 LDIs, 2 NOPs and a RJMP.

Saves time and space.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ki0bk wrote:

Ralph, instead of complaining and fighting with the compiler, why not lend your expertise to making it a better gcc, then we all can benefit from your knowledge!

 

Because I know I'd end up just as frustrated as gjlay, if not moreso.  To fix all the issues would require a rewrite from scratch, which is too big of a task for me.  Even to make some small, useful contributions to gcc would mean overcoming lots of hurdles.  Here's just a few:

1) Getting better at custom gcc builds.  I've done some custom gcc builds in the past, and didn't find it fun.  It also would mean getting past my distaste for autoconf/M4.

2) Learning gcc internals like GIMPLE, which I have no interest in doing.

3) Lack of support resources.  When I've worked on large software projects (gcc is supposedly >10M lines of code), I could focus on what I find interesting.  Other people, like testers, documentation writers, etc did the things I don't want to do.

 

Finally, I'm not good at finishing big, long-term projects, be they in software or otherwise.  There's too many other rewarding things to do in my life that will reap bigger benefits in a shorter period of time.

 

 

 

 

 

 

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:

If the compiler was really clever, main would have consisted of 4 LDIs, 2 NOPs and a RJMP.

Saves time and space.

 

Which is what it does when I add -fno-move-loop-invariants.  And when my target is a t13 (or any other with <256 bytes of address space), it could dispense with the LDI for the high byte, but it doesn't.

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Why not write you program in AVR ASM?  It's quick, easy , and then works exactly like you want.  It also maintains the balance of speed/size performance determined by you.

 

 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avrcandies wrote:

Why not write you program in AVR ASM?  It's quick, easy , and then works exactly like you want.  It also maintains the balance of speed/size performance determined by you.

 

I'm writing an Arduino core, and am writing some of it in asm.  I also want existing C/C++ code to build and run efficiently.

https://github.com/nerdralph/pic...

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You didn’t get rich from bitcoin mining?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Can you do MIT license with the Arduino (or, more specifically, Wiring) function names? Is that not one angle of what a derivative is? I am asking because this is an area I do not understand.

my projects: https://github.com/epccs

Debugging is harder than programming - don’t write code you can’t debug! https://www.avrfreaks.net/forum/help-it-doesnt-work

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ron_sutherland wrote:

Can you do MIT license with the Arduino (or, more specifically, Wiring) function names? Is that not one angle of what a derivative is? I am asking because this is an area I do not understand.

 

I don't really care if Lord Banzi tried to sue me for copyright infringement, in fact I'd be amused if he tried.  And although it's getting off-topic, I seem to recall a few old court cases in the US where API function names were found not copyrightable.

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I seem to recall a few old court cases in the US where API function names were found not copyrightable.

 https://en.wikipedia.org/wiki/Go...

Although I believe that was/is WRT duplicating APIs to proprietary code.  I don't know how it'd apply to OSSW licenses.  An interesting question.

It's not something that came up in any of the Wiring vs Arduino.com vs Arduino.cc battles, as far as I know.

Energia has a bunch of code implementing Arduino APIs with a TI (non-gnu) license.

ChipKit seems to have moved their stuff to the Apache license at top level, though many of the core files retain LGPL/etc.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ralphd wrote:
skeeve wrote:

If the compiler was really clever, main would have consisted of 4 LDIs, 2 NOPs and a RJMP.

Saves time and space.

 

Which is what it does when I add -fno-move-loop-invariants.  And when my target is a t13 (or any other with <256 bytes of address space), it could dispense with the LDI for the high byte, but it doesn't.

I should have been more explicit:

The time savings wold come from keeping the LDIs out of the loop.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
00000030 <main>:
  30:   20 e6           ldi     r18, 0x60       ; 96
  32:   30 e0           ldi     r19, 0x00       ; 0
  34:   83 e6           ldi     r24, 0x63       ; 99
  36:   90 e0           ldi     r25, 0x00       ; 0
  38:   f9 01           movw    r30, r18
  3a:   00 00           nop
  3c:   fc 01           movw    r30, r24
  3e:   00 00           nop
  40:   fb cf           rjmp    .-10            ; 0x38 <main+0x8>

 

An interesting question is whether it still outputs code like this when the surround code is more complex (and those extra registers it is using become more useful for other tasks.)

I've noticed a definite trend toward it being very difficult to analyze compiler performance from "simple examples" these days :-(  Even the Arduino code will emit things like "special version of digitalWrite() for the case where the first argument is 13."  Sometimes.  (Until -Os says maybe there should only be one copy of digitalWrite?)  (Unfortunately, there's so much other bloat in the Arduino functions that that level of optimizations doesn't help much.)

 

BTW, I'm not sure how "interesting" chips like the Tiny13 and Tiny25 are these days, given the cheaper and more powerful Tiny0 and Tiny1 chips.  Not that those cores are well optimized for size, either.

(and I'm not sure you'd get the same "rate of return" on megaTiny efforts.  Even Very Clever bit-gang routines have trouble competing with hardware peripherals.)

 

 

 

Last Edited: Mon. Mar 16, 2020 - 06:29 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ralphd wrote:

jfiresto wrote:
... For your amusement, here are the changes it makes to the .s files of a ca. 1K ATtiny24 application:

[butler:trunk/Firmware/AVR]% make
avr-gcc-3.4.6 -g -Wall -mmcu=attiny24 -nostartfiles -nodefaultlibs -Os -DAVR -DOSCCAL_VAL=`awk '/^001 /{print $2}' osccal.txt` -x c -S ../flash.c -o - | \
	    ./optimizer.py -c -mcu=attiny24 -d function_args.txt > ../flash.s
======================================================= 82, delete_needless_tst
          or r18,r19
  .L7:
  .LM14:
-         tst r18
          brne .L7
  .L6:
  .LM15:

...

Wow, that's impressive!

I've also wondered how much work it would be to write a basic C compiler for embedded systems.  I wouldn't bother with a linker, and just compile all sources straight to executable code.  With no linker and therefore no ABI to follow, the compiler would be free to make much more intelligent choices for allocating registers....

It was an interesting project and a lot of work, but I learned quite a bit. Writing an optimizer is a "baby problem" for writing a compiler, since you have to address some of the same issues, such as live register analysis. Compilers are some of the most complex works of engineering you will find. Managing that complexity is perhaps the biggest challenge and why people like to compile in stages and create intermediate products. I focused on adding just another, pretty simple phase, executed with Python, but even that took some organizing to stay on an even keel. One thing I found very helpful was to express the avr-optimization functions using an "internal" domain specific language (DSL). I would quote you an example but the code editor collapses everything to a single, very long line. :(

 

- John

Last Edited: Mon. Mar 16, 2020 - 03:36 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:

An interesting question is whether it still outputs code like this when the surround code is more complex (and those extra registers it is using become more useful for other tasks.)

I've noticed a definite trend toward it being very difficult to analyze compiler performance from "simple examples" these days :-(  Even the Arduino code will emit things like "special version of digitalWrite() for the case where the first argument is 13."  Sometimes.  (Until -Os says maybe there should only be one copy of digitalWrite?)  (Unfortunately, there's so much other bloat in the Arduino functions that that level of optimizations doesn't help much.)

 

BTW, I'm not sure how "interesting" chips like the Tiny13 and Tiny25 are these days, given the cheaper and more powerful Tiny0 and Tiny1 chips.  Not that those cores are well optimized for size, either.

(and I'm not sure you'd get the same "rate of return" on megaTiny efforts.  Even Very Clever bit-gang routines have trouble competing with hardware peripherals.)

 

Good point regarding the compiler doing different things depending on the code context.  I find it makes explaining compiler optimization bugs more difficult, as it is often not possible to create a small & simple example.

 

As for the popularity of the 8-pin tinies, there's likely some personal bias, as I did most of my early AVR development for the t85 (programming it with a parallel port programmer I wired up for avrdude).  Based on the contributions I've been making to MicroCore, there still seems to be people interested in the 8-pin tinies.  And while I agree there are many more cheap and interesting MCUs to choose from, I find my existing collection of 8-bit AVRs are still quite useful for basic electronics projects.  For example I built a couple multi-GPU rigs for doing OpenCL crypto and password cracking work.  I used some old server PSUs that require a PWM signal to control fan speed.  After an hour of work, I had a t13 with a pot hooked up to the PSU control port using some dupont cables.

 

 

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ralphd wrote:
STM32 that have consistent register definitions

Hmmm ... to an extent.

 

There's only so far you can go before "consistent" becomes an irksome contraint ...

 

frown

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

BTW, if you want to learn a bit about the optimizers in modern compilers here is an enjoyable, at least to me, blog post.

Static Single Assignment for Functional Programmers

You may want scroll down about 1/3 way to the section "the machine tribe in two sentences".

(FWIW, I do a lot of coding in the Scheme dialect of lisp.)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

MattRW wrote:

BTW, if you want to learn a bit about the optimizers in modern compilers here is an enjoyable, at least to me, blog post.

Static Single Assignment for Functional Programmers

You may want scroll down about 1/3 way to the section "the machine tribe in two sentences".

(FWIW, I do a lot of coding in the Scheme dialect of lisp.)

 

I've never been a fan of lisp.  I browsed over the article, and didn't get much from it.  I don't really have any interest in stack machines, as RISC architectures with large register sets are what dominates now.

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ralphd wrote:

MattRW wrote:

BTW, if you want to learn a bit about the optimizers in modern compilers here is an enjoyable, at least to me, blog post.

Static Single Assignment for Functional Programmers

You may want scroll down about 1/3 way to the section "the machine tribe in two sentences".

(FWIW, I do a lot of coding in the Scheme dialect of lisp.)

 

I've never been a fan of lisp.  I browsed over the article, and didn't get much from it.  I don't really have any interest in stack machines, as RISC architectures with large register sets are what dominates now.

 

 

I didn't note anything about stack machines in there.  The article introduces discussing SSA, which is the intermediate language used by most compilers in optimization stages, including gcc (i.e., it's GIMPLE language, as I interpret it).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think Ralph might be confusing Lisp with Forth.......

 

And the RISC architectures with large register sets have faded into the background - eg Sparc and Am29000 come to mind.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

RISC-V has 32 registers (or 16 for embedded)

 

https://en.wikipedia.org/wiki/RISC-V#Register_sets

 

And the instruction format is oddly familiar, almost friendly looking. ARM fills me with a strong desire to dig a deep hole and bury all my projects in it.

my projects: https://github.com/epccs

Debugging is harder than programming - don’t write code you can’t debug! https://www.avrfreaks.net/forum/help-it-doesnt-work

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'd call 192 registers 'large', 32 registers 'average'.  It was found managing large amounts of registers became cumbersome and didn't aid performance except in very specific cases.

Last Edited: Sun. Mar 22, 2020 - 04:40 AM