avr-gcc 4.6.1 and Link Time Optimization on Windows

Go To Last Post
97 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi, I just did a Win32-build of avr-gcc-4.6.1 (release candidate) and am playing around with it.

Trying to use link time optimization (LTO) I run into ld complaining about "unrecognized option -plugin" similar to reported in
http://sourceware.org/PR12742
which is reported against binutils 2.20.

My build uses binutils 2.21.

Did anyone ever try to use LTO with avr-gcc on Windows?

If anyone feels inclined to play around with Windows avr-gcc-4.6.1, too, I can post a link to get it together with some notes.

The build is just indended to get a feeling of quality of 4.6.1.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I seem to recall I tried once. Didn't get very far, but now I don't remember why...

So, what's your feeling of quality of 4.6.1?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

EW wrote:
I seem to recall I tried once. Didn't get very far, but now I don't remember why...
In my case it's missing dlopen support in the build environment I use. I tries to hack binutils without success (there is a dlopen wrapper on the web), so I will have to extend the build environment.

Quote:
So, what's your feeling of quality of 4.6.1?
It's still outperformed by 3.4.6, similar to any avr-gcc 4 I ever tried so far. For some sources it produces code +15% in size.

One of the main reasons appears to be PR46278 (fake X+const addressing). Theoretically, it's easy to solve but in AVR-practice it's very hard. Seems that Denis is working on it.

Moreover, PR46779/PR45291 (missing reloads for fp subregs) is not yet fixed so that I would not use it in a real application (similar applies to 4.5 and trunk). I have a patch to review as you know but it's got stuck somewhere in the pipe...

To test I am using my "Asteroids+Snake on Scope Clock" application. I have no other larger applications to test. And at the moment I feel more inclined to develop avr-gcc than to develop using avr-gcc. :)

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
If anyone feels inclined to play around with Windows avr-gcc-4.6.1, too, I can post a link to get it together with some notes.

The next week or so is looking pretty crazy, but following that I should be free to do some looking. I've no particularly large projects to test against, but, then, variety helps.

If you are up to getting some notes together I will certainly add my two cents. I am especially interested in the possibilities provided by LTO ( assuming it can be made to work ).

Martin Jay McKee

As with most things in engineering, the answer is an unabashed, "It depends."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mckeemj wrote:
SprinterSB wrote:
If anyone feels inclined to play around with Windows avr-gcc-4.6.1, too, I can post a link to get it together with some notes.
The next week or so is looking pretty crazy, but following that I should be free to do some looking. I've no particularly large projects to test against, but, then, variety helps.

If you are up to getting some notes together I will certainly add my two cents.

Ok, here is is:

Sources

"Installing"
  • It's just a zip archive that I dropped at sourceforge: http://sourceforge.net/projects/... . It's not self-extracting, so you need (un)zip. The size of the zip is 26MB; the size on disc is 76MB.
  • Unpack to a place you prefer. The zip will unpack to a directory "gcc-4.6.1-mingw32". You can rename it if you like.
  • To "uninstall" just remove that directory.
Using
  • To use the tools you can simply give the absolute path like c:\gcc-4.6.1-mingw32\bin\avr-gcc[.exe] or change PATH. You can add it to the end of PATH and call avr-gcc-4.6.1 so it does not matter if avr-gcc is hidden behind you favourite avr-gcc.
Please let me know if you get it to work or run into problems.
mckeemj wrote:
I am especially interested in the possibilities provided by LTO ( assuming it can be made to work ).
I'll have to extend the build environment so get LTO to work. At the moment it works only for GCC, but the linker fails to call a plugin to callback GCC at link time. Easiest way to benefit from LTO is using native build on Linux, native cygwin build should work also.

Actually LTO is not an optimization at link time. GCC writes intermediate representation (IR) to the object file. At link time, the linker makes a callback to GCC. GCC then deserializes the IR and does the compilation again, just based on the IR. So it's similar to giving all the sources on the command line - cum granum salis, e.g. a library can also contain LTO info.

Therefore, LTO is rather "recompile at link time" than "optimize at link time". This means that the original assembler s-files and code in o-object files is paper-waste...

avrfreaks does not support Opera. Profile inactive.

Last Edited: Fri. Oct 21, 2011 - 04:38 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

EW wrote:
So, what's your feeling of quality of 4.6.1?
Moreover, -mint8 is consistently broken in all versions 4.5+

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
EW wrote:
So, what's your feeling of quality of 4.6.1?
Moreover, -mint8 is consistently broken in all versions 4.5+
Doesn't sound like a great loss... of course anything in "production" should issue a legible error message and a clear note in the documentation...

0.02E

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mckeemj wrote:
If you are up to getting some notes together I will certainly add my two cents. I am especially interested in the possibilities provided by LTO ( assuming it can be made to work ).

I gave it a second try and added the dlopen wrapper to the build environment. The compiler builds fine but ld fails to operate correctly on plugin, i.e. it cannot correctly dlclose or unload the lib. So I stopped trying to get LTO to work in canadian cross. Perhaps the dlfcn-win32 I found on the web is just garbage...

AFAIK Eric builds under cygwin, so maybe the next WinAVR release (4.6.2?) comes with functional LTO.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
The compiler builds fine but ld fails to operate correctly on plugin,

Thanks for the short write up, BTW, I should be able to get started with some testing this weekend. As to LTO, well, it's a shame it seems to be something of a pain right now; but if I am having an especially slow day perhaps I'll take a look at the compilation myself.

On a slightly different tack... I have my own list, but does anyone have anything specific they would like me to check in 4.6.1? It sounds like code size hasn't improved ( shame... ) but there may be more subtle things.

SprinterSB wrote:
Moreover, -mint8 is consistently broken in all versions 4.5+

I agree with Jan that this option is unlikely to really matter in most cases. Still, I don't like the idea of something so "available" to break code with. I can see plenty of people complaining about problems when they are "optimizing" - or not, I've been known to be overly pessimistic at times as well.

Martin Jay McKee

As with most things in engineering, the answer is an unabashed, "It depends."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:

AFAIK Eric builds under cygwin, so maybe the next WinAVR release (4.6.2?) comes with functional LTO.

Hi Johann,

Actually I build under MinGW for the most part (for sure for binutils, gcc, avr-libc). Right now I'm looking at using gcc 4.6.1, unless there will be a 4.6.2 release in the next month. Also, I'll be using the latest binutils release, 2.21.1.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Findings. Lengthy, boring, and touches various issues, not necessarily relevant to the original post.

I took one of the projects I am working on and tried to run it through 4 different builds: my "production" toolchain is WinAVR20071221 (yes, this project runs for quite some time ;-) ), WinAVR20100110, AVR Toolchain (should be identical to AS5) and the "4.6.1 preview" from above. There are 3 processors in the device, an ATMega2561 and two ATMega8. Various stuff is involved: heavy use of bitfields, extensive use of inline assembler/macros, use of the 64kW+ FLASH memory stuff, various mods on the linker script, algorithmically intensive routines, unusually lengthy (partially generated) functions. On the other hand, library use is minimized and function pointers use (which appears to trigger a couple of problems in the 64kW+ devices) is also minimal.

While doing this, and testing the result for expected functionality, I discovered two flaws of our "standard" inline assembler macros (circular buffer manipulation), which were revealed "thanks" to more aggressive optimization of the newer versions of avr-gcc, so the exercise was worth doing it. (One was an "r" constraint used where "d" should've been; other was a register assigned through input parameter seen by the compiler as constant while it was changed in the inline asm).

I also discovered the flaw in the "Toolchain"'s startup, reported elsewhere.

Findings pertinent to the "4.6.1 preview":

  1. - progmem must be const -- this was major pain in the ***, requiring adding const to hundreds of lines, sometimes requiring investigation how some of the typedefs are used. I understand the reason (some of my PROGMEM-tagged variables were already const, but I was lazy to do that consistently so far); but I would warmly welcome an (interim, perhaps, and accompanied by a nagging warning) switch to alleviate this need for similar situations where existing codebase is to be compiled.

  2. -msize does not work. It probably stopped work earlier; I don't miss it (don't even know what does it do), it just broke my usual make process

  3. -time does not work (is silently ignored). Probably never worked, I just wanted to use it to find out what's wrong - see next item

  4. compilation (-c) of a bigger source (~250k, ~8kLOC) took inordinate time - ~50sec instead of the usual 7-8 sec. Further investigation revealed, that -S took in fact LESS then previously (cca 3sec vs. cca 5sec); it was the assembling which lasted much much longer. Finally narrowed the problem down to the generation of listing file, using -Wa,-adhlns.

  5. the 'M8-s' firmware is appended to the 'M2561 which is then able to "burn" those two M8-s immediately after being uploaded.

    So, first the M8-s firmware is translated into respective .hex, which in turn is converted back into a single-section "object" (.elf) using objcopy (binary in, elf out, explicit padding and filling, explicitly renamed section). This is then linked "normally" to the 'M2561's firmware. I found this method as logically superior to the "usual" way of converting .hex to .C source which is then compiled/linked in the "usual" way.

    This worked well with the '20071221, but in the "preview" stopped working (maybe it stopped meantime, did not check). The "preview"'s linker complains, that the such created "objects" are not avr:6 architecture, and refuses to link.

    So I learned about the -B switch of objcopy, but it refuses to work when .hex is the input file. This is weird, as .hex is IMHO as good a binary input as .bin. Now my workaround is first converting hex to bin and then bin to obj.

    I would submit a feature request to enable -B for .hex (and other similar) types of input; I just would like to ask the more knowledgeable developers here to comment on whether there is or is not any logical flaw in this request.

Below are the compiled sizes. I did not experiment with using different command-line switches relevant to the optimizations, all were with "-Os -funsigned-char -funsigned-bitfields -fpack-struc
t -fshort-enums -std=gnu99". While there is some improvement throughout the years, I don't think I am going to abandon my trusty 2007-vintage WinAVR just for the 2-3% code size decrease.

I have no means to benchmark "speed". (The one difference in data size in one of the 'M8 between the 2007 vintage and others is due to use of alignment on some data to a 0x40 boundary: the 2007 vintage adds the padding into the total data size while others don't.)

Enjoy! ;-)

JW

----------------
                       avr-gcc (GCC) 4.6.1 20110620 (prerelease)

ATMega2561: Program:  144581 bytes (55.2% Full)   Data:       2922 bytes (35.7% Full)
ATMega8/1:  Program:    4940 bytes (60.3% Full)   Data:        303 bytes (29.6% Full)
ATMega8/2:  Program:    3084 bytes (37.6% Full)   Data:        141 bytes (13.8% Full)


----------------
AVR Toolchain          avr-gcc.exe (AVR_8_bit_GNU_Toolchain_3.2.3_314) 4.5.1

ATMega2561: Program:  144829 bytes (55.2% Full)   Data:       2922 bytes (35.7% Full)
ATMega8/1:  Program:    5010 bytes (61.2% Full)   Data:        303 bytes (29.6% Full)
ATMega8/2:  Program:    3118 bytes (38.1% Full)   Data:        141 bytes (13.8% Full)

----------------
WinAVR20100110         avr-gcc (WinAVR 20100110) 4.3.3

ATMega2561: Program:  148379 bytes (56.6% Full)   Data:       2922 bytes (35.7% Full)
ATMega8/1:  Program:    5158 bytes (63.0% Full)   Data:        303 bytes (29.6% Full)
ATMega8/2:  Program:    3162 bytes (38.6% Full)   Data:        141 bytes (13.8% Full)

----------------
WinAVR 20071227        avr-gcc (GCC) 4.2.2 (WinAVR 20071221)

ATMega2561: Program:  148267 bytes (56.6% Full)   Data:       2922 bytes (35.7% Full)
ATMega8/1:  Program:    5032 bytes (61.4% Full)   Data:        352 bytes (34.4% Full)
ATMega8/2:  Program:    3198 bytes (39.0% Full)   Data:        141 bytes (13.8% Full)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ad 1.
There were some inconsistencies in progmem handling that caused PR44643. With just a warning you could run on a "foo causes section type conflict" error, witch is less comprehensible message. I thought I was strict with const+progmem in my programs, too, but also found some complaints ;-)

I'd like to change the progmem stuff even further so that it works properly on types -- or throw error on types because progmem on types has never been documented. But there is stuff like prog_char in avr-libc which even encourages users to use undocumented, undefined features...

Moreover, I'd like to change progmem inerts so that string merging was possible and implementing progmem pragma would be straight forward.

ad 2.
Anatoly removed -msize; just use -dp.

ad 3.
Nothing avr specific ;-)

ad 4.
hmmm. Such hog is maybe a binutils issue and you could asm. Or the generated lss is extremely big because of inlining, macro expansion etc.

ad 5.
Again binutils. You could generate bin out of elf directly, without intermediate ihex.

Maybe ask in binutils list why thes changed it or if there is a replacement with -B, -I, -O, -F, --alt-machine or whatever.

For the size, did you try -fno-split-wide-types or -fno-inline-small-functions?

mckeemj wrote:
SprinterSB wrote:
The compiler builds fine but ld fails to operate correctly on plugin,
but does anyone have anything specific they would like me to check in 4.6.1?
It's already good news if it simply works :-)
Quote:
SprinterSB wrote:
Moreover, -mint8 is consistently broken in all versions 4.5+
Still, I don't like the idea of something so "available" to break code with.
It doesn't break code, it shreds the compiler. I don't know anything on correctness of code with -mint8; the last time I used it was with 3.4.6.

Johann

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
ad 1.
There were some inconsistencies [...]

As I said, I understand the reasons; I just would like to have an option to fall back willingly to the "old order". As a mere user, I of course cannot see how easy or difficult is this to accomplish.

SprinterSB wrote:
I'd like to change the progmem stuff even further so that it works properly on types -- or throw error on types because progmem on types has never been documented.

I use PROGMEM in typedefs quite extensively and never run into problems. Can you please show me an example where it would fail?

Honestly, it even did not occur to me that this would not work in typedefs so I did not bother to check in documentation - it would be completely illogical. Again, mind, I said this from an user's point of view (the user wants thing to work in a logical way, he does not care about the formal difference between qualifier and attribute - the latter being a non-standard and not-that-well-defined extension of gcc anyway).

SprinterSB wrote:
ad 4.
hmmm. Such hog is maybe a binutils issue and you could asm.
Yes, it is a gas issue, but as a user, I don't quite care that binutils and gcc are formally separate projects. I as a user see this as a flaw of the "package".

SprinterSB wrote:
Or the generated lss is extremely big because of inlining, macro expansion etc.
(Just to make sure we are talking about the same thing, the issue is NOT the listing generated by disassembling using objdump (usually suffixed .lss). The issue is, that it takes inordinately long time to assemble if the assembler is commanded to generate a list file, usually with .lst suffix.)

The .lst file - which takes around 50 secs to generate - is 1.5 megabytes long. The '2007 toolchain generates a 0.9mb .lst file in some 3 seconds. I see in the "new" .lst file a lot of garbage - seems like the whole C source appended at the beginning - but I honestly don't intend to investigate the reason, I just see this to be a flaw I am reporting about here and now.

SprinterSB wrote:
ad 5.
Again binutils.
And, again, I am the user of the package... :-)
SprinterSB wrote:
You could generate bin out of elf directly, without intermediate ihex.
Maybe I could, but I need the hex for other purposes too... :-|

SprinterSB wrote:
Maybe ask in binutils list why thes changed it
I don't think there is really a reason to ask specifically this.

SprinterSB wrote:
or if there is a replacement with -B, -I, -O, -F, --alt-machine or whatever.
As I said, I do have a solution with -B, except that it annoys me that it does not accept hex. I've looked into objcopy.c, and it's specifically written so that -B with any other input_target than "binary" throws "Warning: input target 'binary' required for binary architecture parameter." and ignores the -B.

I just would like to know whether there might be a specific reason why ihex (or srecord or similar) is not good for -B. If you or anybody other knows of none, I would then perhaps nag the binutils people.

SprinterSB wrote:
For the size, did you try -fno-split-wide-types or -fno-inline-small-functions?

----------------
                       avr-gcc (GCC) 4.6.1 20110620 (prerelease)

ATMega2561: Program:  144581 bytes (55.2% Full)   Data:       2922 bytes (35.7% Full)
ATMega8/1:  Program:    4940 bytes (60.3% Full)   Data:        303 bytes (29.6% Full)
ATMega8/2:  Program:    3084 bytes (37.6% Full)   Data:        141 bytes (13.8% Full)

with -fno-split-wide-types

ATMega2561: Program:  144337 bytes (55.1% Full)   Data:       2922 bytes (35.7% Full)
ATMega8/1:  Program:    4956 bytes (60.5% Full)   Data:        303 bytes (29.6% Full)
ATMega8/2:  Program:    3114 bytes (38.0% Full)   Data:        141 bytes (13.8% Full)

with -fno-inline-small-functions

ATMega2561: Program:  144743 bytes (55.2% Full)   Data:       2922 bytes (35.7% Full)
ATMega8/1:  Program:    4974 bytes (60.7% Full)   Data:        303 bytes (29.6% Full)
ATMega8/2:  Program:    3100 bytes (37.8% Full)   Data:        141 bytes (13.8% Full)

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
SprinterSB wrote:
ad 1.
There were some inconsistencies [...]

As I said, I understand the reasons; I just would like to have an option to fall back willingly to the "old order". As a mere user, I of course cannot see how easy or difficult is this to accomplish.
With that option you would run into other problems like ICE as mentioned in PR44643. I don't think you prefer ICE over proper error message.

wek wrote:
SprinterSB wrote:
I'd like to change the progmem stuff even further so that it works properly on types -- or throw error on types because progmem on types has never been documented.

I use PROGMEM in typedefs quite extensively and never run into problems. Can you please show me an example where it would fail?
That's PR38342 that has been closed just recently as "Won't Fix". It's agreed by the maintainers nothing to do about it: neither throw an error nor support progmem in types. If you use that feature it might silently break your code in future versions of gcc.

wek wrote:
Honestly, it even did not occur to me that this would not work in typedefs so I did not bother to check in documentation - it would be completely illogical.
You don't change a type layout like with 'packed', I don't see a reason for havon 'progmem' for types. A char is exactly the same, no matter wether it's placed in flash, in ram or in eeprom or in .noinit or wherever.

wek wrote:
SprinterSB wrote:
Or the generated lss is extremely big because of inlining, macro expansion etc.
Just to make sure we are talking about the same thing [...]. The issue is, that it takes inordinately long time to assemble if the assembler is commanded to generate a list file, [...]

The .lst file - which takes around 50 secs to generate - is 1.5 megabytes long. The '2007 toolchain generates a 0.9mb .lst file in some 3 seconds. I see in the "new" .lst file a lot of garbage - seems like the whole C source appended at the beginning

If you request garbage, you get garbage ;-)
Maybe it's more informative to dump .s -fverbose-asm for your needs (at you option /without/ -g).
And as the .ls-whatever is quite big, how do you see that it's a gas issue and not because of some (file system) cache or buffer limitations in your OS? You might want to try -pipe with programs that big or omit -g if don't really need -g (I never needed that garbage on embedded).

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I now also confirmed that both linker problems presented here and here persist in both the "AVR Toolchain" and your build.

---
As per PROGMEM in typedefs: I disagree with most what you wrote about this both in various threads here and on the mailing list. From the user viewpoint, I find it just logical that a method to define a storage class (or named space as you call it) works with typedefs as well. Also this is how these work in other 8-bit compilers I know of. I understand that supporting that in avr-gcc is troublesome; I just repeat that most I post here is from the user point's of view, ignoring what that implies for the supporters.

---

Similarly to my request to support a command-line option allowing to willingly revert to the "old", "const-less" state. I understand that this might be a problem for you as the developer, but I assure you that there will be more *users* requesting backward compatibility at the cost of potential problems (they possibly never experineced in the past and never will in the future too). A nagging warning may be issued when this option is used, notifying the user of the known issues with this.

---

Johann wrote:
If you request garbage, you get garbage
The -Wa,-adhlns=filename.lst is one of the compiler flags "supplied" by the standard makefile template from mfile. Again, as a user I expect certain consistency in what the commandline flags mean. In fact, I quote often inspect the so generated .lst files and never thought they may be "garbage" so far.

Johann wrote:
And as the .ls-whatever is quite big, how do you see that it's a gas issue and not because of some (file system) cache or buffer limitations in your OS?
I am a mere user, I don't know that; however, I never found significant performance jumps with files/data structures in the order of hundreds of kilobytes to few megabytes on a contemporary PC.

I report behaviour different than expected and am willing to cooperate in whatever corrective action there might be, but I can only do as much as my limited knowledge of the things allows.

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
The -Wa,-adhlns=filename.lst is one of the compiler flags "supplied" by the standard makefile template from mfile. Again, as a user I expect certain consistency in what the commandline flags mean. In fact, I quote often inspect the so generated .lst files and never thought they may be "garbage" so far.

Okay so I played with this a bit and it appears, that while in the older "installations" the -ah portion of that switch (include high-level language listing) simply did not work at all, it does work in your "installation", and that single "sub-flag" is indeed what causes the extremely long assembly times.

So, I'd suggest to Joerg to modify the makefile template in mfile accordingly (namely to use -Wa,-adlns); and to whomever would emit a new package (Eric?) to place a remark to the release notes saying that users who update should remove the "h" from their old makefiles.

I wonder whether any of them is following this thread... especially for mfile there's no "official" bug tracker, even if this time I think I would be able to produce a patch... ;-)

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Jorg is on vacation

/bingo

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I envy him... :-)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
Actually LTO is not an optimization at link time. GCC writes intermediate representation (IR) to the object file. At link time, the linker makes a callback to GCC. GCC then deserializes the IR and does the compilation again, just based on the IR. So it's similar to giving all the sources on the command line - cum granum salis, e.g. a library can also contain LTO info.

Therefore, LTO is rather "recompile at link time" than "optimize at link time". This means that the original assembler s-files and code in o-object files is paper-waste...

What is the relationship, if any, between LTO and --relax?

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
5. the 'M8-s' firmware is appended [...]

I would submit a feature request to enable -B for .hex (and other similar) types of input [...]

And I did.

And a big shame on me. Turned out, this is already implemented in binutils 2.21... By mistake, for the "tests" I used the same avr-objcopy of WinAVR20100110, which of course is older... :oops:

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I finally got around to doing some tests of my own with 4.6.1 against 4.3.3 in WinAVR20100110. The good news is that there were no code breakages from the compiler ( or hidden code issues... yet! ). For the most part I found similar results to wek in that there was little change in size. There are a few cases, however, where this does not hold true.

   target   |  GCC  | .text| .data|.text %
-------------------------------------------
frequency   | 4.3.3 | 1588 |  184 |  --
meter       | 4.6.1 | 1602 |  184 |  +.9%
-------------------------------------------
xevent      | 4.3.3 | 1194 |   30 |  --
test        | 4.6.1 | 1220 |   30 |  +2.2%
-------------------------------------------
xevent      | 4.3.3 |  564 |    8 |  --
delay       | 4.6.1 |  382 |    8 |  -32%
-------------------------------------------
xio         | 4.3.3 |   84 |    0 |  --
test        | 4.6.1 |  122 |    0 |  +45%
-------------------------------------------
xio         | 4.3.3 |  352 |   26 |  --
test 2      | 4.6.1 |  342 |   30 |  -2.9%
-------------------------------------------
main        | 4.3.3 | 7834 |  370 |  --
controller  | 4.6.1 | 7218 |  406 |  -7.9%

In all cases, these programs proved quite difficult for GCC to optimize. They have heavy use of function pointers and indirection, including multiple indirection. Of particular note are two pairs: xevent_delay and xio_test and xio_test_2 and main_controller. The first pair xevent_delay and xio_test are obviously the outliers with excessively wide size variations. In the first case 4.6.1 found an optimization ( a call through a function pointer ) that 4.3.3 missed, in the second case it missed one. The second pair, xio_test_2 and main_controller both resulted in a reduced code size at the expense of reduced SRAM footprint optimization.

It is my feeling that 4.6.1 is ready for prime time, though, of course, I would hardly consider the testing I've done to be conclusive. On average the result is the same or, perhaps, slightly smaller. It would still be interesting to see what difference LTO might make, but I have been unable to find the time to attempt a compilation under cygwin.

Martin Jay McKee

P.S. The options I used for the above tests ( 4.3.3 and 4.6.1 ) were ( compiled for an AtMega8 ),

Compile:
-Os -std=gnu99 -ffunction-sections -fno-exceptions -fno-inline-small-functions -funsigned-bitfields -fshort-enums -fno-split-wide-types -fno-tree-scrv-cprop -ffreestanding

Link:
-Wl,-gc-sections,relax

As with most things in engineering, the answer is an unabashed, "It depends."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Is it possible to factor out a module of xio_test that contributes most to size increase an supply it as .i?

Maybe it's related to PR46278.

Did you try if -fno-tree-loop-optimize fixes some size regressions?

avrfreaks does not support Opera. Profile inactive.

Last Edited: Tue. Jul 19, 2011 - 07:30 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mckeemj wrote:
They have heavy use of function pointers and indirection, including multiple indirection.

Wow.
Doesn't sound like a typical 8-bit microcontroller program... Nevertheless an interesting basis for comparison!

mckeemj wrote:
The second pair, xio_test_2 and main_controller both resulted in a reduced code size at the expense of reduced SRAM footprint optimization.

This would worry me.

In mcus, the RAM is a precious resource, more precious than FLASH (my "personal", completely unscientific, factor is 4 :-) ). Unfortunatly, we don't know what the stack usage is...

Could you please try to track down where did (at least some of) those extra RAM bytes go?

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SBSprinter wrote:
Is it possible to factor out a module of xio_test that contributes most to size increase an supply it as .i?

wek wrote:
Could you please try to track down where did (at least some of) those extra RAM bytes go?

To both, yes. I'll see what I can do. Unfortunately I wasn't thinking when I did the last set of tests and failed to create map files for each. It's not a major hassle mind you, but I have to be at the right machine.

The results for xio_test were the ones that bothered me the most... 45% size increase is not good; but then, it is 45% of the ( otherwise ) smallest ( trivial actually ) program. So it really isn't as much difference ( code size wise ) as some other cases.

wek wrote:
Wow.
Doesn't sound like a typical 8-bit microcontroller program... Nevertheless an interesting basis for comparison!

No... not exactly standard from an implementation point of view! The project the code was written for is pretty standard application however. Since it was a personal project I decided I wanted to test how well avr-gcc handled optimizations in a "non-standard" environment. I'd say it did reasonably well considering.

And, in regards to increased RAM usage...

wek wrote:
This would worry me.

In mcus, the RAM is a precious resource, more precious than FLASH


I agree that a small decrease in Flash usage at the cost of RAM increase is a poor trade off. Here again, however, it is a case of missed optimizations that the last version ( 4.3.3 ) caught. All of the RAM used is explicitly defined in the code. Some of it, however, can be optimized to constants or register access ( if memory serves -- I'll have to recheck exactly how it removed it before ). I'll see if, while I am doing other analysis, I can find a clean example about what changed here.

Martin Jay McKee

As with most things in engineering, the answer is an unabashed, "It depends."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> So, I'd suggest to Joerg to modify the makefile template in mfile accordingly

The issue with Mfile is that there's no actual release strategy/policy for
it at all. Eric used to maintain a CVS tree for it as part of WinAVR, but
that one eventually got abandoned, and I think he didn't intend to ever
ship Mfile anymore even with a new WinAVR.

So, it's eventually up to the Mfile users to simply edit their template.

Despite, as the option used to work once, if it no longer does, either
something is broken, or the option should be dropped from gas.

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Eric used to maintain a CVS tree for it as part of WinAVR, but
that one eventually got abandoned, and I think he didn't intend to ever ship Mfile anymore even with a new WinAVR.

That'd be a huge pity if it were true. How many more objcopy's not including .data problems would we see here (for example) if a lot of people weren't using an Mfile template. What about lacking -lm's or ease of selecting printf/scanf support?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dl8dtl wrote:
The issue with Mfile is that there's no actual release strategy/policy for
it at all.
Why can't it be simply attached to avr-libc? It already contains non-strictly-libc-related stuff, e.g. in the documentation. I faintly recall that nobody complained last time this was suggested.

dl8dtl wrote:
[...]I think he didn't intend to ever
ship Mfile anymore even with a new WinAVR.
Why would he want to do that? What would be the replacement?

--------
[in the following, we are talking about the "h" in "-Wa,-adhlns=filename.lst", which is supposed to add "high level language listing" to the .lst file]

dl8dtl wrote:
Despite, as the option used to work once,
Did it? In my 2007 vintage WinAVR it definitively does not work.

dl8dtl wrote:
if it no longer does,

The problem is inverse, namely, that it now DOES work - and results in excessively long assembly times.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> Why can't it be simply attached to avr-libc?

Well, documentation about the entire toolchain has always been on the
agenda of the avr-libc project, that's why you can find it there.

Hosting completely unrelated projects is another thing though. There's also
the "AVR Super-Project" on savannah where it could be hosted -- yet,
just a hosting place still wouldn't mean there were any kind of release
policy or strategy, and I'm afraid I'm simply out of resources for doing
so.

> The problem is inverse, namely, that it now DOES work

OK, now I can see the issue with it.

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

More findings.

Highjacking a different thread, I remarked taht the compiler outputs into asm in form of a comment the stack size and stack frame size for every function. I asked for adding two more similar items, namely "net" register usage of a function, and stack usage for parameters passed through stack for each function call. The former to analyze whether C does not get inadvertently into way of reserved registers (to be used in asm portions of the program); the latter to enable a crude worst-case stack usage analysis.

Related to the latter, SprinterSB mentioned, that there's a new -fstack-usage switch in the newer versions, outputting some stack usage data into a .su file. While it does *not* do what I requested, I got curious and gave it a few tries. Here are the findings, not all of them related to -fstack-usage.

1.
This is how the .su file looks like for a .c source with 5 functions:

j.c:4:6:foo_with_long_name	34	static
j.c:16:6:bar	2	dynamic
j.c:36:6:used_uninitialized1	0	static
j.c:42:6:used_uninitialized	0	static
j.c:69:5:main	4	dynamic,bounded

Apparently the file:line:column format is aimed at automatic parsers such as those which allow to jump to the given spot upon clicking in IDEs, it is the same as the error/warning format. Nice.

It would be a bonus if the numbers could be kept visually in a column, but that is not simple I admit.

2.

void foo_with_long_name(long a, long b, long c, long d, long e) {
  volatile long l, l1, l2, l3, l4;
  
  l = a;
  l1 = b;
  l2 = c;
  l3 = d;
  l4 = e;
}
  12               	foo_with_long_name:
  13 0000 4F92      		push r4
  14 0002 5F92      		push r5
  15 0004 6F92      		push r6
  16 0006 7F92      		push r7
  17 0008 AF92      		push r10
  18 000a BF92      		push r11
  19 000c CF92      		push r12
  20 000e DF92      		push r13
  21 0010 EF92      		push r14
  22 0012 FF92      		push r15
  23 0014 0F93      		push r16
  24 0016 1F93      		push r17
  25 0018 CF93      		push r28
  26 001a DF93      		push r29
  27 001c CDB7      		in r28,__SP_L__
  28 001e DEB7      		in r29,__SP_H__
  29 0020 6497      		sbiw r28,20
  30 0022 0FB6      		in __tmp_reg__,__SREG__
  31 0024 F894      		cli
  32 0026 DEBF      		out __SP_H__,r29
  33 0028 0FBE      		out __SREG__,__tmp_reg__
  34 002a CDBF      		out __SP_L__,r28
  35               	/* prologue: function */
  36               	/* frame size = 20 */
  37               	/* stack size = 34 */

It appears, that the .su file outputs the "stack size", i.e. stack frame (mainly local variables) plus pushed registers (there is more, see below for correction).

It is sometimes useful to know how much is gulped by local variables, but luckily the .asm/.lst contains this; it might be even possible to add a third number for non-local-variable content of the stack frame (spills, maybe others I don't know of).

3.

#include 

void bar(unsigned char l) {
  volatile unsigned char * p;
  p = alloca(l);
}

alloca dynamically allocates memory on stack, thus prevents to calculate stack size. That seems to be indicated by "dynamic" in the respective line of .su file, while it also indicates the 2 bytes which in bar() are occupied "statically":

  94               	bar:
  95 0092 CF93      		push r28
  96 0094 DF93      		push r29

4. (not related to -fstack-usage)

typedef struct {
  union {
    unsigned char a;
    unsigned char k;
  };
  unsigned char b;  
} Ts;

volatile unsigned char voln;
void used_uninitialized1(void) {
  Ts s;

  voln = s.a;
}

void used_uninitialized(void) {
  Ts s;

  switch(voln) {
    case 0:
      s.a = 1;
      break;
    case 1:
      s.a = 55;
      break;
  };
  switch(s.a) {
    case 1:
      voln = 5;
      break;
  }
}

When compiled with -Wall, there is a warning issued for the second function (for the line with "switch(s.a)"): j.c:53:3: warning: 's..a' may be used uninitialized in this function [-Wu
ninitialized]

The new, a bit surprising but nice feature is, that the "anonymous" union in struct here gets a name () :-)

The other surprising thing is, that the first function does NOT trigger a warning, although it's quite obvious that the s.a variable is certainly used without initialisation. The resulting code indicates that the compiler (rightfully) entirely eliminated the s variable, which I guess is the reason why it does not warn:

 126               	used_uninitialized1:
 127               	/* prologue: function */
 128               	/* frame size = 0 */
 129               	/* stack size = 0 */
 130               	.L__stack_usage = 0
 131 00c0 1092 0000 		sts voln,__zero_reg__
 132               	/* epilogue start */
 133 00c4 0895      		ret

Corollary is, that if you do something stupid, you must do it in a sophisticated way, if you want to get warned... The blunt, plain stupidity is overlooked... ;-)

5.

void not_supported(void) __attribute__((__naked__));
void not_supported(void) {
}

(i.e. any naked function) results in the following warning

j.c:64:1: warning: -fstack-usage not supported for this target [enabled by default]

The warning is OK, just the wording is a little bit strange.

6.

int main(void) {
  foo_with_long_name(1, 2, 3, 4, 5);
  bar(100);
  while(1);
}

foo_with_long_name() was deliberately constructed so that the compiler needs to pass parameters (here, the last one) through stack.

 162               	main:
 163               	/* prologue: function */
 164               	/* frame size = 0 */
 165               	/* stack size = 0 */
 166               	.L__stack_usage = 0
 167 0000 00D0      		rcall .
 168 0002 0F92      		push __tmp_reg__
 169 0004 85E0      		ldi r24,lo8(5)
 170 0006 90E0      		ldi r25,hi8(5)
 171 0008 A0E0      		ldi r26,hlo8(5)
 172 000a B0E0      		ldi r27,hhi8(5)
 173 000c EDB7      		in r30,__SP_L__
 174 000e FEB7      		in r31,__SP_H__
 175 0010 8183      		std Z+1,r24
 176 0012 9283      		std Z+2,r25
 177 0014 A383      		std Z+3,r26
 178 0016 B483      		std Z+4,r27
[... filling registers with the first 4 parameters as per ABI, 
then the function call, purging of the stack,
and the rest of main() ...]

So, apparently, there is 0 stack consumed by the function prologue (register pushes + stack frame), and then 4 bytes used for the instance of function call (Footnote: this was compiled for a 'm2561, that's why the dummy "rcall." is used to "reserve" 3 bytes on stack - this optimalisation is the AFAIK only reason why binary compiled for 'm64x/'m128x cannot be used on the pin- and feature-compatible 'm256x - a switch allowing for this would be nice, but I am rational enough not to expect that happen). Those 4 bytes I would like to have reported in an identifiable comment somewhere around that call.

In the .su file, the respective line says "dynamic, bounded". We see 4 bytes reported, so it apparently adds up the stack usage; the "bounded" indicates that the maximum of all such stack usages will be used (tested by the slightly "extended" test file which is attached). So, at the end of the day, the number in .su file is different from the 2 numbers we already had in the .asm/.lst file... but that's the one representing better the real stack usage of the function.

7. (not related to -fstack-usage)
In the parameter-into-register-filling portion of main() just before calling foo_with_long_name, the following sequence struck my eye:

 197 003c AA24      		clr r10
 198 003e BB24      		clr r11
 199 0040 6501      		movw r12,r10
 200 0042 6894      		set
 201 0044 A2F8      		bld r10,2

I admit I had to reach for the instruction set list to find out what "set" and "bld" stand for. A cunning way to load constant 4 into a non-ldi-able register indeed! Funny that in the "extended" version which is in the attachment, in an identical situation the same register gets loaded through a "standard" procedure using a "high" register. Those are the ways of a compiler... ;-)

JW

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
1. It would be a bonus if the numbers could be kept visually in a column
That text file contains TABs, maybe you are using the wrong tab width? I hate TABs.

Quote:
4. The other surprising thing is, that the first function does NOT trigger a warning, although it's quite obvious that the s.a variable is certainly used without initialisation.
Probably worth a bug report.

Quote:
5. [...] any naked function) results in
j.c:64:1: warning: -fstack-usage not supported for this target [enabled by default]

The warning is OK, just the wording is a little bit strange.

Maybe there are some target hooks to implement in order to help gcc to determine stack usage of functions which are non-standard, i.e. have target specific attributes. Didn't yet look into 4.6 Internals concerning that topic.

Quote:
6. [...] there is 0 stack consumed by the function prologue [...], and then 4 bytes used for the instance of function call [...]. Those 4 bytes I would like to have reported in an identifiable comment somewhere around that call.
The .su file reports the 4 bytes.

Quote:
7.
clr r10
clr r11
movw r12,r10
set
bld r10,2

I admit I had to reach for the instruction set list to find out what "set" and "bld" stand for. A cunning way to load constant 4 into a non-ldi-able register indeed!

Thanks for drawing attention to that, the 4.7 would print
set
clr r10
bld r20,2
clr r11
clr r12
clr r13

in the same situation because of changes to output_reload_insisf. So that function will have to be even more complicated :-(

Quote:
Funny that in the "extended" version which is in the attachment, in an identical situation the same register gets loaded through a "standard" procedure using a "high" register.
That's because a high register is available, whereas is the BLD case above, no d-reg is available.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
Quote:
6. [...] there is 0 stack consumed by the function prologue [...], and then 4 bytes used for the instance of function call [...]. Those 4 bytes I would like to have reported in an identifiable comment somewhere around that call.
The .su file reports the 4 bytes.
It reports ONLY the highest number if there are multiple function calls in a function. For worst-case tree analysis, I need this number for ALL functino calls in the function.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
It reports ONLY the highest number if there are multiple function calls in a function. For worst-case tree analysis, I need this number for ALL functino calls in the function.
You have an example showing that GCC stack analysis is lower than the worst case?

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
wek wrote:
It reports ONLY the highest number if there are multiple function calls in a function. For worst-case tree analysis, I need this number for ALL functino calls in the function.
You have an example showing that GCC stack analysis is lower than the worst case?
GCC does not analyze the call *tree* (or at least I don't know of it).

I don't want to obtain a worst than the worst-case value, if it's possible to know it more precisely.

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I tested this version with my project (running on a 168p), it went from 15822 bytes (using GCC4.3.0) to 17710 bytes (GCC4.6.1), a 12% increase.

I used avr-nm to check the size of various functions, one went from 1302 bytes to 2034 bytes (56% increase!), 170 to 258, 340 to 400, 152 to 248, etc

Analyzing the .lss it seems every calls to some functions (sprintf_P() in my case) adds a lot of bytes.

EDIT: It's about the same problem since 4.3.2

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's hart to tell from the distance what causes the increased code size. (No source code, no compiler options, etc.)

One bug that can cause such increase is PR46278. Can you tell if that PR the cause of your problem?

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I uploaded a new version of avr-gcc: http://sourceforge.net/projects/...

The "Release Notes" are the same as for https://www.avrfreaks.net/index.p... except that the compiler is generated from SVN 179406 of gcc-4_6-branch.

Compared to the gcc-4.6.1-mingw32, following PRs are fixed:

PR50652, PR50289, PR49764, PR49487, PR46779, PR34734, PR44643, PR39633, PR39386.

Moreover, the compiler is configured with --disable-lto to disable LTO which does not work when building with my build environment. Thus, the zip is a bit smaller: It's 21MB and inflates to 68MB on disc.

avrfreaks does not support Opera. Profile inactive.

Last Edited: Fri. Oct 21, 2011 - 04:40 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is also avr-gcc 4.7 snapshot 179594 that fixes the following AVR-specific PRs:

PR50652, PR50566, PR50465, PR50449, PR50447, PR50446, PR50358, PR49903, PR49881, PR49868, PR49864, PR49687, PR49313, PR47597, PR45099, PR43746, PR42210, PR36467, PR35860, PR34888, PR34790, PR33049, PR29560, PR29524, PR18145, PR17994.

The code for PR49868 is not upstream yet; you find it alongside with that PR. It enables you to compile the following C code with semantics as obvious:

#define _PGM __const __pgm
#define PGM_STR(X) ((_PGM char[]) { X })

int _PGM a = 1;
char _PGM *pstr = PGM_STR ("123");
long _PGM l[] = { 'a', 'b', 'c'};

char get_1 (char _PGM **p)       { return **p; }
char get_2 (char _PGM * _PGM *p) { return **p; }
char get_3 (char * _PGM *p)      { return **p; }
char get_4 (char **p)            { return **p; }

int main (void)
{
    return a + pstr[2] + l[1];
}

Notes:

  1. GCC 4.7 is still work in progress (stage 1)
  2. Implied by PR49687, PR49313 and PR29524, you must use the same version of avr-gcc to compile and link your objects.
  3. Be aware of implications of PR18145 when using section attributes.
  4. -fdata-sections affects data in .progmem, same applies to -fmerge-[all-]constants due to PR43746.
  5. Because of considerable problems, PR46278 is not yet fixed.

avrfreaks does not support Opera. Profile inactive.

Last Edited: Fri. Oct 21, 2011 - 04:41 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just tried a non-trivial application (my Bluetooth stack and explorerbot, code available here) and the results are:

  GCC 4.3: 41476 bytes (.text + .data)
GCC 4.6.1: 41276 bytes

Which means that the new compiler performs marginally better (size-wise) in this instance. I was actually expecting a huge difference, due to the sheer number of changes to GCC between the two versions, but it seems this isn't the case.

I'm hoping the introduction of LTO will dramatically decrease the size of some of my applications, so I'm looking forward to being able to test that.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

abcminiuser wrote:
Which means that the new compiler performs marginally better (size-wise) in this instance. I was actually expecting a huge difference, due to the sheer number of changes to GCC between the two versions

Why?

If you read the compiled results, do you see some obvious sources of inefficiency?

Have you ever tried also some "historical" version, perhaps some 3.x.x? You might be surprised.

It's quite visible how the newer versions approach things in a very different manner - but I expect no miracles, just the minor improvements both I (in reports above) and you have seen.

I guess there are cases when a certain writing "style" can be optimized better, perhaps in conjunction with certain gcc command-line switches (see Martin's report above) - but in a typical complete real-world embedded program, it's quite unlikely that a large program is written in such a "style".

abcminiuser wrote:
I'm hoping the introduction of LTO will dramatically decrease the size of some of my applications, so I'm looking forward to being able to test that.

I wouldn't hold my breath.

I personally have more expectations towards bugfixes than miraculously reduced code/data sizes.

But the really big thing I am looking forward to trying is the true "named" progmem support - thanks Johan!

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Why?

If you read the compiled results, do you see some obvious sources of inefficiency?

I'm not expecting miracles, but I do expect some improvements (otherwise what is the point of the newer versions, other than the obvious device support and bug fixes!) over the old. I'd expect to see improvements on register allocation, better peephole optimizations, smarter pointer usage and the like.

I'm not expecting a 40KB application to suddenly decrease to 20KB or anything, but I would expect the difference to be more than a few tenths of a percent.

Very excited to see the PROGMEM improvements however; that alone will make 4.7 a must have while also making my tutorial on the subject completely obsolete.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

other than the obvious device support and bug fixes

Do you need more than that?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Do you need more than that?

Not really, but of course a smarter compiler is always a good thing. LTO and other optimizations should make for smaller binaries, which can lead to faster programs and/or cheaper designs.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The thing is, that avr-gcc is pretty optimal for quite a couple of years, so you can't really expect any massive improvement. Not in the obvious directions(*). Of course, if a program is written so that a newer version "taps" on a certain weakness, the improvement will be massive, but embedded applications typically use a well-mixed set of features of the language to make such surprises unlikely.

And, Dean, your tutorial is not going to be obsolete. Rather, it will need to be updated. There's still need to manually locate variables into FLASH, and it's quite likely that even if we might see "universal" library functions like printf() or memcpy() seamlessly taking pointers to either FLASH or RAM or other memory classes (EEPROM, far FLASH), they will probably turn out to be inefficient enough to grant the "specialized" functions (hence the need to teach about their proper usage). And there might also be need for a migration document, too.

(*) includes the native progmem support. And I can imagine more of that. A couple of years ago, I had the opportunity to betatest the '51 version of HiTech's "omniscient" compiler just before Microchip bought them and scrapped the whole thing. Now THAT was SOMETHING. That thing KNEW which variables should go to which memory and exactly how much space should they occupy. It KNEW exactly how much stack is going to be used and could use the rest, e.g. in the startup filled up the rest of RAM with the strings and alike which were to be used often. In '51 there are a lots of opportunities like that and that thing used them all, sometimes in a quite surprising manner.

And that is where I would like to see more work to be done, even if it is highly unlikely it will ever happen again given today's state of affairs.

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Please don't write tutorials or the like for __pgm. It's just a thing I played around with and I have no idea if AVR maintainers are inclined to have such stuff in the compiler.

You won't see magic in GCC, in particular no magic beyond well defined language specification.

wek wrote:
thanks Johan
Johann :-)

Automatically mapping to printf or printf_P could be done by built-in function. I don't think general pointer with compile-time semantics is any good. Notica that *any* dereference to such a pointer would have to test if the reference points to RAM or to Flash or to EEPROM and that the is no more direct addressing like with LDS or STS or indirect plus offset like LDD Z+1.

Such stuff is simply not appropriate for AVR.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

abcminiuser wrote:
Which means that the new compiler performs marginally better (size-wise) in this instance.
I see 4.7 performing marginally better on my favourite non-trivial applicationthan 4.6 but is still outperformed by 3.4 both in speed and size. Flash size:
4.7  14682
4.6  14861
4.5  14680
4.3  14652
3.4  13904

So 3.4 performs ≈5% better than 4.x here.

Quote:
I'm hoping the introduction of LTO will dramatically decrease the size of some of my applications, so I'm looking forward to being able to test that.
Don't expect too much of LTO. Simply because most AVR applications are well written — I assume so — and are not brain-dead code like from a code-generator or can benefit from cross-module inlining. Applications that want inlining just do static inline by hand so LTO cannot to better there.
Quote:
I was actually expecting a huge difference, due to the sheer number of changes to GCC between the two versions, but it seems this isn't the case.
Great deal of effort in GCC development is put into fields like C++11, Go language support, LTO, new DWARF standards, vectorization, better location tracking, ... whatever.

In avr backend there was no work since 4.3 except some bug fixes so that 4.6 still has shortcomings of older versions.

4.7's avr backend comes with some mini-optimizations like

  • Better support of widening multiplication
  • Better multiply with constants
  • Multiply-Add of 8-Bit-Values
  • Better 32-bit operations in lower registers
  • Better built-ins for ffs, ctz, clz, popcount and parity
  • Support tail-calls
  • Allow -fdata-sections or constant merging in .progmem
  • Built-ins for delay-loop, fmul*, etc.
  • Less comparisons in binary swicth/case decision trees
Thus, if you never use one if these few features, you won't observe improvement.

The hard problems like optimal register allocation are too hard for me and far beyond my skills, and GCC gurus don't really take notice of AVR as we all know.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Johann,

I apologize for misspelling your name, and I owe you one beer for each time I did so (I hope it's still only a reasonable amount :-) ).

SprinterSB wrote:
Please don't write tutorials or
the like for __pgm. It's just a thing I played around with and I have no idea if AVR maintainers are inclined to have such stuff in the compiler.
Now what's wrong with a feature which is sought for, and implements features present in many if not all other C compilers for 8-bitters.

And you mention that mysterious club of elder AVR maintainers so often - could they perhaps be politely asked to speak up in this matter personally, here or at whatever forum they might choose?

SprinterSB wrote:
Automatically mapping to printf or printf_P could be done by built-in function.
Please, hint.

SprinterSB wrote:
I don't think general pointer with compile-time semantics is any good. Notice that *any* dereference to such a pointer would have to test if the reference points to RAM or to Flash or to EEPROM and that the is no more direct addressing like with LDS or STS or indirect plus offset like LDD Z+1.
I am fully aware of that, as all good pupils learning the insides and outsides of the 8-bitters do. But again, the traditional 8-bit compilers usually do implement generic (all-encompassing, if you don't like the "generic" name) pointers, with all the inefficiencies etc., and I guess they do have a reason to do so.

I know much of this might be beyond the standard (although I can imagine readings of standard allowing for most if not all of it), but the standard (and its committee) quite overtly ignores the needs of this particular group of targets; even the proposed "enhancements" towards "embedded" are quite off this particular target of 8-bitters with severely limited resources. On the other hand, there IS an established practice, even if not quite homogeneous. Implementing features present in other compilers enhances awareness of these problems, and facilitates portability - however dubious the latter in the world of microcontrollers might be.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
And you mention that mysterious club of elder AVR maintainers so often - could they perhaps be politely asked to speak up in this matter personally, here or at whatever forum they might choose?
A maintainer has his very good reasons for accepting a patch or rejecting it. And it's up to him if he likes to speak up in whatever place or not.
Quote:
SprinterSB wrote:
Automatically mapping to printf or printf_P could be done by built-in function.
Please, hint.
Just as GCC has a notion of what strlen does and optimizes it out for know strings, it could automatically map printf to a call to printf_P if it sees a __pgm pointer as format string.
Quote:
SprinterSB wrote:
I don't think general pointer with compile-time semantics is any good.
I am fully aware of that, as all good pupils learning the insides and outsides of the 8-bitters do. But again, the traditional 8-bit compilers usually do implement generic pointers, with all the inefficiencies etc., and I guess they do have a reason to do so.
You always see the underlying architecture at some point, in particular on a segmented one. Just to see what we are actually talking about: Please make the code explicit under the following assumptions:

Load a 16-bit value V to some Register R. The lower address-word of V is already in Z and V's higher address-byte is in some register Q.

Just write down the code you'd like to see!

Quote:
... but the standard (and its committee) quite overtly ignores the needs of this particular group of targets; even the proposed "enhancements" towards "embedded" are quite off this particular target of 8-bitters with severely limited resources. On the other hand, there IS an established practice, even if not quite homogeneous. Implementing features present in other compilers enhances awareness of these problems, and facilitates portability
The standard is all we have. So you can participate in the next standard comittee or go to the bleeding edge an implement features to GCC like user/application-defined named address spaces as already covered by the standard.

The latter will then enable you to plug in whatever code to target your private, favourite adrress spaces.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
It's hart to tell from the distance what causes the increased code size. (No source code, no compiler options, etc.)

One bug that can cause such increase is PR46278. Can you tell if that PR the cause of your problem?

I tried to answer but the server always return an error if I try to embed some codes :-/

Quote:
400 Bad Request

Your browser sent a request that this server could not understand.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It rejects the percent sign, because of the admins' fear of hacking. Replace it by the % sequence, or if it is some code, zip it up and post as an attachment.

JW

Last Edited: Tue. Oct 11, 2011 - 07:40 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I tried to answer but the server always return an error if I try to embed some codes

Change any % signs for the sequence %

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

here is one of my routine which goes from 170 bytes with 4.3.0 to 258 bytes in 4.6.1 (it was 254 bytes in 4.3.4)

void long_to_dec_str(long value, char *decs, uint8_t prec)
{
  uint8_t pos;

  if(prec==0)
  {
    sprintf_P(decs, PSTR("%ld"), value);
    // nothing more to do
    return;
  }
  else if(prec==1)
    sprintf_P(decs, PSTR("%02ld"), value);
  else if(prec==2)
    sprintf_P(decs, PSTR("%03ld"), value);

  pos=strnlen(decs, 16)+1;

  for(uint8_t i=0; i<=prec; i++)
  {
    decs[pos]=decs[pos-1];  // move digit
    pos--;
  }

  // then insert decimal separator
  decs[pos]=decsep_DG;
}

I can see in the .lss that with 4.3.0, about every call to sprintf_P is 4 push, 2 ldi, 4 push so about 20 bytes. Since 4.3.2 each call is a mix of in, out, subi, std Z, ldi, etc for about 50 bytes.

Another example with 4.3.0:

    155e:	26 e7       	ldi	r18, 0x76	; 118
    1560:	31 e0       	ldi	r19, 0x01	; 1
    1562:	3f 93       	push	r19
    1564:	2f 93       	push	r18
    1566:	9e 01       	movw	r18, r28
    1568:	2c 5f       	subi	r18, 0xFC	; 252
    156a:	3f 4f       	sbci	r19, 0xFF	; 255
    156c:	3f 93       	push	r19
    156e:	2f 93       	push	r18
    1570:	2d e5       	ldi	r18, 0x5D	; 93
    1572:	31 e0       	ldi	r19, 0x01	; 1
    1574:	3f 93       	push	r19
    1576:	2f 93       	push	r18
    1578:	9f 92       	push	r9
    157a:	8f 92       	push	r8
    157c:	0e 94 32 1c 	call	0x3864	; 0x3864 

Same example with 4.6.1, same code, same place:

     d66:	23 e7       	ldi	r18, 0x73	; 115
     d68:	31 e0       	ldi	r19, 0x01	; 1
     d6a:	4d b7       	in	r20, 0x3d	; 61
     d6c:	5e b7       	in	r21, 0x3e	; 62
     d6e:	48 50       	subi	r20, 0x08	; 8
     d70:	50 40       	sbci	r21, 0x00	; 0
     d72:	0f b6       	in	r0, 0x3f	; 63
     d74:	f8 94       	cli
     d76:	5e bf       	out	0x3e, r21	; 62
     d78:	0f be       	out	0x3f, r0	; 63
     d7a:	4d bf       	out	0x3d, r20	; 61
     d7c:	ed b7       	in	r30, 0x3d	; 61
     d7e:	fe b7       	in	r31, 0x3e	; 62
     d80:	31 96       	adiw	r30, 0x01	; 1
     d82:	ad b7       	in	r26, 0x3d	; 61
     d84:	be b7       	in	r27, 0x3e	; 62
     d86:	12 96       	adiw	r26, 0x02	; 2
     d88:	9c 92       	st	X, r9
     d8a:	8e 92       	st	-X, r8
     d8c:	11 97       	sbiw	r26, 0x01	; 1
     d8e:	4b e4       	ldi	r20, 0x4B	; 75
     d90:	51 e0       	ldi	r21, 0x01	; 1
     d92:	53 83       	std	Z+3, r21	; 0x03
     d94:	42 83       	std	Z+2, r20	; 0x02
     d96:	ae 01       	movw	r20, r28
     d98:	4c 5f       	subi	r20, 0xFC	; 252
     d9a:	5f 4f       	sbci	r21, 0xFF	; 255
     d9c:	55 83       	std	Z+5, r21	; 0x05
     d9e:	44 83       	std	Z+4, r20	; 0x04
     da0:	37 83       	std	Z+7, r19	; 0x07
     da2:	26 83       	std	Z+6, r18	; 0x06
     da4:	0e 94 e2 1f 	call	0x3fc4	; 0x3fc4 

option used to compile: -fno-tree-loop-optimize -ffreestanding -morder1 -funsigned-char -funsigned-bitfields -fshort-enums -fpack-struct -ffunction-sections -fdata-sections -fno-split-wide-types -Wl,--relax,--gc-sections -fno-inline-small-functions -mcall-prologues

Since 4.3.2+ all the calls to some functions is like this, one of my functions that use maybe 20 sprintf_P in a switch, goes from 1.3K to 2K just because of this.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Magister wrote:
here is one of my routine which goes from 170 bytes with 4.3.0 to 258 bytes in 4.6.1 (it was 254 bytes in 4.3.4)

I could not reproduce your exact results even with all your options (and some assumptions like decsep_DG) but I see a gross increase in size, too.

Notice that I cannot do relaxation because there is just an object and avr-size shows the size of .o and not that of a final .elf.

With just -mmcu=atmega16 -Os and the code from below I get the following results:

4.7    232
4.6    274
4.5    260
4.3    278
4.2    180
3.4    176

What is strange is that you observe a big difference between 4.3.4 and 4.3.0 which should not be there because 4.3.4 is just a bugfix release of 4.3.0. Sure is't not a 4.2?

Anyways, the effect cen bee seen clearly. The test case I used factors out hardware dependency and reads

extern int sprintf (char*, const char*, ...);
extern unsigned int strnlen (const char*, unsigned int);

typedef unsigned char uint8_t;

void
long_to_dec_str (long value, char *decs, uint8_t prec) 
{ 
    uint8_t pos, i;

    if (prec == 0)
        sprintf (decs, "%ld", value);
    else if (prec == 1)
        sprintf (decs, "%02ld", value);
    else if (prec == 2)
        sprintf (decs, "%03ld", value);

    if (prec == 0)
        return;

    pos = strnlen (decs, 16) + 1;

    for (i = 0; i <= prec; i++)
    {
        decs[pos] = decs[pos-1];
        pos--;
    }

    decs[pos] = '.';
}

It's all about passing args to varargs function like sprintf that get their arguments on the stack and then popping the arguments again.

As varargs calls are expensive, all I can propose you is to factor them out like so:

void
long_to_dec_str (long value, char *decs, uint8_t prec) 
{ 
    uint8_t pos, i;
    const char *fmt;

    if (prec == 0)
        fmt = "%ld";
    else if (prec == 1)
        fmt = "%02ld";
    else if (prec == 2)
        fmt = "%03ld";
    else
        return;

    sprintf (decs, fmt, value);

    if (prec == 0)
        return;

    pos = strnlen (decs, 16) + 1;

    for (i = 0; i <= prec; i++)
    {
        decs[pos] = decs[pos-1];
        pos--;
    }

    decs[pos] = '.';
}

With that change sizes are

4.7    160
4.6    200
4.5    166
4.3    164
4.2    134
3.4    130

There is a way to accumulate outgoing push/pop arguments in a function that calls varargs function(s) so that the expensice SP operations just occur once in prologue/epilogue.

That should decrease the code size of the original example to a reasonable amount. But it will take time to imlement it and I cannot say if it will go into 4.7 whose development will be closed very soon.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
What is strange is that you observe a big difference between 4.3.4 and 4.3.0 which should not be there because 4.3.4 is just a bugfix release of 4.3.0. Sure is't not a 4.2?
Couldn't that be the consequence of avr-libc version change?

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
SprinterSB wrote:
What is strange is that you observe a big difference between 4.3.4 and 4.3.0 which should not be there because 4.3.4 is just a bugfix release of 4.3.0. Sure is't not a 4.2?
Couldn't that be the consequence of avr-libc version change?
Maybe, yes. Dunno how the different versions have been packaged/distributed.

That's one reason for why I list sizes of o-files which do not depend on libc implementation because they are not yet linked. Besides, you don't need the complete project to analyze one specific function.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
wek wrote:
SprinterSB wrote:
Automatically mapping to printf or printf_P could be done by built-in function.
Please, hint.
Just as GCC has a notion of what strlen does and optimizes it out for know strings, it could automatically map printf to a call to printf_P if it sees a __pgm pointer as format string.

Ah, I see. And I don't like it. I don't think the compiler should handle what's basically up to a library.

SprinterSB wrote:
wek wrote:
SprinterSB wrote:
I don't think general pointer with compile-time semantics is any good.
I am fully aware of that, as all good pupils learning the insides and outsides of the 8-bitters do. But again, the traditional 8-bit compilers usually do implement generic pointers, with all the inefficiencies etc., and I guess they do have a reason to do so.
You always see the underlying architecture at some point, in particular on a segmented one. Just to see what we are actually talking about: Please make the code explicit under the following assumptions:

Load a 16-bit value V to some Register R. The lower address-word of V is already in Z and V's higher address-byte is in some register Q.

Just write down the code you'd like to see!


  ;perform whatever movs are needed for ABI 
        call	__gptrget

Actually, I was cheating, and took this from sdcc/avr - see attachment (sorry, there's no actual implementation of __gptrget in there as the avr target of sdcc was abandoned long ago before the library was written; but you surely get the point).

Jan

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That's no code that is able to perform the described operation.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:

4.7    232
4.6    274
4.5    260
4.3    278
4.2    180
3.4    176

What is strange is that you observe a big difference between 4.3.4 and 4.3.0 which should not be there because 4.3.4 is just a bugfix release of 4.3.0. Sure is't not a 4.2?

I can confirm that with GCC 4.3.4 and also 4.3.2 (from WinAVR20081205) that it easily adds 10% to the code. The thing I see with ldi/push becoming stdZ/in/out takes place between 4.3.0 and 4.3.2, so this can be the start of where to check. I checked the diff between them but I'm not a pro in how GCC works (yet!)

Quote:
As varargs calls are expensive, all I can propose you is to factor them out
I tried in some place but I have multiple arguments that can be string or integer etc so I can not factor them :-/

Quote:
But it will take time to imlement it and I cannot say if it will go into 4.7 whose development will be closed very soon.
It's ok, as long as you have a case that can reproduce the missed optimization, take your time to work on it. If I can help with others tests, just tell me :)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I tried your avr-gcc 4.7 snapshot 179594 with my project and it went from 15764 to 15888, first time since 4.3.0 that I do not see an almost 2k increase, good job!!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Magister wrote:
I tried your avr-gcc 4.7 snapshot 179594 with my project and it went from 15764 to 15888, first time since 4.3.0 that I do not see an almost 2k increase, good job!
If you are fine with a specific avr-gcc version or dirstribution package like WinAVR there is no need to switch to the most up-to-date version like "newer = better".

The reaoson why I got involved into avr-gcc development was a gross code increase and speed penaly of about 10% when I tried to switch from avr-gcc 3.4.6 WinAVR-20060421 to gcc 4.x.

+10% is not peanut and with 4.7 -Os from above it's still +5.6%. Adding

-fno-inline-small-functions
-fno-move-loop-invariants
-fno-tree-loop-optimize
-fno-optimize-sibling-calls
-fno-caller-saves
-fira-algorithm=priority

decreases the project from 14682 to 14056 bytes, i.e. the bloat factor reduces from +5.6% to +1.1% compared against avr-gcc 3.4.6 with its 13904 bytes.

My changes to the AVR realm in GCC are just mini-optimizations that try to print things like widening multiplication a bit smarter, but it appears that most of the performance decrease in new avr-gcc is systemic and it pays to find command line options that help to reduce the systemic performance loss and work out what optimization not to do.

Regarding the amount of optimization options and parameters GCC offers, this can take some time.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If we are talking about code size – why not to change ABI, and move zero_reg (R1) to some other register, e.g. R2? The problem is with newer AVRs, which have mul instruction which uses R0:R1 register pair. After that R1 needs to be reloaded, so moving zero_reg to some other register can easily save some bytes. Not too much, but it should be easy to implement, right?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TFrancuz wrote:
If we are talking about code size – why not to change ABI, and move zero_reg (R1) to some other register, e.g. R2? [...] but it should be easy to implement, right?
  • It's an ABI change and it will render assembler programs and inline assembler incorrect. Thus you'd like to have an option like -mabi2.
  • Because it's an ABI change, such an option must be a multilib option for libgcc and avr-libc at least.
  • I doubt it is easy to implement in avr-libc because great deal of algorithms like floating point are hand-written assembler.
  • In the compiler, you'd have to double or at least to rewrite if you bust the old ABI
    • All patterns that involve multiplication including multiply-add and multiply-subtract and fmul* builtins.
    • All libgcc implementation of multiplications and fmul* builtins
    • Change R0/R1 from fixed to call-clobbered
    • Rewrite ISR pro- and epilogue
    • Review the backend for places that change/clear zero_reg and find a replacement not temporarily changing zero_reg.
  • In (inline) assembler is's no more sufficient to clear zero_reg if it was used temporarily. Instead, such code parts must be rewritten to be atomic.
Regarding all these points I wouln't call it "easy to implement", not for avr-gcc, not for avr-libc and not for applications using (inline) assembler with C/C++ and parameterize depending on some #ifdef __AVR_ABI2 built-in macro.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So it is interesting to calculate how often R1 is reloaded in real application because of mul – so it will give us an idea if the effort you mentioned, is balanced by some space savings.
Unfortunately my apps are not using complicated math equations, so mul is rarely used.
BTW, I noticed that you applied some patches needed to implement VTABLES in C++ to be stored in FLASH. Can you tell me if this feature will be finally implemented in reasonable future, or I should rather forget about that?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TFrancuz wrote:
So it is interesting to calculate how often R1 is reloaded in real application because of mul
In my application from above of about 14k bytes flash occupied there are ~60 CLR instructions on zero_reg as of

grep 'clr[ \t]*__zero_reg__' *.s | wc -l

It's not "complicated math" but just (fixed-point)arithmetic using [F]MUL*.

Quote:
so it will give us an idea if the effort you mentioned, is balanced by some space savings.
There will of course be space saving and anyone just a bit familiar with avr-gcc will guess correctly on that.

Quote:
BTW, I noticed that you applied some patches needed to implement VTABLES in C++ to be stored in FLASH. Can you tell me if this feature will be finally implemented in reasonable future, or I should rather forget about that?
No, there are no changes "applied". I just built a version with some patches attached to the PR, but nothing of it is approved or upstream.

Named address spaces is no feature specified for C++. It's C only.

I cannot say if an "hidden" address space that'n not exposed to the user would do the trick by, e.g., tagging respective pointers with AS information. But as AS is no C++ feature, it's not very likely that this will work smooth and AS information is tracked consistently throughout the C++ part (of at all). GCC is far too complex for me to be familiar with that spot in the compiler world. I know my limits and at the moment I don't think I should touch it. Besides that, it's much more effort than to work in the avr sandbox because changing the front/middle-ends has side effects on any platfrom, not just avr.

As I am not interested in C++. The language is utter ugly IMHO and has so many backdraws that I avoid it and use Java or Python or whatever on PC. And on AVR, C is completely reasonable for me. Thus, I am not inclined to get into the C++ mess.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It’s a pity that Atmel don’t support avr-gcc active development (at least the support is not very big), instead of they produce a messy toolchain and completely crazy AS5. It seems that arm-gcc is more actively developed, so probably it’s time to dive into ARM-world. To me it is a conclusion of the whole topic.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

At least in official GCC, ARM gets much more notice than AVR and there are many contributors, e.g. from codesourcery. But I don't know if ARM is still involved in codesourcery.

Alternatively, guys here could improve AVR support in GCC. I think there are many very experienced programmers here — not only with respect to µC but also to host programming — that have excellent programming and analytical skills.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
Alternatively, guys here could improve AVR support in GCC. I think there are many very experienced programmers here — not only with respect to µC but also to host programming — that have excellent programming and analytical skills.
I wrote some compilers years ago (Pascal to 68000 for instance) using lex/yacc or flex/bison, I understand the way it works, but optimization is another story :(

Also I do not know the assembly enough on the AVR, I should go deeper into it before attempting to modify the gcc/config/avr/ folder :roll:

GCC 4.7.0 is the first one since 4.3.0 that improve my code size. If I can help you by doing some tests or whatever, just ask!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Magister wrote:
SprinterSB wrote:
Alternatively, guys here could improve AVR support in GCC. I think there are many very experienced programmers here — not only with respect to µC but also to host programming — that have excellent programming and analytical skills.
I wrote some compilers years ago (Pascal to 68000 for instance) using lex/yacc or flex/bison, I understand the way it works, but optimization is another story :(
There are two "flavour" of optimization: Optimization algorithms that are already present in GCC. They might not produce best results for unimportant targets like AVR that are not in the center developers' focus. And there are mini-optimization in the AVR-only part like printing instructions smarter, working out better cost functions, do code cleanup, write test cases for AVR-specific features, etc.

The overall compiler infrastructure is already there so there is no need to bother with lexing/parsing/syntax/language specificatons... — except in the case one feels inclined to work in that area.

But no matter what field you pick: GCC is a real-world compiler and not a finger exercise from university. It's not easy to get a start and to understand what goes on where and why and what to change to achive this or that. The AVR back-end is not linear code but instead a zoo of target hooks called from somewhere at some time.

Quote:
Also I do not know the assembly enough on the AVR, I should go deeper into it before attempting to modify the gcc/config/avr/ folder :roll:
Yes, you need an idea of what good/bad/correct/incorrect code is and learn about GCC's internal representations of code. Assembler is only the very last step of "dumping" the internal information.
Quote:
GCC 4.7.0 is the first one since 4.3.0 that improve my code size. If I can help you by doing some tests or whatever, just ask!
If you like to hunt for errors, you will most likely find one in an area where there had been changes, e.g. some internal cleanup to progmem handling, using muliply-add like instruction sequences to save some ticks/bytes, tweak code expansion for switch/case, built-ins, ....

Above you find a list of fixed PRs that give you an idea what changed between 4.6 and 4.7. And the bugfixes for 4.6 mentioned above are part of 4.7, too, of course.

Then there is no good test coverage for the __pgm qualifier. There is no code in GCC test suite for that feature so that respective kines in AVR part is dead code from the test suite's perspective.

I never run code generated with __pgm feature — not even on a simulator — so you might be the first to try it out 8)

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
The AVR back-end is not linear code but instead a zoo of target hooks called from somewhere at some time.
Saw this, I'm sure it takes weeks to understand how it works :-/ it's a full time job!

Quote:
If you like to hunt for errors, you will most likely find one in an area where there had been changes, e.g. some internal cleanup to progmem handling, using muliply-add like instruction sequences to save some ticks/bytes, tweak code expansion for switch/case, built-ins, ....
Trying some options I managed to have the hex code smaller than with 4.3.0. It has been almost 3 years since I have seen this (seems related to bug #49881), I uploaded the code on my board and runs every functions for a couple of days, everything works fine, so this 4.7 is not that bad really. I have a lot of interger maths in it and progmem (mainly for string printing).

I'll take a look at the bug list you mentionned and also the __pgm.

I saw that trying the -fmerge-all-constants make the compiler stop saying

confused by earlier errors, bailing out

I also have a couple of warning like

warning: uninitialized variable 'blabla' put into program memory area [-Wuninitialized]
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

First appears to be PR50739.

If the second is not PR50807, could you give an example? A one-liner will probably do already.

The current implementation of argument pushing using PUSHes is fine, but to get rid of there arguments is tedious and might lead to unpleasant code if there are many functions getting stack arguments.

There is ACCUMULATE_OUTGOING_ARGS which is not implemented and will take some time to do and test, in particular with -mcall-prologues but without frame pointer. Dunno if there is enough time to do it for 4.7.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yup 1st one seems to be PR50739.

For the second one, I have a mystring.h

extern const prog_char str1[];
extern const prog_char str2[];
extern const prog_char str3[];

then in mystring.c I have:

const prog_char str1[] ="bla1";
const prog_char str2[] ="bla2";
const prog_char str3[] ="bla3";

In the main.c file I include the mystring.h and at compile it gives me the warning when it uses str1 (or 2 or 3). I guess the warning is valid but until 4.7.0 it was not here.

I also noticed that I can not use -ffreestanding else delay() can not compile because of missing fabs and fceil IIRC.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Following code:

#define PROGMEM __attribute__((progmem))

extern const char PROGMEM str1[];
const char PROGMEM str1[] = "bla1";

compiles correct and without warning both with avr-gcc and avr-g++.

I did not look into prog_char because it's unspecified behaviour so you use it at your own risk.

With freestanding you are left alone on the silicon, most likely you do not want freestanding, e.g. you like library support.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
I never run code generated with __pgm feature — not even on a simulator — so you might be the first to try it out 8)

EDIT: misunderstood, it does not work as follow...

-----------------------------
It works, basic test shows it is working

const char* str;
if(a)
  str=(const __pgm char*) "0123";
else
  str=(const __pgm char*) "4567";
lcd_print(str);

Output the right string on my LCD. I have a lcd_print() and a lcd_print_P() function that I wrote. The second one use a pgm_read_byte() in it. Using the new address space means I can get rid of it, nice. Now to see if duplicate strings will be only once in flash.

In my project, changed all the progmem variable to __pgm, removed all the _P reference and all, seems to work fine. There is only strnlen_P which is linked (used internally by vfprintf). A fine diff is output of avr-size that now reports about 1K less text and 1K more data (is this normal or is it because all my data are copied into RAM?). avr-strings | sort | uniq -d shows no duplicate.

Last Edited: Sat. Oct 22, 2011 - 01:14 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Magister wrote:

const char* str;
str = (const __pgm char*) "0123";
lcd_print (str);

Maybe you misunderstood how it works. The code above will put the string literals into RAM and access them from RAM (as your lcd_printf accesses RAM). The cast is superfluous and just serves confusion.

To access RAM, you write

char lcd_print (const char *str)
{
   return *str;
}

and to access flash

char lcd_print_P (const char __pgm *str)
{
   return *str;
}

Pointers are still 16 bits wide, but there is second flavour of pointers.

Calling these functions is straight forward. But in particular initializing pointers to flash with string literals that shall be located in flash is not straight forward and error prone:

#include 

#define PGM_STR(X) ((const __pgm char[]) { X })

char const __pgm *gstr = PGM_STR ("123");
char const __pgm text[] = "abc";

void foo (const char __pgm *str2)
{
    void lcd_print_P (const char __pgm*);
    const char __pgm *str = PSTR ("0123");
    static const char __pgm stext[] = "abc";
    
    str2 = PSTR ("abc");
    
    lcd_print_P (str);
    lcd_print_P (str2);
    lcd_print_P (gstr);
    lcd_print_P (text);
    lcd_print_P (stext);
}

All this will work.

However, the following does not work (resp. it works as doomed be the standard):

const char __pgm *str = "foo";

The new standard extension condemns string literal like above to live in default address space, which is .rodata, which is RAM.

String merging should work for strings in .rodata, but merging for strings in .progmem and string literals is not yet complete and is low priority.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

*oups* yes I misunderstood, I will redo some tests, thanks for all the info!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just compiled and run my first example.

Run on an ATmega168:

#include 

#define _PGM const __pgm
#define PGM_STR(X) ((_PGM char[]) { X })

typedef struct s_tree
{
    char _PGM * val;
    struct s_tree * left;
    struct s_tree _PGM * right;
} tree_t;

tree_t      A  = { PGM_STR ("a"), NULL, NULL };
tree_t _PGM B  = { PGM_STR ("b"), NULL, NULL };
tree_t      C  = { PGM_STR ("c"), NULL, NULL };
tree_t _PGM D  = { PGM_STR ("d"), NULL, NULL };
tree_t      AB = { PGM_STR ("A"), &A,   &B };
tree_t _PGM CD = { PGM_STR ("C"), &C,   &D };
tree_t _PGM H =  { PGM_STR ("*"), &AB,  &CD };

void print_tree_P (tree_t _PGM*);
void print_tree  (tree_t*);

void print_tree_P (tree_t _PGM * t)
{
    if (!t)
        return;
        
    printf ("[%c]", *t->val);
    print_tree   (t->left);
    print_tree_P (t->right);
}

void print_tree (tree_t * t)
{
    if (!t)
        return;
        
    printf ("[%c]", *t->val);
    print_tree   (t->left);
    print_tree_P (t->right);
}

void testit (void)
{
    printf ("\nStart\n");
    print_tree_P (&H);
    printf ("\nDone\n");
}

It's a tree structure that is scattered over Flash and RAM: Left childs (A, AB, C) are in RAM, Head and right childs (H, CD, B, D) are located in Flash.

printf is set up to print to UART. The output is:

Start
[*][A][a][b][C][c][d]
Done

So at least this small example works fine 8)

At first I got garbage because I wrote

printf ("[%s]", t->val);

However, printf's %s will read from RAM but get pointer from flash so I changed it to

printf ("[%c]", *t->val);

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@SprinterSB: as this thread is about 4.6.1 and LTO I posted a new thread about 4.7 and PGM

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

abcminiuser wrote:
LTO and other optimizations should make for smaller binaries, which can lead to faster programs and/or cheaper designs.

- Dean :twisted:

At the risk of being repetitive, if you want smaller binaries through smarter linking, just use standard C.

That is, #include your source files into a single compilation unit.

This mechanism is standard, cross-platform, portable, and has worked for the past 30+ years: R included the mechanism in the first versions of C, K& wrote about it. It works with all versions of WinAVR.

Personally, I'm a little irritated when I see people using non-standard, platform specific work-arounds for something that is already well supported in standard C: it makes my work as a cross-platform maintenance programmer just that extra bit more difficult.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

That is, #include your source files into a single compilation unit.

Why do that when --combine -fwhole-program exists?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
Quote:

That is, #include your source files into a single compilation unit.

Why do that when --combine -fwhole-program exists?
https://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=111466

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The avr-gcc currently in widest use is the 4.3.3 in WinAVR so surely it's still there in that? No doubt folks will prefer the LTO solution but only when there's a general distribution widely available that contains it.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
The avr-gcc currently in widest use is the 4.3.3 in WinAVR so surely it's still there in that?
Yes, but this thread is about 4.6 and LTO.

Melbourne asked: why not using the old fashioned portable way as an alternative to LTO
You say: use --combine instead
I say: --combine does not exist as an alternative to LTO (because it is replaced by LTO)
;-)

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Oh right I get your point - I was simply responding to the recent posts without taking into account the thread title (or previous contents). Sorry about that :oops:

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

melbourne wrote:
At the risk of being repetitive, if you want smaller binaries through smarter linking, just use standard C.
LTO/non-LTO does not affect the C standard. From the users' side it pretty much like any other optimization option.
Quote:
That is, #include your source files into a single compilation unit.

This mechanism is standard, cross-platform, portable, and has worked for the past 30+ years: R included the mechanism in the first versions of C, K& wrote about it. It works with all versions of WinAVR.

No, it does not work for library code. You do not want to drag in libraries like libc or libm as source.

And encapsulation is was not on topic 30 years ago but might be on-topic today.

Quote:
Personally, I'm a little irritated when I see people using non-standard, platform specific work-arounds for something that is already well supported in standard C: it makes my work as a cross-platform maintenance programmer just that extra bit more difficult.
It's not non-standard. It's just a feature that don't affect the language standard or you build scheme.

Moreover, you need not change the sources (which you will do when dragging them into one source file just to mimic LTO).

You do not want to hack sources to emulate each and every optimization feature.

If you mean by standard "use paradigms popular in stone age and never think about or change them" you correct, of course.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I tested zillions of parameters and tricks to reduce the size, and using one big .C or the --combine does not give the best results for my project (~16K binary). So YMMV.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
Quote:

That is, #include your source files into a single compilation unit.

Why do that when --combine -fwhole-program exists?

Please refer to the original post. If there is something you do not understand, please post here again.

SprinterSB wrote:
Moreover, you need not change the sources (which you will do when dragging them into one source file just to mimic LTO).

I'm staggered at the suggestion that you would drag sources into one source file, and by the accompanying suggestions that you would change/hack sources.

The whole point is to use the standard C "include" mechanism, which was designed for the purpose, to include portable code.

SprinterSB wrote:
If you mean by standard "use paradigms popular in stone age and never think about or change them" you correct, of course.

If you think LTO (and encapsulation) is more modern than "include", you've never understood Pascal. Your ignorance explains, but does not justify, your childish insult.

Thank you both for your thoughtful and insightful responses. It's good to see the level of maturity and care that you bring to your contributions here.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Okay, can someone fill me in with what's going on with avr-gcc? I exchanged an email with Thibault North who maintains the package for the Fedora project, he didn't know why or when the XMEGA support was pulled from the upstream (FSF) release of gcc. He told me other Linux distros use these patches against gcc 4.5.1. I thought the AVR patches were already in the upstream. Fedora's policy is to use upstream code exclusively. I've tried to bring those patches forward into 4.6.1, but so far I am unsuccessful. Does anyone have a working patch set for 4.6.1 they can share?

I like cats, too. Let's exchange recipes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

melbourne wrote:
The whole point is to use the standard C "include" mechanism, which was designed for the purpose, to include portable code.
The original question it *not* about some AVR application, it's about building avr-gcc as canadian cross.

How can you think it's even possible to include all files? The compiler gets what it gets, it's up to the user if he wants to include all headers, it's no choice for the compiler.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ninevoltz9 wrote:
I exchanged an email with Thibault North who maintains the package for the Fedora project, he didn't know why or when the XMEGA support was pulled from the upstream (FSF) release of gcc.
What means "pulled from"? "Removed"? If so: xmega support went upstream for 4.7.0, see PR52261 and the GCC 4.7 Release Notes. At no point in time xmega support was removed from GCC.

Quote:
I've tried to bring those patches forward into 4.6.1, but so far I am unsuccessful.
What is the problem?

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Those patches relate to a previous version of Atmel's AVR Toolchain (3.3.x or less) which is derived from avr-gcc 4.5.1.

I wouldn't be surprised at all to discover that patches originally targeting 4.5.1 would be difficult to graft onto 4.6.1 -- especially for a target audience that doesn't have any interest in hacking the internals of the compiler, but rather wants a compiler that "just works".

The newest version of Atmel's AVR Toolchain (3.4.0) is derived from avr-gcc 4.6.x. Currently, it is only possible to obtain Atmel's AVR toolchain 3.4.0 integrated within Atmel Studio 6. All the necessary source code patches are included inside that binary distribution.

I haven't downloaded and installed Atmel Studio 6, so I do not have a copy of those patches myself. Otherwise, I would have been perfectly willing to post a copy of them for you to use.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for the updates, and for anyone looking for those patches in Studio 6, they are located here: C:\Program Files\Atmel\Atmel Studio 6.0\extensions\Atmel\AVRGCC\3.4.0.65\AVRToolchain\source\avr

I like cats, too. Let's exchange recipes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

lfmorrison wrote:
The newest version of Atmel's AVR Toolchain (3.4.0) is derived from avr-gcc 4.6.x. Currently, it is only possible to obtain Atmel's AVR toolchain 3.4.0 integrated within Atmel Studio 6. All the necessary source code patches are included inside that binary distribution.

I haven't downloaded and installed Atmel Studio 6, so I do not have a copy of those patches myself. Otherwise, I would have been perfectly willing to post a copy of them for you to use.

The patches are supplied here:
https://www.avrfreaks.net/index.p...

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Fedora 17 still does not have XMEGA support in the avr-gcc package: (I built the Fedora source RPM on EL6)

    avr-gcc (Fedora 4.6.2-1.el6) 4.6.2 Copyright (C) 2011 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions. There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

    Known MCU names:
    avr1 avr2 avr25 avr3 avr31 avr35 avr4 avr5 avr51 avr6 at90s1200 attiny11
    attiny12 attiny15 attiny28 at90s2313 at90s2323 at90s2333 at90s2343
    attiny22 attiny26 at90s4414 at90s4433 at90s4434 at90s8515 at90c8534
    at90s8535 attiny13 attiny13a attiny2313 attiny2313a attiny24 attiny24a
    attiny4313 attiny44 attiny44a attiny84 attiny25 attiny45 attiny85
    attiny261 attiny261a attiny461 attiny861 attiny861a attiny87 attiny43u
    attiny48 attiny88 at86rf401 ata6289 at43usb355 at76c711 atmega103
    at43usb320 attiny167 attiny327 at90usb82 at90usb162 atmega8u2 atmega16u2
    atmega32u2 atmega8 atmega48 atmega48p atmega88 atmega88p atmega8515
    atmega8535 atmega8hva atmega4hvd atmega8hvd atmega8c1 atmega8m1 at90pwm1
    at90pwm2 at90pwm2b at90pwm3 at90pwm3b at90pwm81 atmega16 atmega161
    atmega162 atmega163 atmega164p atmega165 atmega165p atmega168 atmega168p
    atmega169 atmega169p atmega16c1 atmega32 atmega323 atmega324p atmega325
    atmega325p atmega3250 atmega3250p atmega328p atmega329 atmega329p
    atmega3290 atmega3290p atmega406 atmega64 atmega640 atmega644 atmega644p
    atmega644pa atmega645 atmega649 atmega6450 atmega6490 atmega16hva
    atmega16hvb atmega32hvb at90can32 at90can64 at90pwm216 at90pwm316
    atmega32c1 atmega64c1 atmega16m1 atmega32m1 atmega64m1 atmega16u4
    atmega32u4 atmega32u6 at90usb646 at90usb647 at90scr100 at94k atmega128
    atmega1280 atmega1281 atmega1284p atmega128rfa1 at90can128 at90usb1286
    at90usb1287 m3000f m3000s m3001b atmega2560 atmega2561

Attachment(s): 

I like cats, too. Let's exchange recipes.

Last Edited: Wed. Jun 6, 2012 - 05:21 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You have applied all these?

C:\Program Files\Atmel\Atmel Studio 6.0\extensions\Atmel\AVRGCC\3.4.0.65\AVRToolchain\source\avr\gcc>dir | grep xmega
10/05/2012  16:51            28,668 301-gcc-xmega-v14.patch
10/05/2012  16:51             7,830 404-gcc-atxmega_16_32_a4u.patch
10/05/2012  16:51             3,069 405-gcc-atxmega64_128_192_256a3u.patch
10/05/2012  16:51             2,202 408-gcc-atxmega384c3.patch
10/05/2012  16:51             1,304 410-gcc-atxmega128a4u.patch
10/05/2012  16:51             1,339 411-gcc-atxmega64d4.patch
10/05/2012  16:51             2,343 413-gcc-atxmega64_128_b3.patch
10/05/2012  16:51             1,375 414-gcc-atxmega64b1.patch
10/05/2012  16:51             1,376 416-gcc-atxmega64a4u.patch
10/05/2012  16:51             1,390 417-gcc-atxmega128d4.patch
10/05/2012  16:51             3,745 419-gcc-atxmega16c4_32c4_128c3_256c3.patch
10/05/2012  16:51             1,344 420-gcc-atxmega384d3.patch
10/05/2012  16:51             1,391 424-gcc-atxmega192c3.patch
10/05/2012  16:51             1,335 426-gcc-atxmega64c3.patch
=================================================================================

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

binutils won't build for me with the patches applied, it fails on gas/config/tc-avr.c and bfd/elf32-avr.c with

Quote:
‘BFD_RELOC_AVR_7_LDS16’ undeclared

Something appears to be wrong with the patch 503-binutils-avrtc193-tiny.patch. It builds fine if I omit that patch.

I like cats, too. Let's exchange recipes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ninevoltz9 wrote:
binutils won't build for me with the patches applied, it fails on gas/config/tc-avr.c and bfd/elf32-avr.c with
Quote:
‘BFD_RELOC_AVR_7_LDS16’ undeclared
Something appears to be wrong with the patch 503-binutils-avrtc193-tiny.patch. It builds fine if I omit that patch.
The patch is incomplete: It adds new relocs in bfd/reloc.c but does not supply the induced changes to the auto-generated files. Notice that the new relocs are introduced in comments (sic!) that must be parsed by the generator tools!

Run make headers in bfd.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That fixed it. Here's an updated binutils patch that fixes the issue.

Attachment(s): 

I like cats, too. Let's exchange recipes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here are the complete Linux toolchain packages I built for Fedora/RHEL6 with source RPMS and source archives, for anyone who's interested.

https://www.hindleyelectronics.c...

I like cats, too. Let's exchange recipes.