avr32-gcc omit frame pointer optimization related bug

Go To Last Post
4 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi all!

After several days of hopeless search for a bug, I turn to you hoping someone may point out where what I experience happens.

The problem is in a larger partly interrupt-driven software compiled with avr32-gcc 3.2.3-261 which I could narrow down to a single rather short source file.

The situation materializes as follows:

When I turn off the omit frame pointer optimization for the attached source (with -fno-omit-frame-pointer) at any optimization level, the software runs happily all fine. When I turn it on (by omitting the above with any optimization level above -O0), the software's built-in self checks start to detect memory corruptions in the memories shared across the main and interrupt processes.

I already ruled out the possibility of an always present synchronization problem by that the running time of the various parts are not constant, and during searching for the bug, I turned on and off several other module's optimizations (which obviously changed around the running time of the processes greatly) which did not affect the situation at all (so the bug only seems to occur when I allow omitting the frame pointer for the attached source file).

I also attached two generated ASM compiles of this source file with the lowest possible optimization with the two states of the frame pointer setting. I tested these, one runs, the other produces the bug. My problem is that with my limited assembly knowledge with AVR32 I couldn't spot anything in the variant producing the bug which could be the cause.

I am compiling with the 32 bit Linux version of the above-mentioned avr32-gcc to UC3C1512C. I tried to update it, but currently this tool seems to be inaccessible on Atmel's site (gives me a 84 byte obviously corrupt .tar.bz file). I also would have liked to read a change log of the compiler which I couldn't find - does anyone know if such exists?

Thanks for any help in this matter!

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'd suggest to move this thread to the avr gcc forum.
Is it possible that you have a stack overflow?
I didn't really get what you mean with "memory corruptions"?
Please explain more detailed.

Here some questions that might help:
What kind of memory gets corrupted?
What data get's written to that memory?
What other memory region is next to this memory?
Are you using a modified linker script?
Are you using an operating system?
Did you try to enlarge the stack(s)?

I'm using exactly the same toolchain but there is a new version available:
http://www.atmel.com/tools/ATMEL...

-sb

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for looking at my problem! :)

Sure I should move it to the avr-gcc forum? It's under the AVR (8-bit) Technical category, so I tought it only relates to that compiler.

I don't think I could have a stack overflow. The software has self checks for this in the form of guarding patterns at both the beginning and the end of a 4Kb stack which never triggered. There are no recursive call patterns in it, so I really don't think I could get anywhere near overflowing it.

The whole software uses the built-in 64Kb SRAM of this UC, if that was the question (so I didn't utilize the 4Kb HSB RAM yet).

The particular shared data area is quite large, it's a main portion of the software's memory allocation. There is a double-buffering concept over it to share data between the asynchronous parts of the program, and it is guarded by the above mentioned (failing) interrupt blocking / unblocking mechanism. The self check over it runs after a synchronization point under the proper blocks to check the whole area's consistency, and in the failing compile, it occasionally reports failures from all over the area like if the interrupt blocking mechanism was ineffective.

I am not using any operating system, neither ASF for the project, so the whole software down to the last bit is under my control. The linker script was modified though (two things come in my mind just now: an area to point the EVBA at, and the stack canaries at the ends of the stack - huh, if we are here I also spotted a small definitely compiler related bug with that which I could work around. I will test, and if it still stands, will post it in a seperate thread with a proper test case).

For the compiler version matter I will try out the new version of course! (As I told above so far on Atmel's site I only got a corrupt package. To this morning apparently they fixed this matter and I can download it)

Meanwhile the download finished, so I picked up my stuff and did some test compiles and runs. The bug seems to be present with the exact same behavior in 3.4.1-348.

Thanks again, hope we can sort it out! (Someone with assembly knowledge may please look in those assembly listings? Is there anything in those which I couldn't spot, and might cause the source file doing it's thing wrong?)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Going further with tons of testing and testing it seems like this will more likely be a processor bug, not a compiler related one.

First I removed all the memory writes (relating to the area in question), and any extra components without which the software would run. So effectively all what remained was a communication protocol (to report back the problem) running in the interrupt and the problematic self-check code with the blocks. The situation persisted.

I experienced with the followings:
Replaced the blocking mechanism with a global interrupt disable / enable. Result: Program does not fail.
Removed the blocking mechanism: Program does not fail (see above: as no-one accessed the block for write it is obvious).
Adding instructions after the blocking mechanism: sometimes the failure persisted, sometimes not at all.

After some modifications I got to the point where the program reliably failed frequently enough, and the result of the offending comparison was sent out on the comms. line. First I experienced that despite the failures the result was correct (that is the return on the comms. line indicated that the two values involved in the comparison were identical), then modifiing the program somewhat so each location were read only once, I could get the failure reported on the line.

The result:

It looks like after disabling some interrupts on the USART, if at the same time somewhere (I did not attempt to do anything on localizing this yet, maybe it is not even necessary for the fault) an USART interrupt occurs, a read access will fail reading 0xFFFFFFFF instead of the real contents of the memory.

I will try to go futher along these lines hoping I can get some reliable test case for it.

EDIT:

I managed to squeeze out a test case demonstrating the processor bug. I posted it here:

https://www.avrfreaks.net/index.p...