[SOLVED] Memory read corruption after disabling interrupts

Go To Last Post
13 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi all!

UPDATE: I think I can say this is "solved" now. You may read more on this here: https://www.avrfreaks.net/index.p... , and of course Catweax's topic: https://www.avrfreaks.net/index.p... as this was discovered to be the same bug.

Reflecting to one of my previous posts I tracked down a processor bug.

The environment:
Linux + avr32-gcc 3.4.1-348
UC3C1512C part

The processor is used on a custom board, we did not use any evaulation board, so I can not give a "ready to use" test case. I tested it on more different boards of ours, the failure reproduces, so I think it is really in the processor (that is, not our application is faulty).

The failure described:

When using an USART (maybe also applies to other peripherals, I did not test), if the USART's interrupts are disabled on the peripheral with it's 'idr' register, if this register was accessed indirectly, some read accesses a few cycles later will partially fail apparently reading zero for the bottom or the top halfword of a 32bit word.

A test case is attached, it works as follows:

It is set up for one of our boards, but I don't think it will be hard to apply it on an another board.
It sets up an USART interrupt, and runs a stream of instructions causing the bug. On the USART it spits out repeatedely a 8 byte sequence which includes a fault counter and the failing read (this is only updated on a fault, so it can be watched real time). It seems that it reliably works with any optimization setting, however if not (or after modifying the USART routines), in the source generating the fault the number of 'nop's may be adjusted. The fault seems to slide around depending on various factors I could not discover, probably including the absolute placement of the generation code.

Anyone willing to verify this?

Attachment(s): 

Last Edited: Thu. Oct 11, 2012 - 08:42 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I would like to get some help on this matter, now most particularly what I could do to get some sensible response from Atmel?

We (I mean the company where I work) sent this bug report about two weeks ago (when I originally posted this here) in a similar form, and got absolutely no response from them since (we had problems with the datasheet previously, they did neither respond to that apart from sending pointers to the exact same sheets back we had problems with, again two weeks later).

Would that help if we ordered some eval board from them (we never had any), and modify this test case to run and produce the bug on that? (Or again: could someone please verify this using an eval board? I really don't think this test case is so large that it would be of any problem to try it out. At least I could port it from one to another board of ours with different oscillator and different USART in at about 10 minutes)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I just tried your demo program on a AT32UC3C1512CrevC. It took me a bit to get it rolling because I only have ASF version 1.7 installed and I had to copy the newlib into your project.

When running the program I get 0xAA55AA71 as the incorrect result from liocheck() instead of 0. As this is the result from the XOR with 0xAA55AA55, variable r must’ve been 0x24. r is the output from your assembly code and the same register that got fed with the address of r0 before the asm block. Disassembling the binary shows that r0 is located at address 0x24.

// %0 contains 0x24
st.w	%1[0], %2
nop
nop
nop

// %0 should be loaded with whatever its held address points to
ld.w	%0, %0[0]
// %0 is still 0x24

I’d say that the ld.w instruction is skipped and not executed at all, which results in variable r holding the address of r0.

I think this is exactly the same bug that I described in another thread. It occurs when an interrupt line is lowered too quickly after it was unmasked, tripping up the CPU’s interrupt handling so that exactly one assembly instruction is skipped. Atmel confirmed that bug, but offered no solution. So for me it’s a case of “I have to make sure it doesn’t happen”. My solution is in that linked thread. While you’re masking the interrupt instead of unmasking it, I believe the root cause is the same.

So I don’t think the bug you’re observing is caused by an indirect register access at all.

Maybe you could try this:
- add one more nop to the list
- replace the ld.w instruction with "mov %0,0xABCD"
- replace the 0xAA55AA55 in the final XOR with 0xABCD

When I do that, I get 0x0000ABE9 as the failing value, so the content of r is still 0x24.

If that instruction is skipped, compiler optimisation level, type of instruction and similar are all irrelevant. Whenever you change something, the instructions in the compiled code tend to move around a bit and a different instruction ends up in the place that’s skipped, so you might be led to believe that whatever change you made is the cause, while it’s actually not. If you try my demo program from the thread I linked above, you might see the error occurring in different places in the output, depending on compiler and optimisation settings.

So yeah, I’m betting on the interrupt trip-up, but I could be wrong, of course.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Huh, thanks for looking in this!

It is strange how it would relate to that problem, but maybe so! I will try to verify once I get the AVR stuff back here. Originally I tried to think along the lines of that problem you described, but put that aside for the different environment.

It is probable though, maybe that I needed the indirect load before the action is for that without that the register would somehow already contain the needed value (it might easily be so if there is no other code writing that register in the main program).

You described the problem in your experiment that probably there is a situation where the interrupt signal is too short to properly initate interrupt handling and that messes up the PC. There is clearly a possibility for this in my code if the disable would come right after a TxEmpty condition.

The bad part is that this then seems to be a lot broader problem than it originally looked like as the DMA controller is not necessary for this at all, just a race condition where an interrupt signal becomes too short, and then bam, an instruction a few cycles later will vanish in the thin air.

I will definitely pick this stuff up a few days later and try to do some further experiments among these lines - hope something will turn up from those!

By the way how come you did need ASF to run it? It's true I used it, but only for extracting the header pack providing the definitions for the various AVR32 I/O ports: , not using any actual code from the framework (I compile just using this under Linux, not even having anything else from ASF in my toolset, just the headers in the avr32/io.h hierarchy).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dev.null wrote:
The bad part is that this then seems to be a lot broader problem than it originally looked like as the DMA controller is not necessary for this at all, just a race condition where an interrupt signal becomes too short, and then bam, an instruction a few cycles later will vanish in the thin air.
Yes, the DMA controller is probably not needed, but it was the easiest way I could think of for lowering an interrupt request line right after unmasking it. Especially because I needed a few cycles for that nop-slide thing.

dev.null wrote:
By the way how come you did need ASF to run it? It's true I used it, but only for extracting the header pack providing the definitions for the various AVR32 I/O ports: , not using any actual code from the framework (I compile just using this under Linux, not even having anything else from ASF in my toolset, just the headers in the avr32/io.h hierarchy).
While you don’t use any source code from the ASF in your project, there’s a bit of stuff going on during the initialisation phase, after the MCU jumped to the program entry point at address 0x80000000 and before entering main(). That code takes care of setting the stack, copying initialised global variables from Flash to SRAM and clearing all uninitialised global variables. For some reason my avr32-gcc uses code for that which is a bit lacking, so all my projects need either the newlib binary blob that comes with the ASF because that brings its own working initialisation code along, or a custom assembly snippet to set everything up.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ah, understood!

I wanted to EDIT my prev. post and wondered what happened to the server for dropping me off with "Bad request" (apparently it cant take the percent character, as I figured out).

So "EDIT":

Uhh, something still smells of dead fish here... Above I wrote "somehow already contain the needed value", which clearly can't be if the omitted instruction would be that "ld.w 0, 0[0]" (Note: two percent chars missing of this op). An other thing bugging me is that with various tries I could also get to have 0xAA550000 or 0x0000AA55 as results like if only the top or bottom half of the load was dropped off (maybe they came with missing the op. too, but from where then?). Anyway, more experiments will sure come (in which I will put a good attention at the generated ASM code, too)!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For the percent you need to write % -> %
Also when you edit your post, you’ll have to replace all percent signs with that code again.

The constant you’re using is 0xAA55AA55, so it’s 32 bits long. The AVR32 doesn’t support 32 bit immediates, so instead you’ll find something like this in the generated (or disassembled) code:

80000408:       ee 18 aa 55     eorh    r8,0xaa55
8000040c:       ec 18 aa 55     eorl    r8,0xaa55

If your problem is caused by a skipped instruction and one of these two instructions is skipped, you obviously get only half of your constant.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Huh, nice find! I didn't even consider the part after the load! That eor part is after the interrupt enable. That's quite weird since I see no race condition there, and seems to be relatively far from the disable with that probable race condition.

I checked your code, I see you enable the DMA, wait a variable amount of NOPs, global enable the interrupts, then start copying. Your explanation implies that the first character would arrive on the UART (by the mean of clearing TxEmpty) just after the interrupt enable, and this causing a short signal, would mess up somehow the internal state of the processor to miss an instruction.

It looks like the bug happens (or better: looks like happening) two or so instructions after the situation (the removal of the interrupt signal), so within a relatively short distance. This could well explain the miss of the "ld.w" in my code, but not the other.

So that second situation you now pointed out seems to look weird here, but it definitely happens, indeed. There is nothing to take back the interrupt signal in my code there, so naturally that interrupt enable does not look like it could cause a race condition. But something still happens in there.

I think I will be back some days later with probably some fried UC's!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dev.null wrote:
I checked your code, I see you enable the DMA, wait a variable amount of NOPs, global enable the interrupts, then start copying. Your explanation implies that the first character would arrive on the UART (by the mean of clearing TxEmpty) just after the interrupt enable, and this causing a short signal, would mess up somehow the internal state of the processor to miss an instruction.
Yep, exactly what I was thinking.

dev.null wrote:
It looks like the bug happens (or better: looks like happening) two or so instructions after the situation (the removal of the interrupt signal), so within a relatively short distance. This could well explain the miss of the "ld.w" in my code, but not the other.
Well, if there’s one such error in the AVR32 core, why not another related one? Happy debugging. ;-)

dev.null wrote:
I think I will be back some days later with probably some fried UC's!
Fried food is bad for your health! ;-)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am dumb or there is not just a fish, but a huge dead rotting whale hidden in there somewhere...

So now a more urgent project is over, and I picked up the AVR stuff for some further experimenting.

First I checked that indirection. Why it would have been needed? Without it I get a properly working program (that is it does not fail), and with it, one which fails. Checking the ASM listing I found that the one with the indirection is 2 bytes shorter (in that function) than the one without it. So what, let's just pad it with a nop just before the disable, and see what it does.

The program by the transmission seemed to work, but the LED stopped blinking!! So the main program after a short while got "dead". What?!

I tried to added some other 'nop's in there (Just above the disable in the same ASM block), with the following results:

1 nop: Main program dies, but transmits (zeros)
2 nops: Deadlock (neither blinks, neither transmits)
3 nops: Deadlock
4 nops: Deadlock
5 nops: Deadlock
6 nops: Program works (no fault at all)
7 nops: Deadlock
8 nops: Deadlock

For one another try, I tried an 'or r8,r8' added before the interrupt disable. Same result like with a single 'nop'.

(The rest of the program was not altered in any way, it is the same as posted here. There are three 'nop's enabled in the ASM block, and I added these on the top, before the 'st.w' disabling the interrupts)

Now if I hadn't seen these with my very own eyes, I just wouldn't believe it. Care for a try?

(By the way still no response from Atmel for the company's letter)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

During the debugging of my problem I noticed that depending on the amount of NOPs that I inserted, a different instruction would be in the spot that gets skipped. With many NOPs the error occurred in the function where I enable interrupts, but with few or no NOPs the errors would appear all over the code because they would be triggered after the current function returned to whatever was interrupted before. I also used simple LED toggling to narrow down the problem, but sometimes those instructions would be the ones to get skipped, resulting in no LED toggling but a working program.

I found that because the problem is not triggered by what my program does just before it crashes but rather by what it did a few clock cycles before, fiddling with NOPs didn’t really help me.

I suggest that you try to figure out which interrupt status lines give you trouble and then come up with a way to guarantee that they’re always consistent.

In my project I programmed the PDCA to transfer data to the USART, then enabled the TXEMPTY interrupt. The problem was that the USART immediately signalled TXEMPTY once I re-allowed interrupts because the PDCA had not yet transferred any data to it, but did so just when the CPU wanted to handle the TXEMPTY interrupt. I worked around that by manually writing the first byte into the USART, programming the PDCA to transfer the remaining bytes and then enable the TXEMPTY interrupt. Because I already fed the USART with data, TXEMPTY doesn’t rise prematurely any more and everything’s fine.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yeah, maybe I should correct myself, and that's just a warlus, not a whale, and I was a bit dumb there yesterday afternoon :)

The weird situations really could happen by that something after that 'ld.w' got skipped. I checked that assembly, and just after the 'ld.w' a four byte (2 word) instruction followed (storing the result obtained from the 'ld.w'), it might have happened that it somehow was partially skipped completely tipping over the program. At least this behavior explained somewhat the results I obtained a month before with the original program compiled with IAR.

So now I put a little thought in it, looked in an instruction sheet, and found what I needed: the 'cbr' instruction. I changed the test case to that (attachment): it will tell which instruction goes missing down there (and well, this verified that too we are talking about the same processor problem).

I had never seen an instruction timing reference (are there any?), but it seems that the 'cbr' instruction executes two times faster than a 'nop' (at least judging from that six of these are needed to get to the skipped instruction contrary to the three 'nop's).

The very interesting find is that the bug does depend on the previous instructions in the main program, but it is very unclear, how. It does not need the indirect access as I originally thought, but replacing it with straight loads and making sure the 'st.w' disabling the interrupts in both program are on the same address does not get the same behavior (the behavior for a particular compile is quite distinct and does reproduce well across power cycles).

It is possible to mix up the "prolog" in such a way there that the bug seemingly vanishes completely, or gets to trigger like just one shot a minute, while with other compiles it could occur as much as forty times a second with this setup.

And well, what for the impact, and why it is very hard to work around...

You also experimented with a TxEmpty, but in your topic someone verified that this also happens with the TWI peripheral. We also got to know that the global interrupt disable does not circumvent this problem either (as your test case shown it). It is also clear that the only thing needed to trigger the problem is just an interrupt which could occur the time we are attempting to disable it. And this is a very serious problem.

Now it seems that, based on our current knowledge, the only safe way to avoid this problem is to not disable an interrupt while it could still occur. This is indeed possible for things we are triggering (like the TxEmpty which happens based on our actions), but is near impossible for external sources (if the bug can occur on such lines too). So if one enabled a truly asynchronous source, he can not disable it safely any more. So he must set up the program that way that after enabling the external sources, he utilizes other means of synchronization than temporal interrupt disables which could be really tedious (It is always easier to delay the interrupt a little until we hastily copy away a buffer or so).

In my case the program from which I got down to this test case will have to be completely ditched and rewritten from ground-up (well, or using our current knowledge on the bug, I could still pad all interrupt disables with 8 or so 'nop's to be safe).

So for now this is it, maybe I will try one or another thing with some other interrupt sources to get some glimpse on how broad this problem could be.

UPDATE:

I did a short experiment with a timer interrupt, specifically Timer/Counter 0, channel 0, RC compare. I didn't even need to hack around anything, just giving it a go with a 100KHz interrupt rate the bug clearly reproduced. Then I tried the asynhronous timer's periodic interrupt, same result, even more, with the AST, now I even got skips of the 9th 'cbr' instruction (reading 0x00000100) which I never got before. Probably just "blind luck", but might also indicate that this thing could even depend on some other factors (such as the clock on which the interrupt source operates maybe).

UPDATE #2:

Apparently my assumption that the bug would produce with the global interrupt disable was false. I checked Catweax's code, and yes, he just enabled the interrupts through the status register (in his case the DMA took away the interrupt request). So now it seems that it is completely safe to disable any of these interrupts with either the global disable, or the priority based disable in the status register! That can pretty much solve the majority of the situations this bug would effect.

For example from this it looks like if one wants to get rid of just a particular interrupt, he could do a global diable, then disable the individual source, and finally do a global re-enable if he wants to avoid this problem (and it's yet unknown other impacts) altogether. I will test this though (along with some other stuff came in my mind just now).

In the end I think I will round up some shorter article on what we discovered about this behavior and how it can be avoided safely (that is solutions which do not rely on directly working around a problem not fully understood, such as dropping a few nops in here or there, but ones which prevent it's occurence in the first place).

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Interesting discoveries! I think I am pretty much done with it at this point! :)

I started to experiment with the relation of the global & peripheral interrupt disable / enables, and I think I got really clean results now showing the nature of the bug and how the processor's interrupt logic works.

The attached test cases can be used in the code I supplied in the first post, just replace it's "liocheck.c" source with them to try. They use the USART's TxEmpty and the global interrupt mask, but I also tried these out with the asynchronous timer and peripheral level interrupt masks getting the same behavior. In the sources what I describe below is also explained as comments.

So what seems to be going on here?

Apparently in the AVR32 UC, the interrupt logic has two distinct levels. One is the peripheral level (I do not refer to the actual individual peripherals here, rather probably a multiplexing hub where the peripheral interrupt logic is carried out for all peripherals), and an other is the core level where the interrupt masks in the status registers operate.

The peripheral level interrupt handling seems to go on it's own way without knowing anything about what's going on in the core including that too whether at core levels the interrupts are disabled or not. This shows well by that even though the core disables the interrupts, later disabling the peripheral interrupt source (such as masking the USART's TxEmpty), the instruction skip can still happen the exact same way if the core enables the interrupts soon enough.

For the behaviors I experienced it seems that for the problem to occur both parts are needed to work together. A short interrupt signal is required from the peripheral level (either by using an "idr" register or by any other means like Catweax's DMA experiment shown), which is required to be allowed to pass in the core, but! If one just disables and enables interrupts at core level (via the status register) he can not get such behavior (an instruction skip). The situation has to set up in the peripheral level, where some preprocessing happens leading to this in the core if at that time it allows interrupts to pass in. This also shows well in the test cases that the location of the instruction skips does not change according to the global interrupt enable, they relate to when the hazard happened at peripheral level.

And now what?

So in conclusion it looks like the peripheral level interrupt handling works on it's own and has some kind of problem in it affecting the core if interrupts are enabled.

This can not be avoided directly, if someone once enabled a peripheral interrupt source, he will definitely run in this hazard when he tries to disable it as a most common scenario, and of course this will likely to happen in any situation where the interrupt signal is "short". Only the consequence may be masked by masking the interrupts at core level.

Attachment(s):