Flash memory erase problem

Go To Last Post
21 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've encountered a field failure where ATmega32 chips are returning undamaged with just their FLASH memory erased, but not their EEPROM. The program fuses set/cleared and program memory is erased as though the studio had been used to erase the part pending reprogramming. We know EEPROM doesn't get erased as OSCAL byte still holds the initial calibration value. The customer's application board does not support any field programming or debugging - no JTAG or debug connections brought out.

What I am looking for is not so much a diagnosis as any experience with similar Flash (not EEPROM) loss. Atmel has contributed one similar instance, but I need to know if anyone on the forum has seen anything like this and, ideally, if you were able to fix the problem.

Thank you.

Charlie H.

"It's easier to ask forgiveness than it is to get permission" - Admiral "Amazing" Grace Hopper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Could it be that the customer (or someone else) tried to read the FLASH and inadvertently erased it?

Jim Wagner
Oregon Research Electronics

Jim Wagner Oregon Research Electronics, Consulting Div. Tangent, OR, USA http://www.orelectronics.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Not likely for various reasons, most of which I cannot go into in specific detail. We're pretty sure the problem is as described, just haven't found the smoking gun.

"It's easier to ask forgiveness than it is to get permission" - Admiral "Amazing" Grace Hopper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It is at least theoretically possible that an AVR running in the twilight region of Vcc level below the speed rating or just too low might do something strange, maybe even erase the FLASH. Are you using the Brown Out Detector and is the BOD level set correctly?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Also, what type of environment is this in?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The clock is the internal 8MHz clock, trimmed using Atmel's methods and hardware to be reasonably accurate.

BOD is definitely enabled and this was verified. Even if the wrong BOD selection had been made, tests on my bench cycling power from 3.3 to 1.0 V and back don't result in any problems. (Interesting what you can do with an arbitrary function generator.)

The environment is fan cooled rack mount. We also have lots of experience with this part in other, very similar customer's applications - no thermal Flash degradation with any other customer. So far as we are informed, if the customer leaves power on the part if fine. It is the transient AC loss test that erases memory at random.

BTW - if we do discover the cause, I'll post the answer for the benefit of other engineers working with Atmel products.

"It's easier to ask forgiveness than it is to get permission" - Admiral "Amazing" Grace Hopper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SPM used at all?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No SPM during the customer's test. What we are seeing is a full on erase under internal micro-program control, not corruption. This has been reported to us by Atmel as having been seen when severe noise is injected into an open reset line, but in this case the customer tried tieing the line to VDD and still sees the problem.

"It's easier to ask forgiveness than it is to get permission" - Admiral "Amazing" Grace Hopper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Do you have some external capacitance on the reset line?

Jim

Jim Wagner Oregon Research Electronics, Consulting Div. Tangent, OR, USA http://www.orelectronics.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I cannot guess at the cause or solution. I can only say that in dozens of AVR designs since 2000, and many tens of thousands of units in the field, most in "industrial" applications, I have never seen a single case of flash corruption or erasure. (Sure, there could have been an unreported case here and there. But if there were multiples with a particular product we as the designers would hear about it just as with OP.)

So it is a puzzle, especially if no SPM in the program.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The original circuitry had nothing attached to the RESET. Tieing to VDD didn't seem to help. A suggestion I made was to use a supervisory circuit to hold RESET low when VDD was below spec. This is being tested and tentative feedback is that this clears the problem.

Atmel has recommended in AVR040 use of an external reversed diode to ground, resistor to VDD and a cap on RESET for very noisy environments. Both that and an active circuit have the effect of making the RESET pin a low impedance node, and hence less suceptible to noise.

As noted in the original post, Atmel is aware of the problem of high energy noise on this pin causing another part to enter into high voltage programming mode - and erasing the part in the fashion noted by us. This was discovered by another Atmel customer in a motor control application (not surprising that it was a very noisy environment.)

Customers have put thousands of our devices using ATmega32 chips into their board without a hint of this happening. So, I don't think that this problem is necessarily an indictment of the '32 or its family. The lack of feedback on similar instances coupled with the measures that seem to have corrected the behaviour tends to argue that the sense that this was an EMI problem was correct. With enough noise any controller will get into trouble. It is just the manifestation that was surprising.

"It's easier to ask forgiveness than it is to get permission" - Admiral "Amazing" Grace Hopper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

The original circuitry had nothing attached to the RESET.

We always have a resistor+cap on /RESET. Certainly more resets would happen without it due to the weak nature of the internal pullup, but resets don't really explain erasure (or similar symptoms).

Quote:

A suggestion I made was to use a supervisory circuit to hold RESET low when VDD was below spec.

So it appears that as with EEPROM "corruption", simply enabling the BOD is the solution? (no "supervisory circuit" needed) That was too obvious to even consider at first read.

Quote:

Customers have put thousands of our devices using ATmega32 chips into their board without a hint of this happening.

With no /RESET circuitry and no BOD enabled? Wow. Scary.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Not exactly, the BOD WAS enabled. It wasn't able to help as it appears that the problem was over voltage hitting the reset pin. A high enough voltage there pops the uC into high voltage programming mode. The first thing that happens when you enter that mode in that fashion? Erasing the Flash.

Note that I don't know yet if the reset solution is in fact fixing the problem. It may take another day or two before the customer is satisfied by banging away on his system.

"It's easier to ask forgiveness than it is to get permission" - Admiral "Amazing" Grace Hopper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Besides attacking the symptoms, think about finding the root cause. If those kinds of voltages are running amok around your AVR, then the Absolute Maximum Ratings will be routinely violated. Long-term recipe for disaster even if things appear stable in the short term.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Microsuffer wrote:
BOD is definitely enabled and this was verified. Even if the wrong BOD selection had been made, tests on my bench cycling power from 3.3 to 1.0 V and back don't result in any problems.
Is this the ATmega32 or ATmega32L? If you stay within specification in the ATmega32, then your operating Vcc must be 4.5 volts or higher at all times. In this case BODLEVEL = 0 is the correct setting (4.0 volt typical). A note indicates Vbot has been tested for correct operation, even though it is lower than the 4.5 volt minimum. If you pick BODLEVEL = 1 instead (2.7 volt typical) on the ATmega32, then you are allowing the chip to operate in that unpredictable twilight zone Vcc level without any BOD protection.

If your power cycling test had reproduced the failure, then all you would have done was verify that failure mode. However, since you failed you learned nothing. The very nature of random twilight zone failures means the only useful information from negative test results is that you failed to reproduce it, not that the failure does not exist in that testing regime. You misinterpreted your test results.

Microsuffer wrote:
A suggestion I made was to use a supervisory circuit to hold RESET low when VDD was below spec. This is being tested and tentative feedback is that this clears the problem.
Isn't this exactly what a correctly setup BOD is supposed to do, except with the internal reset (which is also a chip wide reset)?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I absolutely agree with Lee about fixing the real problem.

However, if you want to treat the symptom that you believe is causing the problem for testing purposes, then you would need some kind of reset pin voltage clamping/spike suppression. It is possible the loading from an external supervisory circuit might be moderating over voltages on the reset pin, which means the external BOD function itself isn't having any useful effect in fixing the problem.

Another thought. It is possible that a very slow reacting supervisory circuit might be holding the reset pin low through the time period that a high voltage spike used to appear on the reset pin. If it is a timing issue with the external circuit, then I wouldn't feel good about any apparent fix based on this.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The customer's test is one for the Journal of Irreproducable results. No instrumentation, just a tech cycling the AC switch as he feels like it and we have to live with it. And we haven't gotten any good data on what voltages are appearing in transient on the customer's VDD line.

Snubbing reset when power is coming up seems from tentative results to fix the problem for now, either through suppressing EMI at the pin, or holding reset low while VDD is dancing. What conserns me is the latter. If that is the case, the system, as pointed out earlier, is likely to become unreliable much later because our chip and others on that supply line are being abused. I cannot fix that, only note that tieing RESET to VDD did NOT fix the problem and suggest the customer still has a systemic problem. (The problem with being a canary is that someone actually has to pay attention when you aren't singing - and figure out if that is significant.)

"It's easier to ask forgiveness than it is to get permission" - Admiral "Amazing" Grace Hopper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
A high enough voltage there pops the uC into high voltage programming mode. The first thing that happens when you enter that mode in that fashion? Erasing the Flash.
Well, not exactly (maybe). Assuming you enter hv programming, to get a chip erase you would still have to correctly set XA0, XA1, BS1, DATA, then pulse XTAL1, then pulse /WR. I guess it could be that you found a 'shortcut' to this whole process.

You could find out if a chip erase is actually occurring by setting the lock bits (or just pick 1 bit), then when 'it' happens, read the avr lock bits- lock bits still cleared = no chip erase command, all lock bits set = chip erase (possibly).

Quote:
No SPM during the customer's test
Does that mean there is no SPM command in the bootloader section? or does that mean its not used. (Although even if it was present, it would still be quite a feat to get all of flash erased).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

First, the erase problem is a "known issue" with Atmel. It seems that there is an unintended path into parallel programming - it doesn't always follow the rules. The good news here is that stumbling into that problem seems to be difficult. This is only the second time in how many thousand applications?

Next, there were lock bits set to protect the firmware from snooping or downloading. As noted before, these were cleared much as happens when you use the Studio to erase programming.

No boot loader code. The program doesn't update itself, but does modify configuration information in specific blocks of Flash set asided for that. That is only done using a GUI as part of developing an optimum configuration. If you cycle power on the chip in the middle of reprogramming the configuration data you could in theory get corruption, if not a full memory erase. BOD wouldn't help in that case. It is a theory.

More (hopefully accurate) information has surfaced. Seems that the custormer's test was unintentional. A software engineer was debugging his code and using a power strip to cycle power on his test system. He seems to have a magic touch. Customer's hardware engineers have been unsuccessful in duplicating his effect.

One might suppose that the programmer was experimenting with our configuration and got impatient and cycled AC in the middle of this. I don't think that is the case as the changes he might contemplate would require instrumentation and a hardware engineer to support. Supposedly neither were involved.

Another theory is that if the AC power is cycled just right, residual charge left on some of the regulator circuits is prebiasing one or more such that the 3.3V line jumps briefly to 12V. Pre-bias of some regulators has interesting effects. No proof this is happening and likely the H engineers have been investigating this.

Bottom line is that I'm still working with blind men futzing around an elephant and reporting by mail. Engineering is fun.

"It's easier to ask forgiveness than it is to get permission" - Admiral "Amazing" Grace Hopper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Are the failures only in the customer's testing setup or are they being reported from the field?

What else is switched when the tester flips the AC switch? Maybe that is generating EMI.

Have they tried replacing the switch itself? A badly corroded switch (nearing the end of its life) might not be switching cleanly. If it does the trick, tell them not to throw it away. Maybe you could reproduce the error at your works where you have access to a digital storage scope.

If you think education is expensive, try ignorance.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A wrap up.

No field failures as this pair of products is going into production.

No other system anomalies have been reported and this was only found in the software lab, not in the customer's engineering lab, which has attempted to instrument and reproduce the problem.

A reset circuit that holds the uC's reset pin at ground until the power supply that feeds it is stable has been added to both affected systems. No further failures have been observed, even in the software lab where the code monkeys have really tried to kill their hardware. (The original programmer cum part breaker has no idea how close he has come to finding a new career playing whack-a-mole on the production line.)

And no, neither they nor we feel that the root cause has been proven, only implied. As to ATMega's, I think that prudence dictates not leaving the reset pin un-snubbed, data sheets not withstanding.

Much thanks to all who have attempted to help.

"It's easier to ask forgiveness than it is to get permission" - Admiral "Amazing" Grace Hopper.