EEPROM data corruption

Go To Last Post
28 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have a design that has data saved to EEPROM. Very occasionally, I noticed that the data has been corrupted. It has taken me ages to find the cause but I believe it is due to the ramp up/down speed of the power supply to the chip. If all program code that refers to the EE is stripped out (to avoid inadvertant read/write access to EE), the EE can still be messed up.
There is a 500A capable (!!!) thyristor driven crowbar trip circuit across the 5V rail to protect the safety critical design. The power can go from 5V to 0 in about 10us. Fast. Are there any documented failure modes of the EE due to power cycling? Is there an easy work around?
We've also noticed that if there are huge (100+Amp) switching currents near the chip (but not connected anywhere in the same circuit), the EE can also be corrupted, although the power supply 'looks' clean. This works within about a 1m radius. Blasting the circuit with 1000W ERP (rf) through a log-periodic at various frequencies on a bench has no effect.
I remember the good old days of 'tricking' petrol station pumps with CB radios... I can't believe that the AVR is still susceptible to this. The flash program memory, which is of similar silicon design/construction on the die is not affected at all. Even if there is no program code in the chip at all to read or write the EE, the data still gets corrupted (reading back through AVRISP). SRAM is not affected. Data registers are not affected. (Brown-out detection is active- I suspect that its not fast enough- takes 2 clks.)
Ideas anyone?
-f

[cliff: NB this has been moved by the OP (rightly) from GCC forum - but there was already one reply there:

https://www.avrfreaks.net/index.p... ]

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I can confirm the unreliability of the AVR EEPROM. It caused A LOT of grief for me a few years ago. I was storing board configurations in the EEPROM. Our warranty returns skyrocketed and that's when I discovered the configuration was getting lost. I store the parameters in flash now, I don't even use the EEPROM anymore. I just can't trust it.

I like cats, too. Let's exchange recipes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I can confirm the unreliability of the AVR EEPROM. It caused A LOT of grief for me a few years ago. I was storing board configurations in the EEPROM. Our warranty returns skyrocketed and that's when I discovered the configuration was getting lost. I store the parameters in flash now, I don't even use the EEPROM anymore. I just can't trust it.

LOL. I guess then that our 100+ production designs (mostly in industrial environments) that use EEPROM for storing all configurable parameters, and error logs, and run totals/duty cycles don't really work?!? Geez, I'd better tell the users of control boards that are shipping several hundred per month. [I've got one app with 600+ 16-bit Modbus "registers" on a Mega64 that is running twin Modbus slave stacks, and either Modbus master can adjust the parameters at any time.]

EEPROM corruption can occur when you run the AVR below its rated supply voltage. The symptoms are often that location 0 gets clobbered as the address bus drops first, but if you go through enough slowly-dying power-off cycles you might find other locations trashed as well, typically the one where EEAR is sitting at.

Now, enable the brownout detector at a legal level. Once you stop trying to run the AVR below the speced supply voltage level I'll challenge you to cause an EEPROM "corruption".

IMO the byte-addressable EEPROM parameter storage is one of the AVR's strongest features, having many advantages over primary/secondary flash pages and similar schemes.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Is there an easy work around?

To OP: I'll repeat the challenge: Set the brown-out detector to an appropriate level, and you will not be able to reproduce "corruption".

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry to ask this but if the OP's ID, "ex_atmel_FAE", really does mean he was an FAE for Atmel shouldn't HE be the one telling US about this problem and the BOD solution? :lol:

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Re-reading the original post

Quote:

...huge (100+Amp) switching currents near the chip (but not connected anywhere in the same circuit), the EE can also be corrupted, although the power supply 'looks' clean. ...

When you hit the AVR with enough noise you can certainly drive it nutz. Even with our "cheap" TDS 'scopes (60MHz, 1GS), though, we can trap spikes of many dozens of volts on various signals--Gnd, Vcc, commo signals. You get rid of the spikes (which violate absolute chip maximums) and the nutzo behaviour goes away. Again, with brownout detector enabled all but the narrowest of spikes will engage it when there is ground bounce. (Ground bounce is either the most common symptom, or it just is the easiest to trap. Dunno.)

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
LOL. I guess then that our 100+ production designs (mostly in industrial environments) that use EEPROM for storing all configurable parameters, and error logs, and run totals/duty cycles don't really work?!? Geez, I'd better tell the users of control boards that are shipping several hundred per month. [I've got one app with 600+ 16-bit Modbus "registers" on a Mega64 that is running twin Modbus slave stacks, and either Modbus master can adjust the parameters at any time.]

I didn't realize this would be taken as a personal attack on EEPROM evangelists!

Quote:

Now, enable the brownout detector at a legal level. Once you stop trying to run the AVR below the speced supply voltage level I'll challenge you to cause an EEPROM "corruption".

My brownout detect was set at 4.5V on a 5V board. All of the EEPROM data was corrupted. Since I am not the only person who has observed this problem, I think it might be real no matter what anyone says. Could it be a board layout problem? Maybe, but my problem is solved now.

What happens if you hang a 24LC02 or similar off the AVR and hit the circuit with EMI? Does the 24LC02 get corrupted? I bet not.

I like cats, too. Let's exchange recipes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I didn't realize this would be taken as a personal attack on EEPROM evangelists!


Personal? No, but I'll repeat the LOL. If AVR EEPROM was indeed as direly incompetent as your post indicates, my organization would no longer be in business, would it? After all, I use it for all configurable parameters in every app, and a typical main loop would read several of these parameters, and nearly all are changeable via onboard menu or outside [PC setup/monitoring] programs.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes it is true! :) But hey guess what- the guys in the factory don't tell us everything- only what is good enough for field apps guys! Strangely enough, you learn a lot more about a product from the customers who've put devices through environments that the semi' design guys often never encounter (as all good designers know...). In a lab is very different to the real analog noisy world. :)
-f

clawson wrote:
Sorry to ask this but if the OP's ID, "ex_atmel_FAE", really does mean he was an FAE for Atmel shouldn't HE be the one telling US about this problem and the BOD solution? :lol:

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It was meant as "tongue in cheek" but it does seem a bit sad that Atmel don't train their FAE's with the "known issues". Or is each one left to reinvent the wheel when it comes to problems/solutions? While I guess that'd be a good learning experience to more fully understand the problem/solution it does seem a bit of an inefficient way to go about things. Maybe Atmel need something like M$' "Knowledge Base" both for FAEs and end users? (perhaps 'Freaks is as close to that they've got so far? ;-) )

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just had a thought- besides all the usual spikes and stuff (which can't be identified), I wonder if perhaps the internal DC charge pump circuit is activating, causing the power rails to the EE matrix to spike. That could cause the corruption. Its always the same bit in any of the matrix. Yes I know there is built in bit-redundancy at the factory, but on average it looks like a data line is hovering...
Good grounded board, no ground bounce, no long tracks, brown out is on (4.5), and the power ramp is generally monotonic. It still does it when a program is not even in the device (erase device leaving EE alone).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

you learn a lot more about a product from the customers who've put devices through environments that the semi' design guys often never encounter (as all good designers know...). In a lab is very different to the real analog noisy world.

Then wouldn't my experience of building production apps since 2000 using AVRs, with models from AT90S through many Mega and Tiny models, in well over 100 different boards (many systems have more than one AVR; some "modular" systems have 5-20), and many/most in industrial and other high noise environments, and all using EEPROM for parameter storage that is expected to keep values "forever" and it does, give lie to any blanket "AVR EEPROM is crap" proclamations?

I know that js has similar designs as mine and many are production. We'll have to ask him how he uses the EEPROM. I think Cliff has some high-volume products; do they use the EEPROM, Cliff? I don't really have a roster of others that might have several/many production designs.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Nope, no EEPROM use in ours as we're just using AVRs as front panel controllers and the "brains" of the design (including its non-volatility) lives elsewhere.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Our boards are subjected to very frequent power cycling. Some are also sometimes used in the vicinity of arc-welding equipment. God only knows what was causing the corruption. I was never given time or the equipment to trace down the cause. I would like to know if this would happen with an external EEPROM IC, or if it is a hardware bug of the AVR.

I like cats, too. Let's exchange recipes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yo Lee... youve said you run at 3.86 or 7.37 a lot. We have heard the eeprom is flaky at hi speeds. Time for some experiments folks? I sent out a bunch of 128s a couple years ago that had a user enterable eth and ip addr saved in eeprom, and those came back with some clobbered locs. I think this particular prob got solved with some atomic cli-sei additions, but I recall it caused me some uncomfort at the time.

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My 'few hundred k' experience is that the EEPROM can be used reliably in all parts where EEAR isn't reset to 0. This means everything publicly available today. Also, this does not implicitly mean that any part that resets to 0 will corrupt other cells than 0, only that I don't have the volume to draw a conclusion.

There are two popular ways to wreak destruction other than trying to execute when VCC is out of spec (which enabling BOD fixes). One is insufficient crystal oscillator stabilization delay; Either because the fuses are set wrong or the crystal is incorrectly loaded or poor quality.

Another is eeprom read after a reset without waiting for the previous write to complete. Yes Alice, there IS a time before reset, and the eeprom wisely does not care about the reset pin. It completes the write if possible. Which means that blindly mirroring eeprom into sram after reset will result in grief. Edit: This is most likely what's happening when the AVR is reset by ESD or other transients.

/Kasper

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

We have heard the eeprom is flaky at hi speeds.

As I recall, newer AVR generations (starting with Mega? can't remember) use a fixed internal clock for EEPROM operations so system clock speed shouldn't matter. But I don't remember when that change was made.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The guys in Armel Norway (Trondheim) set up this web site as a means to reduce the work load for the FAEs (as I believe). Of course all the known issues are passed to FAEs, but most of the time, an FAE is the first source of feedback to the design guys at the factory if there is a problem. Curiously, its often an FAE that identifies the failure pattern amongst a number of customers that eventually leads to the 'issue' being turned into an 'errata'. Thats why I thought it'd be a good idea to field out the question about EEPROM corruption to all you guys out there, to see if I can spot a pattern!

Sadly I don't have internal contacts within the Atmel factory any longer. NDA's and corporate security etc... I still think that AVRs are a fabulous design though, and this is after learning the hard way on the 1st gen of 90s1200.

I always percieve an EE cell as nothing more than a FET with a really well insulated, low leakage gate capacitor on. Flash is fundamentally just an array of those cells.
-f

clawson wrote:
It was meant as "tongue in cheek" but it does seem a bit sad that Atmel don't train their FAE's with the "known issues". Or is each one left to reinvent the wheel when it comes to problems/solutions? While I guess that'd be a good learning experience to more fully understand the problem/solution it does seem a bit of an inefficient way to go about things. Maybe Atmel need something like M$' "Knowledge Base" both for FAEs and end users? (perhaps 'Freaks is as close to that they've got so far? ;-) )

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I always percieve an EE cell as nothing more than a FET with a really well insulated, low leakage gate capacitor on.

I always perceive an EE as nothing more than an FAE with a really well insulated, low leakage fur cap on.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Anybody that uses NutOs may have seen this (I hope not) but it was driving us crazy. NutOs stores the MacAddress in a structure in low eeprom (somewhere below address 100h, don't have the code in front of me right now). There is a function NutNetLoadConfig() that reads this structure from eeprom and performs a sanity check (just a string compare on a device name string supplied). If the sanity check fails (due to a read error on the eeprom) then default values are used and the eeprom is re-written with the default values.

The result was that once in a while (maybe 1 out of 100 power up cycles) our product 'lost' its MacAddress when the eeprom was overwritten. I eventually proved this was a 'soft' read error by removing the code from NutOs that overwrote the eeprom with the default values, we just ran with them out of ram. On the next power up cycle the correct MacAddress was back, since it was never really corrupted in the eeprom in the first place. The error was a soft read error.

We have never been able to figure out why the read errors happen, but the eeprom is being read shortly after the "C" initialization is finished and NutOs starts the main thread. In that main thread the ethernet device is initialized (that's when the MacAddress is read from eeprom). The device is an Atmega128, 16mhz, run at 5v.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I always have a delay, as long as the app can stand, right after basic AVR init setting port directions and the like. The usual value is 100ms. This simple precaution saves a lot of racing at power-up; for example the AVR comes out of reset but an LCD display is not yet really "alive". Or a wimpy power supply ramps up and then a power-sucker kicks in and the supply V drops for a while before climbing again.

It is hard to list all the weird things that can happen; I'm just saying that a decent delay, or a cursory initial loop looking for critical conditions, sure saves a lot of head-scratching and occasional false "trips" at power-on. [How many threads have you seen: "My AVR won't start after power-on but then I give it another reset and it runs fine."?]

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The call to NutNetLoadConfig() is near the beginning of the main module, but not the first thing. The "C" library initialization code and that inside NutOs run first and there is ALOT of init code in NutOs itself. For one thing a big loop that zeros out the entire heap space (over 32k bytes). I haven't timed it, but there could be almost 100ms of execution in the existing init code right now. I guess adding another delay of 100ms wouldn't hurt, but there are other issues here.

First of all, the problem only seems to happen in one of our products, and all use the same NutOs network init code. The product that does have the problem actually has TWO megaAVR's joined with a common clock oscillator and a parallel port interface so the two can send messages to each other. The clocks of the two cpu's are 180 degrees out of phase, which seemed to fix a problem with communication over that parallel interconnect. One processor releases the other from reset somewhere in its own main loop initialization. The master processor is the one with the eeprom problem. Clearly our case is more complicated than I first explained. However as I said we don't really have a corruption problem, rather a soft read problem. Only thing is a retry loop doesn't work. Once the eeprom is read bad, it STAYS bad, until the cpu is reset and restarted!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Would the ex_atmel_FAE be now working for Microchips perhaps?

I seem to remember the original post almost word for word from a while ago, I had to look twice at the posting date. :roll:

John Samperi

Ampertronics Pty. Ltd.

https://www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

How true :) LOL!
-f

theusch wrote:
Quote:
I always perceive an EE as nothing more than an FAE with a really well insulated, low leakage fur cap on.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Although this is an old thread, it is still relevant.

Recently, we received compliants about our products failing in the field with a particular customer. Analysis of the failed units showed EEPROM corruption at address 0. As the thread suggests, this isn't that unusual with BOD disabled. However, all of our designs have BOD ENABLED. EEPROM Address 0 is used as a board ID which is programmed at our manufacturing location, and is not programmable in the field.

I've used EEPROM address 0 for many years without any problems. My believe was that as long as BOD was enabled, this problem wouldn't occur. However, it appears as if this is not fail safe either. Somehow, this particular customer is creating a condition that corrupts EE address 0. It's proabably a power up/power down issue (slow rate, teasing, rapid cycle, etc). The power supply has a linear 5V regulator and quite a bit of capacitance on the front end. BOD level is set at 4.3V.

I can't explain why corruption occurs with BOD enabled, but apparently there is still some condition where there is a hole in the protection. From now on, I will not use EEPROM address 0 for useful information. Also, I will always leave EEAR at address 0 after any EEPROM access. "Parking" the EEPROM address pointer at address 0 will leave address 0 as the sacrifical location.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you have sensitive values in EEPROM consider keeping two copies with CRC protection.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

From now on, I will not use EEPROM address 0 for useful information.

Quote:

"Parking" the EEPROM address pointer at address 0 will leave address 0 as the sacrifical location.

We had a similar situation--until we finally found out that the keyfob programmer being used had a bug in fuse setting, and the BOD fuses were not actually being set properly. So based on experience, get a couple of "failed" units and actually check the fuses. If locked, you have to do it indirectly--lower the supply V until the AVR goes into reset.

Now, brown-out is the most common cause for EEPROM corruption that I've seen over the years. Not the only one, though. Severe noise spikes can cause a number of AVR subsystems to have erratic behaviour, and spurious EEPROM writes are indeed one of them.

Parking the EEAR to a sacrificial location does indeed sometimes help--but I do EEPROM reads all the time for parameter settings! So it isn't a "cure". The sacrificial location can help sometimes, 'cause you see what is written there and (in my case) I could then divine from the content more of the conditions.

It could also be a rogue EEPROM pointer with value 0?

-- Reserve a few locations at address 0.
-- Put a signature in those locations: 0x55aa or 0x1234 or whatever
-- Periodically, monitor that signature and if it changes, "log" the event. Some kind of time stamp, the value, etc. and then replace it. Log to unused EEPROM area or serial or wherever.
-- Also make your parking spot. I suggest putting it way out in unused EEPROM space, and put a signature that is checked as well. I'm suggesting addr 0 and addr "PARK" to help in later analysis. E.g., if location 0 is still getting hammered it may well be rogue code. And the value and width may give you a hint.

I feel for you. Been there, done that.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I also have similar problem. Only location 0 becomes corrupted, usually with value 0, sometimes with random non-zero value. It happens on few 10 % of power cycles.
The BOD is set to 4.3 V. The ATmega168 is running on 20 MHz crystal.

First I've found out that I had inproper settings for oscillator - Low power 8 - 16 MHz. Setting it to Full swing solved the problem.

Another workaround was to keep writting to EEAR (can be zero value). Writting it immediatelly after EEPROM activity had no effect. Writting it after a 1 ms delay solved the issue. However it is interesting that this also helped if executed only once and not for subsequent EEPROM accesses.