Detecting RAM errors, anyone got experience?

Go To Last Post
15 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I was just writing some code to implement a MARCH C- test for RAM errors on an XMEGA. I see there are a number of different strategies for testing RAM and MARCH C- is supposed to give good coverage with reasonable execution time. Mine takes about 3 million cycles if doing one bit at a time, or about 400k cycles if doing one byte at a time.

 

Has anyone got experience with this kind of thing? Really I'm wondering about two things.

 

1. Is MARCH C- a good choice for the internal SRAM in AVRs and similar MCUs?

2. Is it worth testing one bit at a time or is there little to be gained?

 

FWIW Atmel's application note tests whole bytes at a time.

 

Edit: My code if anyone is interested https://github.com/kuro68k/avr_r...

Last Edited: Thu. May 6, 2021 - 02:42 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes - I have used Michael Barr's: https://barrgroup.com/embedded-systems/how-to/memory-test-suite-c - not just code, also explains the types of faults that are likely to occur, and why some simplistic test fail to detect them.

 

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's been a long time since I ran memory tests after building a 4k ram board for my 6800 consisting of 32 2102 ram chips.  The test would verify all of the address and data pins were properly soldered and working.

With out knowing what type of memory failures your testing for, and the internal layout of the memory cells, I'm not sure we can say what test is best to detect each failure mode.  So I would assume the app note is your best bet.

 

Jim

edit spelling

Are you testing pre-production, or while the app is running?

 

(Possum Lodge oath) Quando omni flunkus, moritati.

"I thought growing old would take longer"

 

Last Edited: Thu. May 6, 2021 - 02:11 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Last RAM test I did was for 80286/80386 in 80386 Asm using protect mode and IDT/GDTs and lineA20 support to get beyond 1MB.

 

But this was for testing DRAM not SRAM as (a) DRAM was more likely to fail and (b) this was about 25-30 years ago when sometimes silicon was faulty.

 

I did various things like "walking 1s" but to a certain extent you almost need to know the physical layout of the cells and where "neighbour sensitivities" and stuff like that could lie.

 

So is this for external DRAM on the Xmega EBI ? Or are you really concerned about internal SRAM (and why) ?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Code here if anyone is interested: https://github.com/kuro68k/avr_r...

 

This is for the internal SRAM of the MCU. I was basing it off this paper: https://github.com/kuro68k/avr_r...

 

As you can see from that PDF the MARCH C- algo is supposed to cover all types of fault, and the overhead isn't too bad.

 

As for why, mainly just to try and root out any bad parts. As well as natural failures we are worried about grey market stuff, especially with the current shortages.

 

This code tests before main() executes in .init1 so that it can test the entire RAM easily, otherwise whatever algorithm you use has to preserve some of the contents of RAM and/or not test all of it.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
Last RAM test I did was for 80286/80386 in 80386 Asm using protect mode and IDT/GDTs and lineA20 support to get beyond 1MB.
Similar (80486DX, IIRC 32MHz, real mode, Ada, late '90s)

Before that was RCA CDP1802 in assembly language (mid '80s, external SRAM)

clawson wrote:
... when sometimes silicon was faulty.
Context was solder joints (PCBA manufacturing, shake qualification test, vehtronics with chassis mounted on bulkhead that coupled some with the vehicle's drive train)

 

P.S.

Memory tolerance to GCR varies by memory type; SRAM is best, flash eventually fails, MRAM has internal ECC, ECC for DRAM especially by AMD northbridges.

Some system requirements specify a continual checksum, CRC, or ECC on flash.

RAM intermittent defects are best handled by defensive programming (exceptions, assertions, etc)

By preventive maintenance, bit rot is dealt with (scrubbing, or iow, resilvering)

 


GCR - Galactic Cosmic Rays

RIP Opportunity | The Embedded Muse 368 by Jack Ganssle

MRAM Makes a Move into the Embedded Space | Electronic Design (two instances of ECC)

CRCSCAN - Cyclic Redundancy Check Memory Scan | Migration from the megaAVR® to AVR® Dx Microcontroller Families

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
... especially with the current shortages.
XMEGA Lead Time, Dec'20 | AVR Freaks

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This feels like deja Vu all over again. Testing AVR SRAM was, as I recall, a recurring discussion theme on these forums a couple decades ago.

When y'all run the tests on your production batches of 100 or 1000 or 10000 chips as mentioned, report back the results. The nature of the errors found will be interesting.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Quis custodiet ipsos custodes (who watches the watchmen)

In other words - the registers in the AVR are little more than SRAM so if the RAM test runs on the AVR itself and uses its registers why do you have more confidence in the integrity of the registers performing the test than the SRAM it is testing?

 

Probably the only true test would be something external that could read/write memory locations - so perhaps over the OCD interface?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
This feels like deja Vu all over again.

laugh  Indeed. 

 

Testing AVR SRAM was, as I recall, a recurring discussion theme on these forums a couple decades ago.

Not just AVRs, and not just this forum.

 

And, as Cliff says, if you don't trust the microcontroller's RAM - how can you trust it to run a test ... ?

 

In my case, mentioned above, the test was for external memory on prototype boards - as we were seeing some "issues"

 

and that was 20-odd years ago.

 

The nature of the errors found will be interesting

it turned out to be timing errors in the design.

 

and we did suffer from simplistic tests not picking it up.

 

EDIT

 

Even longer ago, a colleague was developing a memory board with  discrete memory chips.

He did a simple write-and-read-back test - which passed even when the memory chips weren't fitted!

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
Last Edited: Fri. May 7, 2021 - 09:02 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

Quis custodiet ipsos custodes (who watches the watchmen)

In other words - the registers in the AVR are little more than SRAM so if the RAM test runs on the AVR itself and uses its registers why do you have more confidence in the integrity of the registers performing the test than the SRAM it is testing?

 

Probably the only true test would be something external that could read/write memory locations - so perhaps over the OCD interface?

 

Well, yes, but I think if any of the registers used to test were faulty then the test would fail. The failure report would be misleading, but it would fail.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
In other words - the registers in the AVR are little more than SRAM so if the RAM test runs on the AVR itself and uses its registers why do you have more confidence in the integrity of the registers performing the test than the SRAM it is testing?

I suppose that if clever people put their minds to it and considered all critical what-if's, then a useful/good test could be devised.  Like deciding which door has the tiger.  A multi-step process to verify one useful subset, ad so on.  Now, what if the previously-verified whatever goes bad...?

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

awneil wrote:

Even longer ago, a colleague was developing a memory board with  discrete memory chips.

He did a simple write-and-read-back test - which passed even when the memory chips weren't fitted!

 

That's capacitance for you!

 

Neil

 

p.s. anyone remember external cache memory for early x86 systems which was frequently faked so that all cache reads just read from the main memory anyway?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:

clawson wrote:

In other words - the registers in the AVR are little more than SRAM so if the RAM test runs on the AVR itself and uses its registers why do you have more confidence in the integrity of the registers performing the test than the SRAM it is testing?

 

I suppose that if clever people put their minds to it and considered all critical what-if's, then a useful/good test could be devised.  Like deciding which door has the tiger.  A multi-step process to verify one useful subset, ad so on.  Now, what if the previously-verified whatever goes bad...?

 

 

Someone already has!

 

Atmel did an app note on it, for compliance: http://ww1.microchip.com/downloa...

 

I'm not totally sold on their RAM test method though. There is also xmburner which seems to have decent coverage: http://ww1.microchip.com/downloa...

 

I thought about implementing xmburner, or at least the CPU core test parts because the RAM test doesn't seem to be as comprehensive as MARCH C-.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
Probably the only true test would be something external that could read/write memory locations - so perhaps over the OCD interface?
PDI instructions do access all memory spaces.

AVR XMEGA memtypes | Embedded Debugger-Based Tools Protocols User's Guide (EDBG)

pyedbglib/avr8protocol.py at master · mraardvark/pyedbglib · GitHub

 

"Dare to be naïve." - Buckminster Fuller