Help needed with suspected silicon bug on ATMega1280

Go To Last Post
30 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In one our product based on ATMega1280, on a new batch, the bootloader failed to work. After extensive investigation we concluded that there may be a bug in the rev.B silicon of ATMega1280 (detailed description below). Before submitting a report to Atmel, I'd like to ask those who have ATMega1280-based hardware at hand to help to exclude influence of our particular hardware.

In attachment there is a simple assembler program which should reveal the problem. The program sits in the bootloader area and it reprograms one sector in FLASH repeatedly. The only hardware requirement is to have a LED to "display" the status - it blinks cca 3-4 times per second while everything is OK and starts to blink much more rapidly when the error occurs (usually, there are 1-20 "normal" blinks until the rapid blinking starts). Our hardware has a 14.756MHz crystal, but I believe it's not important and it would work with any other relevant crystal frequency (with LED blinking faster or slower accordingly). I am ready to adopt the program to any particular hardware.

Details on the error:

- it appears, that while the programming is in progress (i.e. HV/pump is on), a relative jump (in a loop) sometimes jumps to a different then expected address, one word before the target. In the given source, it's the rjmp Wait_spm, which at times jumps to the rjmp Fault before Wait_spm

- it appears, that this happens only for jumps located in a certain part of the program memory - see a commented-out nop before rjmp Fault

- supply voltage does not appear to have impact

- the error occurs only on the rev.B silicon (last letter in the second line on the bottom of chip). All rev.A chips we have (hundreds) don't exhibit the problem; all rev.B chips we have (two different timestamps) do exhibit the problem. I don't know when did the transition from rev.A to rev.B happen - the newest rev.A I have is timecode 07xx, the oldest rev.B I have is 11xx

Thanks to all who are willing to help.

Jan Waclawek

[EDIT] removed the errorneously added unrelated an*.* files from the zipfile [/EDIT]

Attachment(s): 

Last Edited: Fri. Nov 1, 2013 - 09:36 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

- all (A)VCC,GND connected and every pair decoupled with 100nF
- Crystal in full swing mode
- no watchdog active
- stable 5V

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Peter wrote:
- all (A)VCC,GND connected

Yes.

Peter wrote:
and every pair decoupled with 100nF

Not quite *every* pair decoupled (if I recall correctly (I am not at work now) one of them is not, the one on the pins 1..25 side; but that one is connected directly to other pair through a short line under the chip; nevertheless, I doubt this is the root of the problem. I can try to put some extra C as directly to the pins as possible tomorrow. I'll send you the layout through PM tomorrow, please kindly review it and comment.

Peter wrote:
Crystal in full swing mode
Yes. Please have a look at the fuses' setting (as comment in the source).

Peter wrote:
no watchdog active
No. Again, please review the given source and the fuses.

Peter wrote:
stable 5V
As stable as it gets. It's sourced from a bog standard 7805 with ample filtering before and after, which in turn is powered from a linear lab power source. Same result when the power to 7805 removed and powered from an outer source (STK500 - not that I'd consider that a perfect power source, but I had it connected because of programming, and IMO it's reasonable enough for the given task).

There's no other significant consumption on that board; there's no significant source of noise on that board (save a MAX232, maybe, I can remove it tomorrow, but that would REALLY surprise me if it would be the source of problem). The trace of VCC on the scope is as perfect as it gets, with no visible dip around the "programming events" (measured directly at the mcu's pins).

----

These questions are exactly the reason why I called for a help of somebody who has a different 'M1280-based hardware (and I still believe somebody will chime in and volunteer to experiment :-) ). While I believe our design is sound enough, it surely is not perfect, and I admit that it may be the source of the above problem.

----

Some "progress": while previously I tried a lower clock through setting the /8 fuse (i.e. 1.8432MHz), and that still exhibited the fault; I today tried to set the clock to the internal RC oscillator (8MHz, no /8 ), and that did *not* result in fault.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The good news is that I do indeed have a '1280 app.

The bad news is that there is no indicator LED as it is the main board for a big controller nest. :( And the unit on my bench has a date code of 0832--probably an A?

I'll look in stock and see if newer builds have a rev B. You think perhaps 11xx or later?

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I use a bunch o 1280s at work, but I don't use a bootloader. If you could dream up a test program that runs from program flash, I'll try it. c would be good.

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Lee,

As I said, I have no idea when did the transition from rev.A to rev.B happen. 11xx would be most probably rev.B and 08xx sounds quite likely be rev.A. I know of no way how to find out the revision without looking at the underside of the chip :-(

The LED is needed only to indicate whether the problem/fault occured. If you have any other way to observe a state - serial output, maybe? - I could try to modify the code to fit that. Maybe it could be captured also through an attached debugger, but I don't quite know how to go for that (there's one gathering dust on the shelf, and honestly I don't remember how to fire it up - the chip on our board does not have the JTAG pins exposed...)

Bob,

In my first post, there's a zipfile containing simple code to try. A single LED needs to be connected to a pin - in the .hex which is there, it's assumed the LED is on PH4. If you have a LED (or can attach a LED) on PH4, it's all what's needed, plus the fuses settings, which is described in a comment at the beginning of the .S source.

Otherwise it needs to be modified in source and reassembled. It's asm of the gcc variety, but if you tell me where do you have a LED on your hardware, I'll provide the hex. Then you'll program it together with the fuses, and observe the way the LED blinks.

Thanks both for your help,

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I dl'd the 2012 revB 1280 DS and there is some errata for the 2560 somewhat related to this problem... non read while write area of flash non functional.... cant test till Thurs at least... out of town tomorrow.

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
Some "progress": while previously I tried a lower clock through setting the /8 fuse (i.e. 1.8432MHz), and that still exhibited the fault; I today tried to set the clock to the internal RC oscillator (8MHz, no /8 ), and that did *not* result in fault.
Hmmm... the internal RC times flash and eeprom writes. This may implicate the stability of the oscillator in that usage case, however I can't envision a mechanism that would corrupt the PC, and certainly not in such a specific manner.

Perhaps OSSCAL is not getting correctly loaded at reset when the system clock source is a crystal oscillator. You could check by comparing the value of OSCCAL with the calibration byte. Again, can't see how this could corrupt the PC...

Perhaps the flash/eeprom timing register is proximal to the PC on the silicon? Have you tested device stability with EEPROM writes?

None of this is helpful, I'm sure ;)

"c:\PROGRA~1\Atmel\AVRTools\Wavr-gcc-4.7.2-mingw32\bin\avr-gcc" an1.c -S -Os -DF_CPU=14745600UL -mmcu=atmega2561 -o an1.s
sed an1.s -e 's:\.byte.*hh8:.byte hlo8:g' > an12.s 
"c:\PROGRA~1\Atmel\AVRTools\Wavr-gcc-4.7.2-mingw32\bin\avr-gcc" an12.s -mmcu=atmega2561 -Wa,-adhlns=an1.lst -Wl,-Map=an1.map,--cref,--section-start,.himem=0x12345 -o an1.elf

-mmcu=atmega2561 ... ??

JJ

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Jan,

I do not have a 1280 unfortunately, so can not help in that way, but...
With another chip we have seen that after programming changing fuses we needed to reset (real power down not just press reset button) the controller for it to work in normal mode.
Do you automatically also program the fuse bytes each time you program the chip? if yes, what happens if you do not do that.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

We use ATmega2560 with Ethernet-Bootloader.
I don't know the revision.
The crystal was also 14.756MHz.
We have no problems with it.

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Maybe it could be captured also through an attached debugger, but I don't quite know how to go for that

That would be the very first thing I would do and I bet that is going to be the first question of Atmel's support :twisted:
I would have used a program counter breakpoint with range and additionally a single data breakpoint on SPMCSR register, just to be sure nothing goes wrong.
You have mentioned that this could be RC related - the EEPROM and Flash timing is not controlled by a quartz so sweep the RC frequency through nominal up to max to see when/if the write fails.
Did you try with slower quartz? I do not mean the /8 but physically mount an <8MHz quartz.

No RSTDISBL, no fun!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't have a 1280 either but here is my input.
1:
are you sure that the bootloader is correct programed?
2:
do you program BOD, and reset timing, if it's a new proces for the B the boot of the chip (timing and power ) could be different.
3:how clean is the power at start, again a B could be more sensitive to noise at boot, or just faster to boot.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

bobgardner wrote:
I dl'd the 2012 revB 1280 DS and there is some errata for the 2560 somewhat related to this problem... non read while write area of flash non functional....
I've seen that, but IMO it's not related. Not only the decription indicates a completely different problem, also the ATMega2560 was apparently the "test vehicle" for the whole 64x/128x/256x line, and said Rev.A were probably early engineering samples. The product where we encountered the problem actually had to use ATMega2560 during development and in the first production batch, as ATMega1280 were still not available at that time. Those were rev.C, timestamp 0602. As we encountered the problem on ATMega1280 rev.B, which are apparently a die-shrink (or other process-induced redesign, e.g. because of a fab change) of the ATMega1280 rev.A, and did not appear before 2008; I believe the problem you cite is not relevant for this case.

Bob wrote:
cant test till Thurs at least... out of town tomorrow.
That's absolutely OK. Please let me know if you will need anything from my side. Thanks for your help.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks all for your comments. (Sorry for the late answer - turns out that the bug hunt does not relieve me from the usual daily workload... :-| )

joeymorin wrote:
Perhaps OSSCAL is not getting correctly loaded at reset when the system clock source is a crystal oscillator.
Tried to load explicitly OSCCAL to the value of calibration byte, also tried lower and higher values (up to the cca 8.8MHz, as I am not interested in known pathologic states). No change, problem still present.

joeymorin wrote:
Again, can't see how this could corrupt the PC...
+1 . This is what I would like Atmel to explain to us; I just don't want to pester them until I know the bug is reproducible at other place/hardware.

joeymorin wrote:
Perhaps the flash/eeprom timing register is proximal to the PC on the silicon?
The fact that the problematic jump fails only if it is located within some 16 bytes of the beginning of the bootloader area, IMO may be indicative of some spatial correlation between the FLASH and PC-related circuitry. Again, it's up to Atmel to give an authoritative answer... altough I doubt we would ever learn about the real root cause, should this ever be confirmed as a genuine bug.

joeymorin wrote:
Have you tested device stability with EEPROM writes?
No. Good idea, but need some time to device a test.

joeymorin wrote:

"c:\PROGRA~1\Atmel\AVRTools\Wavr-gcc-4.7.2-mingw32\bin\avr-gcc" an1.c -S -Os -DF_CPU=14745600UL -mmcu=atmega2561 -o an1.s
sed an1.s -e 's:\.byte.*hh8:.byte hlo8:g' > an12.s 
"c:\PROGRA~1\Atmel\AVRTools\Wavr-gcc-4.7.2-mingw32\bin\avr-gcc" an12.s -mmcu=atmega2561 -Wa,-adhlns=an1.lst -Wl,-Map=an1.map,--cref,--section-start,.himem=0x12345 -o an1.elf

-mmcu=atmega2561 ... ??

I don't understand, please explain.

meslomp wrote:
With another chip we have seen that after programming changing fuses we needed to reset (real power down not just press reset button) the controller for it to work in normal mode.
While our production programming process burns all flash/eeprom/fuses/locks in one go, once we started to investigate the bug, I resorted to STK500/AVRStudio4's ISP facility, where in most of the hundreds of tests we performed I reprogrammed only the FLASH. And I also re-powered often. So I don't think I made an error of this kind. But thanks for reminding.

danni wrote:
We use ATmega2560 with Ethernet-Bootloader.
I don't know the revision.
The crystal was also 14.756MHz.
We have no problems with it.
To see the bug to exhibit itself, there must be a jump (best in a loop, as it happens only here and there) after the SPM causing the actual sector programming; this jump must be within the first cca 16 bytes of the bootloader area (maybe there are other such critical areas, but we did not find any other and it would be a tiresome search so I leave it to Atmel :-) ) and the errorneous jump target must be such that its effect is visible. For example, in the ATMega1280-containing Arduino bootloader, the loop in question is immediately after the SPM, thus the errorneous jump would jump onto that SPM, which I believe is simply a NOP while the programming is in progress (that loop is not within the first cca 16 bytes of bootloader either).

But you inspired me indeed :-) . As I mentioned, we have the same hardware with ATMega2560, although of rev.C (I believe ATMega1280 rev.B is equivalent in process to ATMega2560 rev.E, but I don't have a newer ATMega2560). I tried, and the bug is not there on that one. We also have a different product based on ATMega2561 (and have rev.E of them), and tried that, and the bug is not present on that one either.

Brutte wrote:
Quote:
Maybe it could be captured also through an attached debugger, but I don't quite know how to go for that

That would be the very first thing I would do and I bet that is going to be the first question of Atmel's support :twisted:
I doubt so. IMO, JTAG/OCD *is* intrusive, even if in a minor way. Thus, I'd expect otherwise: in case of suspected hardware bug, a JTAG-less test method to be more indicative (unless the bug is in the OCD-related circuitry itself, of course).

Brutte wrote:
Did you try with slower quartz? I do not mean the /8 but physically mount an <8MHz quartz.
Good point! So I removed the 14.7456MHz crystal, mounted a 6MHz one, shortened the loop-delay accordingly - and the result is the same, the bug is still present.

While the soldering pen was on, I also soldered a 100nF cap to a ground via close to the pin I said above to have no direct decoupling (on the 'scope, I've seen some 50mVpp noise on it, with no correlation to the periods of time when the FLASH programming happened) and carefully run a short thin wire to it - no change either.

sparrow2 wrote:
are you sure that the bootloader is correct programed?
Yes. Several colleagues assisted, with two different hardware and two different programming software; we always verify after program.

sparrow2 wrote:
do you program BOD,
Yes, please review the fuses settings (in a comment at the beginning of the source I gave above)
sparrow2 wrote:
and reset timing,
The longest possible, again, please review the fuses settings.

sparrow2 wrote:
how clean is the power at start
A clean transition from 0 to 5V within a couple of ms, as observed on oscilloscope.

-------

Again, thanks all for help.... and more volunteers having an ATMega1280-based board at hand are welcome to try the simple test... ;-)

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Perhaps someone else can devise a separate program to test the problem if we have a hypothesis.... something like "a jump located in the first 16 bytes of the bootloader fails unless there is a nop after an spm instruction in the write loop" or something like that. There are 3 flash areas... code, read while write, and bootloader... try moving the test program one page address higher past the start of the bl section with an org and a jmp?

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
The fact that the problematic jump fails only if it is located within some 16 bytes of the beginning of the bootloader area, IMO may be indicative of some spatial correlation between the FLASH and PC-related circuitry.
Yet the behaviour doesn't manifest when running under the internal RC.

Three-way correlation between flash, RC, and PC?

wek wrote:
I don't understand, please explain.
Your subject refers to ATmega1280. In noboot.zip, the file noboot.S starts with:
Quote:
; assemble with:
; avr-gcc noboot.S -mmcu=atmega1280 ...
... yet the file an1.bat builds with:
-mmcu=atmega2561

What am I missing?

JJ

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The "an1*" files don't seem to be related to the test program; just assemble the noboot.S as per the comments therein...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Wek,

just had a quick look in the noboot.s
I have a few things I do not fully understand, they might be completly ok to do btw.

.equ temp1, 16

should those 'defines' for names not be:

.equ temp1, R16

also your code says:

  nop
  nop
;  nop  ;uncomment this nop, and it get stable


  rjmp Fault

Wait_spm:
  in   temp1, SPMCSR - 0x20
  sbrc temp1, SPMEN
  rjmp Wait_spm
  ret

I asume this has to do with your problem, but counting the number of bytes in the current code also gives me that without that specific nop uncommented the rjmp is also in byte 18, so above the 16byte boundary talked about.
Or did the comment mean to remove all the nops to get the code within the first 16 bytes?

also you do:

.equ PAGESIZEB, SPM_PAGESIZE
.section .bootloader,"ax"

while the datasheet (it has an example of a bootloader ) does:

.equ PAGESIZEB = PAGESIZE*2 ;PAGESIZEB is page size in BYTES, not words
.org SMALLBOOTSTART

also I see you doing the SPMCSR adress with the '-0x20' while the example code in the datasheet does not do that

also

  ldi  ZL, lo8(FLASH_ADDR)
  ldi  ZH, hi8(FLASH_ADDR)

should that not be FLASH_ADDR*2?

also you changed teh Do_spm routine. In the datasheet first a is check is done if the flash is ready (wait_spm) before operation is started. You first do the operation and then wait for it to complete, but what if during the first operation the flash was not ready, as far as I see then you did not check, but just started the operation.

also not sure but pagesize....
the datasheet says 128WORDS being 256BYTES. Then the example program states a couple fo times

Quote:
;not required for PAGESIZEB<=256

thus this being true, so can the problem be there?

again I am not an expert certainly not in assembler programming and bootloaders(only played once with that)
but those might give some clues on what might be going wrong. certainly as indeed another poster said that in the batch file you refer to the mega2561...

regards

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am running a suitably-adjusted noboot on my 2008-vintage Arduino-MEGA (with m1280 datecode 0833) and it does not seem to be failing.
(Can I tell whether I have a Rev B, without JTAG?)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

bobgardner wrote:
Perhaps someone else can devise a separate program to test the problem if we have a hypothesis....
That would be ideal of course, but I though to start with testing a simple provided example would be easier.
bobgardner wrote:
try moving the test program one page address higher past the start of the bl section with an org and a jmp?
That's exactly what we did with the production bootloader, as we needed to move on. I am not comfortable with that, though, until I am sure that this is a real fix and there is no other related problem lurking... Am I too paranoid?

joeymorin wrote:
Three-way correlation between flash, RC, and PC?
Rather, between flash, crystal oscillator and PC...?

westfw wrote:
The "an1*" files don't seem to be related to the test program;
Indeed - I zipped up more than needed from my "snippets" directory... :-( sorry...

meslomp wrote:
just had a quick look in the noboot.s
I have a few things I do not fully understand,
Oh yes - most of them related to the peculiarities of the GNU assembler (gas), contained with the suite of avr-gcc.

meslomp wrote:

.equ temp1, 16

Yes, this refers to r16, and this works in the given context OK.

meslomp wrote:

I asume this has to do with your problem, but counting the number of bytes in the current code also gives me that without that specific nop uncommented the rjmp is also in byte 18, so above the 16byte boundary talked about.
I know. I said 16 just because it's a "nice number". I don't know the exact mechanism of the failure, just speculating. One scenario may be that it's not the addition which fails, but the increment before the addition, and in that case the PC would be <= 16 bytes off at that moment. But there may be other mechanisms, too. This all is up to Atmel to determine, though.

meslomp wrote:

.equ PAGESIZEB, SPM_PAGESIZE

while the datasheet (it has an example of a bootloader ) does:

.equ PAGESIZEB = PAGESIZE*2 ;PAGESIZEB is page size in BYTES, not words


Yes, I initially stole that piece of code from the datasheet, but had to modify for the avr-gcc environment. There, things are systematically in bytes, almost nothing in words. Thus, the pre-defined SPM_PAGESIZE (from the respective header file included indirectly through ) is already in bytes.

meslomp wrote:
also I see you doing the SPMCSR adress with the '-0x20'
Again this has to do with the way how the SFR's addresses are defined in the device header in avr-gcc
meslomp wrote:

  ldi  ZL, lo8(FLASH_ADDR)
  ldi  ZH, hi8(FLASH_ADDR)

should that not be FLASH_ADDR*2?

Again, it's already a byte address.

meslomp wrote:
In the datasheet first a is check is done if the flash is ready (wait_spm) before operation is started. You first do the operation and then wait for it to complete,
That does not really matter as far as functionality goes - the only difference is, that the latter takes more time. Btw., I tried both ways.

meslomp wrote:
but what if during the first operation the flash was not ready, as far as I see then you did not check, but just started the operation.
The first operation is after reset, and FLASH is ready (the tested bit is zeroed) at that moment.

meslomp wrote:
Quote:
;not required for PAGESIZEB<=256

thus this being true, so can the problem be there?
No. It's not required, but also harmless.

Thanks all for the comments.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:
I am running a suitably-adjusted noboot on my 2008-vintage Arduino-MEGA (with m1280 datecode 0833) and it does not seem to be failing.
(Can I tell whether I have a Rev B, without JTAG?)
Thanks for your time.

No, unfortunately, I know of no way how to find out the rev. number (not even with JTAG, as I don't use it :-| ) other than looking at the underside of the chip. As, as I don't have a 08xx nor 09xx datecode chip, I can't tell... :-(

Jan

[captchad... for WHAT???]

[EDIT] Out for hunting datasheets: the M of 09/2010 does not know about the rev.B ATMega1280 and the N of 05/2011 does, so I'd say datecodes up to 09xx are surely not rev.B, 10xx-11xx maybe, 12xx surely are.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
Rather, between flash, crystal oscillator and PC...?
Have you tried running with an external clock?

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
wek wrote:
Rather, between flash, crystal oscillator and PC...?
Have you tried running with an external clock?
Just tried that. Bug occured. And tried also the low power oscillator modes. Bug occured.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Anybody here could please make an AS6.1 project from the source in zip in first post?

The guy at Atmel's support needs that to move on, and I would like to avoid installing AS6...

Thanks,

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:
Anybody here could please make an AS6.1 project from the source in zip in first post?

The guy at Atmel's support needs that to move on, and I would like to avoid installing AS6...


I succeeded to borrow from a friend a laptop with AS6.1 installed - project sent to Atmel's support attached in hope there still might be somebody willing to try it on his ATMega1280-based hardware... :-)

Jan

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi jan,

We have a product that uses the ATmega1280 and have started seeing some strange problems with the bootloader that appear to be batch-related.
Did you ever resolve this with Atmel?

Thanks,
Gordo

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Gordo,

It took several weeks to guide the first-line support guy to reproduce the problem. He finally confirmed he can reproduce it, then it took several more weeks or months, and he came back with a response, saying that I either should avoid using the first couple of bytes or use lower frequency, e.g. 8MHz (which is apparently result of him trying the internal RC oscillator, which I know and reported to be one avoid the problem; and not actually trying a lower frequency crystal or clock input, which I know to still result in the problem occuring). I was disappointed - I have invested quite some energy in explaining the problem to them and to help them to reproduce the problem, and even if I stressed several times that this is NOT a trivial problem and is probably a result of changed layout, obviously nobody knowledgeable over there tried to understand the root of the problem, not to mention tracing it down to the silicon and find out whether it may have another adverse effects. So I did not pursue the issue further, and can't tell you more on the nature of the problem than I already said in the previous postings of this thread.

I am very limited in time at the moment, but if you want me to take a look at your bootloader, PM me, I will have more time after 1.July.

Jan

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For a new batch of the product in question, we purchased ATMega2560 rev.E timestamp 1344. I put them under examination and they don't exhibit the problem.

I was not surprised, as at this moment I am quite sure this is a silicon-layout-related problem (that's why I am also not surprised Atmel is reluctant to go deeper in the investigation - this is far from being a trivial problem and I am the only one complaining and don't represent a $M buying power and found a fix myself).

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Jan, good to hear the problem has gone.

To bad that Atmel will not tell you that the problem indeed was a silicon bug. Now the root cause stays in-conclusive and might turn up in a next batch of chips/new revision silicon.

I find it hard to believe that Atmel did not fully investigate. That should be independent of buying power. You now were the first to find this problem and they should have checked it. specially with the amount of time you have already spend on finding the root cause.
I do hope they in the end will tell what they have found....

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The problem is with the ATMega1280, so the problem did no go away, I just tried if it is not present also on ATMega2560.

> I find it hard to believe that Atmel did not fully investigate.

They did what they've seen as adequate: they reproduced the problem and they also reproduced my workaround; it's unlikely you find an engineer knowledgeable and prudent enough to feel that more thorough investigation is needed, in the support line of a company like Atmel is. Thus, I was disappointed, but not surprised.

Due to the nature of the problem it will occur in the "wild" only rarely (the majority of users IMO uses library functions, which mask the problem, as they inline the waiting loop after the SPM, thus don't execute code from the few affected bytes at the beginning of bootloader area during programming), so it's unlikely anybody will complain on the same thing again.

JW