[solved] dying mega8, no reset or wdt

Go To Last Post
24 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hey,

I'm stuck with a rather funny situation, it has never happened to be before. Here I have an ATmega8, a 12MHz XTAL (CKOPT programmed, full swing) and a few other components that blink mostly. Everything is kind of ok, until at a certain moment it dies.

The death is very curious:
*) oscillator is on
*) WDT (previously verified to work reliably) does not reset the MCU
*) pulling RESET down does not reset the MCU
*) at the same time, it answers to avrdude (via ISP) and can be reprogrammed and verified correctly
*) the only way to make it tick again is by cycling the power
*) I had the same problem with the builtin RC oscillator but then I suspected that it's the oscillator that is the problem and added a crystal.. alas

I suspected poor grounds so I added more wires around GND and +5V just to make sure, in case there's a broken trace where I can't see it. Also, it appears that this fault is temperature-related: when I heat the board to some +30-40ºC it fails almost instantly. When cooled down, which is easy at this time of the year, it keeps going on as if there's nothing wrong.

Obvious problems that I'm aware of:
*) lack of proper ground plane (it's a single-sided PCB)
*) lack of load caps "” I know, bad, but the oscillator is working, besides, see above about RC

What would you suspect here? I'm rather frustrated.

The Dark Boxes are coming.

Last Edited: Thu. Mar 8, 2012 - 11:31 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Looks like PCB related problem.

Quote:
*) pulling RESET down does not reset the MCU
*) at the same time, it answers to avrdude (via ISP) and can be reprogrammed and verified correctly

This is odd. Does it actually can be reprogrammed?
Could be a chip crack, too, (but only if chip is soldered without using a socket).
Thorough inspection on pins with scope can give an answer, since it fails instantly.

Could be generation/latching, also flux residues on PCB - scope can help recovering these, too.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Kas wrote:
Looks like PCB related problem.
Quote:
*) pulling RESET down does not reset the MCU
*) at the same time, it answers to avrdude (via ISP) and can be reprogrammed and verified correctly

This is odd. Does it actually can be reprogrammed?

Yes, this is the funny part. Like I said, it programs and verifies well.

Quote:
Could be a chip crack, too, (but only if chip is soldered without using a socket).
Thorough inspection on pins with scope can give an answer, since it fails instantly.

Could be generation/latching, also flux residues on PCB - scope can help recovering these, too.


I suspected these too. The PCB is rather poor indeed, homemade and stuff. It's a TQFP package and such strange reaction to such a small temperature change.. I thought it could be mechanical. Scrutinized it under microscope, I even replaced the chip with another one. Same effect.

It also bothers me that the chip appears to get warmer with time. It's not warm as in warm, maybe a couple degrees warm: I don't have a good thermometer to make sure, only a fingertip.. But it doesn't do anything that could make it sink or drain current. I also have some experience with mega8 pumping some real milliamperes and not having any problems with that.

Well, I keep on searching. Thanks for the input!

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I assume you have by-pass caps on all of the V+/Gnd pin pairs. Odd things can happen without them.

What is the power supply? Is is oscillating? I would expect it to be more temperature sensitive than the micro itself.

The fact that you changed micros and had the same failure certainly points to the PCB or power supply as the likely source of the problem.

Can you post a photo of the PCB?

JC

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

DocJC wrote:
I assume you have by-pass caps on all of the V+/Gnd pin pairs. Odd things can happen without them.

What is the power supply? Is is oscillating? I would expect it to be more temperature sensitive than the micro itself.

The fact that you changed micros and had the same failure certainly points to the PCB or power supply as the likely source of the problem.

Can you post a photo of the PCB?


I would rather not, because it's really messy now, after all the rage and added wiring. But it's nothing extraordinary, the grounds and VDDs are connected under the chip and routed out to the source points. I made at least a dozen layouts like this and never had any problems with them.

I have bypass caps (10uF and 0.1uF, on either side) located very close to the supply pins. The supply is USB power, by design, and it could be sucked dry by the displays in theory, so I experimented with a different supply that puts out 1A (which is 2x of the projected) of clean non-USB juice. Same result. And no, I didn't really notice anything wrong on the power lines. The power is all I've been messing around for the recent 4 or so days and I can't find anything wrong about it. But I guess it's time to begin with some more systematic approach. Measure this and that, write down and weigh in and so on..

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
... supply that puts out 1A (which is 2x of the projected

500 mA is a lot of current.

I'm definitely NOT a USB expert, but I believe:

I assume you know that the USB port is only required to put out 100 mA, unless you start up at <= 100 mA and then negociate for the higher current capacity.

If you don't do this is the USB controller shutting down the port, or at least not giving you the current you expect, (and hence perhaps not well regulated if you are trying to draw more current?).

Is this a Nixie Tube display?

What are you driving with the chip, and is there a lot of RF or on-the-bus noise from the output device? Perhaps you are just overwhelming the device with noise?

Last thought, when a micro works for a bit and the reproducibly stops, it is often a stack overflow problem. The program works until the stack is corrupted. How long this takes depends, of course, on the ISRs, user interface interaction, etc, etc.

This could also explain why it failed with two different chips.

JC

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

as doc said the most likely cause is a stack problem, although a hardware reset should also solve that as it should reset the stack pointer and re-start the program.

It is really strange that that is not working.

I do not know about avr dude, but you are not working in some sort of debug wire mode and that some thing goes wrong there?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hey JC, meslomp, thanks for the comments.

I'm currently away from home, just taking a break from other (also very frustrating!) things so I can't add new research results. Your arguments are valid, but:
1) the USB is negotiated to put out 500mA, the host happily confirms that but:
1.1) the irrelevance of this is proven by using an external power supply
2) it's a Numitron display, the bulbs are switched by constant-current drivers with shift registers. Each segment eats around 18mA assured by constant-current driver. So all together they can overwhelm 500mA. Luckily, they never really light up all together.
3) RF Noise: hmm. I can't really think of any special source of noise in such a simple circuit: the oscillator is probably the noisiest thing there. Sure, there are wires and even some mildly inductive loads but they should not be able to beat down a healthy atmega just like that. It's not an x-ray tube, just some tiny incandescent lightbulbs. In fact, I have built 3 very compact nixie clocks and never had any RF problems with them. This one just switches currents.
4) If it was a dead loop or stack overflow, it would just keep on resetting by overrunning the addresses or by WDT. Reset interness can't be explained by software, that would mean that AVR actually has a HCF instruction :D

@meslomp: just a regular ISP programmer, nothing fancy. It's easily detachable, does not make a difference if it's plugged or not. I don't find it that strange though. In programming mode, the CPU core is not functioning. The programmer talks to a rather primitive machine, which only needs clock to sample SPI SCK and pump up flash programming voltage, responds to simple commands over SPI. In fact, while this is happening the CPU *must* be in coma, or it would conflict with the process. Normally this is assured by a mundane reset. Here I have something more divine.

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have been pondering over this intriguing problem this afternoon, and given the evidence provided, I dismissed most of the possible causes. So what are the common factors ? I think you are using the mega8 to talk USB. Correct ? And so you use OSCAL to bring the 8MHz internal RC oscillator to 12 MHz. Could it be that bringing it 50% above normal compromises the flash and eeprom write-times ?

A GIF is worth a thousend words   They are called Rosa, Sylvia and Tessa, You can find them https://www.linuxmint.com/

Dragon broken ? http://aplomb.nl/TechStuff/Dragon/Dragon.html for how-to-fix tips

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sounds to me like the AVR is being held in reset during the failure, it might just be heat sensitive. Can you swap out the AVR?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Plons wrote:
I have been pondering over this intriguing problem this afternoon, and given the evidence provided, I dismissed most of the possible causes. So what are the common factors ? I think you are using the mega8 to talk USB. Correct ? And so you use OSCAL to bring the 8MHz internal RC oscillator to 12 MHz. Could it be that bringing it 50% above normal compromises the flash and eeprom write-times ?

No, it's running off a 12MHz crystal. I started to use it because like you did, I suspected that the internal RC is giving in at marginal modes. Unfortunately this doesn't seem to be the case. I should try adding load caps just in case, but I think if they were necessary, the oscillator just would not start, no? I can't observe any deficiencies at the oscillator pins.

Just in case: hfuse=0xcc, lfuse=0xef

I like dksmall's idea about AVR being held in reset. It's logical that it is. The reset pin is pulled up to +5V through a 10K resistor. Going to sniff around that wire.

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

He already did. And changed the power source from USB to separate PS.

Puzzling huh ?

Edit: Mustard after dinner as we tend to say in Dutch :)

A GIF is worth a thousend words   They are called Rosa, Sylvia and Tessa, You can find them https://www.linuxmint.com/

Dragon broken ? http://aplomb.nl/TechStuff/Dragon/Dragon.html for how-to-fix tips

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I should try adding load caps just in case, but I think if they were necessary, the oscillator just would not start, no?
Yes, I think you should add 18 or 22pF load capacitors, and no, not just to start reliably: they are necessary to achieve the necessary phase shift.

A GIF is worth a thousend words   They are called Rosa, Sylvia and Tessa, You can find them https://www.linuxmint.com/

Dragon broken ? http://aplomb.nl/TechStuff/Dragon/Dragon.html for how-to-fix tips

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you can hot plug in the ISP and reprogram the chip after it has failed without a power cycle then the fact that it doesn't respond to reset is strange because the ISP uses the reset line to enter program mode.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Can you reset it by pulling the reset pin on the ISP header? Can you measure the reset pullup resistor between the reset pin on the micro and Vcc? Could be a broken trace there.

What's Aref tied to? I had all sorts of strange problems at higher temps once when I had an external reference feeding Aref and accidentally set the chip to internal reference (not sure if it was the bandgap or AVcc, however)

/mike

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Is a capacitor connected between reset and ground, and especially smd capacitor.Its stange as a problem,but as a test try to remove it or replace the reset capacitor.

Quote:
It also bothers me that the chip appears to get warmer with time. It's not warm as in warm, maybe a couple degrees warm: I don't have a good thermometer to make sure, only a fingertip.. But it doesn't do anything that could make it sink or drain current. I also have some experience with mega8 pumping some real milliamperes and not having any problems with that.

Is the pcb absolut correct about the pin connection of VCC,AVCC,AREF,or the 5v supply is stable?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Great guess n1ist, indeed I had AREF connected to +5V and not configured: a mistake I keep on repeating for years. I disconnected it. Unfortunately it still fails.

Observation: the CPU does not fall into coma, at least this is good. I didn't previously notice because it was only visible on a couple of pins, but it WDT-resets itself. What causes WDT resets? For example, waiting for SPIF to be set after writing to SPDR. But if I ignore SPIF, other things come into play. It becomes radically unstable.

I still can't trace down the cause of heating. After reading n1ist's post I really hoped that AREF is going to solve it, but it appears that there is something else.

Sigh. Good night.

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Update: the problem is solved but not really understood.

After converting the board into an authentic battlefield, cutting every trace and rewiring everything, scrutinizing every solder joint under microscope and just dancing around with tambourine I discovered that there's a way to touch the board in such a manner that it suddenly starts to behave. It had to be a finger, a dielectric wouldn't do.

I touched and touched and then I made a "pullup probe", a 15K resistor to Vdd, which I applied to every pin in sequence. Turned out that when PIN14/PORTB2/SS/OC1B is pulled up, the problem disappears. Everything is ticking and I can heat the chip with a soldering iron and it does not notice.

Now this is interesting because I only use SPI as a master, I do not use PORTB2 anywhere in the code, I do not use Timer1. I tried to find something about this in datasheet but there's nothing that says that SS pin can be used to implement a theremin.

I have set that pin to be an output, by luck I don't use it or care, and now it works. I only regret ruining a board that once was very pretty and wasting a week in fruitless frustration.

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There we go:
http://www.ustream.tv/channel/sh...
It's standing on the central heating radiator.

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ANY pin which is not used MUST be either an output or, if input, pulled high or low.

If you are using the SPI as a master and the /SS pin is floating and it happens to go low then the SPI will go nuts and revert to SLAVE.

John Samperi

Ampertronics Pty. Ltd.

www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You cracked it once again :) Congratulations.

A GIF is worth a thousend words   They are called Rosa, Sylvia and Tessa, You can find them https://www.linuxmint.com/

Dragon broken ? http://aplomb.nl/TechStuff/Dragon/Dragon.html for how-to-fix tips

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
If some pins are unused, it is recommended to ensure that these pins have a defined level. Even though most of the digital inputs are disabled in the deep sleep modes as described above, float- ing inputs should be avoided to reduce current consumption in all other modes where the digital inputs are enabled (Reset, Active mode and Idle mode).

Searched the datasheet for nuts, no results. Recommended, yes, but not absolutely dictated. Oh well.

Thanks Plons ;)

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Recommended, yes, but not absolutely dictated.

It is indeed absolutely dictated in the datasheet. Doesn't your datasheet have a section in the SPI chapter...
Quote:
SS Pin Functionality
Slave Mode

When the SPI is configured as a Slave, the Slave Select (SS) pin is always input. When
SS is held low, the SPI is activated, and MISO becomes an output if configured so by
the user. All other pins are inputs. When SS is driven high, all pins are inputs, and the
SPI is passive, which means that it will not receive incoming data. Note that the SPI
logic will be reset once the SS pin is driven high.
The SS pin is useful for packet/byte synchronization to keep the Slave bit counter syn-
chronous with the master clock generator. When the SS pin is driven high, the SPI Slave
will immediately reset the send and receive logic, and drop any partially received data in
the Shift Register.

Master Mode
When the SPI is configured as a Master (MSTR in SPCR is set), the user can determine
the direction of the SS pin.
If SS is configured as an output, the pin is a general output pin which does not affect the
SPI system. Typically, the pin will be driving the SS pin of the SPI Slave.
If SS is configured as an input, it must be held high to ensure Master SPI operation. If
the SS pin is driven low by peripheral circuitry when the SPI is configured as a Master
with the SS pin defined as an input, the SPI system interprets this as another Master
selecting the SPI as a Slave and starting to send data to it. To avoid bus contention, the
SPI system takes the following actions:
1. The MSTR bit in SPCR is cleared and the SPI system becomes a Slave. As a
result of the SPI becoming a Slave, the MOSI and SCK pins become inputs. ...

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@Lee: you are absolutely right, shame on me. I guess I was lucky to use that pin as output in previous projects and this is how I never noticed this before.

The Dark Boxes are coming.