ISP an AVR with external WDT

Go To Last Post
81 posts / 0 new

Pages

Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am just wondering if there is any way to program an AVR via the ISP interface, while it's reset pin is connected to an external WDT ic.

I am using the MAX823 watchdog ic and the AT90CAN32 uC. I have connected the MAX823 open drain output to the uC's reset pin and the WDI (WDT timer reset) pin to the uC's PB1(SCK) pin. Also, between the MAX823 reset output and uC's reset input, there is a jumper that is open circuit while the microcontroller is been programmed via the ISP.

I have tested to program the uC while it's reset pin is connected to the external WDT, but this is impossible.

I have also tried to find an ApNote from ATMEL, but nothing.

Do you have any experiense on this?

Thank you.

Michael.

User of:
IAR Embedded Workbench C/C++ Compiler
Altium Designer

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Also, between the MAX823 reset output and uC's reset input, there is a jumper that is open circuit while the microcontroller is been programmed via the ISP.

If the jumper exists to isolate the AVR during ISP then what's the problem?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No, there is no problem. I am just wondering if there is any simple way to eliminate the jumper in the future products.

Thanks.

Michael.

User of:
IAR Embedded Workbench C/C++ Compiler
Altium Designer

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The MAX823 data sheet lists a Voh for the MAX82_ RESET output, so it doesn't look like an open drain anyway. This would explain why you have to remove the jumper to the external WDT for the ISP reset to work. Even if it was an open drain, it might have an external pull up that is too small a value for the ISP reset to overcome.

Did you try connecting the MAX823 external MR reset input pin 3 to your ISP reset line and leave the jumper in? It changes the external reset timing to the AVR reset pin slightly, but it might work.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One of the reasons we "like" AVRs is the high integration for our type of apps. A Mega48/88 schematic is near-singl-chip with typically only signal conditioning added (and power supply/regulation as needed). There are fuses for various configurations, internal BOD, byte-addressable EEPROM for parameter settings and simple logging (e.g., fault history), ... and a configurable watchdog timer.

A glance at the MAX823 datasheet makes me curious. At the price of an entire Mega88 (nearly US$3/qty. 1; nearly $2/qty. 100) what is the need/advantage/use of adding a not-inexpensive extra component to your design? Admittedly, you are starting with a $10 micro and not a $2 model so the proportion isn't quite as high. But the AVR model you mentioned has 7 BODLEVELs to choose from, and the "modern" watchdog with interrupt capability. A "manual reset" button can be added to the AVR's /RESET circuit just as easily as to the MAX823. The '823's nominal timeout figures are coarse, and also quite long to protect e.g. a moving apparatus.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Lee,

I cannot disagree with you. Althought I have used uC form other manufacturers, I vote to AVRs because I love their very high integration.

But..... after all I would like to inform you that there are ways to stuck the cpu even if you have the WDT or BOD, enebled. I have spend enough time trying to protect the ATmega48PV from this phenomenon, that occures when I obsessively touch the XTAL lines, with an external crystal connected and the CPU in active mode.

That's why I used an external WDT only for a specific product that is not enclosed in a plastic, in order to protect it.

Also a button to the reset pin costs more than a MAX823 that automatically resets the AVR.

Mike B,

Yes, you are right there is no open drain output. I was wrong. No I didn't try to use the MR' pin and I have connect it to Vcc. I'll try it.

There must be some easy way. I don't believe that I am the only one having this question.

Michael.

User of:
IAR Embedded Workbench C/C++ Compiler
Altium Designer

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I would like to inform you that there are ways to stuck the cpu even if you have the WDT

Then you aren't doing the "WDR" only when all conditions are correct to continue operation. With the external watchdog, if you tickle that watchdog at the same spot you do the AVR's WDR you would get the same results.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No,

Even if your application is the simplest one like:

Fuse bits:
WDTON enabled,
No interrupt is used,
BOD enabled,
External Crystal Osc. is used.

I/O setup.

while(1){
    __watchdog_reset();
}

you can do it. I have also do this test to several type of micros like NEC 78K0, Renesas 6N24, Freescale and PIC 16. The result was the same. I never get it using an external WDT.

Why don't you try it???

Michael.

User of:
IAR Embedded Workbench C/C++ Compiler
Altium Designer

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I never get it using an external WDT.

You never get >>what<<?

If this is indeed the entire program, the results are going to be exactly the same if you have:

// Using AVR's watchdog
while (1)
    {
    #asm ("wdr"):
    }

or

// Using external watchdog
while (1)
    {
    // Pulse the external watchdog
    WATCHDOG_PULSE_HIGH();
    WATCHDOG_PULSE_LOW();
    }

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

that there are ways to stuck the cpu even if you have the WDT or BOD, enebled

Quote:

You never get >>what<<?

Michael seems to be saying that (in his experience) the AVR can get into a latched-up state from which even it's own WDT will not free it.

(assuming that's what "stuck the cpu" means?)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A really good watchdog is also sensitive to the timing of the kicks. It should not be too fast or too slow.

For one board I used a tiny13 as multi-input window watchdog :oops:

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Michael seems to be saying that (in his experience) the AVR can get into a latched-up state from which even it's own WDT will not free it.

1) I don't believe it.
2) The posted sample program certainly will not prove that theory, as the watchdog is constantly being reset.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

icarus1 wrote:
But..... after all I would like to inform you that there are ways to stuck the cpu even if you have the WDT or BOD, enabled. I have spend enough time trying to protect the ATmega48PV from this phenomenon, that occurs when I obsessively touch the XTAL lines, with an external crystal connected and the CPU in active mode.
If Michael completely stalls the external AVR clock by messing with the crystal, it wouldn't be the fault of the BOD or WDT that it cannot recover. It would probably only recover and restart the AVR clock after cycling the AVR power. He should at least put his circuit inside a case!

Something like the new Xmega external oscillator failure detector should recover from an external clock stall automatically, but no current ATtiny or ATmega parts have this feature.

Michael should read and use section 5 Using crystal and ceramic resonators in AVR042: AVR Hardware Design Considerations and the entire AVR186: Best practices for the PCB layout of Oscillators.
http://www.atmel.com/dyn/resourc...
http://www.atmel.com/dyn/resourc...

BTW, if his AVR has the CLKO (clock output pin) it could be enabled with its CKOUT fuse for easy AVR oscillator stall testing. This would prevent loading the XTAL pins with a test instrument, which would effect the testing and might stall the oscillator itself. If the oscillator is stalling, he should try different XTAL capacitor values as per AVR042.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Michael seems to be saying that (in his experience) the AVR can get into a latched-up state from which even it's own WDT will not free it.
theusch wrote:
1) I don't believe it.
I don't find it so unbelievable.

The claim seems to be that the internal, integrated WDT is NOT wholly independent of the CPU and is not as credible as one that IS wholly independent of the CPU. They might be sequestered by design, but they do live in the same house.

The argument is that the mechanisms that go into the WDT functioning can be compromised by whatever spurious event might compromise the CPU - even if the clock source is different.

If an event could clobber the program counter of the CPU and cause erratic code execution, why couldn't it also potentially disrupt the WDT operation? Wouldn't an off-chip WDT be less susceptible to such a concurrent disruption?

I have personally seen an NEC's integrated WDT fail to bring the micro out of latchup. That chip claimed to have an independent WDT oscillator. Yet it hung with its i/o forever latched in some on, some off fashion - happily pumping product until POR. NEC quickly acknowledged that true watch-dog protection requires a standalone supervisor circuit. That was less than 5 yrs ago.

Maybe Atmel is smarter than NEC. I'm not willing to take that chance. When safety is an issue, I use a standalone supervisor (I don't care who the uC MFG is). When not, I use the integrated WDT, BOD.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

If Michael completely stalls the external AVR clock by messing with the crystal, it wouldn't be the fault of the BOD or WDT that it cannot recover. It would probably only recover and restart the AVR clock after cycling the AVR power.

And in that case, the external watchdog doing a reset shouldn't be any different than the internal watchdog doing a reset. The clock either recovers and it comes out of reset, or it doesn't.

Quote:

If an event could clobber the program counter of the CPU and cause erratic code execution, why couldn't it also potentially disrupt the WDT operation? Wouldn't an off-chip WDT be less susceptible to such a concurrent disruption?


Well, icarus and Class seem to think it's possible. How am I to prove something is impossible? Is an off-chip WDT less susceptible to the chain of events that you describe? Perhaps; maybe; I can't say no; I think reason says yes. But what are those possibilities? Is one a few parts per trillion more likely?

Let's put out another scenario: Given these dastardly conditions with no-holds-barred, eye-poking (at least finger-poking) and gouging, etc., isn't the internal WDT design >>less<< likely to suffer spurious resets because the
external WDT design has more traces to/from the AVR etc. etc.? I'll guess more ppt there.

I don't do medical designs. (IIRC AVRs are specifically disallowed from life-support designs? Or am I thinking of some other part(s)?) But I do a lot of industrial designs. We just blasted a multi-module, multi-AVR desgin with so much noise that we were damaging peripheral chips such as DS485 drivers. I never "stucked" an AVR in the testing such that the symptoms might have been a failure to WD. We bring the ISP header, including the /RESET signal, out to where it is accessible in the "real world". Yes, if you short those wires you are going to hold the AVR in reset. In many cases if you short an AVR output, or several of them, you will drag down the AVR to an unusable condition.

No, Virginia, I don't believe it. Impossible? Ha. Is it "impossible" with the external WDT? (Be careful of your answer--if the AVR's output that tickles the external WDT is no longer an output because of rogue code but rather becomes a floating input, mains humm could be tickling the watchdog. Impossible? Can't happen with the internal WDT, so that makes the internal better.

Let's discard these hit-by-a-meteor possibilities for a moment. If icarus says it can be replicated, then I'd like to see the program and test conditions. Now we are hearing about stopping the AVR's main, external clock. What does that have to do with the internal WDT?

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
And in that case, the external watchdog doing a reset shouldn't be any different than the internal watchdog doing a reset. The clock either recovers and it comes out of reset, or it doesn't.
I never read that Michael made any case that the external watchdog worked when the internal watchdog didn't (it is only implied). Since this is an internal silicon type problem, testing/reproducing failures isn't a sure bet. And since we are getting implied information, it appears he added an external watchdog after experiencing induced failures in the original circuit. Sounds like a different PC, circuit, whatever. Did he test the new/modified circuit for the exact same problem with only the internal AVR WDT enabled (with the external WDT jumper pulled)? Maybe the crystal was operating under different circumstances in the new/modified environment and never failed when being touched? Did he verify an external watchdog actually fired when touching the crystal traces or did he simply assume it fired because the failure didn't occur? This is a testing/investigation issue and I have yet to see any information that positively rules out anything.

If the AVR internal WDT is really failing, then it is implied there is something in the WDT that is upset and fails when the AVR clock is messed with. According to the the data sheet the AVR external reset is much the same as the WDT reset path with only a minor source difference and timeout counter difference. Maybe there is something synchronous in the internal WDT circuit if only an external WDT can get the job done, or more likely the problem testing/characterization is incomplete and we are chasing ghosts.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Maybe there is a problem with AVR WDT, even after all these years. I'd consider it very unlikely when the AVR is operating within Absolute Maximum Ratings. (If you hit the AVR hard enough with noise spikes especially ground bounce, you can indeed get it to do weird things. But at that time the Abs Max ratings are violated.)

I thought of a way to do some checking on stopping the external clock/crystal. Have a simple looping program as above. Let's say we toggle an output each loop (or periodically using a timer) to indicate the AVR is still running. Optional: CLKO monitoring.

Have another output pin. Drive it low with the AVR program, and have a weak external pullup. (Or drive it high with the AVR, and have a weak external pulldown.)

Enable the watchdog at a convenient interval. With a fast loop containing a WDR a short interval may work best for 'scope work.

Run that program and 'scope the outputs. When you stop the AVR clock, the toggling output should go to a state and stay there. One watchdog period later, the "aux" output should go to the pulled state, indicating that the WDT kicked in, the AVR went into reset, and the I/O pin became a floating input.

From what I can glean from the discussion, this is the situation that icarus claims gets the AVR stuck. I'll bet a cold one it doesn't. Whether it starts again when the clock is resumed may well depend on the clock circuitry.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
We just blasted a multi-module, multi-AVR desgin with so much noise that we were damaging peripheral chips such as DS485 drivers. I never "stucked" an AVR in the testing such that the symptoms might have been a failure to WD.
Like theusch points out, testing to the point of failure is tricky business. You can blast away for indefinite periods of time. The trouble is, there is likely a sweet spot (in reality, it's a sour spot) that you won't stumble on in a designed test - even if you spend 10's or 100's of hours testing.

Testing to failure can be telling, even if the test conditions are way beyond the worst case expected environment conditions. It will tell you how your device behaves if there is a failure. Does it fail safe? Sometimes, your device cannot cause serious harm, no matter what happens. That is a good position to design from. Sometimes, your device is absolutely critical (life support, aircraft control, etc). I would be a basket case if I was responsible for those designs. A lot of times, the potential hazards are more subtle.

How about a thermostat that could turn the furnace on forever. There could be some serious property damage if it happened when a home/office is unoccupied. But what if it is the thermostat in an assisted living apartment? The ambient temperature could become fatal if the occupant is not able to react for whatever reason. How about a commercial laundry dispenser pumping 40 gallons of bleach or builder into an empty washing machine. The cost of that wasted laundry product is huge. But what if that washer has its door open because the operator is adding or removing linen? It is near fatal, certainly permanently disfiguring. I'm sure there are countless examples.

It might take 1/2 million parts running for 1000's of hours for the the sour spot to show itself. Then, what is the severity? If it is medium or lower (no death or serious injury), then the risk rank is low and you can sleep at night. If the severity is high, the risk rank might be high even if the occurrence probability is low.

Perhaps there are limited cases where internal WDT is more reliable and limited cases where external WDT is more reliable. Maybe when you feel compelled to use an external WDT to mitigate that possibility, you could put on the belt and suspenders - implement both.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't disagree with any of that. I just questioned the $2 external WDT on an AVR design. I read that the justification is stucked Mega48s, and with the simple posed program "you can do it". I don't see how to do it, and cannot see how stopping the clock would affect it. Thus, I'd like to know how "you can do it". And I'll try it!

Re the catastrophic failures: If you are going to jump signals then there is nothing to say that the live supply wire can't be crossed directly to the output.

Our grizzled crew has been doing industrial electronic devices for decades, and nowadays nearly all have a micro. I'd summarize the end result as "do it good enough". Does it operate reliably under normal conditions? Is it graceful under foreseeable conditions? No, we don't guard against a mouse crossing the hot and the output (which just happened across screw terminals in one of our installations. The mouse is now quite hollow.).

I count on the AVR's watchdog. If there is a hole, I want to know how to verify it--and icarus implied it was straightforward to do.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This starts with the basic premises of touching external crystal traces causes the AVR to get "stuck". Stalling the AVR external clock is documented in application note AVR042. However, there are temporary stalls where the oscillator recovers when conditions return to normal, and there are long term stalls where AVR power must be cycled to recover and restart the crystal oscillator.

The conclusion is in order to do any testing of the described problem you must be able to reproduce getting an AVR "stuck" when touching external crystal traces. This is where icarus1 could be a big help by providing information like the crystal type (a data sheet would be ideal), the crystal capacitor type/values (C0G/NP0, tolerance, etc.?) and a brief description of the physical circuit (PC board, surface mount, through hole, breadboard, point-to-point wiring, etc.).

Only after you can reproduce the problem can you start testing internal vs external WDT effects. You will need to know the state of the AVR clock (the CLKO pin would be ideal) to evaluate which type of stall occurred. If you artificially kill the AVR XTAL clock without using touching, then how will you answer the question of how an internal WDT can't restart it vs how an external WDT can restart it? This whole stalling is only a best guess based on known behavior in weak designs. Maybe something even stranger is going on? Can icarus1 get an AVR stuck using the internal RC (i.e. just human contact/ESD on the XTAL pins with no stalling possible)?

As an alternative, if you are confident in your human ESD model, try obsessively touching your AVR external crystal lines and see if you can get your AVR "stuck". If you can't, then its probably safe to assume this is a design/implementation problem icarus1 created for himself.

Why would you want an exposed circuit board where any internal trace/pin/connection may be subjected to ESD (especially any not connected pins)? The external WDT itself might even get zapped under these conditions. It is not a high reliability design to begin with.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One other thing just came to mind. An enabled CLKO pin should be able to check for asymmetrical duty cycle. This is documented as something you shouldn't subject an AVR to. Maybe there is a marginal duty cycle problem in play. As in maybe X7R +- 30 % tolerance XTAL capacitors were not the right choice :).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I agree that good enough is indeed good enough. For sure experience and proven history are good tools to use in determining when it's good enough. When you have to break ground though, sometimes you don't know you missed the mark until it is too late.

I've heard enough embedded guys say:

Quote:
you can't trust the internal BOD and WDT like you can a dedicated supervisor circuit
to think there could be something to it. I don't consider myself qualified to say that about the AVR. I would say it about the NEC 78K0 though.

I too have used the internal WDT with all of the AVR stuff that I have done. I've used it with some PICs. All without any worries - because of the nature of the products.

It has been a couple of years since I've had the kind of safety concerns that would cause me to second guess the internal WDT. If I do find myself there again, a dedicated supervisor circuit is for certain. Until then, internal is good enough.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In my opinion, if something happens that causes your system to stall or get stucked, the WDT (internal or external) better at least reset the i/o to default. If it is an external supervisor, it will/should generate a reset pulse. If your micro is so stucked that it completely ignores the reset pulse, then your micro is very, very poorly designed. Sure, if it is electrically damaged, it might not be able to shut outputs off. But if it is just stucked because a loss of system clock, I would not forgive it for ignoring a reset.

I think one part of the debate has turned into a question of how likely is it that a stucked part would generate the internal reset. One would hope that it would be just as likely as the external circuit would be, and one would hope that the internal reset would be as effective as if a pulse can in on the pin.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

icarus1 said that it did get stucked, from the little info that was given--at least as I read it. I claim the WDT would fire.

I'd be happy to test his method of recreating the symptoms, but the airwaves have gone quiet in that respect. This isn't idle curiosity--as mentioned we have many, many industrial designs using AVRs with WDT enabled. If there is a hole there I want to know about it. At least how to recreate the symptoms so courses of action can be considered.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For years I have heard enough embedded guys say:

Quote:
Connect a RC pull up and capacitor to the reset line for power on reset
This used to be true in the old days. However, the AVR has an internal built in circuit to do this. It is actually bad advice with the modern AVR, except it doesn't do any real harm, so it must be correct after all :lol:.

What enough embedded guys say might be good information, might be old wives tales or a cover for user design induced problems. I prefer something that can be reproduced over unsupported sayings. On the other hand I suppose I could take the blind faith approach and make myself "feel better" by making design decisions based on sayings and rumors :wink:.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

However, the AVR has an internal built in circuit to do this.

... but the pull-up is too wimpy for real-world work. Besides the old wives, Atmel recommends it as well.

Repeating, I'm not worried about extreme stress conditions that are a really crazy set of circumstances and signal levels. But I am concerned if there is a sizable hole in the AVR's WDT. Let's hear more about how to reproduce the stucking, icarus1.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Mike B wrote:
On the other hand I suppose I could take the blind faith approach and make myself "feel better" by making design decisions based on sayings and rumors :wink:.
You can call it sayings and rumors, or you can call it experience talking. But, you have to consider the source. I might be inclined to believe those who have lived through the hard knocks more than I am to believe the marketing folks at Atmel, Microchip, Freescale, or NEC.

Bottom line is that I wouldn't recommend "blind faith" in crossing the street in either direction. MFGs can exaggerate the performance of their parts just like they exaggerate the availability of their parts. Just like an old wife can exaggerate the benefit of smearing butter on a burn.

I've seen issues with one MFG's internal WDT. That might make me less inclined to discredit an otherwise credible source telling me that other MFG's have the same issue. It's all about feeling better (or feeling good enough) with the information you have.

The internal WDT is there. Use it. I'm sure the majority does (I do). I would wager that you will never get the opportunity to see it perform the way you expect it to. Not because it can't or doesn't, but because the conditions will never be right for it to happen in your presence. The same goes for an external WDT. BOD is readily testable. ESD immunity is readily testable. But, how could you create the conditions that show the performance (good or bad) of the WDT? I agree with theusch, icarus1's test doesn't do it. "Blind faith" in the ATMEL designers isn't "good enough" when it really counts. Fortunately, it rarely really counts.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
... but the pull-up is too wimpy for real-world work. Besides the old wives, Atmel recommends it as well.
Sorry, I didn't make my point clearly (my bad). I meant the stated reason of provided power up reset didn't apply. The recommended ATMEL capacitor is for noise filtering, not to provide a power up reset and the external resistor is only to compensate for the weak pull up. Even if the AVR reset pin is tied to Vcc it should still get a power up reset. The old wives tale is just about the power up reset part.

You made an excellent point about ATMEL reset pin recommendations, which actually get just a little more complex when you consider HV, ISP and DegubWire applications.

Class88 wrote:
"Blind faith" in the ATMEL designers isn't "good enough" when it really counts. Fortunately, it rarely really counts.
I think we agree that blind faith isn't a good way to design. I just don't see how following an unsupported "they say" is anything but blind faith. They may not be correct or actually may be correct, but they can't tell you why.... Problems like this are actually quite complex, specific to individual products (which rules out general experience) and implementation dependent. Maybe ATMEL has an unpleasant surprise waiting if this could be reproduced and tested.

Believe me other things "really count" when you have to fix the in the field unreproducible product failure, like before yesterday.

To go off into la-la land, lets say the design is absolutely sound and follows all guide lines with perfect construction. What if the touching causes unbalanced AVR clock capacitive clock loads, resulting in a corrupted asymmetrical AVR clock duty cycles. This duty cycle is used in the metastability input pin sampling. So, what if it causes the metastability protection to fail and the AVR circuitry is latching up internally? Well, latch up is usually only correctable with power cycling. So, maybe its just an abnormal AVR CPU state caused by corrupted clock duty cycles? On and on....

I would love to get my hands on icarus1's hardware which has the best chance of being repeatable. Toss in a scope, logic analyzer, other test equipment, design evaluation, construction checking and time spent will likely get a usable answer.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Mike B wrote:
I think we agree that blind faith isn't a good way to design. I just don't see how following an unsupported "they say" is anything but blind faith.
Agreed. I contend that the "they" part of our mutual concerns include the uC MFG as well as the old timers. Neither have offered solid proof of their claims.

Going to a tie on the lack of solid proof, I use reasoning to decide that a dedicated supervisor circuit (that has no alternate, programmable functions, that is not running micro-code, etc) is more likely to perform its intended function through the worst of it all.

Quote:
Believe me other things "really count" when you have to fix the in the field unreproducible product failure, like before yesterday.
I've been there. At the time, it sure felt like it really mattered. The CFO sure thought it did.

I've also walked though the soggy wet basement of a home to take photos of a piece of equipment that started that home on fire. To get there, I had to walk past the sooty remains of a family's home. No one was injured in that or the other 7 fires attributed to that design. But that was by pure luck.

It's a matter of perspective what constitutes "really" mattering.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

adding an external watchdog certainly couldn't hurt. But don't fool yourself into thinking that it is any more robust than the internal one.

Simply put, if an event cause the silicon to "latch up" there is no event short of removing power, that will reset the chip.

I'd say if you feel you need to use an external one for reliability... use both, so you have a level of redundancy. It is highly unlikely that both WDT's will fail at the same time.

In highly critical apps I have done the dual WDT approach, with the external one actually used to cycle power to the control unit on a secondary time-out.

Also don't take failures by other manufacturers to be a sign that all integrated designs are flawed. Many WDT's are implemented using the same clock source as the CPU. Meaning that if the CPU clock halts, so does the watchdog. This is not the case in the AVR.

The only thing that could hang the WDT on the AVR would be a bad power condition. Where the internal RC is not able to meet the required thresholds (or the controlling logic). (it is fairly hard to hang an RC oscillator) But if does happen, the rest of the AVR core isn't going to be able to run either. So not even an external WDT is going to help you.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

adding an external watchdog certainly couldn't hurt.

Well, yes it does. Why not add fully-redundant components for ALL aspects of your design? Can't hurt, right?

That's (high integration for simple microcontroller apps) the reason we (my organization) keep designing in AVRs to our suitable apps. Not expensive; high integration; fast for the clock speed; reliable; operate according to the datasheets.

Add the cost and board space for a $2 supervisor to a $1.50 Mega48 app? (where is that eye roll smilie?)

If there is indeed a problem with the AVR WDT I want to know about it. But to this point I still claim it is a solution in search of a problem.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

: roll : (without the spaces) gives you :roll:

John Samperi

Ampertronics Pty. Ltd.

www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
Quote:

adding an external watchdog certainly couldn't hurt.

Well, yes it does. Why not add fully-redundant components for ALL aspects of your design? Can't hurt, right?

If you need ultra high reliability, absolutely... look at any NASA specification for spaceflight, nothing is left to a single system, architecture, or design.

The redundancy does add cost and complexity, but it doesn't "hurt" and by that I mean it does not make the design any less reliable.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I certainly agree with Glitch that the external supervisor circuit can have many of the vulnerabilities that an internal one has.

I do still hold some suspicion that the internal one carries a few more though. Most notably, the alternate functions and firmware enable/disable. [Yes, I know about the WDTON fuse bit. Yes, I know about the strict timing and sequential requirements of disabling WDT in firmware. But, because no WDT protection is a valid state of the part, the vulnerability does exist. The mitigating factors against unintentionally entering that state are just that - mitigating factors, not guarantees.]

I agree, there is no solid proof of the internal WDT being flimsy. Just like there is no solid proof that it is inherently robust. We have to all use our best judgment in the context of our particular situations.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

The redundancy does add cost and complexity, but it doesn't "hurt" and by that I mean it does not make the design any less reliable.

Obviously, I was extrapolating. Where do you stop? Have triple and vote on each step like a Minuteman missile controller?

Glitch mentions ultra-high reliability. I haven't done those apps; I think few of us do; and I don't think icarus1 is either. If I >>was<< doing that and this watchdog issue came up, then I'd know exactly what the "problem" is with the AVR WDT.

I suspect most of us want reliable AVR apps but don't have the applications/resources/knowledge to do the medical-class. My outfit couldn't afford a lot of failures/returns/warranty claims. I can't think of a single AVR "failure" over the years. I talked to our board houses once and asked them about DOA/infant mortality on AVRs, and they couldn't remember any either. I think that bringing "mission control" into this is reaching.

glitch, re doesn't hurt and doesn't make the design less reliable: I thought there was a rule of thumb that the more components the less reliable?

But are we really talking about reliability as the failure of the part to operate? I thought we were talking about a suspected flaw in a subsystem. Isn't that different than a "failure"? Or is the "failure" that the entire app doesn't operate as intended? It seems we are now trying to find out how many angels can fit through the eye of a needle.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I agree, there is no solid proof of the internal WDT being flimsy. Just like there is no solid proof that it is inherently robust.

Do you not think that any vulnerability might have raised it's ugly head on this very message board some time in the last 8-9 years? AVRs have known vulnerabilities (like EEPROM location 0 sometimes being corrupted when power is slow to die/come up and BOD is not enabled) but there's been enough reports here, sporadically, of that to give plenty of evidence of this. I cannot remember (well not in the time I've been reading) any prior suggestion of a vulnerability in the WDT. I guess it's something that might have crept in only in recently released devices that haven't been widely enough tested? (maybe as a result of process shrinkage or other engineering changes in Atmel's design?). It all seems very unlikely and there have to be literally tens of millions of shipped units out there with WDT enabled and working reliably. Someone said Microchip have shipped > 1 billion PICs, while the AVR is not so prolific it must be into 100's of millions and a lot of those must run with WDT enabled.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
Do you not think that any vulnerability might have raised it's ugly head on this very message board some time in the last 8-9 years?
I do think the install base is big enough that if a vulnerability exists, it has shown itsself many times over. But, it would be a tree falling in the woods. (Absence of evidence is not always evidence of absence.)

For sure, if the WDT flat out didn't work, it would be detected. I think it's got to be one of the most difficult features to test (for someone outside of Atmel's R&D lab that doesn't have access to the AVR IP).

When you blast your design with power surge, dips, sags, interrupts, conducted and coupled EFTs, radiated and conducted RF, air and contact ESD, etc, and your system wisely does a reset, do you know why? Could be BOD, could be the program counter looped back to main(), could be the WDT. You don't care, you're just happy the EMC engineer is going to sign the compliance report :) and you can grab lunch on the expense report before you head back to the office for your pats on the back for a job well done.

Like you say, new die designs flow pretty freely these days. What works today might not work tomorrow.

I'm still not saying I think it doesn't work. I'm still just saying I personally don't trust it as much as an external supervisor circuit. I'm still saying that I believe it to be adequate for many applications.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The first AVR design I did was with the Mega8 (using an ICE50). It was probably in the 2001 time frame.

It was a total redesign of a device that was using a Zilog Z8. That design had an external supervisor circuit.

I thought I was clever with how I used the leftover EE (couldn't use addr 00 at all on that part). I logged output run time (in 10s of minutes), I counted POR resets and WDT resets (using the built-in WDT). I mostly did this because one of the reasons for the redesign of that product was so many no-fault-found warranty returns. I wanted to give the product support engineer some tools to try to determine how long a device was installed before it "failed." I thought it would also be useful for field testing.

I never saw any counts on the WDT reset (yes, I did test it by inserting endless loops in various modules). That doesn't mean it wouldn't have worked if need be. I would like to think that with over 1M of those out there (it sold at 200k/yr for 7 years), there would surely be some counts in the EE for WDT reset.

We ended up using the accumulated run time to activate a maintenance reminder LED. I should have gotten a cut of the consumable sales that came from that.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

do you know why? Could be BOD, could be the program counter looped back to main(), could be the WDT.

Yep--in general I trap reset cause and dump it in EEPROM. For "big" controllers that have an error log, the "startup" record is part of the "fault history" and can be pulled up on the display. On smaller apps (little timers e.g.) it can be pulled out of the EEPROM on a "problem" unit via ISP.

I guess I've said my piece and am out unless icarus1 gives a way to recreate the problem.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
I trap reset cause and dump it in EEPROM.

As have I. See previous post.

Quote:
I guess I've said my piece and am out

Before you go, give a generalization on anything you have learned from trapping reset cause. Any counts on WDT reset? From field returns? From lab test units?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'd call WDT traps "rare". But also, we/customers don't poke at the info on "normal" working units.

I described the way I race shutdown in another recent thread. When raw supply power loss is detected, critical tasks are done such as writing cycle counts to EEPROM, and then I sit in a loop tickling the watchdog waiting to die. If power is restored, I find it easier in most apps to just do a forced reset with the watchdog rather than do all the housekeeping for a "soft restart". Also, in some apps when a major reconfiguration is done (module added or removed) I'll use the watchdog spin to force a reset. So I will get those during "normal" operation.

There was one problem app where the customer didn't do a good job (put politely) about noise and spike handling. There were massive spikes on the Gnd when motor starters and the like were engaged. (I don't think even Leon's PICs would have ridden through unscathed.) Really weird symptoms would happen inside the AVR. Some of them were missed interrupts. In particular, my ADC ISR starts the next conversion.

In apps like this, I do the WDR in the mainline in the area that handles my timer tick flag that is set in a timer ISR. So I then know the mainline is running and interrupts are running and the timekeeping timer is running. In addition, there are other critical things that may have to be "running" before I'll do the WDR. For example, I've got a motor drive module that gets commands to move via USART. If moving, it needs a new (or repeated) command during a fixed time, or I'll stop all outputs and watchdog-spin. Getting back to the ADC app, I had a conversion count set in the ADC ISR that had to be incrementing before I do the WDR. So I trapped a number there.

I have a modular app with modules that control outputs of different types. These are all RS485. One of the situations in RS485 is that if one unit gets stuck in TE Transmit Enable everything comes to a grinding halt. Every module does as I described above in the drive module example: WDT set short at 65ms. WDR only when timer firing and poll received. Otherwise WDT kicks in.

So in this app it looks like your VCR after a power outage flashing 88:88--all the modules are doing their startup indicator LED test every 65ms+startup time when the master is removed. You might say that WD resets are an integral part of this multi-AVR app.

That's why I want to know if there is a hole, and how to recreate the symptoms.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Class88 wrote:
I would like to think that with over 1M of those out there (it sold at 200k/yr for 7 years), there would surely be some counts in the EE for WDT reset.

I would like to think NOT, for any of my apps. Frankly if you see a WDT reset, that means your code is misbehaving and has a problem. A WDT reset should only happen in the case of an unhandled exception. (or when used to perform a software reset) The lack of WDT resets indicates that the code is solid, and the problem is elsewhere. (may still be a code problem, but not one where it runs off into never-never land [which could also be caused by a power glitch causing the CPU to execute some code incorrectly, and lock up])

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

glitch wrote:
Frankly if you see a WDT reset, that means your code is misbehaving and has a problem.
I disagree.
Quote:
glitch causing the CPU to execute some code incorrectly, and lock up
For me, that is the only expected reason for the WDT to generate the reset. The internal registers of the CPU can get corrupted, the program counter can go to never-never land. That's when my watchdog barks out a reset.

For some applications, I put an endless loop at the top of code space. For some applications, I fill all unused code with an endless loop. That way if the program counter goes awry, WDT reset will happen rather than a rollover.

I have nothing against using the WDT for planned resets. I don't do it though.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I imagine that some freaks will like the WDT "window" feature of the XMega.

For those unfamiliar, you configure a count "window" to clear the WDT. If you clear too late, you get a reset - just like you all are use to. However, if you clear too soon, you get a reset. That addresses a concern of renegade code continuously clearing the WDT.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
we/customers don't poke at the info on "normal" working units.
I imagine that you scrutinize that data following a barrage of lab tests. Have you seen evidence of the WDT causing a reset not related to the falling power scenario where the WDT reset was expected?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Class88 wrote:
glitch wrote:
Frankly if you see a WDT reset, that means your code is misbehaving and has a problem.
I disagree.

I added the allowance for cpu corruption later as you saw. Code is still mis-behaving, but not necessarily due to a coding error. Either way you have a problem, beit in software or hardware.

As for logging I also log reset causes, including software resets, and unknown resets. The startup routine logs counts based on what bits are set, with a priority masking (for example EXTRF & BODRF are ignored when PORF=1) the flags are then cleared. If no bits are set, an unknown reset condition is logged. Software resets are logged before the WDT time-out loop. Ideally the software reset count, and WDT reset count should be the same. (and so far they have been for any units that I have examined)

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

glitch wrote:
Code is still mis-behaving, but not necessarily due to a coding error. Either way you have a problem, beit in software or hardware.
Agreed. The system is misbehaving - not something to normally hope for.

I would, however, find more comfort than concern in knowing that the WDT got me through some rough times. Like 2 or 3 sites (out of 1.5M) having up to 10 WDT resets in 10 years. Or, maybe 100 sites having 1-2 WDT resets in that same amount of time. As it is, only blind faith tells me I have some protection.

Sure, it would be good if the CPU never gets scrambled and the WDT never gets a chance to show what it's made of. But, you're left wondering if your watchdog is just a paper tiger.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

1) No news is good news. ;)
2) A "good seller" for our stuff is 100/month. A different market area than, say, Cliff's set-top boxes and similar.
3) Many of these industrial timers and similar are short-run semi-custom--a variation on a proven design purpose-built to fit a particular need. (that is pretty much our old-guy niche: provide a tailored device that has more application-specific features than a generic timer/monitor/controller from say Red Lion at a value price. e.g., we can provide units display and setpoint limits that are tailored. maybe form factor. often we can combine a series of these off-the-shelf devices into a single package. it isn't unusual to replace say three OTS devices at $100 each with an integrated AVR solution at maybe $100 plus sensors.)

But there are enough units out there in many app areas that if they were either hanging up (WDT failure or other) or resetting (via WDT or other) we'd hear something about it.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Class88 wrote:
glitch wrote:
Code is still mis-behaving, but not necessarily due to a coding error. Either way you have a problem, beit in software or hardware.
Agreed. The system is misbehaving - not something to normally hope for.

I would, however, find more comfort than concern in knowing that the WDT got me through some rough times. Like 2 or 3 sites (out of 1.5M) having up to 10 WDT resets in 10 years. Or, maybe 100 sites having 1-2 WDT resets in that same amount of time. As it is, only blind faith tells me I have some protection.

Sure, it would be good if the CPU never gets scrambled and the WDT never gets a chance to show what it's made of. But, you're left wondering if your watchdog is just a paper tiger.

Well if it makes you feel better, I have disabled the BOD and introduced glitches into the system to cause the core to mis-step resulting in a WDT timeout. (I have not been able to glitch the system with the BOD enabled) In a lab setting, over the years, I have introduced all manner of failures, including latch-up conditions where even an external reset could not get the unit running. I cannot say that I have ever had a scenario where I would say the WDT (or BOD) failed to reset the CPU, except those cases where nothing could reset it - short of a power cycle.

My, earlier, statistics were for field units running production code.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

glitch wrote:
I have introduced all manner of failures, including latch-up conditions where even an external reset could not get the unit running.
I think that would bother me to the point of not using the micro. If I cannot rely on a reset pulse to clear its head, I wouldn't trust it.

If it latched with outputs on and ignored a valid reset from its supervisor circuit, that part would hit the bricks quick. No looking back.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

the latch-up conditions were on old AVR's 90S era (though I have not repeated the experiment) and was very difficult to produce, and would never happen under normal circumstances. This condition was not introduced from a running position, but rather during power-on. As far as I could tell, the pins were all tri-stated. (note that this condition caused other logic on the board to latch up as well, so it was not unique to the AVR)

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

Pages