Very Long Debug Run?

Go To Last Post
29 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Greetings, Freaks

 

I am having problems with my accelerometer. It does some strange things after operating a long time (sometimes more than 30 days!) At this point, I have no idea even what state it is in then as some aspects continue in more or less normal operation. What I want to do is put it on the debugger and "break all" once it gets into this mode, to determine what states the port pins are in, and what pertinent variables indicate. There has never been any indication of stack overflow or buffer over-runs but I DO need to check those, once an event occurs.. 

 

The problem I need to figure out is how to keep the debugger running continuously for that time.

 

First issue is the dreaded Micro$oft update. I think that turning off WiFi should prevent that. Other ideas or Comments?

 

Second is sleeping. Using Win10, most recent update. Not a Windows person so not sure how to prevent sleep?

 

The host is a laptop so it should effectively be in a UPS. Will connect the Dragon directly to the laptop to avoid power drops on the externally powered USB hub. We get lots of power drops, here.

 

Anything else that I need to watch out for and possibly change, temporarily?

 

And a debugger question: Is there some way to do a (RAM) memory dump early in the operation then compare that with a memory dump after the fault occurs?

 

Thanks

Jim

 

Oregon Research Electronics

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can attach the debugger to a running target - so you don't actually need to keep it the debugger running.

 

ka7ehk wrote:
First issue is the dreaded Micro$oft update.

What Windows version?

 

Win10-Pro lets you control when updates happen; lesser versions not so much

 

ka7ehk wrote:
I think that turning off WiFi should prevent that.

Yes, it will (so long as Ethernet is also disconnected).

 

But make sure it hasn't got anything "pending" - do restart after you've turned the WiFi off.

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
Last Edited: Tue. Jan 26, 2021 - 06:59 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Not sure what you mean by "attach your debugger to the target" in this context. Does not the debugger have to be physically attached to the target at all times?

 

Appreciate the comment about pending update and restart. No ethernet.

 

Thanks

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

Last Edited: Tue. Jan 26, 2021 - 07:04 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You know I'm a strong proponent of debuggers, but I find this kind of situation is one where UART output can be more useful - as it's much easier to get long-term logs.

 

Another thing is to set up an "event buffer" in RAM, where you write an "event" at strategic points in the code - then you can read that and see what the system has been doing.

 

I use a circular buffer, and record an event ID and maybe a timestamp and a byte or 2 of "status".

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ka7ehk wrote:
Does not the debugger have to be physically attached to the target at all times?

No, I don't think it has to be physically attached.

 

But, also, when it's physically attached it not necessarily in an active debug state - starting the debug session is what I was thinking about as "attaching" the debugger in this case

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Problem is that this is "production" code with fairly tight timing constraints. I like debug prints, but that would seriously alter the program flow. If I were to add debug prints, I could no longer be certain that the system being observed is consistent with the system with the problems.

 

The observed problem is that power consumption goes up by 8X or so, but it continues to read the sensor, update the RTC, and write, apparently correctly, to the SD memory card. At this point, I can't even tell if it is something happening in the SD card or in my hardware!

 

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ka7ehk wrote:
I like debug prints, but that would seriously alter the program flow.
Advantage OCD; a logic analyzer should have near minimal effect.

ka7ehk wrote:
The observed problem is that power consumption goes up by 8X or so, ...
can ideally patch small-valued resistors to measure where the current is flowing.

ka7ehk wrote:
... but it continues to read the sensor, ...
Is the post-failure sensor data complete, precise, and correct?

ka7ehk wrote:
... the SD card ...
(there's an MCU in an SD memory card) (pardon the obvious)

A logic analyzer can log SD's SPI; can quickly get a "big picture".

 


Troubleshooting real-time software issues using a logic analyzer - Embedded.com

 

The Art of Electronics 3rd Edition | by Horowitz and Hill

Download a sample chapter

10.7.1 Keeping CMOS low power 754

[Figure 10.93. A "current spy" ...]

10.8 Logic pathology 755

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ka7ehk wrote:
If I were to add debug prints, I could no longer be certain that the system being observed is consistent with the system with the problems.

Ah yes - that can be a problem

 

Maybe time to look at doing in-RAM logging, then?

 

Yes, that's the one.

 

I was going to get a screenshot, but it seems that my Microchip Studio doesn't want to acknowledge the presence of my debugger.

 

frown

 

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ka7ehk wrote:
The observed problem is that power consumption goes up by 8X or so,

Do you have a logger that can log the current consumption?

 

I've done this before to see where such things happen.

 

If you're lucky, you might even be able to tell what the system is doing from looking at the current waveform - a kind of side-channel debugging

 

https://en.wikipedia.org/wiki/Side-channel_attack

 

https://searchsecurity.techtarge...(SCA,engineer%20the%20device's%20cryptography%20system.

 

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ka7ehk wrote:
I am having problems with my accelerometer.
An accelerometer has significant digital signal processing.

If by ASIC then should be reasonably reliable as ASIC tools aid some formal analysis.

If by DSP then there's firmware; can still do formal analysis though maybe not was the decision.

FIR filters are unconditionally stable; IIR filters are conditionally stable.

Does the accelerometer have product change notices?

ka7ehk wrote:
The host is a laptop so it should effectively be in a UPS.
Thought so myself until the charger browned-out (outlet strip's very leaky MOV, lightning is common here)

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0



Second is sleeping. Using Win10, most recent update

 

John Samperi

Ampertronics Pty. Ltd.

https://www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Does the system know if the SD card is installed, (are you watching for any feedback from the card)?

 

The point being, can you run one unit on the bench with the SD card, as per usual, and one unit on the bench without the SD card inserted?

 

I think you have mentioned before that your code pretty much fills the memory available.

Are you re-using the same variables in different parts of the program?

 

One of the hardest bugs I ever remember tracking down was on a GPS and "map" display project.

It would run for hours, or days, or occassionallly a week or two, and then crash.

 

After years of debugging with an LED, and often an LCD, and an O'scope, I finally broke down and purchased a logic analyzer to help find the problem.

 

It turns out if one (of several) ISRs fired during one particular line of code, (of many), the ISR trashed a variable and subsequently crashed the program.

 

Obviously you need to make a list of possible problems, (such as stack overflow, etc.), and then devise your testing strategy to help rule in or rule out each potential problem one at a time.

 

Projects beget more projects.

 

There are chips that can do the high side current sense resistor monitoring for you, and you can read that signal into a Nano and then beep a piezo when the current increases 8X.

 

You will at least then have an easy way of knowning when the project has entered its faulty state.

 

Also, make sure to read the SD card data and make sure it didn't miss a data entry or two when the faulty state was entered, (if possible).

This is to help make sure that the SD card wasn't stuck doing some internal bad sector remapping, etc. as a trigger for the problem.

 

Any tropical lightning storms triggering the failure?

Deoes it fail on a very clean power supply on the bench?

 

I assume you have good reason to believe that the devices are failing while stuck up in a tree, acting alone, and that the failures are not triggered by the user messing with the device to try to download the data, (static electricity, not powering the device off before messing with it, etc.).

 

Gotta ask, how many versions of the software are out in the wild?

Is there any correlation between the failures and the SW version?

 

Does the chip family allow you to assemble one testbed on a similar uC with more memory?

And if so then assign more stack spaces, (whatever your compiler uses), as a poor man's test for a stack overflow?

i.e. It overflows on the production model, but not on the 'almost' identical model with more memory and stack spaces.

 

Does every unit eventually fail, when whatever the trigger is hits it, or are ther some units that fail, and will fail again, and others that seem to never fail?

 

It is very frustrating trying to find an intermittent bug that you can't reproduce on command.

But it is very rewarding when you definitively find the bug, (and then you kick yourself for making the error...).

That said, do you have a Plan B?

 

Is it time to design and build Version N+1, with a larger memory micro, and using one of the DB micros with a 2 level priority interrupt controller, or an Xmega with a 3 level priority interrupt controller, and with spare memtory to build in some of the debugging code so that its addition or removal doesn't alter the operation of the device?

 

Recall that the DB series runs at 24 MHz, while the Xmega will give you 32 MHz, so you have more processing power to work with, both for the native application as well as for the addition of the debug / self monitoring data integrity code.

 

I feel your pain.

 

JC

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

DocJC wrote:
It turns out if one (of several) ISRs fired during one particular line of code, (of many), the ISR trashed a variable and subsequently crashed the program.
Would that specific defect have been found by a code reviewer?

DocJC wrote:
... (and then you kick yourself for making the error...).
Don't ... defects are as common as weeds in your spring lawnwink

Been stated that engineering is the study of failures.

 


SEI CERT C Coding Standard - SEI CERT C Coding Standard - Confluence

Concurrency [rules and recommendations]

There are linters that can detect a relative few concurrency defects though that's usually done by static analyzers.

A few sound static analyzers exist though these will not be easy to operate and may be difficult (a lot)

 

Fatal Defect by Ivars Peterson: 9780679740278 | PenguinRandomHouse.com: Books

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The observed problem is that power consumption goes up by 8X or so, but it continues to read the sensor, update the RTC, and write, apparently correctly, to the SD memory card

Can you try activating/turning on different items (port pins, chip functions, leds, etc), or combos of things, to try to match the observed current? The closer you can get a match, the better...if you can figure out what is activated or running, it could significantly narrow down the source and help pinpoint where to look.   Have you scoped around---for example maybe an "off" backlight is pulsing on for 2us---too short to see, but enough to raise current.   Floating inputs that float around can suddenly start drawing "large" current.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Tue. Jan 26, 2021 - 10:11 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

There is one point in the program that it drops into if there is a write failure. I have one debug pin that I COULD wiggle at that point tp trigger the LA. I may need to do that though am a bit loth to do so because it would mean changing the program. Was hoping to avoid that just because any change means it is no longer the same system. 

 

There is a little, but scant, evidence that it MAY be related to starting or stopping sampling or removing and re-inserting the SD card (latter can only properly be done when not sampling and users say that they are very careful about this).

 

Well, I THINK that what I am going to do is set up the debugger, and, at the very least, set up a break point at the write-fault trap point then just hammer away at starting sampling, stopping, remove card, read it, replace, and repeat. I can monitor current (manually with a DVM via charging current to a super-cap battery substitute. I cannot monitor directly because the current is quite high during SD card writes and the required dynamic range (and speed) exceeds my instrumentation capability. Then, if I find that the current has increased, I can break-all and attempt to see what is not the way it should be.

 

Thats the plan for now. Thanks for the great ideas and suggestions.

 

Hmm, DocJC, thanks for the insights. All of this is happening in two versions of software (difference SHOULD only be cosmetic). It is likely happening with other users who have not contacted me. Not likely environmental triggers (one site is in northern Japanese forests, one site is on broadleaf trees near Ottawa). All are battery powered. In fact the way users find this is that battery life is around 6 days instead of expected 45 days. Otherwise, things seem normal. I have the 5 units from Japan in my lab for testing right now. 

 

As an aside to this, there is somewhat bizarre behavior involved. Units can fail (e.g. fast battery discharge). If you replace batteries quickly, this persists. BUT, if the faulty unit is unpowered an hour or more (maybe it does not have to be this long? Hard to tell because so few instances to check on), it returns to normal. In a few cases, in the lab (only) SD cards have turned up not able to reformat. BUT, if I let these also sit, they can be formatted again !! This MAY point to some short term data retention in control registers, probably unrelated to the problem at hand but also not inconsistent with persistence of the problem across battery changes.

 

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

Last Edited: Tue. Jan 26, 2021 - 10:41 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Maybe you can also remove certain components...leds, resistors, buzzer, etc  one-by-one to see if that helps to narrow down to the culprit.  Of course that assumes you can get the problem to repeat often enough to notice a difference.  Something like a floating pin could  be hard to find, since it could be semi-random (not sure how much current increase you are chasing down:  10ma?  10ua?) . 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nd the required dynamic range (and speed) exceeds my instrumentation capability. 

Ah, but you have extensive experience in analog circuit design!

 

You could easily build a current monitor with either windows, or just (simplier) levels, and turn on an LED when the current is in that given range.

 

Your buffered input signal can drive several op-amp level detectors, remember that it truly doesn't matter if you saturate the higher gain stages, it just doesn't matter.

 

With a pulse extender you can see even brief episodes of different current ranges.

 

When working on projects it is very reasonable to build your own test equipment.

 

Likewise, the starting and stopping sampling can be done with a nano and a relay across the user switch.

Who cares if you use a relay and it fires once every 4 seconds for X number of days.

It is a cheap test instrument, albeit that test doesn't include the withdraw and re-insertion of the SD card, but it lets you test 1000's of sample On/Off cycles over a day or two.

 

JC 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Normal is about 1.6mA average battery current at Vbatt = 1.6V into a 3.3V micropower boost converter. When its bad, it climbs to around 12mA average. A dare not measure between the battery and the converter or between the converter and the rest of the system because of high peak SD card write currents. The battery cutoff point is about 1.1V at which the converter input current is around 2.4mA (when operating correctly).

 

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

Last Edited: Wed. Jan 27, 2021 - 12:39 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So for now you could replace the battery with a stable voltage source, within which you can measure the current.

Not ideal, as yo are changing the system, and if the fault is related to the power supply you won't detect that.

But, the power supply is only one (small) (but important) part of the system, and besides, if you don't get a failure with a high capacity, low impedance 1.5 V source, then that actually helps point to the power supply as part of the problem.

 

Anyway, fire up a heavy duty, low impedance 1.5 V source, with a small current sense resistor.

Recall that the current sense resistor could be on the INPUT to the power supply itself, so it doesn't impact the output impedance, or step response of the supply.

 

Then put your op-amp across it...  and you have your notification signal when the board goes into fault mode.

 

JC

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What is the batt voltage during the problem?   Of course, as Vbatt falls, its current has to increase to deliver the load power.  Also, at low Vbatt, transistors may not be full on/off, wasting energy (thereby requiring more current).   So, wondering if the issue could be in the boost portion (such as getting into a mode where the transistors are not switching well).   You could verify it is NOT this section by taking a failing unit & run it from a 3.3V bench supply--check: does it still fail.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have seen a graph of one failure where the battery voltage was still about 1.3V. The others are less distinct as they were discovered after the battery had fully discharged, early, which could be then seen in the voltage field of the CSV data records.

 

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

FYI, further to John's image of the sleep settings, remember that laptops have two sets of sleep settings. One for when on mains power and one for when on battery. To get to these settings click on the search magnifying glass in the bottom left corner next to the Windows logo and type in power. In the results under the Settings heading you will see Power & Sleep Settings.

 

The laptop power and sleep settings look like the attachment image.

 

 

Attachment(s): 

Wayne

East London
South Africa

 

  • No, I am not an Electronics Engineer, just a 54 year old hobbyist/enthusiast
  • Yes, I am using Proteus to learn more about circuit design and electronics
  • No, I do not own a licensed copy of Proteus, I am evaluating it legitimately
  • No, I do not believe in software or intellectual property piracy or theft
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Got it!

 

Thanks

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My small input.

I would use a logic analyzer, today it would be a USB one, to all the needed io's, (probably with a 10k resistor to avoid interference, so they have common ground but not power(but same voltage)).

this way you can place it close to correct conditions.

 

About the SD card, are some extra garbage collection when it's about full, or same file gets segmented ?

(I remember a C64 there was a way to build strings so it stopped for up to 1/2 hour doing garbage collection, the solution was to run the command that ask for free memory once in a while to force a cleanup.)

  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The SD cards are far from full. 10Meg files on an 8Gig memory card. The triggers for this behavior seem very inconsistent, which leads me to believe that it is an odds-thing, two infrequent events, like changing the SD card combined with some running event in the software, maybe. 

 

I do not think that an LA is much help. It MIGHT be pulling op on the Vio pin of an unpowered FT232R, or maybe setting the UART Tx pin high into the same unpowered FT232R, or maybe one of several LEDs on at a very low duty cycle that is hard to see. I think that I can only see what the source of the increased current is by breaking the code execution after it is drawing too much current and laboriously going through everything that is possible for it to do. And, of course, at the same time, check for buffer and stack overflows, etc (which LA will not do).

 

Those are may thoughts -

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I just fear that sleep and debugger is a bad combo.

 

But perhaps write at lot of extra data on the SD card like :

  1. stackpointer
  2. reset counter (bod etc.)
  3. sleepcounter
  4. time in SD write routine
  5. batt voltage 
  6. temperature ! thunder 
  7. time since last sleep

 

I have just had so many error I can't repeat in lab.

Once we had to put a camera up (There was a lot of status led. and relays and then a real meter, it was a automatic concrete mixer and some times it was to wet). But it never happened when I was there.

It turned out that it only happened when I added zero water and this way we could prove it :)   (they had denied that there was a problem with rain water in the sand!!! )

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


 

ka7ehk wrote:
Not a Windows person so not sure how to prevent sleep?

One of these: 

https://null-byte.wonderhowto.com/how-to/create-usb-mouse-jiggler-keep-target-computer-from-falling-asleep-prank-friends-too-0236798/

 

EDIT

 

Or, more seriously, open the 'Search' on the Taskbar and search for "Power":

 

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
Last Edited: Thu. Feb 4, 2021 - 12:13 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

OK, that is what I finally figured out. It has now been running non-stop for about 8 days sleepless.

 

Thanks everyone!

 

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!