Is there a way to programmatically detect interrupt starvation?

Go To Last Post
33 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello,
I am working with an AVR32 AT32UC3A0512, where three GPIOs are connected to a hall sensor (it's a sensor with 6 states depending on the direction of a motor). When one of the 3 GPIOs change state, an IRQ is generated, I save the value of the 3 GPIOs as the "previous hall state", then the motor keep turning and the next time an IRQ is generated I can compare the new value of the 3 GPIOs to determine in which direction the motor is turning, among other things.
 

This "works" 99% of the time, however sometimes, I can see that the IRQ is triggered and the value of 3 GPIOs compared with the last state is such that it seems that one IRQ was missed (for instance current state is "3" and previous state is "1", and there should have been an IRQ for state "2"). For this reason, I suspect that something is keeping the interrupt handler busy and that one IRQ is missed.

 

I tried to debug this by setting a static boolean to true when entering the IRQ and setting to false when exiting the IRQ and checking if sometimes the IRQ is entered while the boolean is already true, but I found out later this is not a valid test, because as long as the IRQ handler (declared with compiler directive "interrupt" so it has special assembly at the end) is not exited, the interrupts are not re-enabled.

 

To give some context the IRQ is triggered with a frequency of 1kHz, the MCU is clocked at 48MHz, FreeRTOS is running on it, and the IRQ handler contains about a thousand assembly instructions with -O0, (the problem is much easier to reproduce when compiling with -O0, but it can be reproduced after several hours with -O3). There is no obvious other interrupt which would starve the system.

 

Is there some way just with software changes to detect such an IRQ starvation? If not I will debug this with some GPIOs and an oscilloscope.

Thanks.

Last Edited: Tue. Jul 16, 2019 - 07:46 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Counter debug variables might help.  Define three variables, one for each interrupt.  Initialize them all to zero.  Inside each interrupt have them increment just their counter.  Run the equipment for a period of time then inspect the three counters.  If they are off by more than +/- one count from each other then an interrupt is being missed.

 

Note that the increment itself does add processing time to the interrupt handlers which could change the behavior of the system.  Be sure to consider this side effect.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

etienne.cor wrote:
Is there some way just with software changes to detect such an IRQ starvation?
Yes

AVR32 have performance counters; the cycle counter register (COUNT) is first to mind.

When such a starvation then an exception can be raised to recover (am assuming you don't want an assertion as that will stop the application)

etienne.cor wrote:
If not I will debug this with some GPIOs and an oscilloscope.
A logic analyzer will typically have deeper memory than an oscilloscope.

Some logic analyzers roll the capture memory into the PC such that the memory depth is very large.

 

P.S.

etienne.cor wrote:
... IRQ ... FreeRTOS ...
Can assure lack of starvation via RMA.

Will need to measure the worst case timing for interrupt handlers and task processing so can completely, precisely, and correctly perform RMA.

The fix may simply be to correctly set the interrupt priorities and the task priorities.

 


32-bit Atmel Architecture Manual (AVR32)

(page 57)

7. Performance counters

RMA - Rate Monotonic Analysis

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@ScottMN: Thanks for your answer. Is there no way to directly detect an IRQ getting triggered while another IRQ is already being handled? At first I was trying to re-enable IRQs as soon as an IRQ handler gets entered, but this doesn't work because of this special assembly at the end of the interrupt function is missing ("rete" instruction). I cannot call "rete" manually with inline assembly, while running the IRQ handler normally, because "rete" is endinf the function.

@gchapman: we cross-posted at the same time

Last Edited: Tue. Jul 16, 2019 - 08:05 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@gchapman I am not sure I understood what you mean. I checked the datasheet and the performance counter cannot count the number of IRQ, or even detect an event such as an IRQ being triggered while another IRQ is being handled. According to table 7-2 on page 59 it can rather count evens such as cache-misses, etc. What were you suggesting that I do with the performance counter?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

etienne.cor wrote:
What were you suggesting that I do with the performance counter?
performance counters (plural) one of which is COUNT in the AVR32 architecture.

page 16

[COUNT is register 66]

 

page 19

COUNT - Cycle Counter Register

 

page 57

7. Performance counters

7.1 Overview

....

Two configurable event counters are provided in addition to a clock cycle counter.

...

An idle task overrun event occurs when the idle task consumes less than a certain fraction of a periodic super-frame.

A task overrun event occurs when a task doesn't meet its deadline within a frame.

An interrupt overrun event occurs when an interrupt's processing doesn't complete within its deadline or sub-frame.

Overrun events are failures that result in exceptions (faults) that indicate a defect (defect -> fault -> failure)

Two of the many possible defects :

  • incorrect priorities (interrupt, task)
  • incorrect task sequencing (is there more than one way to not overrun the frame? (margin),  state machine adjustment?)

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

How "clean" are your sensor signals?  A false IRQ trigger on one of your signals could also cause the symptoms you describe.

Letting the smoke out since 1978

 

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I re-read the datasheet, and according to table 7-3 the monitorable events of the COUNT register and of the two event counters do not include any kind of interrupt overrun, task overrun, etc. but rather only CPU performance information like cache-misses.
The architecture manual (http://ww1.microchip.com/downloads/en/DeviceDoc/doc32000.pdf) does not contain the word "overrun", so unfortunately I don't understand what you are suggesting that I do with those registers, and I am not aware of any event triggered when a GPIO interrupt is starved.

Thanks for the help anyway.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

When one of the 3 GPIOs change state, an IRQ is generated

Sounds like possibly a very bad scenario...are these signals debounced by some electro-mechanical hysteresis?  As you creep close to the switching edge do you generate a boatload of edges?  If the motion waves/vibrates back & forth as it closes in on the edge, does that generate a bunch of pulses? Unless fully debounced, try doing it without IRQ's---you will save yourself a lot of headaches.

 

What sensor is this that has built-in states?  That might be helping your cause--provide a link:

 

When one of the 3 GPIOs change state, an IRQ is generated

Who says all 3 will change at the same moment (they likely won''t)? So reading one might happen while the others are in still updating & mass confusion (unless Gray code, where only one can ever change per event).

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

COUNT is a monotonic counter of CPU clocks.

Deadlines are in units of CPU clocks.

Zero COUNT, set COMPARE to the current deadline, an interrupt occurs when COUNT == COMPARE therefore an overrun event.

 

You're welcome!

Take care!

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Have you tried enabling the GPIO glitch filter ?
In the interrupt handler you could check the IFR to see if there is another pending interrupt.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@ avrcandies : a hall sensor only has one bit changing between two neighboring states (for instance the 6 possible states can be in the order 451326), so it is guaranteed that only one GPIO changes at a given time.

I am quite sure debouncing is not relevant here, there are already low-pass filters in place and there is no reason why the error would happen more often with -O0 than with -O3 if the error was due to not debouncing. I don't have the reference of the hall sensor (it's integrated with the motor), but such digital hall sensors have an integrated hysteresis, so there should not be a need for debouncing (also the fact that I systematically get two successive interrupts with 2 states which are not neighbors indicate that this is not a debouncing problem, for instance if the states are in the order 451326, I see that I am getting state 4 followed by state 1, which is why I suspect that I missed the interrupt with state 5).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I kept debugging this, and I can see that in some of the cases, the issue is caused by some of the hall-sensor GPIO toggling unexpectedly (blue line) as if the brushless DC motor was turning in the wrong direction (the hall sensor has a schmidt trigger, so this can't be a glitch). I can't really explain this behavior.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi,

 

What hall sensor you are using ? how fast is it (tcycle)?, try adding a small cap in the Vdd to GND 100nF and small cap in  the output 10nF (to GND). check after this with the osci and let us know.

 

is it a CMOS output ?

what is the Bop sensitivity ?

 

Regards,

Moe

 

Edit: typos

Last Edited: Thu. Jul 25, 2019 - 11:07 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

the issue is caused by some of the hall-sensor GPIO toggling unexpectedly (blue line) 

I already mentioned you should apply some debouncing...ALWAYS assume there will be some in a mechanical system.

 

 (the hall sensor has a schmidt trigger, so this can't be a glitch)

I have some gold to sell you, for only $2 a pound...give me your credit card & I'll send as much as you want.

 

Who says the motor doesn't vibrate some as it it turning?  Do you think a sawblade would bounce back & forth a bit as it cuts a log (dynamic load change as teeth cut)?

 

small cap in  the output 10nF (to GND). check after this with the osci and let us know.

The "glitch" looks somewhat too long for it to be a "cap fix"  or electrical glitch...if the scope is really 200us/div, looks like a rather slow 50us wide output...if it were 100ns long, I'd be all for a cap.

 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Thu. Jul 25, 2019 - 09:22 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

See if the ‘glitch’ lines up with your pwm signals.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, I guess catching the "glitch" is a good thing.

It identifies the problem as an unexpected input signal.

 

Your current software didn't expect this, so you get your "failure" 1 % of the time.

 

Perhaps a change in approach is needed.

USARTS these days often read the data signal multiple times per bit interval, and then the majority vote, (high or low), wins.

 

Perhaps you need another input signal data processing layer to your project.

You "oversample" the 3 GPIO signals and essentially filter out the glitches in software, prior to passing Hall Sensor state info on to the higher level processing.

This is similar to "debouncing" a signal, but isn't the same.

Debouncing (in my mind) refers to a valid state transition, such as a mechanical switch changing states, that has some transition noise.

The transition noise is ill-defined in terms of the number and duration of the "bounces".

In the Hall Sensor case, perhaps it is a debounce problem with an insufficient Schmidt Trigger hardware filter, or perhaps the signal is valid and the shaft's rotation is not a clean as expected, ("vibrating", as mentioned above).

 

JC

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

etienne.cor wrote:
, for instance if the states are in the order 451326, I see that I am getting state 4 followed by state 1,

Surely recognising this means you are already close to a solution, where you simply disregard any glitches in the inputs that would take you to an invalid state.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

At the relatively low rates mentioned (1KHz), it seems like some sort of debouncing would be easily doable.   Since the runt pulses don't seem numerous, you might be able to keep your IRQ's & debounce in them as part of a state machine.  However, in case of fast numerous unexpected transients, IRQ's can give an earful of headache. 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

etienne.cor wrote:

This "works" 99% of the time, however sometimes, I can see that the IRQ is triggered and the value of 3 GPIOs compared with the last state is such that it seems that one IRQ was missed (for instance current state is "3" and previous state is "1", and there should have been an IRQ for state "2"). For this reason, I suspect that something is keeping the interrupt handler busy and that one IRQ is missed.

 

That type of skipped state sanity-check is a good way to catch interrupt starvation.

 

etienne.cor wrote:

 

To give some context the IRQ is triggered with a frequency of 1kHz, the MCU is clocked at 48MHz, FreeRTOS is running on it, and the IRQ handler contains about a thousand assembly instructions with -O0, (the problem is much easier to reproduce when compiling with -O0, but it can be reproduced after several hours with -O3). There is no obvious other interrupt which would starve the system.

I would be wary of "FreeRTOS is running on it"  - check also for any code that disables interrupts globally.

If time is critical, many MCUs allow interrupt priorities, so that even other interrupts are paused for the really important one. Not sure about UC3.

 

Worst case, if you must run FreeRTOS, and cannot force a high priority, you may need to add a small MCU to manage the Hall sensors alone.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

etienne.cor wrote:

I kept debugging this, and I can see that in some of the cases, the issue is caused by some of the hall-sensor GPIO toggling unexpectedly (blue line) as if the brushless DC motor was turning in the wrong direction (the hall sensor has a schmidt trigger, so this can't be a glitch). I can't really explain this behavior.

 

A 'schmidt trigger' does not guarantee against glitches, it merely sets a threshold, that ignores smaller disturbances. If a large enough disturbance arrives, you will still get glitches.

How fast is your interrupt handler on this ?

Last Edited: Fri. Jul 26, 2019 - 10:23 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There should be a low pass filter ahead of the Schmitt trigger in the contexts of the time domain and ESD immunity (UC3 don't have a spec ESD in the data sheets) (plus, the last sentence in DocJC's post 17)

A digital gate's input is a differential amplifier; typical is a filter before, in, and/or after an amplifier.

UC3C GPIO input have an optional Schmitt trigger.

 

METHOD 4. HARDWARE DEBOUNCE FOR SPST SWITCHES. - LogiSwitch

...

Hysteresis assures a single transition with no oscillation when the switch is activated or released.  

...

AT32UC3A Datasheet (UC3A0, UC3A1)
AT32UC3C datasheet

Earth Ground

by Dr. Howard Johnson

...

In my opinion the most important point to make with regard to grounding is that the input to every digital logic gate is a DIFFERENTIAL amplifier. That's right--a differential amplifier. This differential amplifier compares the digital input signal to some local reference (often generated inside the chip), and decides which is bigger (more positive).

...

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I investigated this more, and this "glitch" is unfortunately not relevant, because impossible to reproduce when compiled with -O1 even when testing for hours (I was previously on purpose testing with -O0 in order to increase the reproduce-rate). The UC3 port of FreeRTOS disables all interrupts when portENTER_CRITICAL is called, unless the more recent ports of FreeRTOS for ARM which disable interrupts only up to some level, so this could be the real issue.

Do you know if there was already some attempts to modify the AVR32 port of FreeRTOS and make portENTER_CRITICAL / portEXIT_CRITICAL work like the ARM port and disabling for instance only interrupts of priority INT0? 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Have you read what is written before ?.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes I read all of it, thanks you for your answers. Regarding your suggestion to add a capacitor, there are already capacitors in place as part of a low-pass filter, and the glitch is a clear flank lasting 22us with "perfect" edges, I don't think that can be pure charge/discharge noise, but anyway I was able to confirm that I can't reproduce this glitch with -O1 even when letting the software run hours, so I am assuming the real issue I am debugging which is that sometimes an interrupt seems to be missed is not due to this, and could maybe due to interrupts being disabled with a bad timing causing a missed interrupt, which is why I am investigating if modifying FreeRTOS's "portENTER_CRITICAL" would be  a viable option.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

etienne.cor wrote:
... unless the more recent ports of FreeRTOS for ARM which disable interrupts only up to some level, ...
appears to be likewise for PIC32.

Couldn't locate UC3 in Amazon FreeRTOS.

https://github.com/aws/amazon-freertos/blob/master/vendors/microchip/manifest.cmake (PIC32MZ EF)

http://www.freertos.org/History.txt

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

etienne.cor wrote:
this "glitch" is unfortunately not relevant, because impossible to reproduce when compiled with -O1 even when testing for hours 

Eh ?

How deos the software optimisation level affect glitching in the O/P of the hall sensor we saw on your scope in #13 ? Surely all the software does is read those GPIO pins.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@gchapman: the AVR32 port is available in the freertos.org version (see https://www.freertos.org/portAVR32.html as well as in the ASL releases from atmel).

@N.Winterbottom: the software is also controlling the motor, so the optimisation level can also change the timing of the motor's coils switches, which could explain why this does not happen with -O1 (not 100% sure).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Adding caps to your datalines can be a bad idea...it may eliminate extremely fast pulses, but create  slower others if it slows the rise time too much.  Schmidtt trigger inputs can help.   Your  IRQ handler must be setup with a state machine that can recognize and handle these unexpected edges.  At a minimum it can be set to ignore something out of sequence.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ah ... Software controlled motor drive.

 

Going back then to your scope screen-shot in #13;

What is the CH1 20V/div yellow trace; is that the software control of a motor drive coil you mentioned ?

What type of motor are you driving ? Is it simple PWM type coil drive or something more timing critical ?

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

22us suggests the glitch comes from pwm. Maybe a particular case where the rotor position is not where it should be.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

etienne.cor wrote:

... the glitch is a clear flank lasting 22us with "perfect" edges, I don't think that can be pure charge/discharge noise, but anyway I was able to confirm that I can't reproduce this glitch with -O1 even when letting the software run hours, so I am assuming the real issue I am debugging which is that sometimes an interrupt seems to be missed is not due to this, and could maybe due to interrupts being disabled with a bad timing causing a missed interrupt, which is why I am investigating if modifying FreeRTOS's "portENTER_CRITICAL" would be  a viable option.

 

What is the highest expected edge rate on the sensor, in practical operation ?

 

You could make this more edge-noise tolerant, by disable of sensor interrupts for a fixed short time after the fire.

eg Sensor Edge -> Interrupt  -> DisableThatInt -> StartBlankingTimer (eg 100us) -> TimerInt re-enables Sensor INT, and disables timer.

The timer could do a simple second sanity check of an expected-sensor-order 

Thus a simple ping-pong digital filter or debounce.

 

Quadrature counters are in theory, edge noise self correcting, but that theory assumes they can follow all the edges, and do not miss any, so they can do the ++/--/++/--/++ following.

Asking an Operating system to not blank interrupts during that short edge time, seems optimistic, so you may be better to ensure rapid re-interrupts cannot occur.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@ N.Winterbottom  The yellow line only marks the start/end of an interrupt-handler, it stays high on the picture I uploaded in this case because a breakpoint triggered. It's a simple PWM-type coil drive (BLDC motor).

@ Who-me: "Asking an Operating system to not blank interrupts during that short edge time, seems optimistic, so you may be better to ensure rapid re-interrupts cannot occur" agreed, in FreeROTS the way to implement this would be to modify portENTER_CRITICAL the way I described, because not only the operating system is calling portENTER_CRITICAL and disabling interrupts, which is why I was interested to know if someone already modified FreeRTOS that way for AVR32.

Generally, I already have logic in place to handle this error-case gracefully, that is not an issue. I am rather trying to find the root-cause to fix it completely.