Missing third byte with buffered interrupt driven UART, but NOT when debugging

Go To Last Post
27 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Morning all,

 

After a couple of days trying to debug an issue; I'm resorting to help!

 

So, the summary is:

 

  • We have an application which transmits data over UART, and is fully interrupt driven. This data is streamed to a PC and displayed in a graphing system or terminal.
  • When executing the program at full speed, consistently - every time, the third data byte in each frame that is loaded into the transmit FIFO goes missing
  • Data buffering is handled by a lower layer FIFO system, such that head/tail management is fully abstracted
  • We are using all 3 UART interrupt; RXc, TXc and UDRe
    • RXc just puts the new RX'd data into the RX FIFO
    • UDRe gets the next byte for transmission and puts it into UDR
    • TXc simply disables the relevant interrupts and disables the transmitter

 

So, the TX FIFO is loaded up with the data (usually lots of numerical data, but for purposes of debugging - an ASCII string of "hello").

72
101
108
108
111

 

But here's the problem:

 

  • If the application is just run at full speed (12MHz, ATMega1284P) - we receive "helo", i.e. the third char goes missing every time. This happens regardless of UART speed, buffer contents, or length of data frame. Note also polled operation transmits the data without loss of third byte, on the same hardware. The hardware has been in use for over a year without issue.
  • With a breakpoint on the TXc interrupt, and executing the interrupt by clicking 'Continue' in AS7, then "hello" is transmitted out over the UART and received just fine on a PC.
  • Note, whether run at full speed or single-stepped, I can confirm the buffer loading is correct and does contain the data we want to send.

 

For some reason, code tags aren't working today - please ask if you need to see the interrupts - it's more of a 'conceptual' question I suppose, as to how I appear to have a race condition of sorts with such a simple task.

 

All the best

 

This topic has a solution.
Last Edited: Thu. Jan 21, 2021 - 10:07 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Reeks of trying to process too many interrupts too fast.  The AVR won't queue two of the same interrupt. 

 

Also, is it every third, or only just the third?  If you (try to) send 1,2,3,4,5,6,7,8,9 do you get:

1,2,4,5,6,7,8,9 or

1,2,4,5,7,8?

 

And just off the top of my head, I'd suggest putting TXc where you've got UDRe - don't disable the transmitter until the whole frame is gone.  Hope that helps some.  S.

 

edited for clarity.

Last Edited: Fri. Jan 15, 2021 - 10:37 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just the third byte, so I get 1,2,4,5,6,7,8,n...

 

Scroungre wrote:

 

And just off the top of my head, I'd suggest putting TXc where you've got UDRe - don't disable the transmitter until the whole frame is gone.  Hope that helps some.  S.

 

What do you mean by this sorry? Isn't what your suggesting the opposite of not disabling the TX until the whole frame is gone.

 

In UDRe, the TX remains enabled but we just add data to EDR.

In TXc, the TX hardware is disabled along with the TXCIE bit.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A bit more thought and some surfing around a fairly generic AVR spec sheet confirms this.  Here is what I think is happening:

 

Chronologically:

1) Transmitter enabled, waiting.

2) First byte put to transmit

3) First byte transmits, UDRE fires and loads another byte into UDR

4) TXC fires, and turns off the transmitter - but NOT until the 2nd byte is sent!

5) UDRE fires again, loads another byte into UDR, but the UART is disabled!

6) UART re-enabled, byte in UDR (transmit) lost

7) Repeat. 

 

The important bits being that UDRE will fire before TXC (UDR is empty when the stop bit is still being sent)

and TXC will not disable the transmitter until all pending data has been sent (including stop bits).

 

Anyhow, I may be completely in left field here, but that's what I think might be happening.  Have fun!  S.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yah, in a way.  TXC should be disabled while you're using UDRE to keep loading more data to send - otherwise it'll fire, and turn off your UART.  S.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

jtw_11 wrote:

Just the third byte, so I get 1,2,4,5,6,7,8,n...

 

Scroungre wrote:

 

And just off the top of my head, I'd suggest putting TXc where you've got UDRe - don't disable the transmitter until the whole frame is gone.  Hope that helps some.  S.

 

What do you mean by this sorry? Isn't what your suggesting the opposite of not disabling the TX until the whole frame is gone.

 

In UDRe, the TX remains enabled but we just add data to EDR.

In TXc, the TX hardware is disabled along with the TXCIE bit.

 

Not quite.  What I had in mind was:

"load the next data byte into UDR when TXC fires, not when UDRE fires.  Don't disable the UART until your entire data packet has been sent."

 

If nothing else, it might be an interesting experiment.  S.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

I'm confused. Why would you use both UDRE and TXC. It would usually be "either"/"or" ? In fact in "normal" usage it's almost always going to be UDRE. The use of TXC is generally for when you are doing half-duplex RS485 and after the last bit of a transmit has finished you need to switch direction.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi both, thanks for the replies

 

clawson wrote:

I'm confused. Why would you use both UDRE and TXC. It would usually be "either"/"or" ? In fact in "normal" usage it's almost always going to be UDRE. The use of TXC is generally for when you are doing half-duplex RS485 and after the last bit of a transmit has finished you need to switch direction.

 

This application is MODBUS over half-duplex RS-485, which is why we're using TXC (I'm just debugging using a UART-USB cable pre-485 transceiver). However, why the concern over using UDRE and TXC together? I've also assumed the logic goes:

 

  1. Load UDR, thus send first byte
  2. UDRe fires
  3. Load UDR, thus send second byte
  4. UDRe fires, no third byte to add so disable UDR interrupt (otherwise it'll execute again on return, as we haven't loaded UDR)
  5. Then TXc fires, disabling TXCIE and TXEN

 

If we used just TXc by itself, itself that just a huge time waster - given the AVR UART is double-buffered? I.e to transmit a byte on TXc would mean loading UDR, then shifting into the transmit register, then shifting that out onto the bus?

 

Scroungre wrote:

Not quite.  What I had in mind was:

"load the next data byte into UDR when TXC fires, not when UDRE fires.  Don't disable the UART until your entire data packet has been sent."

 

If nothing else, it might be an interesting experiment.  S.

 

Agreed, that's what we're doing - we don't disable the UART until all the data has been transmitted and TXc fires. All the loading of UDR is handled by the UDRe interrupt.

 

As an experiment, I've taken the function out of the scheduler that initially loads the TX FIFO, and starts the transfer - and put it in the RXc interrupt. Doing this, the data frame transmits reliably all the time. This is making me this I have some horrible race condition between a task running in the scheduler, and the interrupts.

 

But just so I'm totally clear, are we saying you can see an issue with using both UDRe and TXc? My logic is:

 

  1. Start
  2. UDRe
  3. UDRe
  4. UDRe
  5. UDRe
  6. TXc
  7. Finish
Last Edited: Fri. Jan 15, 2021 - 12:49 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Scroungre wrote:

Chronologically:

1) Transmitter enabled, waiting.

2) First byte put to transmit

3) First byte transmits, UDRE fires and loads another byte into UDR

4) TXC fires, and turns off the transmitter - but NOT until the 2nd byte is sent!

5) UDRE fires again, loads another byte into UDR, but the UART is disabled!

6) UART re-enabled, byte in UDR (transmit) lost

7) Repeat. 

 

I've also confirmed step 4 isn't the case - if the only breakpoint I set is the line where the TX is disabled, it halts only once - at the very end of the data frame, not near the third byte. It's enabled at the start of the frame, and disabled at the end of the frame.

 

I.e. the only time TXC fires is when the whole frame has been sent, but of course still with the missing third byte.

Last Edited: Fri. Jan 15, 2021 - 12:52 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

first thing is find out if 3. char never get's loaded, or 4'th is loaded to fast!

 

I would have a relative slow timer running and every timer running and log the time when every byte is loaded to UDR.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

first thing is find out if 3. char never get's loaded, or 4'th is loaded to fast!

 

I would have a relative slow timer running and every timer running and log the time when every byte is loaded to UDR.

 

Yep - in a similar process of this now, but exporting the contents of all the structs that form the serial and FIFO modules, and comparing the changes. What I can see, is the serial and FIFO buffers get filled and emptied no problem at all. But, what I have totally convinced myself of - is I'm fighting a timing dependency.

 

If I move the task called by the scheduler that goes and interprets the received data, and kicks off the relevant data transmit, into the RXc interrupt itself - it works fine. Put it back in the scheduler, missing third byte. And this is even though the contents of the FIFO buffers is the same regardless of how I call it.

 

...that and as per OP, single-stepping through everything works exactly as expected until it executes at speed.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It is probably not entirely coincidental that the UART buffer is effectively two bytes deep. UDR can hold one byte for TX and the Transmit shift register can hold one. So you write the first, it goes to UDR then almost immediately (as there's nothing there yet) moves to transmit holding. UDRE is now set. So you write the next. It will now hold in UDR while the previous one in transmit shift is being pumped down the line. But if a third attempt is made to write UDR it won't succeed. However I would have thought that what would actually be seen is that the 3rd overwrites the 2nd that has not moved out of the way yet.

 

However all this does not make total sense if you are awaiting UDRE interrupts before loading each byte to UDR as you simply should not get the interrupt to be able to load the 3rd until the 1st finishes Tx, the 2nd then moves from UDR to TX hold and only then should UDRE be set to trigger the interrupt to allow the next to be written to UDR.

 

It'd be interesting to see the buffering and ISR parts of your code to see how this can be circumvented.

 

(must go and remind myself in a datasheet as to exactly what circumstances clear UDRE...)

 

EDIT: the datasheet implication seems to be it is simply the action of it internally moving a byte from UDR to transmit shift that then clears UDRE. (I was looking at 328P in fact).

Last Edited: Fri. Jan 15, 2021 - 02:38 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Do your protocols on both ends match: parity bit, stop bit(s), etc?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

jtw_11 wrote:

But just so I'm totally clear, are we saying you can see an issue with using both UDRe and TXc? My logic is:

 

  1. Start
  2. UDRe
  3. UDRe
  4. UDRe
  5. UDRe
  6. TXc
  7. Finish

That should work, assuming the UDRe interrupt is not delayed by scheduling, as it must fill the TX buffer of the USART before the current character is shifted out, if not then the TXC interrupt will fire prematurely, ie. before the complete data packet is sent.   So it looks like you may have a scheduling issue with the UDRe interrupt. 

 

Jim

 

 

(Possum Lodge oath) Quando omni flunkus, moritati.

"I thought growing old would take longer"

 

Last Edited: Fri. Jan 15, 2021 - 03:32 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

MattRW wrote:

Do your protocols on both ends match: parity bit, stop bit(s), etc?

 

Yes, I can transfer 100s of bytes no problem at all - it's just this 3rd byte not playing ball.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ki0bk wrote:

jtw_11 wrote:

But just so I'm totally clear, are we saying you can see an issue with using both UDRe and TXc? My logic is:

 

  1. Start
  2. UDRe
  3. UDRe
  4. UDRe
  5. UDRe
  6. TXc
  7. Finish

That should work, assuming the UDRe interrupt is not delayed by scheduling, as it must fill the TX buffer of the USART before the current character is shifted out, if not then the TXC interrupt will fire prematurely, ie. before the complete data packet is sent.   So it looks like you may have a scheduling issue with the UDRe interrupt. 

 

Jim

 

 

Thanks Jim - agree I've got a scheduling problem somewhere. Though, I've confirmed it's not that TXC is firing too early - a breakpoint on this ISR will fire only once, at the end of my frame - even when the 3rd byte has gone missing.

 

Will pop some code snippets up shortly - I won't delve into the FIFO buffers, as everything is pointer based, so there is some fairly terse syntax going on there, as there are buffers within structs, within structs, referenced by pointers - so triple de-referenced pointers etc. However, I've confirmed FIFO management is working just fine (as is also used by totally unrelated aspects of the application such as data filtering).

 

The obvious question is why is it written so complex? It's simple because the FIFO 'module' can be re-used by multiple aspects of the application, with only a few really simple API calls, without needing to know about/worry about/care about head, tail, overflow, wrapping management etc.

 

Cheers

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

that is why I suggest logging the time of write to UDR (there should not be "big" holes)

 

Or flip a pin and see it on a scope, (and with the same code flip an other pin with the scheduler) 

 

 

edit forgot a not

Last Edited: Fri. Jan 15, 2021 - 04:18 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Welp, if you've confirmed that TXC is not firing when it shouldn't, then that's ruled out.  Must be something else then...

 

I would like to make a minor editorial remark here, though - the AVR spec sheets refer to a 'data frame' as "One byte plus start and stop bits" and I think you should describe your sending of multiple bytes as a 'data packet' (or something) to avoid confusion.  Note that the AVR TXC will fire at the end of what the AVR calls a 'data frame', not a 'data packet' (if it's enabled).

 

clawson's point may also stand, in that the third byte is getting overwritten by the fourth, as well.

 

If nothing else, we're thinking up things to rule out!  Have fun.  S.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Check the UART status  bits. OVERRUN means that a new byte reception was completed when there was no room in the (hardware) receive buffer system. As noted, an overrun COULD be the source of your problem.

 

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

This reply has been marked as the solution. 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

jtw_11 wrote:
As an experiment, I've taken the function out of the scheduler that initially loads the TX FIFO, and starts the transfer - and put it in the RXc interrupt. Doing this, the data frame transmits reliably all the time.

Ah Ha - Can you confirm that the FIFO code is thread safe. Commonly done by disabling interrupts while modifying head/tail pointers/indexes. Now as you've told us the FIFO code is generic, this disabling must be done globally or by the calling code a level above the FIFO.

 

jtw_11 wrote:
Will pop some code snippets up shortly - I won't delve into the FIFO buffers

May be useful for the bug-hunt though.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

N.Winterbottom wrote:

jtw_11 wrote:

As an experiment, I've taken the function out of the scheduler that initially loads the TX FIFO, and starts the transfer - and put it in the RXc interrupt. Doing this, the data frame transmits reliably all the time.

 

Ah Ha - Can you confirm that the FIFO code is thread safe. Commonly done by disabling interrupts while modifying head/tail pointers/indexes. Now as you've told us the FIFO code is generic, this disabling must be done globally or by the calling code a level above the FIFO.

 

Yes that does suggest an issue with the FIFO, since it works with in an ISR() where the code can not be "interrupted" (unless you enable interrupts with in your ISR()'s, not a recommend practice on ram limited AVR's) or your scheduler needs more "task stack space" for each thread allocated! check how much stack space the task above your FIFO task is taking, perhaps its walking down into the FIFO local storage space, ie. it exceeding its stack space. 

 

JIm

 

 

(Possum Lodge oath) Quando omni flunkus, moritati.

"I thought growing old would take longer"

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Good morning all,

 

Thanks for all the constructive feedback - couple of thoughts/responses:

 

ki0bk wrote:

Yes that does suggest an issue with the FIFO, since it works with in an ISR() where the code can not be "interrupted" (unless you enable interrupts with in your ISR()'s, not a recommend practice on ram limited AVR's) or your scheduler needs more "task stack space" for each thread allocated! check how much stack space the task above your FIFO task is taking, perhaps its walking down into the FIFO local storage space, ie. it exceeding its stack space. 

 

JIm

 

This is what I thought I'd proven too - however, if I send the same command many many times to the device, and have it print a string back to me this all works OK - until eventually, I get to a point where the data stream ends up corrupted again. Interestingly the buffers have successfully wrapped around themselves several times without error before this happens. Digging into what's actually happening, is my tail pointer overtakes the head pointer. The tail increment function is never called by itself, only within the head increment function - so I'm baffled at the moment how this is happening. I will think of how/what to share; the project is ~100k SLOC, the buffer modules contain LIFO as well as FIFO etc.

 

ka7ehk wrote:

Check the UART status  bits. OVERRUN means that a new byte reception was completed when there was no room in the (hardware) receive buffer system. As noted, an overrun COULD be the source of your problem.

 

Jim

 

Thanks for this - I checked the overrun bit this morning at the point in time where this data corruption occurs, and it is not set - so no overrun by the looks of things.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The behaviour of the double-buffered receive can be "unexpected":

 

https://www.avrfreaks.net/forum/what-happens-if-you-read-udr0-multiple-times

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

awneil wrote:

The behaviour of the double-buffered receive can be "unexpected":

 

https://www.avrfreaks.net/forum/what-happens-if-you-read-udr0-multiple-times

 

Interesting read, that's for sure. I think I've come across something similar before (and also was never quite sure how the AVR seems to go back in time... doesn't this mean when reading UDR, the data get's put back into the RX shift register after the read. Effectively the RX shift register and UDR just swap their contents. Odd. I'd have thought reading UDR just zeroed UDR.

 

Thankfully in this case, that's not what I'm up against (I don't think!). In post #22, where I said:

 

jtw_11 wrote:
however, if I send the same command many many times to the device

 

At the point in time where the TX FIFO corruption occurs, if I check the RX buffer (the 'big' buffer, in my serial/FIFO modules, not the UART buffer), the data itself is correct. It's the head/tail management by the looks of things that's got an issue.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

jtw_11 wrote:

At the point in time where the TX FIFO corruption occurs, if I check the RX buffer (the 'big' buffer, in my serial/FIFO modules, not the UART buffer), the data itself is correct. It's the head/tail management by the looks of things that's got an issue.

 

Sounds like stack/heap collision as JIm suggested is still a possibility.  Check the variables near the top of the heap, these are obviously more vulnerable (assuming the memory layout is similar to avr-libc's memory map).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Make sure that there can't come a rx ISR between read from buffer and handling the tail. (or the code is robust to handle it).

 

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

So, sorted.

 

Two problems:

 

  1. FIFO module not thread safe - this was the cause of the missing third byte.
  2. Seemingly random data corruption, was not random at all. After much much investigation, the tail wrapping was not handled correctly ONLY in the case where the first data byte of the next message was placed in the last buffer index.

 

Thank you very much indeed for every bodies thoughts - explaining the issue as always really helped.