TWI module seems buggy in multi-master communications

Go To Last Post
42 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have a project with 2 ATMega48 both configured as TWI master and slave. I noticed that after a couple hours, the TWI modules stop working and I have to reset the AVR's in order to resume operation.
I cleaned up my application to the bare minimum in order to test this issue and increased the communication rate on the bus. I now can crash the modules in seconds. Both AVR have the same program loaded except the slave and target addresses are swapped. There's a random delay before issuing a start condition so that both have equal chances to be master.

After analysis, it seems to me that there's a bug in the TWI hardware module when both masters are configured simultaneously to send a start. (TWSTA set in TWCR)

The program can run for a couple of seconds, during this period both AVR's are master/slave by turns. Most of the time both set TWSTA around the same time but because one is slightly before the other, one becomes master and the second slave.

From repetitive measurements, I could determine that the condition to crash the bus is that both TWSTA are set within approximately 5µs. Then 2 glitches appear after the stop condition and seems to crash the TWI module because the bus freezes from that point.
I also monitored TWSR during all communications and everything is normal there.

I attached a more detailed report of this problem with screenshots of the signals. There's also the test code I used in the rar file.

Has anybody else experienced this too? I've seen on different posts that people had problems in this mode but without good analysis of the problem. Maybe this is related.
I'd like to have your opinion before I contact Atmel.

David

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You are not allone, this was also my experience too.

I found, that one master holds the SDA line permanently low, without a clock and without setting the interrupt bit.

I found only one working dirty hack against this strange behaviour:

Feed the SDA line to an external interrupt pin and set up a timeout, which was resetted by the external interrupt. So when SDA was low, for a certain time, then disable the I2C and reenable it after about 10µs.

I'm not happy with it, but no other works.

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That's also how I worked around that problem: if there's no communication on the bus for a certain time, I reset the bus. That seems to work fine and no data is lost.

In my case, the bus is not kept low, it's released but freezes. Did you check that the ISR interrupt of a STOP condition was completed before a new start occurs on the bus? If you're doing too much in the STOP interrupt, then the hardware will miss the next start condition and can crash the bus like you describe.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've had a quick look at this code, and it raised a few more questions for me:

When a master looses arbitration and becomes a slave, what happens to the start command issued when it initially tried to become master? I assumed that it just went away, and the code needs to be smart enough to recognize this case and issue a new start at the end of the slave cycle. I did not see any flag set to accomplish this in David's code, but the start glitch at his failing cycle makes me wonder what the hardware does.

Another hint/tool that I have tried to use is that some of the TWCR bits have meaning when read back, for instance the stop bit remains high until it is actually sent on the bus, so this can be used to avoid stepping on your own back to back master cycles.

I wrote everything in asm to control latency better, so I wonder why all of the C ISRs that I've seen put the STOP/REPEAT START case as the last thing in the case statement, since it must be the fastest response. Do the compilers really work this way, I've never looked.

The code that I have does not, as far as I've been able to tell, ever hang (or watchdog), but it does send bad data every once in a while. Because we are seeing 3 different behaviors, I suspect there are pieces of the solution hidden in each of our implementations. I think we need to push on Atmel for help...

Craig

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

rc_campbell wrote:
When a master looses arbitration and becomes a slave, what happens to the start command issued when it initially tried to become master?

Hi Craig,

In my small test code, the start bit is just unset as soon as the ISR is enterred, which is when the slave address has been recognized.
But if you keep that bit set, a start is sent as soon as the bus is released normally.

rc_campbell wrote:
I wrote everything in asm to control latency better, so I wonder why all of the C ISRs that I've seen put the STOP/REPEAT START case as the last thing in the case statement, since it must be the fastest response. Do the compilers really work this way, I've never looked.

Good point. Normally I have a better optimized version with the case replaced by a lookup table so the order doesn't matter. And with this test version you can see on the signals that the stop interrupt is finished well before the start occurs, so this should not be the problem here.

rc_campbell wrote:
The code that I have does not, as far as I've been able to tell, ever hang (or watchdog), but it does send bad data every once in a while. Because we are seeing 3 different behaviors, I suspect there are pieces of the solution hidden in each of our implementations. I think we need to push on Atmel for help...

In your case, does both masters can try to send a start at the same time (within 5µs)? Don't you have any problem in such a case?

David

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi David

After looking at things and thinking a bit more, I have a different spin on your hanging cycle:

Looking at your last timing diagram, the top part is the slave, and it gets a data-ack irq, the master does a stop, then the slave gets the stop irq (which takes noticeably longer), then the slave issues the delayed start. During the slave transaction and certainly in the I2CReset routine, the start bit of the TWCR is cleared without the start having been sent. This may be the reason for the hang: the twi state machine starts the start, but then quits and does not generate the interrupt. The other master is hung because a cycle is active on the bus.

This requires a bunch of assumptions about what the TWI does when the start bit is cleared in the TWCR before the start is actually sent, and I can not find anything in the datasheet that indicates what happens in this case.

My code, like yours, does not set the start bit during the slave cycle, but when there is a pending master cycle at the end of a slave cycle, it does set the start bit rather than just going to the idle state. You could do this in the I2Creset routine with a simple if-else.

While I suspect that this may address your "glitch" and hang, I expect that it will only move you to the next level of issues. I certainly reccomend that you move the stop/repeat start case to the top.

I'm also concerned that your debug code uses a couple of pins in portC, and it's dangerous to do the |= and &= operations on the port in the isr, but I guess if the TWI is never disabled and you don't do anything else on those ports outside of the ISR that it's OK.

I will go through my code one more time considering the issues that you have pointed out. I don't like reverse engineering things like this, I've seen a lot of silly folklore generated this way, but if it's the only way we have, I guess we need to do it.

ATMEL, WHERE ARE YOU????

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

rc_campbell wrote:

I'm also concerned that your debug code uses a couple of pins in portC, and it's dangerous to do the |= and &= operations on the port in the isr, but I guess if the TWI is never disabled and you don't do anything else on those ports outside of the ISR that it's OK.

Why would the |= and the &= be dangerous in the ISR , i havent read the code but would expect IRQ to be disabled ??

I would expect those operations to be dangerous outside the ISR , if not made "atomic"

/Bingo

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Bingo

Well, I guess it depends on how you look at the world. My practice is to do I/O things outside of ISRs unprotected, and not do any inside ISRs. If you protect things outside by disabling IRQs around I/O changes, then it is of course safe to mess with them inside, I just don't do it that way because the code looks messier and I'm afraid I might forget somewhere.

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Craig,

In my case things are actually a little bit different than what you describe. I attached a diagram I commented. The first signal of the slave and master represent when the TWSTA bit is set. So the slave sets TWSTA slightly after the master so it remains set until the first interrupt where I clear it by overwriting TWCR.

I only set TWSTA when the TWI module isn't busy anymore so there's never a start pending when the slave receives a stop. During all the communications I have, the start happens around 400µs after the stop and the stop ISR only takes around 10µs. I also monitored all TWSR values, they are normal and during the glitches there isn't any TWI interrupt generated.

And I noticed that if the 2 TWSTA of the 2 AVR's are set within more than 5µs then the signals still look the same, all TWSR status are the same, but the glitches doesn't happen and the bus never hangs. It seems those microseconds really change everything here. I couldn't find anything in my code that could explain the glitches.

David

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi David

My interpretation of the glitches is that they are a start, sent by the (former) slave, due to the TWSTA bit being set and then cleared. I suspect that this is a TWI state machine feature.

I think you can eliminate this problem by going ahead and issuing the start and doing the cycle that was attempted to be started by keeping track of "aborted" master cycles and setting TWSTA in I2CReset if there is one pending.

The reason the bus is hanging is that the glitches are a valid start, but with no clocks and no timeout, the cycle never finishes. The fact that the chip that generated the glitches fails to do a new start must indicate that it's TWI state machine is locked up waiting for something.

Of course this is speculation, but the code I have that does this does not hang or generate the "glitches", which I'm guessing are just a left-over start from the slave.

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Understood now Craig :-)
I'm going to try what you suggest and let you know.

What's strange though is that I have exactly the same thing during most of the other frames (that a TWSTA is set and cleared during slave receiving mode) and I get no glitches there.

Also that start is really close to the stop and that wouldn't follow I2C specifications but I heard in this forum that this is a problem of the TWI module.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I look forward to your results.

Be careful what you listen to, I may be the source of the "too quick start" rumor. Your waveform actually looks very good to me. If you want to see a really quick start, add a third/forth independant master and slave doing cycles on the bus, and you will see the TWI issuing a start before the pullup gets SDA to 2.5V.

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm still getting further with this problem. As you suggested, I tried not resetting TWSTA in the slave receive interrupts so the start remains pending in the slave until the bus is released. The start will be issued by the slave (which will become master at that time) after the STOP interrupt will be completed actually. And this solution fixed completely the problem of the glitches, the bus is now very stable.

The START is generated very quickly after the end of the STOP interrupt which is around 2µs in my case but there's some slight delay between the STOP condition and the STOP interrupt and there's also the STOP interrupt duration that have to be added to this time to get the complete bus free time between the STOP and the START.

In this case there's no real problem to get at least the 4.7µs bus free time specified by the I2C protocol but that's just because my STOP interrupt is nearly that long, which is not good. This CPU was the slave so there isn't any other slave that could be executing a STOP interrupt at that time so the pending START can be issued very quickly. If you add a systematic 10µs delay between a STOP and a START in the master, this is a rather safe situation when you have only 2 CPU's on the bus.

In case you have more than 2 masters on the bus, it's possible that a CPU which is neither slave nor master has a START pending for the CPU which is already slave at the moment. Then the START condition will probably be issued less than 4.7µs after the STOP condition and there's little chance you will be able to complete the STOP interrupt in that small time and this can lead to another problem which is quite well known and described in this forum. Basically, if If a slave is executing the STOP interrupt at the time another master sends a start on the bus, it looks like the TWI hardware of the slave doesn't monitor the bus and when the interrupt will be completed, if the slave wants to send a start, it will attempt to drive the bus as if it was free. This makes the modules crash completely with the SCL line clocking indefinitely.
There's a snapshot of this problem attached.

Hope that helps getting further with these problems. In my case as I have only 2 CPU's on the bus, I have now a reliable solution but there's still the problem with 3 or more masters which has no good solution.

Recommendations would be though to
- deal with the STOP interrupt at the very beginning of your ISR and be as short as possible to exit the ISR;
- if a master can send a stop followed by a start right after, use a restart instead.

David

Attachment(s): 

Last Edited: Mon. Oct 9, 2006 - 04:38 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I noticed that the images I attached are resized in the html so to view them correctly, you need to right click on them and choose "View Image"

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One more data, I measured the delay between the STOP condition and the STOP interrupt. Here's my assembly code:

@000000B1: __vector_24
217:      {
+000000B1:   921F        PUSH    R1               Push register on stack
+000000B2:   920F        PUSH    R0               Push register on stack
+000000B3:   B60F        IN      R0,0x3F          In from I/O location
+000000B4:   920F        PUSH    R0               Push register on stack
+000000B5:   2411        CLR     R1               Clear Register
+000000B6:   938F        PUSH    R24              Push register on stack
+000000B7:   939F        PUSH    R25              Push register on stack
+000000B8:   93EF        PUSH    R30              Push register on stack
+000000B9:   93FF        PUSH    R31              Push register on stack
225:          ISR_PT |= ISR_MK; /* 'I2C ISR' debug signal */
+000000BA:   9A43        SBI     0x08,3           Set bit in I/O register

Just saving those registers on top of the interrupt delay, I get 4.2µs and I've seen that a pending start can be issued around 2µs after the stop. This lets us very little time to do something reliably :?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi David

I'm very pleased that your issue is solved, but unfortunately mine remains. The stop to start timing problem exists when there are at least 2 masters and any other device, master or slave, on the bus. In my case I have a variety of slaves as well as two masters.

The timing between the stop IRQ and the TWCR write for my code is now 20 cycles, 27 to the RETI. The latency of the worst case non-TWI ISR needs to be added to this to determine the possible dead-time where the TWI is not looking for it's own address. In my case, I have an A/D ISR that can take up to 40 cycles, so my worst case is 60 clock cycles, which means 7.5 usec at 8 MHz. I know of no way to make these routines faster.

My biggest problem remains something different though, where occationally the wrong data is sent from the AVR when it is read by another master. I think this is caused when the addressed slave decides to send a start right after it's slave address is recognized but before the ISR is entered. Because there is no way to see that the bus is busy, and in this case both SDA and SCL will be high since it's a read, the write to the TWCR to issue a start is taken as a go-ahead to send the data that happens to be in the TWDR, which is bad.

In any case, I continue to play and hack in hopes that I will stumble on something or Atmel will step up. I'm pleased that you have a solution, and if you learn more, please let the rest of us know.

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi David

In looking at your pictures and the original code, I notice that your ISR does not have a cases to handle any of the TW_MR cases, and there is no default case. While the failing cycle looks like a good write, I wonder if it's being seen as something else, like the first clock was missed, making this a read.
It looks like the start-sent irq happens and is correctly handled with the sending of the slave address on the master. It may be that the master is trying to send a stop, since it got a NACK on the SLA, but the slave is holding the SDA line low for some reason, so the master's TWI continues to attempt the stop, hence the odd looking SCL waveform.

I'm sure your code is different now, but I have never seen this infinite clocking behavior, I suspect there may be some unhandled case occuring and that a default case might fix things.

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Craig,

I'm using a ring buffer to store all values of the status register when an interrupt occurs and I also usually put breakpoints on all cases I'm not expecting and I must say that I've never seen something strange there. Everything was just running normal and when the hardware fails, I could never see something wrong on the status there.

You're right about the non TWI ISR's and in my case there are some of them that can be very long and then I can still get the problem. The only good solution to that problem would be that the hardware would acknowledge a new recognized address but hold on the bus the related interrupt is processed. What happens now is that the hardware is not able to recognize it's address until the stop interrupt is completed. I can't see any solution at all to this problem.

Does someone have any experience with this with other CPU's than AVR or external TWI modules?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

jaguarondi wrote:
Hi Craig,
Does someone have any experience with this with other CPU's than AVR or external TWI modules?

It is known that:
Multimaster mode does NOT always work
If you are using power save mode, but woke up not by TWI(ext int or timer) you can not transmit anything.
Atmel has admitted those bugs.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

brberie wrote:
Atmel has admitted those bugs.

Then they'd better update their errata. At least I hope we pointed out most of the multi-master issues in this thread.

I'm now using repeated start which works pretty well.
By mistake I had a software that was sending a repeated start, no data and then a stop. I didn't dig deeper about this but it seems like this can be a good solution because the slave still holds the bus when the ISR is handling the data and the master only sends the stop when the slave is done with that interrupt. Of course there's still another interrupt when the stop is received but that one doesn't need to do anything so can be very short. Just an idea which may not conform with the I2C standard though.

David

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, yo're right, but I have to reject using multi-master TWI at all: too much time and blood have been lost.

Last Edited: Wed. Oct 11, 2006 - 10:00 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi David and brberie

I was aware of the power save issue, but I can't remember off the top of my head where I saw it.

I have not ever seen a statement from Atmel that multi-master does not or can not be made to work, but I have seen support for it claimed many places. I have heard Atmel and others say it's complicated, but that doesn't bother me much. What I have an issue with is the lack of support.
Perhaps I'm wrong, it may be that they claim master-slave capability rather than multi-master, but of course that makes no sense, what's the point of being a slave if there isn't another master?

I have continued to make some progress, and have reduced my TWI isr latency for the stop case to 16 cycles, and just set flags in all other ISRs and handle the work outside, which requires 14 cycles for memory flags, so my total latency on a stop irq is down to 3.5 usec, which should be good enough.

My issues still revolve around small time windows before starts are sent, and the proper way to recover from loosing arbitration and dealing with a pending start while handling a slave transaction.

I can add another bug to the list for the TWI: it is possible for the TWI to loose arbitration in master mode to another master that is sending a higher address. Since I don't have a big hangup on arbitration priority, I don't really care about this, but it indicates there is something else out of spec in the TWI hardware.

I thought that AVRs were a good choice for use as IPMI controllers, but I guess I'm just going to give up and use a philips or renesas part.

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

rc_campbell wrote:
The latency of the worst case non-TWI ISR needs to be added to this to determine the possible dead-time where the TWI is not looking for it's own address. In my case, I have an A/D ISR that can take up to 40 cycles, so my worst case is 60 clock cycles, which means 7.5 usec at 8 MHz. I know of no way to make these routines faster.

You can enable interrupts as soon as you enter any interrupt handler except the TWI one. Use this "trick" with care, as you must make sure, somehow, that there won't be any resource contention problems.

Embedded Dreams
One day, knowledge will replace money.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi All,

I'll be starting shortly (~ 2 weeks or so) on a multi-master application. It's just a small part of a much larger project which is largely complete, and now it's time to do the MM TWI bit. Seems like it wasn't such a good idea, but when the project was planned 18 months ago I wasn't aware of the hassles you guys have raised.

Just over a month ago I wrote to Atmel support asking them to expand on several aspects of MM operation, including a phrase in the description of TWSTA. In part, that description says

"The TWI hardware checks if the bus is available, and generates a
START condition on the bus if it is free. However, if the bus is not free, the TWI waits until a STOP condition is detected, and then generates a new START condition to claim the bus Master status. TWSTA must be cleared by software when the START condition has been transmitted."

The phrase I questioned was "if it is free", and specifically asked exactly how the hardware determined whether the bus is free or not. The reply that I received said, in part:

"A multi master system is usually a big challenge. The AVR TWI module will not wait and check for the bus to be idle before starting a transmission, but try sending immediately after the module is initialised to send one.

This means you'll in software check if SDA and SCK is both low before starting by sending a start. If they are you'll have to wait until the bus is IDLE. The hardware will however see if the start failed, and if so retry. This may cause noise on the lines.

...snip

Some recommendations:

After a stop condition, make sure the bus free time is long enough to meet the bus requirement. This is not handled by the hardware, but is the softwares responsibility. As you're having a multi-master application, I suggest a random wait-time to make sure not all masters get synchronised and try at the same time.

Note however that the responsibility of following the I2C protocol is entirely a job for the application it self. The TWI module is only a set of tools assisting you in this process. Especially in a multi master system this becomes very complicated and puts a very high demand on error event handling in the application code. This is due to the likelihood a large amount of protocol errors being generated on the bus. Everybody is chatting along and disturbing everybody else."

There are several interesting statements in that reply, and the bottom line seems to be:

1. The application must ensure that the bus is free before requesting (via TWSTA) that a START be generated.

2. It is the applications responsibility to ensure that no attempted START is made before some minimum time has elapsed after a STOP.

Do you consider that these statements encompass the reasons for the problems you have discussed in this thread, even if they don't provide a clean solution?

I'm beginning to wish now that I had not decided to use the TWI when the system was designed, but at the time it looked like an ideal solution. It looks like a decision that is coming back to bite me before I even start implementing it!

Roger

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Roger,
Software obviously can solve most of known problems, but the question becomes does fixing TWI with software worth time and effort? This question is entirely up to you. Good luck.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for your contribution Roger, this is pretty clear and explains some of our issues.

On my point of view, I spent way too much time trying to get my muslti-master bus working reliably and if I had to redo it from the beginning, I would better choose a single master system. Of course I learnt a lot by doing so but I really think the TWI module should handle more than it is doing right now. If you have to monitor all what Atmel described by software, you'll never get a reliable system. For example, if you check that the bus is IDLE for 5us then try to send a start but at the same time (around the same us) another master does the same, you'll still have the same problem because that small difference can't be checked by software. Random delays can help reduce the probability that conflicts will occur but will never reduce it to 0.
I really think this module is not convenient for multi-master in general. But in special cases depending on what's going on on your bus and if you know the hardware well and deal with it correctly, it should be possible.

Anyway, if continue in multi-master mode, you know the place where to report your new discoveries :-)

David

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi David,

Yes, I'm sure that I'll be back here a lot in the next couple of months - there's nothing like talking to someone who has already been there, done that. Your problems have now alerted me to the dangers at the STOP condition. I always had concerns about the START, but I hadn't made the connection to the STOP/START situation.

I agree, the TWI is a very cpu-intensive peripheral and could be greatly improved with a bit more hardware in there, although even just making this bus control area bullet-proof would go a long way to easing the pain.

I guess they wanted it to be identical to the standard I2C bus. Philips appreciated the danger after a STOP, and made a (very small) point about it in their spec. I had missed it entirely until I re-read it a few times following your investigations.

Forums are wonderful things!

Roger

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi All, and welcome

Boy, it's nice to have this discussion.
I have implemented the wait-for-idle solution, but as David points out it is not an absolute solution because of the remaining time hole just before sending the start. There is another issue as well: SDA and SCL both high on I2C does not necessarily mean the bus is idle, as this condition exists for the duration of the SCL high time for all "1" databits. Since I happen to know by inspection the clock high time of all of my bus masters, I can make up a minimum time before I assume idle, and have done this. It's interesting to see how Philips addressed this issue in their hotswap buffers (PCA9511). This is an area that makes SMB a better choice than I2C, since it defines a timeout.

My current issues are related to what the TWI hardware does when a start is issued and it:
1. detects a bus busy due to the small, unavoidable time window between checking SDA/SCL and issuing the start
2. looses arbitration to some other master addressing some other device
3. looses arbitration to some other master addressing this device's slave address, where there are multiple slave statuses and transactions before the bus can be mastered again.

In each of these cases, there is a pending start in the TWI state machine that may or may not go way and need to be re-issued. Roger identified this issue when he asked Atmel about the start behavior paragraph, but unfortunately I don't see the answer given.

If this can be explained clearly, I will post code that implements it. I have recently re-written my TWI ISR to clean up from months of experiments and hacking, doing my best to reduce latency, and will share this if desired, but it still doesn't work in all cases. If it is simply not possible due to the TWI hardware implementation limits, please put me out of my misery. It is not possible to implement a multi-master + slave interface in software, there are too many races and windows that can not be closed even if the processor has nothing else to do. Hardware help like start detection, own address detect, and contention/arbitration must be implemented with some hardware support. If there was a way to ask the TWI if the bus is busy it would be help. If the TWI state machine was completely and accurately documented it would be a help. If the details of a pending start were documented it would help.

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For whom it may concern...(Italic is Atmel's respond)
June 30, 2005
Can you confirm that there is only 3 ATmega128 devices on the I2C bus, or are there other devices interfering at this point? Are all devices running at the same speed and with the same software? Could you send us a copy of the TWI drivers with a description, to enable us to reproduce the issue?

Yes, there are only 3 devices on the bus with the same speed and the same driver. We have only 3 devices connected physically to the bus.
Our driver would involves our in house OS and its complexity it is not possible for me the whole driver. Instead let me explain in detail how we get to our failure condition in our tests.

In the test case we have is there are 3 chips. All chip 2 and 3 does is to wait for a general call from chip 1 and immediately reply with a message to chip 1.
Chip one starts by sending the general call to chip 2 and 3. After that it immediately tries sending a tranmission to a non-existent devide but fail (SLA+W will NACK) therefore it will STOP and retry later. Chip 2 and 3 would have tried to send back a reply to chip 1 on the bus immediately after receiving but since chip1 is trying to transmit to the non-existent device both of them will have to wait. Then during the STOP after the NACK that chip one received, chip 2 and chip 3 will arbitrate and one of them will lose and have to wait for the line to be free to be initiate a start. Let's say chip 2 won the arbitration. When chip 2 finished transmitting to chip 1 it will make a STOP ( The start of the plots ). Seeing this STOP, chip 3 will immediately make a START to send its message. Chip 1 who has just finished receiving from chip 2 , will want to try to send again to the non existent device. As chip 3 is doing its SLA+W that has not yet completed, chip 1 does not recognize yet that chip 3 is going to talk to it, therefore it considers itself free to retry talking to the non existent device and initiates a START. (Point 2 on graphs). The hardware should have noticed that the bus is busy from the START that chip 3 initialiazed and made it not possible for chip 1 to make a START but we find this to not be true and there is a START or indeterministic situations. This is where our problems start. In the picture below, it can be seen that there are two START conditions

D7: SCK
D6: SDA
D5: PIN triggered when chip one initates sets TWSTA on TWCR.

START condition at point 1 is made by chip 3 and start condition at point 2 is made by chip 1. The 3 clocks in between is chip 3 sending SLA+W. Correct me if I'm wrong but from my understanding it should not be possible for a device to succesfully initiate a START when the bus is being a used by another device that had initated a start and has not made a STOP condition.

As a recap, here are the two other plots of the same code on the same setup with three chips. The first being how I believe it should be working where the START from chip one is not done till chip 3 finished its transmission and STOPs. The second picture is the failure picture again.

By looking at the plots in the LastUpdate file I can see that on plot 2 & 3 the time between STOP and START is about 5us. Which is within the spec, and as you state also gives an acceptable behavior.

Yes, I believe that when the STOP , START is within spec things work fine. But when this condition is not true, this doesn't seem so. The initiation of the START is beyond our control once we have do a a START to enable the hardware to wait for a STOP to initiate a start again as done in the Atmel application notes. What can be done from the software point to ensure the problem does not occur?

In plot 1 however the time is much shorter. You have measured it to be 1.6us. The spec for standard mode is 4.7 and for fast mode 1.3. If one device is running at fast mode (unintentionally or not) then the standard mode devices will have trouble detecting it. Then you will have 2 masters talking at the same time.

We are running all chips at 99.6kHz. At what speed would you define fast mode to be at?

Could there even be more masters trying to take the bus? The most likely cause of the glitches is that a master thinks the bus is free and he is trying to send a START.

We do know which device is trying to take over the bus (chip 1 as described in test above). Yes, I believe that the problem is that a master (chip 1) thinks the bus is free and is trying to take over the bus. From a software standpoint how would we know that the bus is busy or not? The hardware should have recognized a START condition has been made and the bus is busy and prevent the chip from initiating the a START.

Yes, upon losing arbitration the losing master will try to initiate a start again. However if the bus is busy shouldn't it be waiting till a STOP is received and the bus is freed?

Attachment(s): 

Last Edited: Thu. Oct 12, 2006 - 05:38 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

8-30-05
I took the time to reproduce the problem using simple code. The code was made using AVRSide and compiled using avrgcc. Included is a compressed folder with code for 3 different chips. In the files are also .txt files which details the build process for the files. The flow of the programs are the same with chip 1 first sending a Master Transmit general call to all chips ( chip 2 and chip 3 ). Then it tries 3 times to talk to a non existent device and get NACKs at SLA+W. During the same time chip 2 and cip 3 will try to Master Transmit to chip 1 once they have received the first general call from chip 1. The code starts out fine but let it run long enough ( typically less than 30 seconds ), you will start seeing the recurring clocks problem again.

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi brberie

This TWI ISR has more than 70 cycles of latency when handling a slave stop irq, plus the 25 cycles from the overflow isr and you have more than 12 usec of dead time where the TWI state machine is not paying attention to the bus. I would reccomend moving the 0xA0 case to the top of the switch statement, but I'm afraid that I don't think it's possible to ever meet spec with a C ISR.

I am also interested to see the use of all the |= operators on the TWCR. I've wondered about this and the affect it has on pending starts, but I'm afraid of it because of the stop bit read-back behavior. I think it can result in sending multiple stops, and I have a theory that when you send a stop while someone else is holding SDA low, the result is the endless clocking that you're seeing.

I also see stop being set in many cases, and I'm not sure what that does. The multiple, back to back TWCR writes is also something that I haven't done, but I have no idea there is an issue with doing them.

I suspect it is dangerous to do |= into the TWCR outside of the ISR without disabling interrupts, and maybe not even a good idea then, since things can change if you get an irq between the read and the write.

I think in the end that Atmel must tell us what the TWI state machine does, in gory detail. I'm afraid that I've spent too much time on this and have developed some silly theories and folklore about how it may or may not work, and I may be infecting others with my erroneous concerns. Please take this as a plea for help and an invitation to ignore my suggestions.

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Craig,
I'm sorry, but I'm not going to do anything with TWI... I've already wasted a lot of time and effort. With existing HW TWI multimaster will never run 100% reliably.
Just few words about delay calculations: I had a simple example of code, where I was polling for a START immediately after issuing a STOP condition inside ISR: did not help. I can't recollect more details, but the point is NO MORE TWI multimaster with existing hardware. PERIOD.
PS: The code example was intended only for simpliest possible way to recreate the problem: it is not the production code. Therefore it may be far from being ideal.
PSS. One more thing. I believe that the HW state machine was working fine if TWI was reseted every time after sending STOP:

  TWCR &= ~( 1 << TWEN );
  TWCR |=  ( 1 << TWEN );
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi David, Craig, brberie,

Between the three of you there has been some pretty solid investigation of these problems, and ultimately I think that we have to agree with brberie's stance i.e. scratch MM TWI until Atmel revise their implementation. But they've got it embedded in a lot of parts, even in the FPSLIC core, I believe, so.... don't hold your breath.

You are right Craig, you don't see an answer from Atmel re the START behaviour in my earlier post because there wasn't one. Essentially, their reply was a cop-out, although they did admit "The AVR TWI module will not wait and check for the bus to be idle before starting a transmission...".

I'm between a rock and a hard place right now, because all the hardware is built and bolted in place and "all" that I have to do is put the link in place. However, this is the main processing/control equipment in an autonomous underwater vehicle that is intended to be off for a month or so, doing its own thing, and there's no way that buggy hardware can be allowed. Software alone will be enough to keep me sleepless!

Just talking off the top of my head, one work-around (very crude, but might work) could be to implement a token-passing system, where chip1 does whatever it has to do, then sends a message (token) to chip2, which does whatever it needs to, then it passes the token to chip3, which passes it back to chip1, etc. There is never any contention for the bus, and it could be made a co-operative system where a chip will relinquish its ownership after a certain period of time if it has a lot to do, just to allow the others to have a bit of the action. Each chip could have a timeout running, and do a reset on its TWI if it determines that either (1) it hasn't relinquished the bus, or (2) it hasn't regained the bus (received the token on the next time around the loop) by the time the timer expires. Chips 2 and 3, say, could just do a reset, and then wait for a general call from chip 1 to re-initialize them, or perhaps just simply wait for the new token to arrive. Chip1 (sort-of a system 'master') would start things off again, the same as it does at system initialization time.

I don't have a critical real-time requirement; a few hundred milliseconds hiccup won't be a problem as long as I can completely recover the bus and the data transactions, and that should 'just' be a matter of software.

I would appreciate all input that you might have to offer on this idea, or some similar work-around.

Thanks guys,

Roger

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

but I'm afraid that I don't think it's possible to ever meet spec with a C ISR.

Maybe/possibly/probably with GCC. :)

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
Quote:

but I'm afraid that I don't think it's possible to ever meet spec with a C ISR.

Maybe/possibly/probably with GCC. :)

Lee


Very helpful, highly informative, extremely useful as usual

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Blow off, brberie. Yes, GCC does have the "naked" ISR capability, but then you must construct your own saving/restoring sequence and might as well craft the ISR in ASM--hence rc's comment.

With the [not for professional work] CodeVision, ISRs have "smart" saving & restoring. With careful thought about what is actually being referenced in the ISR, the equivalent of a hand-crafted ASM save/restore can be obtained in straight C.

You knew darn well what I meant.

There, is that any more informative?

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:

With the [not for professional work] CodeVision, ISRs have "smart" saving & restoring. With careful thought about what is actually being referenced in the ISR, the equivalent of a hand-crafted ASM save/restore can be obtained in straight C.
Lee

Even with the not very professional work GCC ISRs have "smart" saving & restoring of what is actually being referenced in the ISR, which in over 99% cases eliminates the necessity of even thinking about a hand-crafted ASM save/restore and can be easily obtained in straight C.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi All

I'll avoid the religious wars, and just say that any public contributions of code that I make in this area will be in straight AVR Studio asm. In this way the relationship between the hardware, code, and timing are unambiguous and not affected by the development environment.

Roger, I have had good results using the watchdog timer to address the occational lockup problem. This is not a trivial approach, as it requires you to think long and hard about the placement of every WDR and understand the approximate worst case timing for each function and loop, but I think that it's much easier and more reliable than attempting to put timeouts everywhere and handle every error case perfectly. My problem with the timeout/error handling solution is that I simply don't have the ability to test it well enough in complex asynchronous environments that need to work even if there are some part or code failures. I have opted for redundancy, very limited single points of failure, and careful use of the watchdog timer.

Because it is fairly easy to tell in the reset routine the difference between a WD reset and the pushbutton or power-up reset, it is not a big problem to do different initialization for a watchdog. In my case, I need to avoid glitching some I/O pins on a watchdog reset so that the things that the AVR controls don't see the event. While the AVR unavoidably tri-states it's outputs on a WD reset, it is a quick operation and I have had good results by doing fast I/O initialization to "safe" values and have reliably avoided the reset of the AVR affecting the system. It's remarkable what you learn when you start injecting errors and hardware failures into a system...

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A final note on this TWI stuff...

I give up. brberie is right, life is too short and the TWI implementation, I am now convinced, is not capable of supporting multi-master with slave. I have done hundreds of hacks, resetting after every slave transaction, keeping start bits around on arbitration loss, clearing start bits, waiting for the bus to be not busy by polling SDA/SCL for a period, etc., and there are just too many wrong and uncontrollable behaviors. I have pictures, but I don't even want to take the time to explain them. They are ugly, and include signal corruption while waiting to output a start, sub-usec bus idle, and arbitration loss to higher device addresses. These things can not be addressed by software, they are simply design errors (features) of the TWI implementation that prohibit it from working in a general multi-master environment.

I do believe that exactly two AVRs can talk to eachother over TWI in limited, controlled environments, where it is OK to disable the slave interface for a while, there are only the two devices, and preferably you are only doing writes. I do not believe that it is possible to reliably do more than this.

It is clearly possible to implement reliable master or reliable slave operation separately, although meeting timing specs for slave mode requires very tight control of system irq latency and a very optimized TWI ISR for the STOP/REPEAT START case.

Anyone tempted to use an AVR for an IPMC implementation should forget about using the TWI and put a real bus interface on the board.

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

brberie wrote:
Craig,
I believe that the HW state machine was working fine if TWI was reseted every time after sending STOP:

  TWCR &= ~( 1 << TWEN );
  TWCR |=  ( 1 << TWEN );


Have you tried that?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi brberie

I have code to send the stop, wait for the TWCR stop bit to go away since there is no irq on this event (see pg 207 of the mega128 datasheet), and reset the TWI as you suggest. I have no trouble in completing master cycles if I really get the bus, I have problems responding to slave cycles after loosing arbitration as master, and with the bus inactive time of master cycles where the start is requested when someone else has the bus. It is not clear to me if there needs to be some delay after a TWI reset before sending a start, or some period where it will not see a slave access when doing these resets.

I have now learned that others have successfully implemented a fully functional IPMC on an AVR using the TWI, so my previous statement on this not being possible is clearly wrong, I'm just too stupid to do it without more help than Atmel provides. There is a possibility that my starting point is wrong, since I started by making a fully I2C multi-master compliant master/slave interface, and this is a bit more than an IPMC requires.

We are both breaking our commitments to stop spending time on this. It's a beautiful fall, life is short, no more TWI work...

Craig

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This note appears on page 209 of the 7679B–CAN–11/06 AT90CAN32/64/128 data sheet:

Note: TWBR should be 10 or higher if the TWI operates in Master mode. If TWBR is lower than 10, the
master may produce an incorrect output on SDA and SCL for the reminder of the byte. The problem
occurs when operating the TWI in Master mode, sending Start + SLA + R/W to a slave (a
slave does not need to be connected to the bus for the condition to happen).

I have not seen this note in other data sheets, so I do not know if this is limited to the AT90CAN32/64/128 parts or not.

It also turns out the (1/Fscl – 2/Fck) formula from the Two-wire Serial Bus Requirements in the data sheet Electrical Characteristics section for checking Tlow is a typo. The formula should be (1/2*Fscl – 2/Fck). The old typo version would make incorrectly configured TWI setups appear to pass the minimum Tlow requirement when they actually failed.

https://www.avrfreaks.net/index.p...

I was wondering if any of this information may be contributing to the TWI misbehavior noted in this thread.