Failure of "XOR" Instruction?

Go To Last Post
22 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This relates to a post I made awhile back thinking that I was having a stack overflow issue (https://www.avrfreaks.net/index.p...). I have some more info that has me more confused than ever.

On a product I developed based on the MEGA88, I do a simple periodic comparison of two variables stored in RAM, one of which is the complement of the other:

if((cpData_OK ^ cpData_OK_DR) != 0xFF)   //Double RAM
{
ProcessorFault(11);
}

I was getting an intermittent (once every 1-2 weeks) fault at a customer site, and some diagnostic firmware indicated that this was the fault. Originally, I thought this pointed to some data stack overflow corrupting one of the these variables. However, I added some more diagnostics and had the customer re-run the unit over the holiday break. This new firmware outputs both the value of the variable cpData_OK and its complement via the LED if this fault should occur. I received a video from the customer showing the LED sequence, and the two variables values are 1 and 254. In other words, this fault should NOT be occurring, as the "OR"'d value of these two variables IS 0xFF.

I am stumped now. Anyone have any ideas? As mentioned in my other post, this is an EMI-heavy environment, but the fault occurs at the same point in a superloop-type loop of code each time. However, it takes anywhere from days to a month for it to occur, which obviously makes troubleshooting difficult. If anyone has any other suggestions for how to diagnose this or what to look for, I would appreciate it.

Last Edited: Tue. Jan 3, 2012 - 04:05 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You are using XOR rather than OR.

If you XOR something with itself, you get 0.
If you XOR something with its complement you get 0xFF.

Using OR on something with its complement also gives 0xFF. However, so do several other combinations.

What exactly do you want to do?

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well firstly your operation is XOR, not OR. but in this case the result will indeed be 255. However, due to integer promotion rules you may be getting bit. What are the data-types for cpData_OK and cpData_OK_DR? [hint: Are they signed types?]

Another potential problem is if the values are being altered by an ISR. If so they need to be declared as volatile to prevent a cached value from being used.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Also, show the generated code for your fragment! And tell toolchain and version.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

glitch wrote:
Well firstly your operation is XOR, not OR. but in this case the result will indeed be 255. However, due to integer promotion rules you may be getting hit. What are the data-types for cpData_OK and cpData_OK_DR? Another potential problem is if the values are being altered by an ISR, if so they need to be declared as volatile to prevent a cached value from being used.

Sorry guys, of course you are right, it was a nice long vacation and my head is still stuck back there.

The datatypes for both cpData_OK and its complement are both unsigned chars. These values are indeed modified inside an ISR, and I do NOT use the volatile keyword in their definition. This clearly could be the cause of my issue (though it seems interesting that the behavior would not be consistent and might take tens or hundreds of thousands of repititions to cause an issue).

BTW, I am using IAR Embedded Workbench as my compiler (vs 5.51).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

BTW, I am using IAR Embedded Workbench as my compiler (vs 5.51).

Quote:

Also, show the generated code for your fragment!

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Also, show the generated code for your fragment!

Sorry, missed that request. Here is the assembler that is generated:

0000179A:   9100022B    LDS       R16,0x022B     Load direct from data space
+0000179C:   9110022C    LDS       R17,0x022C     Load direct from data space
+0000179E:   2701        EOR       R16,R17        Exclusive OR
+0000179F:   3F0F        CPI       R16,0xFF       Compare with immediate
+000017A0:   F019        BREQ      PC+0x04        Branch if equal

Looks correct.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What happens if the interrupt that modifies the value hits during the test? Potential atomicity problem methinks.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Kartman wrote:
What happens if the interrupt that modifies the value hits during the test? Potential atomicity problem methinks.

Normally, you would be 100% correct. However, in my case, my ISR is enabled/disabled during a finite period, during which I should never be running this test. Essentially, I sit doing nothing while I wait for my ISR to gather some data. Once the data buffer is full, or a timeout hits, I turn the ISR off, and then carry on my way.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

nobbyv wrote:
These values are indeed modified inside an ISR, and I do NOT use the volatile keyword in their definition.
I don't get it. You have identified a real problem with variable(s) used in an ISR not being declared volatile. Yet in the face of a known problem you appear to be in an ongoing discussion about how it should work while ignoring this already known problem. DUH!

What happens if you fix the known problem and declare the variable(s) volatile?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Assembly code that generated seems different from C.

if((cpData_OK ^ cpData_OK_DR) != 0xFF)   //Double RAM 
{ 
ProcessorFault(11); 
} 
0000179A:   9100022B    LDS       R16,0x022B     Load direct from data space 
+0000179C:   9110022C    LDS       R17,0x022C     Load direct from data space 
+0000179E:   2701        EOR       R16,R17        Exclusive OR 

+0000179F:   3F0F        CPI       R16,0xFF       Compare with immediate 
+000017A0:   F019        BREQ      PC+0x04        Branch if equal 

If the result after EOR between registers or unsigned char is not equal with 0XFF in C,in assembly the comparison between them expects equality with 0XFF,but this is not an extended part of code to see what is following next.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That is why C is not assembly language. Even without compiler optimization high level language statements do not always exactly correspond to equivalent assembly code. In this case the C != is the opposite of ==. This would tell me the assembly = compare BREQ PC+4 should skip over the ProcessorFault(11) function call code, therefore only calling ProcessorFault(11) if the C != is true (exactly what it should be). It probably needs the PC+4 bytes to setup the 11 value passed parameter and make the function() call.

So, am I supposed to believe the C ^ exclusive OR which correctly compiles to an assembly EOR is not working or the LDS instructions are broken? I don't think so simply because this AVR silicon has been around way too long for problems of this magnitude to only be noticed today. Do you think that just maybe, possibly, on a wild guess, could it be the variable values stored at 0x022B and 0x022C might not be what the OP thinks they are???? Has anyone pointed out a known problem that could make them not be the expected values :roll:!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Others have pointed out the volatile issue, could there be some case where in some other interrupt, there's an infrequently used exit path that does not restore SREG?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dbvanhorn wrote:
Others have pointed out the volatile issue...
I was being sarcastic in my last post :). The OP has never said this issue was fixed in the source code and/or never reported any results after the fix was applied and tested. The OP appears to be ignoring fixing this known issue in favor of discussing the I don't think it is a problem because blah blah blah..... Well, blah, blah, blah doesn't cut it in this case or advance solving the problem, especially when fixing the issue is so easy to try out and test.

Any C compiler that fails to save SREG (when needed) has lots more serious problems with the generated code than what the OP reported. I would think such an interrupt would have to have been coded in assembler and linked into the C source, or the C compiler has a big bug. Still, logically how can any useful debugging proceed without first correctly fixing the known issue :wink:?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Mike:

If the ISR (or any non-inline function, for that matter) manages to return without first storing the final results of any computations it has performed on any global variables, into those variables' designated SRAM locations, then the compiler is broken, plain and simple. This is true regardless of whether the variables in question are declared as volatile.

The volatile declaration is all about:
1) Forcing the compiler to save all intermediate steps to SRAM, even before the function has had a chance to return, and
2) Forcing the compiler to reload the current contents out of SRAM every time a function needs to read the current value of that variable, rather than storing it in a register cache. For example, while running a tight loop, the compiler is forbidden to use the simple fact that nothing in the loop makes modifications to a variable as evidence that the variable's contents must therefore remain constant throughout the life of the loop.

If the variables are only ever changed in the ISR and never in the mainline, then case (1) will not be a problem in this case. And if it is a problem, then the compiler is fundamentally broken, and inserting a "volatile" would not do anything to restore my confidence.

The OP has shown a contiguous block of generated assembly code that proves that the variables are being fetched out of SRAM immediately before the XOR is computed -- so I am inclined to believe that case (2) is not a problem in this case either. (Of course, we'd need to check to make sure that the XOR itself is actually being performed each time through the main loop, and there are no conditions where the result of the XOR, itself, is being cached.)

The OP can insert a volatile if he wishes; my crystal ball tells me that it will not generate any appreciable benefit in this particular case.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'd actually like to see the preceding and following few lines of generated code... I believe the key part is missing. Most notably I am not seeing the integer promotion I would expect [though this could have been optimized away]

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Is that SRAM location the real permanent home of these variables or is it setup by the compiler (perhaps a temporary software variable stack or some such compiler creation) for internal compiler processes (sorry, I do not know IAR generated code conventions)? For example, WinAVR is not generically broken, but it absolutely fails to keep the correct synchronization between global variable values modified both inside ISRs and outside ISRs, unless these global variables are all declared volatile. Perhaps a WinAVR C optimizer side effect? IAR also has its own aggressive optimizer and it could even be bypassing the int promotion as an optimization. I agree with Glitch, it would be nice to see more.

I'm not saying your crystal ball is wrong, I said trying volatile declarations is the next step. It's not like I'm asking the OP to totally rewrite his code or something burdensome. The OP has shown so little of his actual code there is not much for us to go on. Which is better our endless assumptions about what we are not told by the OP or a simple test the OP could do and settle the question? I have no idea if the OP modifies the variables outside the ISR since he did not say. So, who among us can assume he does not?

For me it all comes down to am I going to believe the AVR EOR is broken or assume it is working, meaning the data is not what the OP thought it was or the compiler is broken and loading bad data...?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well given that it is an explicate address [LDS], it cannot be on the stack, as the compiler cannot predict how big the stack will be each and every time the function is called. So I think it is safe to say that the value is being fetched from it's permanent location in RAM. If this was a stack variable, you would see it as a load with an offset from a base pointer register.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If we rule out the obvious, what are we left with? If we assume the compiler has done it's job correctly and the noted code behaves as it should, then that suggests something outside that code is doing something wrong and it would seem it does it rarely and the values get fixed up - a transient occurence if you will. I would normally suspect a problem with atomicity in this instance. If this is not the case, a critical code review would be a start - look carefully at arrays and ensuring indexes are bound. I would also have the error detect code save the errant values for later inspection.
Whilst I have had dud cpus that worked 'most' of the time, I don't think we've seen enough evidence to implicate a bad instruction execution at the moment. It sounds like a case of dodgy code and whilst we hate to admit it, it is the most likely cause.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Mike B wrote:
It's not like I'm asking the OP to totally rewrite his code or something burdensome. The OP has shown so little of his actual code there is not much for us to go on. Which is better our endless assumptions about what we are not told by the OP or a simple test the OP could do and settle the question?

While I agree that adding the volatile keyword is a worthwhile thing to try, as I mentioned in my original post, this issue only occurs at a customer site, and takes weeks to occur. I am sorry I could not respond back within hours as to whether this suggestion worked. I was not ignoring it.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I mentioned in my original post, this issue only occurs at a customer site, and takes weeks to occur.

I haven't read the rest of this thread so apologies but issues that only happen every few weeks often talk of either (a) electrical noise issue or (b) an atomicity problem.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dbvanhorn wrote:
Others have pointed out the volatile issue, could there be some case where in some other interrupt, there's an infrequently used exit path that does not restore SREG?

Another good suggestion, however, other than the watchdog ISR, there are no other ISRs enabled in this code. And if the WDT was kicking in, I go into a lock-out loop requiring a reset (and outputting a different diagnostic via the LED).