Stack overflow? Stack being overwritten?

Go To Last Post
15 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Freaks,

As you may remember, last year I was working on a liquid dispensing unit project late last year. That was put on hold for 6 months, because the company (owned by my dad's friend) did not have enough time between filling large orders to discuss and develop. Finally, the orders have died down enough to permit me to continue working with them and the code.

The problem are is in the following routine:

unsigned char SelectLiquid(char Mode)	// 0 = Delete Liquid, 1 = Select Liquid
{
	uint8_t CurrLiquid   = 1;	// Variable to hold the currently selected liquid
	uint8_t SelectionDir = 1;	// Flag to hold the liquid selection direction (< or >)
	uint8_t LiquidsExist = 0;	// Flag, 1 if there is at least 1 liquid in memory

	if(!(Mode))	// If in delete mode, ask for passkey
	{
		if(!(CheckPassKey())) { return 0; }	// If an incorrect passkey is entered, exit the function
	}

	while(1)	// Eternal loop, takes the place of recursive programming for refreshing the LCD
	{
		Macro_WrapHigh(CurrLiquid, 12);	// Wrap if value is 12
		Macro_WrapLow(CurrLiquid, 11);	// Wrap if value is 0

		uint8_t CurrLiqZeroIndex = (CurrLiquid - 1);									// Variable holds corrected value for current liquid (starts at 0 instead of CurrLiquid's 1)
		char TempByte = eeprom_read_byte(&UsedLiquidIDs[CurrLiqZeroIndex]);	// Load liquid exists byte from EEPROM into a temp variable

		if(TempByte != BlankEB)	// Liquid does exist (EEPROM byte is not 0xFF which means blank/no data)
		{
			tlcd_cls(); 											// Clear the LCD
			tlcd_write_mem_string(Macro_LCDMessage(12));	// Write to the top of the LCD - this is inside the loop because it must be
																		// persistent to get rid if the "Really Delete?" text after/if it is shown

			LiquidsExist = 1;	// Flag that there is at least 1 liquid in memory

			WriteCurrLiquid(CurrLiqZeroIndex, 2, 1);	// Get and write the name of the currently selected liquid to the LCD
			tlcd_goto(2, 15);									// Go to the second last character of the bottom line
			tlcd_write_string("<>");						// Show the <> chars onto the bottom-right of the LCD

			ScanKeyPad();	// Function to wait until a key is pressed, and store it in the global variable "ScannedKey"

			if(ScannedKey == KEY_LEFT)  { CurrLiquid--; SelectionDir = 0; }	// If < key pressed, decrement the selected liquid variable
			if(ScannedKey == KEY_RIGHT) { CurrLiquid++; SelectionDir = 1; }	// If > key pressed, increment the selected liquid variable
			if(ScannedKey == KEY_STOP)  { return 0; }								// If stop key pressed, return to main menu		

			if(ScannedKey == KEY_START)	// The start key accepts the currently selected liquid
			{
				if(Mode) { return CurrLiquid; }	// Select Liquid mode, return selected liquid and exit function

				// Delete liquid mode:

				tlcd_cls();												// Clear the LCD
				tlcd_write_mem_string(Macro_LCDMessage(11));	// Write "Really Delete?" to the LCD
				WriteCurrLiquid(CurrLiqZeroIndex, 2, 1);		// Get and write the name of the currently selected liquid back onto the LCD
				Beep(1);													// Warning beep for 1/2 sec
				
				tlcd_goto(1, 1);	// Just return to top-left of LCD but don't clear it (keep the bottom line as it is)

				while(1)	// Another eternal loop, only for handling the delete confirm/cancel
				{
					ScanKeyPad();	// Function to wait until a key is pressed, and store it in the global variable "ScannedKey"	

					if(ScannedKey == KEY_STOP)		// Cancel Liquid data delete
					{
						tlcd_write_mem_string(Macro_LCDMessage(15));	// Show "Del Cancelled:" message onto the LCD
						Beep(0);		// Make a beeping sound and wait 2 seconds (approx)
						break;		// Back to delete liquid
					}

					if(ScannedKey == KEY_START)	// Confirm Delete
					{
						eeprom_write_byte(&UsedLiquidIDs[CurrLiqZeroIndex], BlankEB);	// Clear the Used ID for the current liquid in EEPROM				
						tlcd_write_mem_string(Macro_LCDMessage(16));	// Show "Del Confirmed:" message onto the LCD
						Beep(0);		// Make a beeping sound and wait 2 seconds (approx)
						return 0;	// Back to main menu
					}
				}
			}

			if(ScannedKey == KEY_ENTER)	// Enter key pressed, display liquid info
			{
				tlcd_cls();												// Clear the LCD
				tlcd_write_mem_string(Macro_LCDMessage(28));	// Write "Info:" text to the LCD
				WriteCurrLiquid(CurrLiqZeroIndex, 1, 7);		// Write the name of the selected liquid to the top of the LCD
				tlcd_goto(2, 1);										// Go to the bottom of the LCD
				tlcd_write_string("Ml/s: ");						// Write ml/s text to the bottom of the LCD

				uint32_t AmountData = 0;
				char AmountStr[6]   = "";
				
				eeprom_read_block((void *)&AmountData, &LiquidData[CurrLiqZeroIndex], SizeULong);	// Get ml/s data from EEPROM for the current liquid

				itoa((AmountData >> FixedPointBits), AmountStr, 10);	// Convert the whole number ml/s to a string
				tlcd_write_string(AmountStr);									// Write the whole number ml/s data to the LCD
				tlcd_write_byte(DATA, '.');									// Write the decimal character onto the LCD				
				itoa((AmountData & 0xFFFF), AmountStr, 10);				// Convert the decimal number ml/s data to the LCD
				tlcd_write_string(AmountStr);									// Write the decimal number ml/s data to the LCD

				while(1)				// Eternal Loop
				{
					ScanKeyPad();	// Scan the keypad, wait until a key is pressed (or pump status changes) and store it in the global variable "ScannedKey"

					if((ScannedKey == KEY_START) || (ScannedKey == KEY_STOP))	{ break; }	// Either Start or Stop key pressed, exit loop and return to liquid selection
				}
			}
		}
		else	// Liquid does not exist
		{
			SelectionDir ? CurrLiquid++ : CurrLiquid--;		// Select the next liquid depending on cycle direction

			if((CurrLiquid == 12) && !(LiquidsExist))			// No liquids in memory after cycling through all of them in either direction
			{
				tlcd_cls(); 											// Clear the LCD
				tlcd_write_mem_string(Macro_LCDMessage(5));	// Write "No Liquids" message to the LCD		
				tlcd_goto(2, 1);										// Go to the bottom-left of the LCD
				tlcd_write_mem_string(Macro_LCDMessage(6));	// Write "In Memory!" to the bottom of the LCD
				Beep(0); 												// Make a beeping sound and wait 2 seconds (approx)

				return 0;
			}
		}
	}
}

This routine loads liquids from EEPROM, and (depending on the flag parameter) return the selcted liquid once the ENTER key is pressed, or delete it. This works great, except for the deletion. When the above routine is executed via my main menu with mode set to 0 (delete), everything works fine - I can scroll through avaliable liquids and use the ENTER key to view their stats or the STOP key to return to the main menu. The problem lies when I press the ENTER key. As you would expect, the correct "Confirm Delete?" message is displayed on the LCD. Pressing START deletes the liquid, but then the whole device resets. Same for the STOP button, the "Delete Cancelled" message is displayed and then it resets.

I can hear the beeps, so the problem is when either button code tries to "Return 0" to the main menu. Strangly this is not a problem elsewhere in the same routine, "Return 0" works perfectly. Does anyone have an explanation and/or a fix?

**** Stats ****:
AT90S8535 @ 8MHz
AVR-GCC

All the LCD strings are stored in the program memory and are read into the SRAM. Could this be overwriting the stack pointer for the main menu? Seems possible, but since the stack is written bottom->top, the string would have to overright the entire stack to cause a problem. That, and I have 512 bytes of SRAM to play with. The problem occurs even when I restart the unit and choose delete liquid immediatly (thus minimising stack size).

I'm stumped so i'm turning this over to the experts.
- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Additional: Notice i've used "return" for one button and "break" for the other. Both cause problems; the break one breaks back into the select liquid loop correctly but then the device resets when I try to go back the the main menu.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Dean,

You may be correct about the stack being overwritten.
Could you please do the following:
1.
Build your project to generate .map file (-Wl,-Map=Your_Project.map)
2.
avr-objdump -h Your_Project.elf > Your_Project.seg
3.
zip Your_Project.map + Your_Project.seg
4.
Attach the created zip file to your next post

Best Regards,
Michal

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm pretty sure it is the stack, it's just suprising because I have so much SRAM and not that many stack levels (I believe).

In any case, .zip file will the files you requested is attached.

- Dean :twisted:

Attachment(s): 

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A sidenote: is there a way I can progmatically see the stack size? I want to check to see that there are no leaks in my code by sending the current stack size through the UART.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Dean,

Quote:
is there a way I can progmatically see the stack size?
No, you can not. However, you can see the Stack Pointer and calculate the stack size. When doing that be careful not to use any function calls, as they will change the Stack Pointer. Use the macros instead. On at90s8535 the macro should look something like this:

#define SEND_SP while ((USR & 32) == 0); \
                UDR = SPH;               \
                while ((USR & 32) == 0); \
                UDR = SPL

In the next 2 days I have to do some traveling. I will look into the posted files after that.

Michal

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I have so much SRAM and not that many stack levels

A glance at the .map indicates that your globals/statics take up about half of
your SRAM. This should leave about 250 bytes for stack, which should be
plenty unless you have many (admittedly vague number) local arrays.

Perhaps the easiest way to damage the stack is to overrun a local array.
It's "easiest" since you don't have to overrun by very much to destroy your
(saved) return address. I don't see that happening here, but maybe one of
the subfunctions that's being called does this?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's wierd. That routine is absolutly rock-solid, until I enter the following section:

tlcd_cls();												// Clear the LCD
				tlcd_write_mem_string(Macro_LCDMessage(11));	// Write "Really Delete?" to the LCD
				WriteCurrLiquid(CurrLiqZeroIndex, 2, 1);		// Get and write the name of the currently selected liquid back onto the LCD
				Beep(1);													// Warning beep for 1/2 sec
				
				tlcd_goto(1, 1);	// Just return to top-left of LCD but don't clear it (keep the bottom line as it is)

				while(1)	// Another eternal loop, only for handling the delete confirm/cancel
				{
					ScanKeyPad();	// Function to wait until a key is pressed, and store it in the global variable "ScannedKey"	

					if(ScannedKey == KEY_STOP)		// Cancel Liquid data delete
					{
						tlcd_write_mem_string(Macro_LCDMessage(15));	// Show "Del Cancelled:" message onto the LCD
						Beep(0);		// Make a beeping sound and wait 2 seconds (approx)
						break;		// Back to delete liquid
					}

					if(ScannedKey == KEY_START)	// Confirm Delete
					{
						eeprom_write_byte(&UsedLiquidIDs[CurrLiqZeroIndex], BlankEB);	// Clear the Used ID for the current liquid in EEPROM				
						tlcd_write_mem_string(Macro_LCDMessage(16));	// Show "Del Confirmed:" message onto the LCD
						Beep(0);		// Make a beeping sound and wait 2 seconds (approx)
						return 0;	// Back to main menu
					}
				}
			}

Somthing in there is causing a major problem. It can't be the keypad scanning routine, and the WriteCurrLiquid works fine everywhere else - including elsewhere in the SelectLiquid routine. I think i'll do an experiment; i'll put a "return 0;" command in after the "tlcd_goto(1, 1);" and before the second while(). That should narrow down the glitching area.

Most likely it is just insufficient SRAM. I'm using an AT90S8535 (will be a MEGA8535 for production) but may upgrade to the MEGA16 for the double SRAM and more codespace.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You mentioned staging strings Flash->SRAM; is that what happens in
tlcd_write_mem_string, e.g.? Are you guarding against overrun as you
do that?

Keep in mind that for a symptom like this the fact that "function X works
everywhere else" is evidence, but not proof. If something is whacking
(quasi-)random storage, sometimes it gets away with it and sometimes
it doesn't.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry to ask in another thread but there is something mckenney told that maybe solves my problem. Can an overrun of an array damage the stack?(Mine is global). Cause i have strange stucks of my program and didn't know what it might be?
Kostas

It's better to keep your mouth shut and think you a fool, than open it and move out the doubts!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Globals/statics are allocated down at the low-addressed section of memory.
The stack grows from the high addresses down towards the low addresses.
In between is the stack's "slop" space.

Starting from the address of a global array, you have to overrun pretty far --
through the "slop" space -- to start overwriting the stack, but it's certainly
do-able, particularly if an array index goes "wild". (By contrast, for a local
array being off-by-1 may be enough to walk on something important in
the stack.)

At the risk of drifting off into "War Stories": Secondary and Tertiary effects
are also possible: Damaged local index -> Overrun global array -> Damaged
stack variable -> ... where it's the last link in the chain that actually produces
the symptom. (I've written "causal-chain" analyses that went on for pages.)

But yes: It's all the same memory, so anything can happen.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here's the "writecurrliquid" routine. Looking back at it, I feel sure the problem exists in here.

void WriteCurrLiquid(unsigned char CurrLiquid, unsigned char y, unsigned char x)
{
	char NameBuffer[11] = "";	// Make a new array to hold the liquid name as it is read from EEPROM

	eeprom_read_block((void *)&NameBuffer, &LiquidNames[CurrLiquid], 10);

	tlcd_goto(y, x);						// Go to the place on the LCD specified in the sub parameters
	tlcd_write_string(NameBuffer);	// Write the liquid name onto the LCD
}

Would an invalid array index passed as the "CurrLiquid" parameter cause a) the 11-byte buffer to fill the entire SRAM and thus overwrite the stack, but still allow the main code to run correctly? This sounds strange because if the problem existed here it would overwrite the function's return pointer and the program would fail. As important information, the absolute maximum liquid name is 10 characters (like convention dictates i've made the buffer to hold it 11 bytes).

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

abcminiuser wrote:
Would an invalid array index passed as the "CurrLiquid" parameter cause a) the 11-byte buffer to fill the entire SRAM and thus overwrite the stack, but still allow the main code to run correctly?

I doubt it. An invalid array index would cause you to read the wrong EEPROM locations but will still only write 10-bytes to NameBuffer.

I think you may want to add:

NameBuffer[ 10 ] = 0;

after the call to eeprom_read_block, however. My guess is that tlcd_write_string is looking for a zero terminated string. Since you're only reading 10-bytes from EEPROM, unless one of those 10-bytes is a zero, you're not going to get a zero terminated string.

Don

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I put your line in, DonBlake as a precaution but so far that hasn't given me any problems. The variable passed as the CurrChannel parameter is EXACTLY the same as the other calls earlier in the function so that discounts that, I suppose. This really IS a mystery.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Gaaah! Found the &%@&%*&^$##@ cause - as always my stupidity :(.

I placed all the strings in flash a while ago to save SRAM. I then made a custom routine to read the flash strings into a buffer, and then write that buffer to the LCD. Because my LCD is 16x2, I made the buffer length long enough to hold an entire line of data; 16 bytes.

Anyone else see a problem here? I twigged when the system restarted after I made a small change to one of the strings. Turns out I forgot to accomodate the extra null terminator byte of the string, so each time a very long string - of which there was only two and one of those was the "Confirm Delete?" string - the termination byte would be overwritten. The program would work ok, but little did I know that behind the scenes the routines had overwritten my entire SRAM after not finding the required termination byte. Changing the buffer length to 17 bytes solved my problem.

Ah well, file that under "dumb mistakes". Look like i'll need anither mental filing cabinet to hold the entirety of that metaphorical file. *Sigh*.
- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!