startup code reads SP then subtracts 0x40?

Go To Last Post
9 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm looking at the disassembled startup code for C code compiled for an atmega32u4, using AS4. For some reason, the SPH and SPL are read, then decremented by 0x40 and then written back to SP. I dont understand why, it leaves 0x40 bytes of un-used stack space. I noticed this while playing around with stack monitoring code, thought it was an odd thing to do, and was hoping someone could explain the reason behind it.

IN        R28,0x3D       In from I/O location
IN        R29,0x3E       In from I/O location
SUBI      R28,0x40       Subtract immediate
SBCI      R29,0x00       Subtract immediate with carry
IN        R0,0x3F        In from I/O location
CLI                      Global Interrupt Disable
OUT       0x3E,R29       Out to I/O location
OUT       0x3F,R0        Out to I/O location
OUT       0x3D,R28       Out to I/O location
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Do you have 64 bytes of local variables in main?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Either it is home-brew startup code, then only you know the answer to the question.

Or it is not a common startup code from avr-libc. There is no reason to CLI in the startup code.

Looks much more like a function prologue.

Reas the assembler code, e.g. preserved with -save-temps in *.s, instead of a disassembly to see more.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think he means "prologue" not "startup code", hence the confusion. As snigelen suggests this is normal stack frame setup for 64 bytes of automatics:

int main(void) {
	int numbers[16];
	char buff[32];
	
	for (int i=0; i < 32; i++) { buff[i] = PINB; }
	for (int i=0; i < 16; i++) { numbers[i] = buff[i] + (buff[i+16]<<8); }
	for (int i=0; i < 16; i++) { PORTB = numbers[i] >> 5; }

	while(1) {};
}
.global	main
	.type	main, @function
main:
//==> int main(void) {
	push r28	 ; 
	push r29	 ; 
	in r28,__SP_L__	 ; 
	in r29,__SP_H__	 ; 
	subi r28,lo8(-(-64))	 ; ,
	sbci r29,hi8(-(-64))	 ; ,
	in __tmp_reg__,__SREG__
	cli
	out __SP_H__,r29	 ; 
	out __SREG__,__tmp_reg__
	out __SP_L__,r28	 ; 
/* prologue: function */
/* frame size = 64 */
/* stack size = 66 */
.L__stack_usage = 66

or to put it another way:

00000092 
: #include int main(void) { 92: cf 93 push r28 94: df 93 push r29 96: cd b7 in r28, 0x3d ; 61 98: de b7 in r29, 0x3e ; 62 9a: c0 54 subi r28, 0x40 ; 64 9c: d0 40 sbci r29, 0x00 ; 0 9e: 0f b6 in r0, 0x3f ; 63 a0: f8 94 cli a2: de bf out 0x3e, r29 ; 62 a4: 0f be out 0x3f, r0 ; 63 a6: cd bf out 0x3d, r28 ; 61 int numbers[16]; char buff[32];

or even:

int main(void) {
 PUSH R28		Push register on stack 
 PUSH R29		Push register on stack 
 IN R28,0x3D		In from I/O location 
 IN R29,0x3E		In from I/O location 
 SUBI R28,0x40		Subtract immediate 
 SBCI R29,0x00		Subtract immediate with carry 
 IN R0,0x3F		In from I/O location 
 CLI 		Global Interrupt Disable 
 OUT 0x3E,R29		Out to I/O location 
 OUT 0x3F,R0		Out to I/O location 
 OUT 0x3D,R28		Out to I/O location 

As we all know the CLI in that one is making the two byte SP update atomic.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is -mno-interrupts, but there are hardly any use cases for it because most applications /do/ use IRQs.

On xmega, that CLI is gone:

void g (void)
{
    extern void f (void*);
    char c[64];
    f (c);
}
g:
	push r28
	push r29
	in r28,__SP_L__
	in r29,__SP_H__
	subi r28,64
	sbc r29,__zero_reg__
	out __SP_L__,r28
	out __SP_H__,r29
/* prologue: function */

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I think he means "prologue" not "startup code", hence the confusion
You are correct. Sorry.

I do have 64+ bytes of local variables in main. So, why does that cause this to be the normal stack frame setup if you have 64 bytes of automatics? I'm not questioning anyones wisdom, just want to know how/why it is used. Seems like thats 64 bytes that never get used in the stack, right? Watching in the debugger, they never get used or updated. Nothing got pushed there, it just got skipped over.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You do know how automatics work don't you? Study my little example above:

   for (int i=0; i < 32; i++) { buff[i] = PINB; }
   for (int i=0; i < 16; i++) { numbers[i] = buff[i] + (buff[i+16]<<8); }
   for (int i=0; i < 16; i++) { PORTB = numbers[i] >> 5; }

Where in memory do you think those 32 bytes in buff[] or those 32 bytes in numbers[] are being stored if not in the 64 bytes of space that have been reserved on the stack to hold them by the code sequence we're discussing here?

main() is no different to any other C function and if you create automatics then in the case of the GCC compiler the way this works is that on entry to any function there will be a prologue that first PUSHes any registers that must be preserved then it reads SP, subtracts however many of bytes of automatics are needed, then writes this back. What's more it does this in such a way that the Y register pair (that is R29:R28) hold the base address of the created data space. The Y register is known as the stack frame pointer and the space that's been opened up on the stack is the "stack frame". Often the C code will use the LD Y+n and ST Y+n style opcodes to read/write the n'th byte of data within the stack frame.

If I'd removed the while(1) at the end of my little example so that main() could return from where it was called then the function would also have epilogue:

/* epilogue start */
	subi r28,lo8(-(64))	 ; ,
	sbci r29,hi8(-(64))	 ; ,
	in __tmp_reg__,__SREG__
	cli
	out __SP_H__,r29	 ; 
	out __SREG__,__tmp_reg__
	out __SP_L__,r28	 ; 
	pop r29	 ; 
	pop r28	 ; 
	pop r17	 ; 
	pop r16	 ; 
	pop r15	 ; 
	pop r14	 ; 
	ret

Here you can see the SP being adjusted back by the 64 bytes that had been used for buff[] and number[]s as the stack frame is destroyed.

True main() doesn't normally return but buff[] and numbers[] still need to be created somewhere in memory even if they'll never be destroyed. If the function had been called foo() and was a normal, called function then perhaps it would have made more sense? Obviously a called function that creates automatics has to destroy them on exit or there'd be a huge memory leak in the system after a few variable using functions had been called a few times.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you have a smaller stack frame (up to 6 bytes) the compiler is clever to do the allocation in another way. If you have this function

int main(void)
{
    volatile char a = 1;
}

you get this prologue

	push r28
	push r29
	push __tmp_reg__
	in r28,__SP_L__
	in r29,__SP_H__

where the "push __tmp_reg__" is the allocation for a on the stack. The assignment becomes

	ldi r24,lo8(1)
	std Y+1,r24

So a is stored at Y+1. The stack i restored in the epilogue with

	pop __tmp_reg__
	pop r29
	pop r28
	ret

For two bytes

int main(void)
{
    volatile char a = 1, b = 2;
}

the prologue is

	push r28
	push r29
	rcall .
	in r28,__SP_L__
	in r29,__SP_H__

The allocation for a and b is done with "rcall .", which simply push two bytes on the stack and "calls" the next instruction. The assignments becomes

	ldi r24,lo8(1)
	std Y+2,r24
	ldi r24,lo8(2)
	std Y+1,r24

so a is now at Y+2 and b at Y+1. In the epilogue there is

	pop __tmp_reg__
	pop __tmp_reg__
	pop r29
	pop r28
	ret

two pop __tmp_reg__ that cancel out the fake "rcall .".

For three bytes you'll get one "rcall ." and one "push __tmp_reg__" in the prologue and three "pop __tmp_reg__" in the epilogue.

Try 4, 5, 6 and 7 bytes and see that the compiler changes strategy at 6 and 7 bytes.

Note that if you have non volatile variables, then some or all of them will be located in registers and not on the stack. I used volatile here to force it to not optimize them away and to store them on the stack.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I really appreciate the replies. It will be a little while before I can sit down with the debugger and pen/paper. Just wanted to say thanks.