Another serious reordering problem, misoptimization and gcc

Go To Last Post
20 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I would like to access external memory hidden by internal SRAM. To be able to do that I masked AD15 address line, read a byte, and switch AD15 back to normal address line:

char Read_4kB(size_t adres)
{
	DDRC=0xFF;
	PORTC=0x00;
	XMCRB|=_BV(XMM0);
	char byte=*((char*)(adres | 0x8000));
	XMCRB&=~_BV(XMM0);
	return byte;
}

But let's see generated assembly code:

char Read_4kB(size_t adres)
{
	DDRC=0xFF;
  e0:	2f ef       	ldi	r18, 0xFF	; 255
  e2:	24 bb       	out	0x14, r18	; 20
	PORTC=0x00;			
  e4:	15 ba       	out	0x15, r1	; 21
	XMCRB|=_BV(XMM0);
  e6:	ec e6       	ldi	r30, 0x6C	; 108
  e8:	f0 e0       	ldi	r31, 0x00	; 0
  ea:	20 81       	ld	r18, Z
  ec:	21 60       	ori	r18, 0x01	; 1
  ee:	20 83       	st	Z, r18
	char byte=*((char*)(adres | 0x8000));
	XMCRB&=~_BV(XMM0);
  f0:	20 81       	ld	r18, Z
  f2:	2e 7f       	andi	r18, 0xFE	; 254
  f4:	20 83       	st	Z, r18
  f6:	fc 01       	movw	r30, r24
  f8:	f0 68       	ori	r31, 0x80	; 128
	return byte;
}
  fa:	80 81       	ld	r24, Z
  fc:	08 95       	ret

We can clearly see the misoptimization "“ acces to XMCRB register is done via Z register, LDS R18,0x006C would be faster and better. But it is only minor problem. The serious problem is with char byte=*((char*)(adres | 0x8000)); - it is moved after XMCRB&=~_BV(XMM0); so the read obviously is done from wrong location "“ AD15 line while read is not switched to normal IO pin. How to prevent gcc from reordering? To me it seems to be a serious bug in the compiler.
What is interesting, very similar function:

void Write_4kB(size_t adres, char byte)
{
	DDRC=0xFF;
	PORTC=0x00;			
	XMCRB|=_BV(XMM0);
	*((char*)(adres |0x8000))=byte;
	XMCRB&=~_BV(XMM0);
}

Is compiled ok (without reordering):

void Write_4kB(size_t adres, char byte)
{
	DDRC=0xFF;
  fe:	2f ef       	ldi	r18, 0xFF	; 255
 100:	24 bb       	out	0x14, r18	; 20
	PORTC=0x00;
 102:	15 ba       	out	0x15, r1	; 21
	XMCRB|=_BV(XMM0);
 104:	ac e6       	ldi	r26, 0x6C	; 108
 106:	b0 e0       	ldi	r27, 0x00	; 0
 108:	2c 91       	ld	r18, X
 10a:	21 60       	ori	r18, 0x01	; 1
 10c:	2c 93       	st	X, r18
	*((char*)(adres |0x8000))=byte;
 10e:	fc 01       	movw	r30, r24
 110:	f0 68       	ori	r31, 0x80	; 128
 112:	60 83       	st	Z, r22
	XMCRB&=~_BV(XMM0);
 114:	8c 91       	ld	r24, X
 116:	8e 7f       	andi	r24, 0xFE	; 254
 118:	8c 93       	st	X, r24
}
 11a:	08 95       	ret

Above program is compiled on ATMega128, with WinAVR 20100110.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Then make the access volatile:

char Read_4kB(size_t adres) 
{ 
   DDRC=0xFF; 
   PORTC=0x00; 
   XMCRB|=_BV(XMM0); 
   char byte=*((volatile char*)(adres | 0x8000)); 
   XMCRB&=~_BV(XMM0); 
   return byte; 
}

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Why are you turning the the memory i/f on and off?

If you have external memory, you might just as well turn it on at the start of your program, and leave it on.

From memory, the 'hidden' area gets mapped automatically. At least it does with an 8515.

I would agree. avr-gcc appears to generate the sequence in the wrong order.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I tried it with volatile and got this:

char Read_4kB(size_t adres) 
{ 
   DDRC=0xFF; 
  ce:	2f ef       	ldi	r18, 0xFF	; 255
  d0:	24 bb       	out	0x14, r18	; 20
   PORTC=0x00; 
  d2:	15 ba       	out	0x15, r1	; 21
   XMCRB|=_BV(XMM0); 
  d4:	ac e6       	ldi	r26, 0x6C	; 108
  d6:	b0 e0       	ldi	r27, 0x00	; 0
  d8:	2c 91       	ld	r18, X
  da:	21 60       	ori	r18, 0x01	; 1
  dc:	2c 93       	st	X, r18
   char byte=*((volatile char*)(adres | 0x8000)); 
  de:	fc 01       	movw	r30, r24
  e0:	f0 68       	ori	r31, 0x80	; 128
  e2:	80 81       	ld	r24, Z
   XMCRB&=~_BV(XMM0); 
  e4:	9c 91       	ld	r25, X
  e6:	9e 7f       	andi	r25, 0xFE	; 254
  e8:	9c 93       	st	X, r25
   return byte; 
}
  ea:	08 95       	ret

That looks right to me. Notice the use now of X and Z. X->XMCRB, Z->adres|0x8000

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So the generated code depends on some mysterious gcc behavior. And it seems that it can be ok (like in your compilation) or wrong (like in mine). But obviously it is another reordering problem, fortunately the volatile did the trick  Thanks Clawson.
david.prentice – I’m not turning the memory on anf off, I’m only switching the function of AD15 address line. It need to be 0, when accessing addresses above 0x8000 to get access to memory normally hidden by internal SRAM memory.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Interesting idea - do you access all of your external SRAM through memory I/O routines? It seems to me to be a heavy performance price for accessing an extra 4K, since you would need to examine every external SRAM access for this 4K section. Of course, if you had a 128K external SRAM you would need to do paging anyway, so maybe it's not that bad.

Stu

Engineering seems to boil down to: Cheap. Fast. Good. Choose two. Sometimes choose only one.

Newbie? Be sure to read the thread Newbie? Start here!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Of course no. I have attached 64kB of external SRAM to ATMega128. Because ATMega already has 4kB SRAM, so the lowest 4kB of external SRAM are hidden. To access it I need to temporary disable AD15, so I can control it manually. So only access to the lowest 4kB is realized through this routines. Access to the rest of memory is through normal XMEM interface.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

So the generated code depends on some mysterious gcc behavior. And it seems that it can be ok (like in your compilation) or wrong (like in mine). But obviously it is another reordering problem, fortunately the volatile did the trick  Thanks Clawson.

But 'volatile' isn't just an accident here. Its the same reason why:

PORTB = 0x55;
PORTB = 0xAA;
PORTB = 0x55;

will never be re-ordered by the optimiser - because the PORTB macro is a volatile destination. Volatile generally means "Oi optimiser - sod off". It works to quell reordering for the compiler's benefit as much as it does for making sure variables aren't cached into machine registers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
But obviously it is another reordering problem
But it is not a "reordering problem", it simply has no way of knowing that

   XMCRB|=_BV(XMM0);

has anything at all to do with

   char byte=*((char*)(adres | 0x8000));

And frankly, I think that you are begging the compiler to reorder the statements by putting the variable definition between the two accesses to XMCRB.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok., I have a solution, volatile in his case did the nice job. I think it’s worth to mention about this function of volatile. Typically we only use it in variables shared between main code and interrupts, and personally I forgot that it can be used to prevent reordering.
But still there is a second question – why gcc generates a strange access to XMCRB register through Z register. LDS and STS is far more efficient.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TFrancuz wrote:
LDS and STS is far more efficient.
Is it?

JW

PS. I wonder how the snippet from OP, or the snippet in the datasheet (which does not contain any "volatility" either), would compile in the "preferred compiler" of Atmel (IAR, is it?)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think it is. Let’s compare – code generated by compiler:

d4:   ac e6          ldi   r26, 0x6C   ; 108 
  d6:   b0 e0          ldi   r27, 0x00   ; 0 
  d8:   2c 91          ld   r18, X 
  da:   21 60          ori   r18, 0x01   ; 1 
  dc:   2c 93          st   X, r18
  e4:   9c 91          ld   r25, X 
  e6:   9e 7f          andi   r25, 0xFE   ; 254 
  e8:   9c 93          st   X, r25

Optimal code:

LDS R18,0x6C
ORI R18,1
STS 0x6C,R18
LDS R25,0x6C
ANDI R25,0xFE
STS 0x6C,R25

Total 8 vs. 6 cycles.Code bloat and unnecessary register usage.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Then you must have a very strange processor, surely not an AVR.
LD, ST, LDS, STS are all 2 cycle instructions. LDS/STS also occupy 2 words of code memory, while LD and ST occupy only 1 word.

Thus, the compiler generated code is 8 words and 12 cycles, while yours is 10 words and 10 cycles.

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

According to AVR instruction set:

LDS Rd,k 16 ≤ d ≤ 31, 0 ≤ k ≤ 127
Words: 1 (2 bytes)
Cycles: 1

The same STS. Probably you consider LDS/STS with 16-bit address, which is unnecessary here, as argument is only 7-bit long.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You are looking at the new copy of the opcode manual aren't you? I actually keep two copies. The original tiny/mega one (0856E-AVR-11/05) and then the later abortion in which Atmel tried (and failed majorly) to incorporate the details for Tiny4/5/9/10 and Xmega as well as the originals - 0856H-AVR-07/09).

If you look in the latter then LDS is confusing. For a start there are two entries - "LDS" and "LDS (16bit)". For the first it says:

Words: 2 (4 bytes)
Cycles: 2
Cycles XMEGA: 2 If the LDS instruction is accessing internal SRAM, one extra cycle is inserted.

for the other it says:

Words: 1 (2 bytes)
Cycles: 1
Note:Registers r0..r15 are remapped to r16..r31.

This latter one *ONLY* applies to the "brain dead" Tiny4/5/9/10.

Going back to the original opcode manual it has only one entry that says:

Words: 2 (4 bytes)
Cycles: 2

THAT is correct for tiny and mega. (and the mega128 in particular being discussed in this thread).

I can post a copy of the "old" (but more reliable) opcode manual if it would help to avoid the braindead and Xmega nonsense.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks Clawson for clarifying things. So we have 12 cycles/10words vs. 10cycles/10words. Seems to be very comparable, but… compiler version destroys content of two additional registers, and what is more important it is X,Y or Z register which is widely used. So after that its content must be reloaded or restored. So it generates some additional instructions in adjacent code. So my code gives us significant improvements in memory and cycle terms.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Suggest you read this:

http://www.nongnu.org/avr-libc/u...

So it's only Y (the frame pointer), not X or Z that must be preserved.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok., but still 16-bit pointer must be reloaded, so At least we loose 2 words/2 cycles. Of course not in every situation but in general it should give us some memory and time savings.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TFrancuz wrote:
Thanks Clawson for clarifying things. So we have 12 cycles/10words vs. 10cycles/10words.
It's 12/8 vs. 10/10.

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But if you store it in a temp the code should be faster!

   temp=XMCRB;
   XMCRB=temp | _BV(XMM0); 
   char byte=*((char*)(adres | 0x8000)); 
   XMCRB=temp; 

I just made a smalle test

15:       temp=a;
+00000062:   91900061    LDS     R25,0x0061       Load direct from data space
16:       a=temp | 0x01;
+00000064:   2F89        MOV     R24,R25          Copy register
+00000065:   6081        ORI     R24,0x01         Logical OR with immediate
+00000066:   93800061    STS     0x0061,R24       Store direct to data space
17:       b=100;
+00000068:   E684        LDI     R24,0x64         Load immediate
+00000069:   93800060    STS     0x0060,R24       Store direct to data space
18:       a=temp;
+0000006B:   93900061    STS     0x0061,R25       Store direct to data space

So if you not using all the reg. it should work.

edit: changed to a char version