Efficient Uint16 return value GCC

Go To Last Post
16 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm generating a CRC by passing the initial value, length of array and a pointer to the array then returning the CRC from the CRC generator module which is split across 4 8 bit wide registers in memory. I'm getting some differing output code results when returning the CRC value (16 bit):

1

	return CRC.CHECKSUM1 << 8 | CRC.CHECKSUM0;
    5132:	30 91 d5 00 	lds	r19, 0x00D5
    5136:	20 91 d4 00 	lds	r18, 0x00D4
    513a:	93 2f       	mov	r25, r19
    513c:	80 e0       	ldi	r24, 0x00	; 0
    513e:	ac 01       	movw	r20, r24
    5140:	42 2b       	or	r20, r18
    5142:	9a 01       	movw	r18, r20
}
    5144:	c9 01       	movw	r24, r18
    5146:	08 95       	ret

2

	uint16_t result;
	uint8_t * res8 = (uint8_t*)&result;
	
	res8[0] = CRC.CHECKSUM0;
    513e:	80 91 d4 00 	lds	r24, 0x00D4
    5142:	89 83       	std	Y+1, r24	; 0x01
	res8[1] = CRC.CHECKSUM1;
    5144:	90 91 d5 00 	lds	r25, 0x00D5

	return result;

3

	uint8_t res8[2];
	uint16_t * result = (uint16_t*)&res8;
	
	res8[0] = CRC.CHECKSUM0;
    5132:	80 91 d4 00 	lds	r24, 0x00D4
	res8[1] = CRC.CHECKSUM1;
    5136:	90 91 d5 00 	lds	r25, 0x00D5
	
	return *result;
}
    513a:	08 95       	ret

Easily 3 is the best efficiency but a round the houses way of doing it. is there an expression like in 1 that would produce code like 3?

I'm not understanding why Y is being written to in 2 either.

The function looks like this:

uint16_t
calcCRC_CCITT(uint16_t crc, uint8_t len, uint8_t * data)

I've just put the code that matters to try and make it readable.

I'm trying the GCC compiler that comes with studio 6.1 and haven't run this through WINAVR yet.

Apologies if this has been answered before I did a quick search but did not return anything that seemed to fit.

Thanks.

John.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I'm not understanding why Y is being written to in 2 either.

Y is the stack frame pointer. It's used to access locals that have been created on the stack.

BTW if I write:

uint16_t
calcCRC_CCITT(uint16_t crc, uint8_t len, uint8_t * data) {
	 return 0x1234;
}

the compiler generates:

.global	calcCRC_CCITT
	.type	calcCRC_CCITT, @function
calcCRC_CCITT:
//==>  {
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
//==>  }
	ldi r24,lo8(52)
	ldi r25,lo8(18)
	ret

that's pretty much what you'd hope for isn't it?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

CRC appears to be global or static. Why?
In case 1, optimization appears to be off.
In case 2, result appears to be local and optimized away. res8 should have been.
In case 3, result and res8 appear to be local and optimized away.

I would have expected case 1 to do what you want.
try return *(uint16_t)&CRC.CHECKSUM0;
Unlike case 1, it is endian-speecific.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Why is it so important to you to save a few microseconds?

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Koshchi wrote:
Why is it so important to you to save a few microseconds?
Maybe it's not the microseconds.
Maybe it's the ugliness of the asm.
In that case, OP should probably stick with CRC.CHECKSUM1 << 8 | CRC.CHECKSUM0 .
Better ugly result than ugly source.
Try to use expressions that are obviously correct.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Doesn't avr-gcc come with library functions in the first place?

I presume that they work.
I bet they are fairly efficient.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This shortcoming is known as PR41076, see

http://gcc.gnu.org/PR41076

If you want to stay informed about its progress, CC to that report.

In almost all cases, the generated code is not a real problem and just offends the assembler reader's aesthetic feelings.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
This shortcoming is known as PR41076, see

http://gcc.gnu.org/PR41076

If you want to stay informed about its progress, CC to that report.

In almost all cases, the generated code is not a real problem and just offends the assembler reader's aesthetic feelings.

Even in cases where one cares little about beauty or speed, the code can be a problem.
If one is trying to find a bug, part of that search might be examining the asm to see whether it does what it is supposed to.
Fat code requires more examining than would otherwise be necessary.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The GCC developers are well aware of the small optimization hickup. If the avr-gcc progress it too slow for you, you can speed up things by proposing solutionsin the project mailing list. Thanks.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for the response guys.

Please see comments, I'm relatively new to C programming and come from years of asm writing so I always look to see how the c compiler outputs code and I try to optomise on this basis. Sometimes I write code and wonder how other people would write the same code to see if I could write it better. Hence the thread.

Quote:

In case 1, optimization appears to be off.

Optomisation is set to -Os (size), it was my thought shorty after I posted this and thought it would be easy until I checked this morning. I haven't tried others yet.

Quote:

Y is the stack frame pointer. It's used to access locals that have been created on the stack.

The stack is only written to and never read then the pointer is discarded on exit causing the Y register to be saved then restored for no reason at all adding more bloat to the code.

Quote:

Maybe it's not the microseconds.
Maybe it's the ugliness of the asm.
In that case, OP should probably stick with CRC.CHECKSUM1 << 8 | CRC.CHECKSUM0 .
Better ugly result than ugly source.

I guess this is down to programmer preference and mine is when I'm not programming in windows I care about what the processor is doing as it is a problem that I have created when I run out of resources! I'm also happy with the extended source but I would like to write it better to get the same result. Besides if every routine did the same I could make a significant saving just by writing in a different style.

return *(uint16_t)&CRC.CHECKSUM0;

Spot on... the result is identical. I did write it like this though:

return *(uint16_t*)&CRC.CHECKSUM0;

Quote:

Doesn't avr-gcc come with library functions in the first place?

yes it does, it is really good too and I have been using it just fine for some time but I wanted to see the performance increase by using the new hardware CRC generator in the XMega au chips as I plan to replace a processor with this new version of it, the library function as far as I am aware does not use the hardware, although I haven't done my research as it was merely a quick test!. I also do like to write my own libraries where possible (usually time restrained) so I understand fully the code I am using.

Quote:

This shortcoming is known as PR41076, see

http://gcc.gnu.org/PR41076

Thanks, I'll have a look!

Cheers guys, I'm sure there is no right or wrong answer to any of this as long as the result is usable in its context.

John.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Another example in the same routine:

	while(--len) CRC.DATAIN = *(data++);
    511e:	05 c0       	rjmp	.+10     	; 0x512a 
    5120:	fa 01       	movw	r30, r20
    5122:	81 91       	ld	r24, Z+
    5124:	af 01       	movw	r20, r30
    5126:	80 93 d3 00 	sts	0x00D3, r24
    512a:	61 50       	subi	r22, 0x01	; 1
    512c:	c9 f7       	brne	.-14     	; 0x5120 

Any suggestions to stop this line occurring?

5124:	af 01       	movw	r20, r30

Cheers.

John.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

j0n90 wrote:
Quote:

In case 1, optimization appears to be off.

Optomisation is set to -Os (size), it was my thought shorty after I posted this and thought it would be easy until I checked this morning. I haven't tried others yet.

At the time of writing, I was unaware of the shortcoming.
That said, I stand by my statement: In case 1, optimization appears to be off.
Quote:

return *(uint16_t)&CRC.CHECKSUM0;

Spot on... the result is identical. I did write it like this though:

return *(uint16_t*)&CRC.CHECKSUM0;

Ah. I missed an asterisk and one more thing:

return *(uint16_t*)&CRC.CHECKSUM0; /* requires little endian */

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Any suggestions to stop this line occurring?


I'd guess that "data" is used after the loop, so the value needs to be updated after the increment? Hard to tell out of context.

MOVW is one cycle, isn't it? I guess, over a loop, it might add up.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Any suggestions to stop this line occurring?
Yes, stop looking at the asm code. You are, quite frankly, wasting your time. It is an affliction of asm programmers that they feel the need to control what the compiler outputs. Yes, the code can often be tweaked to generate more efficient asm, but it is extremely rare that it is necessary to do this. Wait until you actually run out of resources before you start optimizing. And even then the most gains will be in changing your high level algorithms, not tweaking low level code. And when you really do need to pay attention to microseconds, then it is better to do the section of code in asm rather than try to coerce the compiler to produce the most efficient solution.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Wait until you actually run out of resources before you start optimizing.

I don't know if I'd come down quite that hard on John. From the description given it would appear to me that the focus is indeed on this one routine. We don't know how many bytes are to be CRCed, or how often, or the desired time constraints. Let's say that it is a thousand-byte buffer. If one can save say four cycles per loop, that is 4000 cycles. 100us to 200us, perhaps.

But (as I hinted earlier) a bit more context is needed to skinny down "the routine". You gurus will be able to apply your experience, then, to suggest alternate approaches as was done in the earlier "return value" snippet.

(Now, I'd think there would be some discussion on "always inline"...)

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You're right I am being a little bit picky but it is just for fun. I do not mean to rub anyone up the wrong way!

I do use the CRC quite heavily though as there is a lot of things to talk to with different uarts so it is nice to have an efficient routine for this.

Dean, data is not reused in this function and effectively this makes it 2 instructions as the Z pointer is reloaded every cycle too.

The fastest way is to shuffle the data from one memory location to another using DMA and hooking the CRC generator up to the DMA channel which is probably the solution I will eventually use.

I did look at the inline result but it didn't seem that beneficial!

return *(uint16_t*)&CRC.CHECKSUM0; /* requires little endian */

Yup, seems to be fine. I think the checksum uses contiguous IO addresses starting with .CHECKSUM0

Thanks again. I will leave it now, I should do some real work.

John.

BTW final routine looks like this:

uint16_t
updateCRC(uint16_t crc, uint8_t len, uint8_t * data)
{
	*((uint16_t*)&CRC.CHECKSUM0) = crc;
 22a:	80 93 d4 00 	sts	0x00D4, r24
 22e:	90 93 d5 00 	sts	0x00D5, r25
	
	CRC.CTRL = CRC_SOURCE_IO_gc | CRC_RESET_NO_gc;
 232:	81 e0       	ldi	r24, 0x01	; 1
 234:	80 93 d0 00 	sts	0x00D0, r24
	
	while(--len) CRC.DATAIN = *(data++);
 238:	05 c0       	rjmp	.+10     	; 0x244 
 23a:	fa 01       	movw	r30, r20
 23c:	81 91       	ld	r24, Z+
 23e:	af 01       	movw	r20, r30
 240:	80 93 d3 00 	sts	0x00D3, r24
 244:	61 50       	subi	r22, 0x01	; 1
 246:	c9 f7       	brne	.-14     	; 0x23a 
	
	return *(uint16_t*)&CRC.CHECKSUM0;
}
 248:	80 91 d4 00 	lds	r24, 0x00D4
 24c:	90 91 d5 00 	lds	r25, 0x00D5
 250:	08 95       	ret