Getting the compiler to properly optimize this code

Go To Last Post
24 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi. I'm trying to get GCC to properly optimize some code. The particular file (and entire project) is here: https://github.com/mojo-chan/Sup...

 

The platform is XMEGA (128A3U), GCC 3.4.1061 as supplied with Atmel Studio 6.2.1563. Compiler is set to -O3 optimization level.

 

The bit I'm not happy with is this:

 

	// fire buttons
	if (logical_inputs[LBUTTON1])		report.buttons |= (1<<0);
	if (logical_inputs[LBUTTON2])		report.buttons |= (1<<1);
	if (logical_inputs[LBUTTON3])		report.buttons |= (1<<2);
	if (logical_inputs[LBUTTON4])		report.buttons |= (1<<3);
	if (logical_inputs[LBUTTON5])		report.buttons |= (1<<4);
	if (logical_inputs[LBUTTON6])		report.buttons |= (1<<5);
	if (logical_inputs[LBUTTON7])		report.buttons |= (1<<6);
	if (logical_inputs[LBUTTON8])		report.buttons |= (1<<7);
	if (logical_inputs[LBUTTON9])		report.buttons |= (1<<8);
	if (logical_inputs[LBUTTON10])		report.buttons |= (1<<9);
	if (logical_inputs[LBUTTON11])		report.buttons |= (1<<10);
	if (logical_inputs[LBUTTON12])		report.buttons |= (1<<11);
	if (logical_inputs[LBUTTON13])		report.buttons |= (1<<12);
	if (logical_inputs[LBUTTON14])		report.buttons |= (1<<13);
	if (logical_inputs[LBUTTON15])		report.buttons |= (1<<14);
	if (logical_inputs[LBUTTON16])		report.buttons |= (1<<15);

Basically I run through an array of uint8s (logical_inputs) and if each byte is non-zero, set a bit in report.buttons which is a uint16. Nothing is marked as volatile. To my mind the compiler should simply set up one of the pointer registers, step through the array and set some bits in a couple of registers, before writing the registers back to SRAM. Instead, it does this:

 

	// fire buttons
	if (logical_inputs[LBUTTON1])		report.buttons |= (1<<0);
    269a:	80 91 31 25 	lds	r24, 0x2531
    269e:	88 23       	and	r24, r24
    26a0:	09 f4       	brne	.+2      	; 0x26a4 <RPT_refresh+0x200>
    26a2:	b2 c0       	rjmp	.+356    	; 0x2808 <RPT_refresh+0x364>
    26a4:	81 e0       	ldi	r24, 0x01	; 1
    26a6:	90 e0       	ldi	r25, 0x00	; 0
    26a8:	80 93 ab 25 	sts	0x25AB, r24
    26ac:	90 93 ac 25 	sts	0x25AC, r25
    26b0:	39 e0       	ldi	r19, 0x09	; 9
    26b2:	20 e0       	ldi	r18, 0x00	; 0
    26b4:	7d e0       	ldi	r23, 0x0D	; 13
    26b6:	60 e0       	ldi	r22, 0x00	; 0
    26b8:	55 e0       	ldi	r21, 0x05	; 5
    26ba:	40 e0       	ldi	r20, 0x00	; 0
    26bc:	ab e0       	ldi	r26, 0x0B	; 11
    26be:	fa 2e       	mov	r15, r26
    26c0:	00 e0       	ldi	r16, 0x00	; 0
    26c2:	1f e0       	ldi	r17, 0x0F	; 15
    26c4:	d1 2c       	mov	r13, r1
    26c6:	b7 e0       	ldi	r27, 0x07	; 7
    26c8:	eb 2e       	mov	r14, r27
    26ca:	b0 e0       	ldi	r27, 0x00	; 0
    26cc:	f3 e0       	ldi	r31, 0x03	; 3
    26ce:	e0 e0       	ldi	r30, 0x00	; 0
	if (logical_inputs[LBUTTON2])		report.buttons |= (1<<1);
    26d0:	a0 91 32 25 	lds	r26, 0x2532
    26d4:	aa 23       	and	r26, r26
    26d6:	61 f0       	breq	.+24     	; 0x26f0 <RPT_refresh+0x24c>
    26d8:	f0 93 ab 25 	sts	0x25AB, r31
    26dc:	e0 93 ac 25 	sts	0x25AC, r30
    26e0:	3f 2d       	mov	r19, r15
    26e2:	20 2f       	mov	r18, r16
    26e4:	71 2f       	mov	r23, r17
    26e6:	6d 2d       	mov	r22, r13
    26e8:	5e 2d       	mov	r21, r14
    26ea:	4b 2f       	mov	r20, r27
    26ec:	8f 2f       	mov	r24, r31
    26ee:	9e 2f       	mov	r25, r30
	if (logical_inputs[LBUTTON3])		report.buttons |= (1<<2);
    26f0:	e0 91 33 25 	lds	r30, 0x2533
    26f4:	ee 23       	and	r30, r30
    26f6:	39 f0       	breq	.+14     	; 0x2706 <RPT_refresh+0x262>
    26f8:	50 93 ab 25 	sts	0x25AB, r21
    26fc:	40 93 ac 25 	sts	0x25AC, r20
    2700:	9b 01       	movw	r18, r22
    2702:	85 2f       	mov	r24, r21
    2704:	94 2f       	mov	r25, r20
	if (logical_inputs[LBUTTON4])		report.buttons |= (1<<3);
    2706:	40 91 34 25 	lds	r20, 0x2534
    270a:	44 23       	and	r20, r20
    270c:	31 f0       	breq	.+12     	; 0x271a <RPT_refresh+0x276>
    270e:	30 93 ab 25 	sts	0x25AB, r19
    2712:	20 93 ac 25 	sts	0x25AC, r18
    2716:	83 2f       	mov	r24, r19
    2718:	92 2f       	mov	r25, r18
	if (logical_inputs[LBUTTON5])		report.buttons |= (1<<4);
    271a:	20 91 35 25 	lds	r18, 0x2535
    271e:	22 23       	and	r18, r18
    2720:	29 f0       	breq	.+10     	; 0x272c <RPT_refresh+0x288>
    2722:	80 61       	ori	r24, 0x10	; 16
    2724:	80 93 ab 25 	sts	0x25AB, r24
    2728:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON6])		report.buttons |= (1<<5);
    272c:	20 91 36 25 	lds	r18, 0x2536
    2730:	22 23       	and	r18, r18
    2732:	29 f0       	breq	.+10     	; 0x273e <RPT_refresh+0x29a>
    2734:	80 62       	ori	r24, 0x20	; 32
    2736:	80 93 ab 25 	sts	0x25AB, r24
    273a:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON7])		report.buttons |= (1<<6);
    273e:	20 91 37 25 	lds	r18, 0x2537
    2742:	22 23       	and	r18, r18
    2744:	29 f0       	breq	.+10     	; 0x2750 <RPT_refresh+0x2ac>
    2746:	80 64       	ori	r24, 0x40	; 64
    2748:	80 93 ab 25 	sts	0x25AB, r24
    274c:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON8])		report.buttons |= (1<<7);
    2750:	20 91 38 25 	lds	r18, 0x2538
    2754:	22 23       	and	r18, r18
    2756:	29 f0       	breq	.+10     	; 0x2762 <RPT_refresh+0x2be>
    2758:	80 68       	ori	r24, 0x80	; 128
    275a:	80 93 ab 25 	sts	0x25AB, r24
    275e:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON9])		report.buttons |= (1<<8);
    2762:	20 91 39 25 	lds	r18, 0x2539
    2766:	22 23       	and	r18, r18
    2768:	29 f0       	breq	.+10     	; 0x2774 <RPT_refresh+0x2d0>
    276a:	91 60       	ori	r25, 0x01	; 1
    276c:	80 93 ab 25 	sts	0x25AB, r24
    2770:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON10])		report.buttons |= (1<<9);
    2774:	20 91 3a 25 	lds	r18, 0x253A
    2778:	22 23       	and	r18, r18
    277a:	29 f0       	breq	.+10     	; 0x2786 <RPT_refresh+0x2e2>
    277c:	92 60       	ori	r25, 0x02	; 2
    277e:	80 93 ab 25 	sts	0x25AB, r24
    2782:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON11])		report.buttons |= (1<<10);
    2786:	20 91 3b 25 	lds	r18, 0x253B
    278a:	22 23       	and	r18, r18
    278c:	29 f0       	breq	.+10     	; 0x2798 <RPT_refresh+0x2f4>
    278e:	94 60       	ori	r25, 0x04	; 4
    2790:	80 93 ab 25 	sts	0x25AB, r24
    2794:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON12])		report.buttons |= (1<<11);
    2798:	20 91 3c 25 	lds	r18, 0x253C
    279c:	22 23       	and	r18, r18
    279e:	29 f0       	breq	.+10     	; 0x27aa <RPT_refresh+0x306>
    27a0:	98 60       	ori	r25, 0x08	; 8
    27a2:	80 93 ab 25 	sts	0x25AB, r24
    27a6:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON13])		report.buttons |= (1<<12);
    27aa:	20 91 3d 25 	lds	r18, 0x253D
    27ae:	22 23       	and	r18, r18
    27b0:	29 f0       	breq	.+10     	; 0x27bc <RPT_refresh+0x318>
    27b2:	90 61       	ori	r25, 0x10	; 16
    27b4:	80 93 ab 25 	sts	0x25AB, r24
    27b8:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON14])		report.buttons |= (1<<13);
    27bc:	20 91 3e 25 	lds	r18, 0x253E
    27c0:	22 23       	and	r18, r18
    27c2:	29 f0       	breq	.+10     	; 0x27ce <RPT_refresh+0x32a>
    27c4:	90 62       	ori	r25, 0x20	; 32
    27c6:	80 93 ab 25 	sts	0x25AB, r24
    27ca:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON15])		report.buttons |= (1<<14);
    27ce:	20 91 3f 25 	lds	r18, 0x253F
    27d2:	22 23       	and	r18, r18
    27d4:	29 f0       	breq	.+10     	; 0x27e0 <RPT_refresh+0x33c>
    27d6:	90 64       	ori	r25, 0x40	; 64
    27d8:	80 93 ab 25 	sts	0x25AB, r24
    27dc:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON16])		report.buttons |= (1<<15);
    27e0:	20 91 40 25 	lds	r18, 0x2540
    27e4:	22 23       	and	r18, r18
    27e6:	29 f0       	breq	.+10     	; 0x27f2 <RPT_refresh+0x34e>
    27e8:	90 68       	ori	r25, 0x80	; 128
    27ea:	80 93 ab 25 	sts	0x25AB, r24
    27ee:	90 93 ac 25 	sts	0x25AC, r25

 

This seems kind of crazy... The preamble is ridiculous and I'm not sure what it's supposed to do. Each if statement writes the registers back to SRAM immediately instead of just doing it once at the end.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

With -Os the function gets inlined and that cuts out the preamble nonsense. Why -O3 doesn't do this is something I'm not very clear on... But this seems to be normal with GCC, -Os is much much faster than -O3 in many cases.

 

	// fire buttons
	if (logical_inputs[LBUTTON1])		report.buttons |= (1<<0);
    124a:	80 91 31 25 	lds	r24, 0x2531
    124e:	88 23       	and	r24, r24
    1250:	31 f0       	breq	.+12     	; 0x125e <RPT_refresh+0x9e>
    1252:	81 e0       	ldi	r24, 0x01	; 1
    1254:	90 e0       	ldi	r25, 0x00	; 0
    1256:	80 93 ab 25 	sts	0x25AB, r24
    125a:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON2])		report.buttons |= (1<<1);
    125e:	80 91 32 25 	lds	r24, 0x2532
    1262:	88 23       	and	r24, r24
    1264:	49 f0       	breq	.+18     	; 0x1278 <RPT_refresh+0xb8>
    1266:	80 91 ab 25 	lds	r24, 0x25AB
    126a:	90 91 ac 25 	lds	r25, 0x25AC
    126e:	82 60       	ori	r24, 0x02	; 2
    1270:	80 93 ab 25 	sts	0x25AB, r24
    1274:	90 93 ac 25 	sts	0x25AC, r25
	if (logical_inputs[LBUTTON3])		report.buttons |= (1<<2);
    1278:	80 91 33 25 	lds	r24, 0x2533
    127c:	88 23       	and	r24, r24
    127e:	49 f0       	breq	.+18     	; 0x1292 <RPT_refresh+0xd2>
    1280:	80 91 ab 25 	lds	r24, 0x25AB
    1284:	90 91 ac 25 	lds	r25, 0x25AC
    1288:	84 60       	ori	r24, 0x04	; 4
    128a:	80 93 ab 25 	sts	0x25AB, r24
    128e:	90 93 ac 25 	sts	0x25AC, r25
...

 

But this is still not all that good, all those sts instructions should be done once at the end. Let's try a loop...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

make a temp for    report.buttons    so it only gets written one time.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
    1254:	87 e0       	ldi	r24, 0x07	; 7
    1256:	90 e0       	ldi	r25, 0x00	; 0
	for (uint8_t i = LBUTTON1; i <= LBUTTON16; i++)
		if (logical_inputs[i])		report.buttons |= (1<<i);
    1258:	41 e0       	ldi	r20, 0x01	; 1
    125a:	50 e0       	ldi	r21, 0x00	; 0
    125c:	61 91       	ld	r22, Z+
    125e:	66 23       	and	r22, r22
    1260:	49 f0       	breq	.+18     	; 0x1274 <RPT_refresh+0xb6>
    1262:	ba 01       	movw	r22, r20
    1264:	08 2e       	mov	r0, r24
    1266:	02 c0       	rjmp	.+4      	; 0x126c <RPT_refresh+0xae>
    1268:	66 0f       	add	r22, r22
    126a:	77 1f       	adc	r23, r23
    126c:	0a 94       	dec	r0
    126e:	e2 f7       	brpl	.-8      	; 0x1268 <RPT_refresh+0xaa>
    1270:	26 2b       	or	r18, r22
    1272:	37 2b       	or	r19, r23
    1274:	01 96       	adiw	r24, 0x01	; 1
	for (uint8_t i = LBUTTON1; i <= LBUTTON16; i++)
    1276:	87 31       	cpi	r24, 0x17	; 23
    1278:	91 05       	cpc	r25, r1
    127a:	81 f7       	brne	.-32     	; 0x125c <RPT_refresh+0x9e>
    127c:	20 93 ab 25 	sts	0x25AB, r18
    1280:	30 93 ac 25 	sts	0x25AC, r19
		if (logical_inputs[i])		report.buttons |= (1<<i);

 

Hmm, it's an improvement I think... Seems like GCC doesn't understand unrolled loops, so it's better to just use an actual loop. But even then it's missing some obvious tricks. It shifts the bit into position from scratch every time, and doesn't split the process into two byte accesses.

 

This is still with -Os, by the way. Let's see if we can give it any more hints.

Last Edited: Fri. Apr 15, 2016 - 09:14 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
The preamble is ridiculous and I'm not sure what it's supposed to do

It's loading bit mask contents that are used in later code - the compiler does not necessarily order the code in the way you write it - it can split parts of a statement if it sees a benefit.

 

Anyway I doubt anyone can help much until you show a little more context such as the types of logical_inputs[], report{}, report.buttons and what LBUTTON1..LBUTTON16 are defined to be.

 

For example one cannot help but notice the increasing (1 << n) pattern which may lend itself to iteration. Also, if LBUTTON1..LBUTTON16 follow some pattern then that would help too. In fact I can see a pattern:

    26d0:	a0 91 32 25 	lds	r26, 0x2532
    26f0:	e0 91 33 25 	lds	r30, 0x2533
   2706:	40 91 34 25 	lds	r20, 0x2534
    271a:	20 91 35 25 	lds	r18, 0x2535

which suggests they are adjacent, incrementing locations - so this very much lends itself to iteration.

 

Also is it really the case that multiple buttons may be pressed? If not would a conceptual:

        if (logical_inputs[LBUTTON1])		report.buttons |= (1<<0);
	else if (logical_inputs[LBUTTON2])		report.buttons |= (1<<1);
	else if (logical_inputs[LBUTTON3])		report.buttons |= (1<<2);
	else if (logical_inputs[LBUTTON4])		report.buttons |= (1<<3);
	else if (logical_inputs[LBUTTON5])		report.buttons |= (1<<4);
	else if (logical_inputs[LBUTTON6])		report.buttons |= (1<<5);
	else if (logical_inputs[LBUTTON7])		report.buttons |= (1<<6);
	else if (logical_inputs[LBUTTON8])		report.buttons |= (1<<7);
	else if (logical_inputs[LBUTTON9])		report.buttons |= (1<<8);

make better code - as it stands it performs every LDS/AND/BREQ as it falls through.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson, the link in the first post goes to GitHub where you can see the entire source file, and indeed the entire project can be viewed and downloaded for context.

 

sparrow2, I'll give that a try.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Okay, this looks better. With -Os:

 

	uint8_t mask0 = 0;
	uint8_t mask1 = 0;
	uint8_t bit = 1;
	for (uint8_t i = LBUTTON1; i <= LBUTTON8; i++)
	{
		if (logical_inputs[i]) mask0 |= bit;
		bit <<= 1;
	}
	bit = 1;
	for (uint8_t i = LBUTTON1; i <= LBUTTON8; i++)
	{
		if (logical_inputs[i]) mask1 |= bit;
		bit <<= 1;
	}
	report.buttons = mask0 | (mask1 << 8);
    1248:	e1 e3       	ldi	r30, 0x31	; 49
    124a:	f5 e2       	ldi	r31, 0x25	; 37

	uint8_t bit = 1;
    124c:	91 e0       	ldi	r25, 0x01	; 1

	uint8_t mask0 = 0;
    124e:	80 e0       	ldi	r24, 0x00	; 0

	for (uint8_t i = LBUTTON1; i <= LBUTTON8; i++)
	{
		if (logical_inputs[i]) mask0 |= bit;
    1250:	21 91       	ld	r18, Z+
    1252:	21 11       	cpse	r18, r1
    1254:	89 2b       	or	r24, r25
		bit <<= 1;
    1256:	99 0f       	add	r25, r25

    1258:	25 e2       	ldi	r18, 0x25	; 37
    125a:	e9 33       	cpi	r30, 0x39	; 57
    125c:	f2 07       	cpc	r31, r18
    125e:	c1 f7       	brne	.-16     	; 0x1250 <RPT_refresh+0x92>

    1260:	e1 e3       	ldi	r30, 0x31	; 49
    1262:	f5 e2       	ldi	r31, 0x25	; 37
    1264:	91 e0       	ldi	r25, 0x01	; 1
    1266:	20 e0       	ldi	r18, 0x00	; 0
		bit <<= 1;
	}
	bit = 1;
	for (uint8_t i = LBUTTON1; i <= LBUTTON8; i++)
	{
		if (logical_inputs[i]) mask1 |= bit;
    1268:	31 91       	ld	r19, Z+
    126a:	31 11       	cpse	r19, r1
    126c:	29 2b       	or	r18, r25
		bit <<= 1;
    126e:	99 0f       	add	r25, r25
	{
		if (logical_inputs[i]) mask0 |= bit;
		bit <<= 1;
	}
	bit = 1;
	for (uint8_t i = LBUTTON1; i <= LBUTTON8; i++)
    1270:	35 e2       	ldi	r19, 0x25	; 37
    1272:	e9 33       	cpi	r30, 0x39	; 57
    1274:	f3 07       	cpc	r31, r19
    1276:	c1 f7       	brne	.-16     	; 0x1268 <RPT_refresh+0xaa>
	{
		if (logical_inputs[i]) mask1 |= bit;
		bit <<= 1;
	}
	report.buttons = mask0 | (mask1 << 8);
    1278:	90 e0       	ldi	r25, 0x00	; 0
    127a:	92 2b       	or	r25, r18
    127c:	80 93 ab 25 	sts	0x25AB, r24
    1280:	90 93 ac 25 	sts	0x25AC, r25

Nice tight loops. 1278/127a are looking a little half baked but generally it's pretty good.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
the link in the first post goes to GitHub where you can see the entire source file

Your the one asking for help here - it's up to you to do the work to provide the information people will need.

 

Anyway, as you can't be bothered, for the benefit of others these are the important details:

REPORT_t	report;
uint8_t		physical_inputs[128];
uint8_t		logical_inputs[128];
typedef struct
{
	uint8_t		udlr_sscc;
	uint16_t	buttons;
	uint8_t		rot_mode;
} REPORT_t;
// logical inputs
enum LOGICAL_INPUTS_enum
{
	LNONE = 0,
	
	LJOY_UP,
	LJOY_DN,
	LJOY_LF,
	LJOY_RT,
	
	LSTART,
	LCOIN,
	
	LBUTTON1,
	LBUTTON2,
	LBUTTON3,
	LBUTTON4,
	LBUTTON5,
	LBUTTON6,
	LBUTTON7,
	LBUTTON8,
	LBUTTON9,
	LBUTTON10,
	LBUTTON11,
	LBUTTON12,
	LBUTTON13,
	LBUTTON14,
	LBUTTON15,
	LBUTTON16,
	
	LROTARY1,
	LROTARY2,
	LROTARY3,
	LROTARY4,
etc.

personally I'm out - I had to go scouring through your files to find these vital details in report.c, report.h and config.h (how was I supposed to know the enum was in config.h for goodness sake?)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Interestingly the -O3 code is still crazy even with the loop. -O1 and -O2 are identical to -Os. It seems like unrolling the loop backfires massively at -O3.

 

sparrow2, your idea is the clear winner. With -Os it spits out:

 

	uint16_t b = 0;
    1248:	31 e0       	ldi	r19, 0x01	; 1
    124a:	20 e0       	ldi	r18, 0x00	; 0
    124c:	80 91 31 25 	lds	r24, 0x2531
    1250:	81 11       	cpse	r24, r1
    1252:	01 c0       	rjmp	.+2      	; 0x1256 <RPT_refresh+0x98>
    1254:	30 e0       	ldi	r19, 0x00	; 0

	uint16_t b = 0;
    1256:	83 2f       	mov	r24, r19
    1258:	92 2f       	mov	r25, r18
	if (logical_inputs[LBUTTON1])		b |= (1<<0);
	if (logical_inputs[LBUTTON2])		b |= (1<<1);
    125a:	20 91 32 25 	lds	r18, 0x2532
    125e:	21 11       	cpse	r18, r1
    1260:	82 60       	ori	r24, 0x02	; 2
	if (logical_inputs[LBUTTON3])		b |= (1<<2);
    1262:	20 91 33 25 	lds	r18, 0x2533
    1266:	21 11       	cpse	r18, r1
    1268:	84 60       	ori	r24, 0x04	; 4
	if (logical_inputs[LBUTTON4])		b |= (1<<3);
    126a:	20 91 34 25 	lds	r18, 0x2534
    126e:	21 11       	cpse	r18, r1
    1270:	88 60       	ori	r24, 0x08	; 8
	if (logical_inputs[LBUTTON5])		b |= (1<<4);
    1272:	20 91 35 25 	lds	r18, 0x2535
    1276:	21 11       	cpse	r18, r1
    1278:	80 61       	ori	r24, 0x10	; 16
	if (logical_inputs[LBUTTON6])		b |= (1<<5);
    127a:	20 91 36 25 	lds	r18, 0x2536
    127e:	21 11       	cpse	r18, r1
    1280:	80 62       	ori	r24, 0x20	; 32
	if (logical_inputs[LBUTTON7])		b |= (1<<6);
    1282:	20 91 37 25 	lds	r18, 0x2537
    1286:	21 11       	cpse	r18, r1
    1288:	80 64       	ori	r24, 0x40	; 64
	if (logical_inputs[LBUTTON8])		b |= (1<<7);
    128a:	20 91 38 25 	lds	r18, 0x2538
    128e:	21 11       	cpse	r18, r1
    1290:	80 68       	ori	r24, 0x80	; 128
	if (logical_inputs[LBUTTON9])		b |= (1<<8);
    1292:	20 91 39 25 	lds	r18, 0x2539
    1296:	21 11       	cpse	r18, r1
    1298:	91 60       	ori	r25, 0x01	; 1
	if (logical_inputs[LBUTTON10])		b |= (1<<9);
    129a:	20 91 3a 25 	lds	r18, 0x253A
    129e:	21 11       	cpse	r18, r1
    12a0:	92 60       	ori	r25, 0x02	; 2
	if (logical_inputs[LBUTTON11])		b |= (1<<10);
    12a2:	20 91 3b 25 	lds	r18, 0x253B
    12a6:	21 11       	cpse	r18, r1
    12a8:	94 60       	ori	r25, 0x04	; 4
	if (logical_inputs[LBUTTON12])		b |= (1<<11);
    12aa:	20 91 3c 25 	lds	r18, 0x253C
    12ae:	21 11       	cpse	r18, r1
    12b0:	98 60       	ori	r25, 0x08	; 8
	if (logical_inputs[LBUTTON13])		b |= (1<<12);
    12b2:	20 91 3d 25 	lds	r18, 0x253D
    12b6:	21 11       	cpse	r18, r1
    12b8:	90 61       	ori	r25, 0x10	; 16
	if (logical_inputs[LBUTTON14])		b |= (1<<13);
    12ba:	20 91 3e 25 	lds	r18, 0x253E
    12be:	21 11       	cpse	r18, r1
    12c0:	90 62       	ori	r25, 0x20	; 32
	if (logical_inputs[LBUTTON15])		b |= (1<<14);
    12c2:	20 91 3f 25 	lds	r18, 0x253F
    12c6:	21 11       	cpse	r18, r1
    12c8:	90 64       	ori	r25, 0x40	; 64
	if (logical_inputs[LBUTTON16])		b |= (1<<15);
    12ca:	20 91 40 25 	lds	r18, 0x2540
    12ce:	21 11       	cpse	r18, r1
    12d0:	90 68       	ori	r25, 0x80	; 128
	report.buttons = b;
    12d2:	80 93 ab 25 	sts	0x25AB, r24
    12d6:	90 93 ac 25 	sts	0x25AC, r25

 

Congratulations, you win ;-)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

An issue that often arises is how much one should be willing
to change the source to get the optimization desired.
Buttons suggest that speed is not an issue in this case.
Code size might be, but I cannot tell.
sparrow2's solution is small enough that it
might be justified even for aesthetic reasons.

A program design issue here is the 16 buttons
with 16 names apparently treated uniformly.
Given their names and their use as indices,
I expect they have values 0..15 .
Why not just LBUTTONS_NUM=16 and process your buttons with loops?

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A little background. This is a video game controller. It takes inputs from many buttons, a joystick and a rotary position encoder. It applies autofire and remapping, and then sends out a report over USB, RS232 or SPI.

 

The button names are supposed to be human readable, so I started at 1 because most people don't count from zero (they should, but, well...) It's somewhat arbitrary anyway, it all goes through the same remapping matrix. Physic button 1 could be remapped to joystick left if you wanted.

 

I agree that in practice optimization of the code isn't likely to have any noticeable impact, it's just me being obsessive about it.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
A little background. This is a video game controller. It takes inputs from many buttons, a joystick and a rotary position encoder. It applies autofire and remapping, and then sends out a report over USB, RS232 or SPI.

 

The button names are supposed to be human readable, so I started at 1 because most people don't count from zero (they should, but, well...) It's somewhat arbitrary anyway, it all goes through the same remapping matrix. Physic button 1 could be remapped to joystick left if you wanted.

Do you really have a button labelled "LBUTTON1"?

If not, I'd say that LBUTTON1 is not human-readable.

If so, I'd ask why.

Your generic and collective button code should just use numbers.

For the start button-specific code, a constant named LBUTTON_START is good.

Likewise LBUTTON_JOY for the joystick button-specific code.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, it's not just buttons. There are even some control lines in there, like autofire. And they match up to the PCB, which is the user facing bit.

There will be a PC app for configuration. For the sake of my sanity I decided to keep the naming consistent everywhere. Windows numbers the buttons from 1, for example.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I was revisiting this code and noticed something:

 

    125a:	20 91 32 25 	lds	r18, 0x2532
    125e:	21 11       	cpse	r18, r1
    1260:	82 60       	ori	r24, 0x02	; 2

 

The A3U manual states that LDS takes 2+1 cycles. An LD r18, X+ would only take 1+1 cycles. So the compiler seems to have missed the opportunity to optimize consecutive address accesses.

 

Anyway, I'm adding more more code that just converts 128 bytes to 128 bits:

 

	uint8_t b = 0;
	if (input_matrix[0])	b |= (1<<0);
	if (input_matrix[1])	b |= (1<<1);
	if (input_matrix[2])	b |= (1<<2);
	if (input_matrix[3])	b |= (1<<3);
	if (input_matrix[4])	b |= (1<<4);
	if (input_matrix[5])	b |= (1<<5);
	if (input_matrix[6])	b |= (1<<6);
	if (input_matrix[7])	b |= (1<<7);
	res->data[0] = b;

 

In actual fact only bit 0 of input_matrix[] matters, but anyway... The compiler with -Os or -O3 produces:

 

	uint8_t b = 0;
	if (input_matrix[0])	b |= (1<<0);
     95e:	91 e0       	ldi	r25, 0x01	; 1
     960:	80 91 fe 24 	lds	r24, 0x24FE	; 0x8024fe <input_matrix>
     964:	81 11       	cpse	r24, r1
     966:	01 c0       	rjmp	.+2      	; 0x96a <kbus_long_report+0x12>
     968:	90 e0       	ldi	r25, 0x00	; 0
	if (input_matrix[1])	b |= (1<<1);
     96a:	80 91 ff 24 	lds	r24, 0x24FF	; 0x8024ff <input_matrix+0x1>
     96e:	81 11       	cpse	r24, r1
     970:	92 60       	ori	r25, 0x02	; 2
	if (input_matrix[2])	b |= (1<<2);
     972:	80 91 00 25 	lds	r24, 0x2500	; 0x802500 <input_matrix+0x2>
     976:	81 11       	cpse	r24, r1
     978:	94 60       	ori	r25, 0x04	; 4

 

Maybe GCC just isn't aware that it can save a cycle using an indirect load with post increment. It clearly knows that the memory locations are consecutive.

 

It could also save a cycle using BLD and BST and thus avoid clearing the temporary register.

 

A loop would be pretty efficient here too, if you could get GCC to behave sensibly with it. I tried and GCC produced crap again. If you want something done right... code it in assembler.

 

	asm volatile(
		"ldi	r18, 16"		"\n\t"
		"loop%=:"
		"ld	__tmp_reg__, X+"	"\n\t"	// 0
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 0"			"\n\t"
		"ld	__tmp_reg__, X+"	"\n\t"	// 1
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 1"			"\n\t"
		"ld	__tmp_reg__, X+"	"\n\t"	// 2
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 2"			"\n\t"
		"ld	__tmp_reg__, X+"	"\n\t"	// 3
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 3"			"\n\t"
		"ld	__tmp_reg__, X+"	"\n\t"	// 4
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 4"			"\n\t"
		"ld	__tmp_reg__, X+"	"\n\t"	// 5
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 5"			"\n\t"
		"ld	__tmp_reg__, X+"	"\n\t"	// 6
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 6"			"\n\t"
		"ld	__tmp_reg__, X+"	"\n\t"	// 7
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 7"			"\n\t"
		"st	Z+, r19"		"\n\t"
		"dec	r18"			"\n\t"
		"brne	loop%="
	:
	: [input] "x" (input_matrix), [output] "z" (&res->data)
	: "r18", "r19"
	);

 

Note the selection of registers to use. They are ones that GCC considers clobbered when calling a function.

Last Edited: Mon. Jan 23, 2017 - 04:25 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:

... It clearly knows that the memory locations are consecutive. ...

 

I had to refresh myself on the thread topic.

 

...code that just converts 128 bytes to 128 bits...

I'll have to think a bit more about the way I'd do that "optimally".  (and perhaps have a bit [pun intended]more interest if I learned why it is so important to the app -- if only happens rarely then a few more cycles per bit is usually not important)

 

Apparently, we started with an 8-bit destination "b".  Now it is a 16-bit destination?  And an array of such?

 

And originally, Cliff did the digging and found that the LBUTTONn values started at 6, and the new code has it starting at 0 (to index into input_matrix[]).

 

I was going to suggest hard-indexing to avoid interpreting the enum values, but the last code seems to do that.  GCC is usually pretty good about pointer handling with index registers; I wonder if that would make any difference.

 

With my toolchain I'd be tempted to try some type of shift-mask loop, perhaps unrolled, and force register vars where needed.

 

Is your input_matrix really 128 bytes?

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch, I need to send 128 bits over USB and some other interfaces. USB in particular has a 64 byte packet size limit on interrupt endpoints (which are required for guaranteed timing), so to send the data as-is I would need to split it over two packets and thus halve the update rate. Since interrupt packets are sent once every millisecond I will be running this code 1000 times per second, so I feel an uncontrollable urge to optimize it. I'm running at 24MHz so it might be unnecessary, but it's a hobby project and I enjoy this stuff.

 

I apologise for not explaining the whole project in detail earlier, I had thought it would be of little interest and that I should concentrate on the relevant code. Anyway, for those interested...

 

It's a video game controller system. There is a controller board, usually installed in a joystick but it could also be something like a converter that reads a Playstation gamepad. It can communicate over USB or over a low voltage RS232 bus. The USB bus uses HID but also has a custom interrupt endpoint that allows updates at 1000Hz, rather than the 120Hz maximum supported by HID.

 

The controller have various convenience features like button remapping and autofire. The button remapping stuff is actually quite powerful; it uses a matrix of 128 physical and 128 logical inputs and allows them to be arbitrarily wired together using a simple mapping script with a "source -> destination" format. These are stored in a 256 element array of bytes. So the remapping is basically:

 

	for (uint8_t i = 0; i < number_of_mappings; i++)
		input_matrix[destination[i]] = input_matrix[source[i]];

 

This of course only works on arrays of intrinsic types like uint8s, as C does not support bit arrays. While it would be possible to do this remapping on an array of bits, it would be less efficient because AVR, like most architectures, isn't really designed for operating on 256 bit words.

 

As I outlined above, to transmit all 128 logical inputs to the receiver it is helpful to pack them down to 16 bytes. This will likely be faster than trying to do the remapping on a bit array and also the byte array system simplifies other code like autofire.

 

Of course, if you have a fast way of selecting one bit from 256 I'd love to hear it :-)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
1000 times per second, so I feel an uncontrollable urge to optimize it. I'm running at 24MHz so it might be unnecessary, but it's a hobby project and I enjoy this stuff.
first rule of optimisation - don't! Even at 1MHz you have 1,000 cycles in a millisecond. At 24MHz you have 24,000 cycles. What on earth could you possibly be doing that requires 24,000 cycles ?!?

 

If you ever do find a job that requires cycle accurate optimisation then the best suggestion is to do that part in hand crafted Asm (possibly outlining it first in C then taking its output code as the basis for what you want to optimise).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you ever do find a job that requires cycle accurate optimisation then the best suggestion is to do that part in hand crafted Asm (possibly outlining it first in C then taking its output code as the basis for what you want to optimise).

That is a bad way of thinking, because then you think in C, and that way you don't get optimal ASM code.  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

True dat. But often C can give you a "starting point" if you are not so familiar with AVR Asm code. (if you were you'd presumably be doing the whole thing in Asm from the start ;-)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
first rule of optimisation - don't! Even at 1MHz you have 1,000 cycles in a millisecond. At 24MHz you have 24,000 cycles. What on earth could you possibly be doing that requires 24,000 cycles ?!

 

But... But... It's not optimal! :-)

 

I've learned not to worry about it most of the time, especially when doing desktop stuff in C# or Java, but I like to know things are being done efficiently in C.

 

The other reason to look at this stuff is that I like to understand how the compiler generates and optimizes code, because sometimes it helps with debugging.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The ASM code you show can't be made to a loop but if you use shift it can and then it's only the loop counter that is a overhead 

 

so : (in normal ASM)

 

  ;init X

  LDI r19,0x80 ; init so this bit is in carry when done

L0:

  LD r24,X+

  LSR r24

  ROR r19

  BRCC L0

 

After 8 times all bit's is placed correct. (not tested but the idea is correct :) ) and if you want less overhead then perhaps do 2 bits 4 times.

Last Edited: Tue. Jan 24, 2017 - 03:35 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I checked out your code to see what all this fuss was about. Actually I was surprised at how invasive ASF was, and the 20 or so include directories.

 

Anyway once you've sorted out the report generation which seems to occur in two places BTW: { report.c and hid.c } you have to look at the loading of input_matrix from the I/O Ports:

 

This code must be worthy of similar inspection.

void rpt_physical_inputs_refresh(void)
{
	// PORTA
	uint8_t p = PORTA.IN;
	if (!(p & JOY_UP_PIN_bm))		input_matrix[PJOY_UP] = 1;
	if (!(p & JOY_DOWN_PIN_bm))		input_matrix[PJOY_DN] = 1;
	if (!(p & JOY_LEFT_bm))			input_matrix[PJOY_LF] = 1;
	if (!(p & JOY_RIGHT_bm))		input_matrix[PJOY_RT] = 1;
	if (!(p & START_PIN_bm))		input_matrix[PB16] = 1;
	if (!(p & COIN_PIN_bm))			input_matrix[PB15] = 1;
	if (!(p & AUTO_LOW_5_PIN_bm))	input_matrix[PA5] = 1;
	if (!(p & AUTO_LOW_6_PIN_bm))	input_matrix[PA6] = 1;

	// PORTB
	p = PORTB.IN;
	if (!(p & BUTTON_1_PIN_bm))		input_matrix[PB1] = 1;
	if (!(p & BUTTON_2_PIN_bm))		input_matrix[PB2] = 1;
	if (!(p & BUTTON_3_PIN_bm))		input_matrix[PB3] = 1;
	if (!(p & BUTTON_4_PIN_bm))		input_matrix[PB4] = 1;
	if (!(p & BUTTON_5_PIN_bm))		input_matrix[PB5] = 1;
	if (!(p & BUTTON_6_PIN_bm))		input_matrix[PB6] = 1;
	if (!(p & BUTTON_7_PIN_bm))		input_matrix[PB7] = 1;
	if (!(p & BUTTON_8_PIN_bm))		input_matrix[PB8] = 1;

etc
etc
etc

I did wonder why input_matrix wasn't a packed bit array to start with but found processing like this: which could be quite messy if a packed bits were used.

** Map physical to logical inputs and apply forced inputs
*/
void RPT_refresh_input_matrix(void)
{
	rpt_physical_inputs_refresh();

	input_matrix[LOFF] = 0;
	input_matrix[LON] = 1;

	for (uint8_t i = 0; i < map->count; i++)
		input_matrix[map->mapping[i][0]] = input_matrix[map->mapping[i][1]];
}

Also is there a bug here if you allow any arbitrary mapping ? You could overwrite a source with a destination before it has been read.

 

PS

I did giggle at this volatile qualifier in joystick.c:

// data embedded in firmware image so that the bootloader program can read it
volatile const __flash FW_INFO_t firmware_info =	{	{ 0x59, 0x61, 0x6d, 0x61, 0x4e, 0x65, 0x6b, 0x6f },		// "YamaNeko" magic identifier

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

N.Winterbottom, it's a work-in-progress so yes there is some redundancy and it needs a little refactoring. I'm thinking about how I might rework the HID stuff.

 

I have a nice big I/O matrix table that is generated by a macro chain, used for selecting outputs. I was thinking that I could create another similar macro that generated the RPT_refresh_input_matrix() code. That would allow the code to support different PCBs with only changes to that one file.

 

The arbitrary mapping thing could, as you say, go wrong but I'm not too worried about that right now. At the moment you have to write the config files by hand but eventually I might make something in C# to do it for you, and all the limiting logic would be in there. I prefer to allow invalid/glitch configs in projects like this because sometimes they can be used to create interesting side-effects or experiment with ideas. Obviously commercial code would fix all this.

 

The Iriomote mountain cat is my favourite cat, known informally as yamaneko (mountain cat). I hope to see one in person one day.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> how the compiler generates and optimizes code

 

* Add -fdump-tree-all -fdump-rtl-all -fdump-ipa-all -save-temps to the compiler command options and read the dumps

 

* Re-build the compiler with -O0 -g3 and debug the compiler proper (cc1, cc1plus or lto1).

avrfreaks does not support Opera. Profile inactive.