Should I be doing something different to avoid this inefficient code?

Go To Last Post
26 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am setting up a peripheral like so:

 

const __flash struct {
	USART_t		*usart;
	struct {
		DMA_CH_t	*ch;
		uint8_t		trigsrc_rxc;
		uint8_t		trigsrc_dre;
		volatile uint8_t *rx_buffer_AT;
	} dma;
} channels[] = {
	{	.usart = &BUS1_USART,
		.dma = {	.ch = &BUS1_DMA_CH,
					.trigsrc_rxc = BUS1_DMA_TRIGGER_SRC_RXC,
					.trigsrc_dre = BUS1_DMA_TRIGGER_SRC_DRE,
					.rx_buffer_AT = bus1_rxbuffer_AT, },
	},

	{	.usart = &BUS2_USART,
		.dma = {	.ch = &BUS2_DMA_CH,
					.trigsrc_rxc = BUS2_DMA_TRIGGER_SRC_RXC,
					.trigsrc_dre = BUS2_DMA_TRIGGER_SRC_DRE,
					.rx_buffer_AT = bus2_rxbuffer_AT, },
	},
};

 

		channels[ch].dma.ch->CTRLA = 0;
		channels[ch].dma.ch->CTRLA = DMA_RESET_bm;
		channels[ch].dma.ch->CTRLB = 0;
		channels[ch].dma.ch->ADDRCTRL =	DMA_CH_SRCRELOAD_BURST_gc | DMA_CH_SRCDIR_FIXED_gc |
										DMA_CH_DESTRELOAD_TRANSACTION_gc | DMA_CH_DESTDIR_INC_gc;
		channels[ch].dma.ch->TRIGSRC = channels[ch].dma.trigsrc_rxc;
		channels[ch].dma.ch->TRFCNT = BUFFER_SIZE;
		channels[ch].dma.ch->REPCNT = 0;
		channels[ch].dma.ch->SRCADDR0 = (( (uint16_t) &channels[ch].usart->DATA) >> 0) & 0xFF;
		channels[ch].dma.ch->SRCADDR1 = (( (uint16_t) &channels[ch].usart->DATA) >> 8) & 0xFF;
		channels[ch].dma.ch->SRCADDR2 = 0;
		channels[ch].dma.ch->DESTADDR0 = (( (uint16_t)channels[ch].dma.rx_buffer_AT) >> 0) & 0xFF;
		channels[ch].dma.ch->DESTADDR1 = (( (uint16_t)channels[ch].dma.rx_buffer_AT) >> 8) & 0xFF;
		channels[ch].dma.ch->DESTADDR2 = 0;
			
		channels[ch].dma.ch->CTRLA = DMA_CH_ENABLE_bm | DMA_CH_SINGLE_bm | DMA_CH_BURSTLEN_1BYTE_gc;

 

GCC generates code like this:

 

     5aa:	32 97       	sbiw	r30, 0x02	; 2
     5ac:	18 96       	adiw	r26, 0x08	; 8
     5ae:	8c 93       	st	X, r24
     5b0:	18 97       	sbiw	r26, 0x08	; 8
     5b2:	19 96       	adiw	r26, 0x09	; 9
     5b4:	9c 93       	st	X, r25
     5b6:	19 97       	sbiw	r26, 0x09	; 9
     5b8:	1a 96       	adiw	r26, 0x0a	; 10
     5ba:	1c 92       	st	X, r1

 

The sbiw followed by an adiw seems unnecessary. It's backtracking to the base address of the peripheral every time, rather than just doing a single add/subtract or better yet using the Y register with displacement (it's an XMEGA).

 

Interestingly if I remove the __flash attribute from the struct the compiler produces much better code:

 

     55a:	83 83       	std	Z+3, r24	; 0x03
     55c:	8a e1       	ldi	r24, 0x1A	; 26
     55e:	90 e0       	ldi	r25, 0x00	; 0
     560:	84 83       	std	Z+4, r24	; 0x04
     562:	95 83       	std	Z+5, r25	; 0x05
     564:	16 82       	std	Z+6, r1	; 0x06

 

It's a small thing I know, but I'm wondering if I should do something with my C code to make the compiler produce better code. It's not a lot of RAM to waste but still...

This topic has a solution.
Last Edited: Thu. Jun 10, 2021 - 03:14 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
The sbiw followed by an adiw seems unnecessary.
That's maybe true for examples like:

     5b0:	18 97       	sbiw	r26, 0x08	; 8
     5b2:	19 96       	adiw	r26, 0x09	; 9

which I agree could have been an adiw r26,1 but the first example:

     5aa:	32 97       	sbiw	r30, 0x02	; 2
     5ac:	18 96       	adiw	r26, 0x08	; 8

involves two different register pairs. As R30 is not subsequently used in the quoted segment one assumes this is "set up" for something that happens beyond this?

 

BTW just out of interest - does it help to reorder the lines of C so that maybe the field offsets are in ascending/descending order ?

 

I think you'll have to wait for SprinterSB to comment on the others as he did the work on __flash.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well spotted. It's already in address order as I usually just go through the register in the datasheet and set them up one by one.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can throw code in the online compiler to see if you can work out what is the 'cause'-

https://godbolt.org/z/58xcxch6s

 

It may be faster than doing the modify/compile/look-at-asm routine on the pc. For instance, change that init function to static and it will do something different.

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

curtvm wrote:

You can throw code in the online compiler to see if you can work out what is the 'cause'-

https://godbolt.org/z/58xcxch6s

 

It may be faster than doing the modify/compile/look-at-asm routine on the pc. For instance, change that init function to static and it will do something different.

 

Thanks, that's super useful! The output is different to what I get so I need to figure out why first. Fun for tomorrow.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

IIRC there are flags that can control how many times each optimization step is run.

My guess is that if you increase the right one, sbiw and adiw will be merged.

 

My guess is that the problem does not need to be fixed.

If it did, you would not have needed to ask.

 

Supposing that the compiler's assembly is ugly and inefficient enough that you need to fix it,

one possibility is to do just that: fix the assembly.

Take the compiler's output and hand-optimize it.

Keep both the .c file and the original .s file around as documentation.

There are potential pitfalls, e.g.  branches and rjmps with explicit offsets.

Moderation in all things. -- ancient proverb

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Okay, I think your example is getting optimized away because it doesn't actually use an LPM instructions. I made a minimal example to demonstrate the problem:

 

https://godbolt.org/z/YWon8cMaW

 

Without __flash it generates good code, using Z with displacement. With __flash it on line 59 it uses the X register, which doesn't support displacement.

 

I can see why this is happening. It wants to use Z as the pointer into flash to load some constant values, and Y is the stack pointer so it only has X left. It should still optimize the addition/subtraction better, but that's the reason why it doesn't use the displacement feature.

 

I tried pre-loading the values into autos but it didn't help. I guess it's a limitation of GCC's ability to see a way to optimize this one. There are other issues even with the data in RAM, for example line 140:

 

channels[ch].dma.ch->TRIGSRC = channels[ch].dma.trigsrc_rxc;

 

Compiles to

 

        ldi r25,lo8(19)
        mul r25,r18
        movw r26,r0
        mul r25,r19
        add r27,r0
        clr __zero_reg__
        subi r26,lo8(-(channels))
        sbci r27,hi8(-(channels))
        adiw r26,13
        ld r25,X
        sbiw r26,13
        std Z+3,r25

 

So it's re-calculating the offset into the array every time, even though it has already done that for channels[ch].dma.

 

Maybe there is a better way to handle this. My goal is to have common code handle any of 3 channels without duplicating the code, because duplication can lead to errors and makes maintenance harder work. Other than using a janky macro I'm not sure what else there is.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
So it's re-calculating the offset into the array every time  ... Maybe there is a better way to handle this.

Does it make any difference if you use a pointer rather than an index  ... ?

 

 

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
I made a minimal example to demonstrate the problem:
In which case consider raising a report on the GCC Bugzilla. We can talk about this kind of thing here as much as we like but the only way to be sure that the developers of the compiler know about it and then consider doing something about it is to tell them.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

awneil wrote:

mojo-chan wrote:

So it's re-calculating the offset into the array every time  ... Maybe there is a better way to handle this.

 

Does it make any difference if you use a pointer rather than an index  ... ?

 

Doesn't seem to.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

mojo-chan wrote:
I made a minimal example to demonstrate the problem:
In which case consider raising a report on the GCC Bugzilla. We can talk about this kind of thing here as much as we like but the only way to be sure that the developers of the compiler know about it and then consider doing something about it is to tell them.

 

I will do that.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I tried avr-gcc 11.1.0 and on -O2 and -O3 it does optimise this a lot better, although -Og is still the same. Unfortunately with 5.4.0 that ships with Atmel Studio all the settings are the same.

 

Question is do I want to go down the path of requiring a custom compiler to build the firmware now.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
although -Og is still the same.
As that is only for debugging I wouldn't have thought it matters - when you build "Release" you will surely use one of the aggressive optimization settings?

mojo-chan wrote:
Question is do I want to go down the path of requiring a custom compiler to build the firmware now.
I guess that depends how tight of space/time your application is? Does this apparently one time initialization of UART/DMA really make that much different to the operation of the code that the odd additional opcode matters? Or do you want to use a tool that is currently widely used/supported (which would, for example, make it easy for others to help out if things like this came up in future) ?

 

BTW it could be interesting to look at AVR/__flash specific issues that were fixed between 5.4 and 11.1 to se if there's anything "lethal" lurking there that you can't do with out.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

As that is only for debugging I wouldn't have thought it matters - when you build "Release" you will surely use one of the aggressive optimization settings?

 

Have not decided yet. The problem with changing the settings for release is that you change the behaviour of the program. Timing is different. In theory it shouldn't matter, and -Og should be worst case for timing, but in practice... Unless you are 101% sure you carefully tested everything with the release settings... Well, especially in early versions the debug settings tend to be a lot better tested. I wish I had time to do a full test suite for this one, but I don't.

 

In the past I've just settled on using -O2 or -Os for debugging and release (before -Og was available). I normally leave all the debug output etc. in, maybe just disable the UART GPIO pin so it can't be read but the timing is still the same. Aside from anything else it's easier to do a test suite with the debug output enabled.

 

clawson wrote:

I guess that depends how tight of space/time your application is? Does this apparently one time initialization of UART/DMA really make that much different to the operation of the code that the odd additional opcode matters? Or do you want to use a tool that is currently widely used/supported (which would, for example, make it easy for others to help out if things like this came up in future) ?

 

BTW it could be interesting to look at AVR/__flash specific issues that were fixed between 5.4 and 11.1 to se if there's anything "lethal" lurking there that you can't do with out.

 

Yeah, I might be worrying about nothing. This is replacing an older application, and the older one has major timing problems on an XMEGA at 32MHz. It uses interrupts for all UART stuff, rather than DMA, but the new one also has 4 UARTs instead of 1. It's only MODBUS at 19200 baud, but instead of the usual 1 second timeout for a response it's 10ms. I can't change it either.

 

I suppose in the scheme of things it's a few microseconds tops, as long as the code isn't called often. Having grown up hand optimising 68k assembler it bugs me knowing it's there, but I'll just have to live with it.

 

Diving into GCC code sounds like a rabbit hole worth of Alice, but maybe I'll take a peak.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

BTW -O3 -g is allowed.

It can make debugging more interesting,

but if you need -O3, you should use it.

 

As I typed the previous paragraph,

it occurred to me that if the last code you debug is good code,

it must run correctly with the debug flags.

Why not ship with the debug flags?

 

BTW gcc does not guarantee that -Owhatever will produce

the same machine instructions as -Owhatever -g .

It might actually do so,

but somewhere in the documentation is the above denial of a guarantee.

Moderation in all things. -- ancient proverb

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

If one wants to ship code in a production version rather than a debug version, you must have a test regime which adequately tests both versions with the same result in each case.

 

As an example, you may recall I discussed a problem with code that ran fine for minutes or hours and then stopped - unless it was the debug version. That turned out to be that the time taken to output a debug message was long enough to conceal a wrong-sense check for an incoming character. Most of the time when the debug wasn't there, the character was, but every now and then it hadn't quite arrived...

 

Neil

 

edited to add: the test process is nothing to do with how the code works, but everything to do with what it does. What you care about is that the thing operates as it should, providing inputs outputs and actions. This might be a software framework but it's hard to beat a physical user test; they're great for finding ways through interface code you didn't know existed, aka 'why on earth would anyone do that?'.

Last Edited: Wed. Jun 9, 2021 - 05:12 AM
This reply has been marked as the solution. 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Try -mstrict-X

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:

Try -mstrict-X

 

Thanks. That makes GCC use the Y register with displacement... But won't that break interrupts?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No. Y can be used in ISRs, of course.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
Why not ship with the debug flags?

for a start, that may leave massive security holes.

 

As Neil suggested,

Test what you fly; fly what you test.

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
Last Edited: Thu. Jun 10, 2021 - 01:40 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:

No. Y can be used in ISRs, of course.

 

Ah yes, of course. I'm trying to understand why this option isn't the default, seems to be a case of there simply being some uncertainty over its utility overall.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
As I typed the previous paragraph,

it occurred to me that if the last code you debug is good code,

it must run correctly with the debug flags.

Why not ship with the debug flags?

awneil wrote:
skeeve wrote:

Why not ship with the debug flags?

 

for a start, that may leave massive security holes.

Then it does not run correctly.

Moderation in all things. -- ancient proverb

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My last place of employment had a firm policy that code with debug modes enabled could not be checked into the version control system. Anything that was regression tested was always a production version.

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
I tried avr-gcc 11.1.0 and on -O2 and -O3 it does optimise this a lot better, although -Og is still the same.

 

So, all this time you were actually looking at code produced in `-Og` mode? What is the point of doing this? What is the point of asking optimization questions about `-Og` code?

Dessine-moi un mouton

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
but if you need -O3, you should use it.

 

You always need `-O3`. Barring genuine compiler bugs (which do happen), if your code doesn't work with `-O3`, your code is broken.

Dessine-moi un mouton

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Okay, one last time, and can we then stop asking about this please? I know what I'm doing, I have decades of experience thanks.

 

Time is limited. I'm not going to be able to build a full test suite. It's a low volume product. I'll use defensive programming, static analysis and as much testing as I can manage, but I'm going to keep the debug output enabled because some of the debugging will doubtless happen after installation of the system. There's no benefit at all to disabling it, performance will be adequate and I'm not worried about someone connecting to the UART or USB and seeing some data spew out. I am concerned about installers needing that info to get the job done on-site.

 

In an idea world this would not be the case, but we don't live in an ideal world. I can't even go to site to help out, because of the human malware situation.