Unrolling memset()

Go To Last Post
38 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi. I'm trying to clear a 256 byte array as quickly as possible. At the moment I have:

 

memset(input_matrix, 0, 256);

 

With -O3 this produces:

 

	memset(input_matrix, 0, 256);
    1306:	20 e0       	ldi	r18, 0x00	; 0
    1308:	31 e0       	ldi	r19, 0x01	; 1
    130a:	e6 ef       	ldi	r30, 0xF6	; 246
    130c:	f3 e2       	ldi	r31, 0x23	; 35
    130e:	df 01       	movw	r26, r30
    1310:	a9 01       	movw	r20, r18
    1312:	1d 92       	st	X+, r1
    1314:	41 50       	subi	r20, 0x01	; 1
    1316:	50 40       	sbci	r21, 0x00	; 0
    1318:	e1 f7       	brne	.-8      	; 0x1312 <RPT_refresh_input_matrix+0x16>

 

In fact -O3 is slower than -O2 overall (the main loop executes at about 5.5kHz with -O3 and 7.2kHz with -O2). In any case -O2 produces the same code.

 

So clearly memset() is not very optimized. I tried a loop to see if GCC would unroll it:

 

	uint8_t i = 0;
	do { input_matrix[i++] = 0; } while(i);
	
	
    1308:	e8 2f       	mov	r30, r24
    130a:	f0 e0       	ldi	r31, 0x00	; 0
    130c:	ea 50       	subi	r30, 0x0A	; 10
    130e:	fc 4d       	sbci	r31, 0xDC	; 220
    1310:	10 82       	st	Z, r1
    1312:	8f 5f       	subi	r24, 0xFF	; 255
    1314:	c9 f7       	brne	.-14     	; 0x1308 <RPT_refresh_input_matrix+0xc>

 

I'm going to write my own assembler code for this, but I'd really like to know what GCC needs to decide that unrolling this is a good idea.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

mojo-chan wrote:
With -O3 this produces:
memset() is a precompiled library function so your optimisation setting is not going to change it - you get the fixed optimisation that was used when the library was built though I suspect a function like memset() is actually implemented in Asm anyway. The only thing the optimisation may affect is the setup of the parameters before the call.

 

EDIT: yup, the source of memset() is Asm here:

 

http://svn.savannah.gnu.org/view...

 

The curious thing is that in your example there is no (R)CALL to lib code so maybe the front end to GCC regonises memset and provides an implementation?

Last Edited: Mon. Jun 19, 2017 - 08:44 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Curious when I build:

#include <string.h>
#include <avr/io.h>
#include <avr/interrupt.h>

uint8_t buff[256];

int main(void) {
	memset(buff, 0x55, 256);
}

I get:

0000007c <main>:
  7c:   40 e0           ldi     r20, 0x00       ; 0
  7e:   51 e0           ldi     r21, 0x01       ; 1
  80:   65 e5           ldi     r22, 0x55       ; 85
  82:   70 e0           ldi     r23, 0x00       ; 0
  84:   80 e6           ldi     r24, 0x60       ; 96
  86:   90 e0           ldi     r25, 0x00       ; 0
  88:   0e 94 49 00     call    0x92    ; 0x92 <memset>
  8c:   80 e0           ldi     r24, 0x00       ; 0
  8e:   90 e0           ldi     r25, 0x00       ; 0
  90:   08 95           ret

00000092 <memset>:
  92:   dc 01           movw    r26, r24
  94:   01 c0           rjmp    .+2             ; 0x98 <memset+0x6>
  96:   6d 93           st      X+, r22
  98:   41 50           subi    r20, 0x01       ; 1
  9a:   50 40           sbci    r21, 0x00       ; 0
  9c:   e0 f7           brcc    .-8             ; 0x96 <memset+0x4>
  9e:   08 95           ret

which has included the library code and has called it.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Weird... I am compiling for XMEGA but it shouldn't make any difference. The code is definitely inlined, no call.

 

A long string of "st X+, r1" seems to be the fastest possible implementation. The only possible optimization I can think of is to use DMA while the CPU is doing something else, but it's probably a marginal since DMA and the CPU share bus access and even I/O ports are on that bus.

 

256 store with post increment to SRAM instructions takes 512 cycles according to the 128A3U datasheet, which is about 20uS at 24MHz. But when I measure it on a 'scope it only takes 10uS... The datasheet says one cycle, plus an extra one if accessing SRAM, which is what I'm doing. I verified that the CPU frequency is 24MHz.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Can you try with with 0x00 as your set value? I'm wondering if it triggers an optimization using r1 and some inlining.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

See, my problem is nonsense like this:

 

	for (uint8_t i = 0; i < map->count; i++)
		input_matrix[map->mapping[i][0]] |= input_matrix[map->mapping[i][1]];
	for (uint8_t i = 0; i < map->count; i++)
    13f8:	c0 91 02 20 	lds	r28, 0x2002	; 0x802002 <map>
    13fc:	d0 91 03 20 	lds	r29, 0x2003	; 0x802003 <map+0x1>
    1400:	8b 81       	ldd	r24, Y+3	; 0x03
    1402:	88 23       	and	r24, r24
    1404:	f9 f0       	breq	.+62     	; 0x1444 <RPT_refresh_input_matrix+0x164>
		input_matrix[map->mapping[i][0]] |= input_matrix[map->mapping[i][1]];
    1406:	20 e0       	ldi	r18, 0x00	; 0
    1408:	82 2f       	mov	r24, r18
    140a:	90 e0       	ldi	r25, 0x00	; 0
    140c:	fc 01       	movw	r30, r24
    140e:	32 96       	adiw	r30, 0x02	; 2
    1410:	ee 0f       	add	r30, r30
    1412:	ff 1f       	adc	r31, r31
    1414:	ec 0f       	add	r30, r28
    1416:	fd 1f       	adc	r31, r29
    1418:	e0 81       	ld	r30, Z
    141a:	f0 e0       	ldi	r31, 0x00	; 0
    141c:	ea 50       	subi	r30, 0x0A	; 10
    141e:	fd 4d       	sbci	r31, 0xDD	; 221
    1420:	88 0f       	add	r24, r24
    1422:	99 1f       	adc	r25, r25
    1424:	de 01       	movw	r26, r28
    1426:	a8 0f       	add	r26, r24
    1428:	b9 1f       	adc	r27, r25
    142a:	15 96       	adiw	r26, 0x05	; 5
    142c:	ac 91       	ld	r26, X
    142e:	b0 e0       	ldi	r27, 0x00	; 0
    1430:	aa 50       	subi	r26, 0x0A	; 10
    1432:	bd 4d       	sbci	r27, 0xDD	; 221
    1434:	9c 91       	ld	r25, X
    1436:	80 81       	ld	r24, Z
    1438:	89 2b       	or	r24, r25
    143a:       80 83           st    Z, r24
    143c:       2f 5f           subi    r18, 0xFF    ; 255
    143e:       8b 81           ldd    r24, Y+3    ; 0x03
    1440:       28 17           cp    r18, r24
    1442:       10 f3           brcs    .-60         ; 0x1408 <RPT_refresh_input_matrix+0x128>

 

 

 

I don't really want to re-write half my application in assembler, but GCC's optimizer for AVR8 really seems to leave a lot to be desired. As an experiment I'm going to try porting this code to an ARM Cortex M0+ just to see what GCC does with it. ARM probably gets a lot more love, although it is THUMB only on M0...

Last Edited: Mon. Jun 19, 2017 - 09:41 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

WHY?

 

memset() is generally an optimised ASM function.   The Compiler might even inline it.

 

I can't see much point in worrying about the speed of memset() for 10 bytes.   Yes,  it would be noticeable for 64k bytes.

As a general rule,  it is worth unrolling 16 writes to reduce the loop overhead.   There is little point in unrolling much further.

 

If you use DMA,  the loop overhead disappears completely.    And for anything more than 16 bytes it is worth setting up DMA.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Tried with 0x00. I added "-v" to show compiler version (and a whole bunch of other stuff). As you can see the build command I use is about as cut-down and simple as they come. To get RCALL rather than CALL I guess I should have -mrelax and there are probably other options I would have benefited from:

C:\SysGCC\avr\bin>avr-gcc -v -mmcu=atmega16 -Os avr.c -o avr.elf
Using built-in specs.
Reading specs from c:/sysgcc/avr/bin/../lib/gcc/avr/5.3.0/device-specs/specs-atmega16
COLLECT_GCC=avr-gcc
COLLECT_LTO_WRAPPER=c:/sysgcc/avr/bin/../libexec/gcc/avr/5.3.0/lto-wrapper.exe
Target: avr
Configured with: ../gcc-5.3.0/configure --target avr --enable-win32-registry=SysGCC-avr-5.3.0 --enable-languages=c,c++ --disable
-nls --without-libiconv-prefix --prefix /q/gnu/auto/bu-2.25+gcc-5.3.0+gmp-5.1.3+mpfr-3.1.2+mpc-1.0.2-avr/ --host i686-pc-mingw32
 --disable-shared --with-avrlibc=yes
Thread model: single
gcc version 5.3.0 (GCC)
COLLECT_GCC_OPTIONS='-v'  '-Os' '-o' 'avr.elf' '-specs=device-specs/specs-atmega16' '-mmcu=avr5'
 c:/sysgcc/avr/bin/../libexec/gcc/avr/5.3.0/cc1.exe -quiet -v -imultilib avr5 -iprefix c:\sysgcc\avr\bin\../lib/gcc/avr/5.3.0/ -
D__AVR_ATmega16__ -D__AVR_DEVICE_NAME__=atmega16 avr.c -mn-flash=1 -mno-skip-bug -quiet -dumpbase avr.c -mmcu=avr5 -auxbase avr
-Os -version -o C:\Users\uid23021\AppData\Local\Temp\ccWswYQa.s
GNU C11 (GCC) version 5.3.0 (avr)
        compiled by GNU C version 4.7.3, GMP version 5.1.3, MPFR version 3.1.2, MPC version 1.0.2
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory "c:\sysgcc\avr\bin\../lib/gcc/avr/5.3.0/../../../../avr/sys-include"
ignoring duplicate directory "c:/sysgcc/avr/lib/gcc/../../lib/gcc/avr/5.3.0/include"
ignoring duplicate directory "c:/sysgcc/avr/lib/gcc/../../lib/gcc/avr/5.3.0/include-fixed"
ignoring nonexistent directory "c:/sysgcc/avr/lib/gcc/../../lib/gcc/avr/5.3.0/../../../../avr/sys-include"
ignoring duplicate directory "c:/sysgcc/avr/lib/gcc/../../lib/gcc/avr/5.3.0/../../../../avr/include"
#include "..." search starts here:
#include <...> search starts here:
 c:\sysgcc\avr\bin\../lib/gcc/avr/5.3.0/include
 c:\sysgcc\avr\bin\../lib/gcc/avr/5.3.0/include-fixed
 c:\sysgcc\avr\bin\../lib/gcc/avr/5.3.0/../../../../avr/include
End of search list.
GNU C11 (GCC) version 5.3.0 (avr)
        compiled by GNU C version 4.7.3, GMP version 5.1.3, MPFR version 3.1.2, MPC version 1.0.2
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: ea6563fa55f568b989f2a9b10690f8a6
COLLECT_GCC_OPTIONS='-v'  '-Os' '-o' 'avr.elf' '-specs=device-specs/specs-atmega16' '-mmcu=avr5'
 c:/sysgcc/avr/bin/../lib/gcc/avr/5.3.0/../../../../avr/bin/as.exe -mmcu=avr5 -mno-skip-bug -o C:\Users\uid23021\AppData\Local\T
emp\cc0aNPeF.o C:\Users\uid23021\AppData\Local\Temp\ccWswYQa.s
COMPILER_PATH=c:/sysgcc/avr/bin/../libexec/gcc/avr/5.3.0/;c:/sysgcc/avr/bin/../libexec/gcc/;c:/sysgcc/avr/bin/../lib/gcc/avr/5.3
.0/../../../../avr/bin/
LIBRARY_PATH=c:/sysgcc/avr/bin/../lib/gcc/avr/5.3.0/avr5/;c:/sysgcc/avr/bin/../lib/gcc/avr/5.3.0/../../../../avr/lib/avr5/;c:/sy
sgcc/avr/bin/../lib/gcc/avr/5.3.0/;c:/sysgcc/avr/bin/../lib/gcc/;c:/sysgcc/avr/bin/../lib/gcc/avr/5.3.0/../../../../avr/lib/
COLLECT_GCC_OPTIONS='-v'  '-Os' '-o' 'avr.elf' '-specs=device-specs/specs-atmega16' '-mmcu=avr5'
 c:/sysgcc/avr/bin/../libexec/gcc/avr/5.3.0/collect2.exe -plugin c:/sysgcc/avr/bin/../libexec/gcc/avr/5.3.0/liblto_plugin-0.dll
-plugin-opt=c:/sysgcc/avr/bin/../libexec/gcc/avr/5.3.0/lto-wrapper.exe -plugin-opt=-fresolution=C:\Users\uid23021\AppData\Local\
Temp\ccyifI7b.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lm -plugin-opt=-pass-through=-lc -plugin-opt=-pass-
through=-latmega16 -mavr5 -o avr.elf c:/sysgcc/avr/bin/../lib/gcc/avr/5.3.0/../../../../avr/lib/avr5/crtatmega16.o -Lc:/sysgcc/a
vr/bin/../lib/gcc/avr/5.3.0/avr5 -Lc:/sysgcc/avr/bin/../lib/gcc/avr/5.3.0/../../../../avr/lib/avr5 -Lc:/sysgcc/avr/bin/../lib/gc
c/avr/5.3.0 -Lc:/sysgcc/avr/bin/../lib/gcc -Lc:/sysgcc/avr/bin/../lib/gcc/avr/5.3.0/../../../../avr/lib C:\Users\uid23021\AppDat
a\Local\Temp\cc0aNPeF.o --start-group -lgcc -lm -lc -latmega16 --end-group

That build creates:

0000007c <main>:
  7c:   80 e0           ldi     r24, 0x00       ; 0
  7e:   91 e0           ldi     r25, 0x01       ; 1
  80:   e0 e6           ldi     r30, 0x60       ; 96
  82:   f0 e0           ldi     r31, 0x00       ; 0
  84:   df 01           movw    r26, r30
  86:   9c 01           movw    r18, r24
  88:   1d 92           st      X+, r1
  8a:   21 50           subi    r18, 0x01       ; 1
  8c:   30 40           sbci    r19, 0x00       ; 0
  8e:   e1 f7           brne    .-8             ; 0x88 <main+0xc>
  90:   80 e0           ldi     r24, 0x00       ; 0
  92:   90 e0           ldi     r25, 0x00       ; 0
  94:   08 95           ret

so, yes, the compiler has recognized 0x00 as a "special case" and inlined an implementation rather than calling the library. The core of the loop here is:

  88:   1d 92           st      X+, r1
  8a:   21 50           subi    r18, 0x01       ; 1
  8c:   30 40           sbci    r19, 0x00       ; 0
  8e:   e1 f7           brne    .-8             ; 0x88 <main+0xc>

That is the bit repeated 256 times. Are you suggesting that it's missed a more efficient way to do this then? Are you thinking that because it is 256 bytes as the count that it should have been able to do a counter in 8 bits until rollover or something?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks Cliff. Yes, it could have unrolled the loop or used an 8 bit counter. There just doesn't seem to be a way to give the compiler that hint...

 

If it was just a minor thing I wouldn't care, but if you take my code above GCC's version takes 35uS to execute over about 50 iterations, and a fair chunk of flash, when as it needs is this:

 

	asm volatile(
		"loop%=:"					"\n\t"
		"ld		__tmp_reg__, X"		"\n\t"
		"ld		r18, Z+"			"\n\t"
		"or		__tmp_reg__, r18"	"\n\t"
		"st		X+, __tmp_reg__"	"\n\t"
		"dec	%[count]"			"\n\t"
		"brne	loop%="
	:
	: [count] "r" (map->count), [input] "x" (map->mapping[0][0]), [output] "z" (map->mapping[0][1])
	: "r18"
	);

 

Is there some C rule I'm missing that makes this solution invalid? The counter and the arrays are all bytes.

 

I was thinking I could actually save a cycle by just using an 0x00 "end of array" market which would be loaded into r18 anyway, but I digress...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
Thanks Cliff. Yes, it could have unrolled the loop or used an 8 bit counter. There just doesn't seem to be a way to give the compiler that hint...
The count in memset is size_t which is 16 bit for AVR. It would be a VERY clever compiler that recognized the count <256 and used an 8 bit counter.

 

EDIT: OK so what we have on our hands is a very clever compiler cheeky:

int main(void) {
	memset(buff, 0, 255);
}

(note 255 not 256) produces:

0000007c <main>:
  7c:   8f ef           ldi     r24, 0xFF       ; 255
  7e:   e0 e6           ldi     r30, 0x60       ; 96
  80:   f0 e0           ldi     r31, 0x00       ; 0
  82:   df 01           movw    r26, r30
  84:   1d 92           st      X+, r1
  86:   8a 95           dec     r24
  88:   e9 f7           brne    .-6             ; 0x84 <main+0x8>

So memset() with a count of 255 then do the last byte as a separate write.

int main(void) {
	memset(buff, 0, 255);
	buff[255] = 0;
}

that is:

  7c:   8f ef           ldi     r24, 0xFF       ; 255
  7e:   e0 e6           ldi     r30, 0x60       ; 96
  80:   f0 e0           ldi     r31, 0x00       ; 0
  82:   df 01           movw    r26, r30
  84:   1d 92           st      X+, r1
  86:   8a 95           dec     r24
  88:   e9 f7           brne    .-6             ; 0x84 <main+0x8>
  8a:   10 92 5f 01     sts     0x015F, r1

Which just adds an STS on the end.

Last Edited: Mon. Jun 19, 2017 - 10:21 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

LOL nice catch Cliff wink

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

With an AVR,  you can do a 16-bit decrement very easily e.g. SUBIW.

But nested 8-bit loops can be a little faster.

 

50 iterations are well worth using DMA.   You lose the loop overhead.   And you can be doing other things at the same time.

Not only that.  But you do not need to learn gobbledygook at all.

 

Mind you,   I would write an ASM function in an .S file as a question of principle.

 

David.

Last Edited: Mon. Jun 19, 2017 - 11:00 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just re-writing the memset() and other loop I mentioned halved the execution time of the main loop. Assembler is the only option for this project to reach the performance level it needs :(

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't believe you.

 

What size of memory do you want to set?

What F_CPU and what is an acceptable time?

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
Assembler is the only option for this project to reach the performance level it needs
I don't believe you either. Unless you are a total magician how could the core of the memset loop be any tighter than:

  84:   1d 92           st      X+, r1
  86:   8a 95           dec     r24
  88:   e9 f7           brne    .-6             ; 0x84 <main+0x8>

that we saw the C compiler generate above? I would love to see your Asm solution that halves the cycles in that!

 

Or are you talking about:

		input_matrix[map->mapping[i][0]] |= input_matrix[map->mapping[i][1]];
    1406:	20 e0       	ldi	r18, 0x00	; 0
    1408:	82 2f       	mov	r24, r18
    140a:	90 e0       	ldi	r25, 0x00	; 0
    140c:	fc 01       	movw	r30, r24
    140e:	32 96       	adiw	r30, 0x02	; 2
    1410:	ee 0f       	add	r30, r30
    1412:	ff 1f       	adc	r31, r31
    1414:	ec 0f       	add	r30, r28
    1416:	fd 1f       	adc	r31, r29
    1418:	e0 81       	ld	r30, Z
    141a:	f0 e0       	ldi	r31, 0x00	; 0
    141c:	ea 50       	subi	r30, 0x0A	; 10
    141e:	fd 4d       	sbci	r31, 0xDD	; 221
    1420:	88 0f       	add	r24, r24
    1422:	99 1f       	adc	r25, r25
    1424:	de 01       	movw	r26, r28
    1426:	a8 0f       	add	r26, r24
    1428:	b9 1f       	adc	r27, r25
    142a:	15 96       	adiw	r26, 0x05	; 5
    142c:	ac 91       	ld	r26, X
    142e:	b0 e0       	ldi	r27, 0x00	; 0
    1430:	aa 50       	subi	r26, 0x0A	; 10
    1432:	bd 4d       	sbci	r27, 0xDD	; 221
    1434:	9c 91       	ld	r25, X
    1436:	80 81       	ld	r24, Z
    1438:	89 2b       	or	r24, r25
    143a:       80 83           st    Z, r24
    143c:       2f 5f           subi    r18, 0xFF    ; 255
    143e:       8b 81           ldd    r24, Y+3    ; 0x03
    1440:       28 17           cp    r18, r24
    1442:       10 f3           brcs    .-60         ; 0x1408 <RPT_refresh_input_matrix+0x128>

I cannot help thinking that the C can probably be "massaged" to generate something more efficient here. However this is using two lots of double indirection so I'm not sure exactly how much could be cut out of that?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

  84:   1d 92           st      X+, r1
  86:   8a 95           dec     r24
  88:   e9 f7           brne    .-6             ; 0x84 <main+0x8>

that we saw the C compiler generate above? I would love to see your Asm solution that halves the cycles in that!

 

#define STX		"st		X+, r1"				"\n\t"
#define STX16	STX STX STX STX STX STX STX STX STX STX STX STX STX STX STX STX

	asm volatile(
		STX16
		STX16
		STX16
		STX16	// 64
		STX16
		STX16
		STX16
		STX16	// 128
		STX16
		STX16
		STX16
		STX16	// 192
		STX16
		STX16
		STX16
		STX16	// 256
	:
	: [input] "x" (input_matrix)
	);

 

I take the point about memset being some kind of built-in so not necessarily optimizing well, but surely the while() loop I posted right at the top couple be unrolled...

 

Quote:
I cannot help thinking that the C can probably be "massaged" to generate something more efficient here. However this is using two lots of double indirection so I'm not sure exactly how much could be cut out of that?

 

	asm volatile(
		"loop%=:"					"\n\t"
		"ld		__tmp_reg__, X"		"\n\t"
		"ld		r18, Z+"			"\n\t"
		"or		__tmp_reg__, r18"	"\n\t"
		"st		X+, __tmp_reg__"	"\n\t"
		"dec	%[count]"			"\n\t"
		"brne	loop%="
	:
	: [count] "r" (map->count), [input] "x" (map->mapping[0][0]), [output] "z" (map->mapping[0][1])
	: "r18"
	);

 

I'd like to think you are right that the compiler can be convinced to generate better code somehow, but no-one seems to know how. You may remember that I broke down and coded this up in assembler after realizing that the compiler was never going to be able to optimize bit stuffing this well: http://www.avrfreaks.net/forum/g...

 

	asm volatile(
		"ldi	r18, 16"			"\n\t"
		"loop%=:"					"\n\t"
		"ld		__tmp_reg__, X+"	"\n\t"	// 0
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 0"				"\n\t"
		"ld		__tmp_reg__, X+"	"\n\t"	// 1
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 1"				"\n\t"
		"ld		__tmp_reg__, X+"	"\n\t"	// 2
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 2"				"\n\t"
		"ld		__tmp_reg__, X+"	"\n\t"	// 3
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 3"				"\n\t"
		"ld		__tmp_reg__, X+"	"\n\t"	// 4
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 4"				"\n\t"
		"ld		__tmp_reg__, X+"	"\n\t"	// 5
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 5"				"\n\t"
		"ld		__tmp_reg__, X+"	"\n\t"	// 6
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 6"				"\n\t"
		"ld		__tmp_reg__, X+"	"\n\t"	// 7
		"bst	__tmp_reg__, 0"		"\n\t"
		"bld	r19, 7"				"\n\t"
		"st		Z+, r19"			"\n\t"
		"dec	r18"				"\n\t"
		"brne	loop%="
	:
	: [input] "x" (input_matrix), [output] "z" (buffer)
	: "r18", "r19"
	);

 

Maybe I'm expecting too much... I'm going to port some of this code to ARM this afternoon, just to see what GCC does with it.

 

Or maybe my compiler-whispering skills are lacking.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Apparently on ARM it seems GCC can't even optimize memset:

 

	memset(input_matrix, 0, 256);
 384:	2280      	movs	r2, #128	; 0x80
 386:	b570      	push	{r4, r5, r6, lr}
 388:	4c0f      	ldr	r4, [pc, #60]	; (3c8 <RPT_refresh_input_matrix+0x44>)
 38a:	0052      	lsls	r2, r2, #1
 38c:	2100      	movs	r1, #0
 38e:	4b0f      	ldr	r3, [pc, #60]	; (3cc <RPT_refresh_input_matrix+0x48>)
 390:	0020      	movs	r0, r4
 392:	4798      	blx	r3

...

00000420 <memset>:
 420:	b5f0      	push	{r4, r5, r6, r7, lr}
 422:	0783      	lsls	r3, r0, #30
 424:	d043      	beq.n	4ae <memset+0x8e>
 426:	1e54      	subs	r4, r2, #1
 428:	2a00      	cmp	r2, #0
 42a:	d03f      	beq.n	4ac <memset+0x8c>
 42c:	b2ce      	uxtb	r6, r1
 42e:	0002      	movs	r2, r0
 430:	2503      	movs	r5, #3
 432:	e002      	b.n	43a <memset+0x1a>
 434:	001a      	movs	r2, r3
 436:	3c01      	subs	r4, #1
 438:	d338      	bcc.n	4ac <memset+0x8c>
 43a:	1c53      	adds	r3, r2, #1
 43c:	7016      	strb	r6, [r2, #0]
 43e:	422b      	tst	r3, r5
 440:	d1f8      	bne.n	434 <memset+0x14>
 442:	2c03      	cmp	r4, #3
 444:	d92a      	bls.n	49c <memset+0x7c>
 446:	22ff      	movs	r2, #255	; 0xff
 448:	400a      	ands	r2, r1
 44a:	0215      	lsls	r5, r2, #8
 44c:	4315      	orrs	r5, r2
 44e:	042a      	lsls	r2, r5, #16
 450:	4315      	orrs	r5, r2
 452:	2c0f      	cmp	r4, #15
 454:	d914      	bls.n	480 <memset+0x60>
 456:	0027      	movs	r7, r4
 458:	001a      	movs	r2, r3
 45a:	3f10      	subs	r7, #16
 45c:	093e      	lsrs	r6, r7, #4
 45e:	3601      	adds	r6, #1
 460:	0136      	lsls	r6, r6, #4
 462:	199e      	adds	r6, r3, r6
 464:	6015      	str	r5, [r2, #0]
 466:	6055      	str	r5, [r2, #4]
 468:	6095      	str	r5, [r2, #8]
 46a:	60d5      	str	r5, [r2, #12]
 46c:	3210      	adds	r2, #16
 46e:	4296      	cmp	r6, r2
 470:	d1f8      	bne.n	464 <memset+0x44>
 472:	220f      	movs	r2, #15
 474:	4397      	bics	r7, r2
 476:	3710      	adds	r7, #16
 478:	19db      	adds	r3, r3, r7
 47a:	4014      	ands	r4, r2
 47c:	2c03      	cmp	r4, #3
 47e:	d90d      	bls.n	49c <memset+0x7c>
 480:	001a      	movs	r2, r3
 482:	1f27      	subs	r7, r4, #4
 484:	08be      	lsrs	r6, r7, #2
 486:	3601      	adds	r6, #1
 488:	00b6      	lsls	r6, r6, #2
 48a:	199e      	adds	r6, r3, r6
 48c:	c220      	stmia	r2!, {r5}
 48e:	42b2      	cmp	r2, r6
 490:	d1fc      	bne.n	48c <memset+0x6c>
 492:	2203      	movs	r2, #3
 494:	4397      	bics	r7, r2
 496:	3704      	adds	r7, #4
 498:	19db      	adds	r3, r3, r7
 49a:	4014      	ands	r4, r2
 49c:	2c00      	cmp	r4, #0
 49e:	d005      	beq.n	4ac <memset+0x8c>
 4a0:	b2c9      	uxtb	r1, r1
 4a2:	191c      	adds	r4, r3, r4
 4a4:	7019      	strb	r1, [r3, #0]
 4a6:	3301      	adds	r3, #1
 4a8:	429c      	cmp	r4, r3
 4aa:	d1fb      	bne.n	4a4 <memset+0x84>
 4ac:	bdf0      	pop	{r4, r5, r6, r7, pc}
 4ae:	0014      	movs	r4, r2
 4b0:	0003      	movs	r3, r0
 4b2:	e7c6      	b.n	442 <memset+0x22>

I'm really surprised, I expected GCC to do a really good job on ARM.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mojo-chan wrote:
I posted right at the top couple be unrolled...
Did you ask the compiler to attempt that?

 

https://gcc.gnu.org/onlinedocs/g...

 

Search "unroll". You are probably looking for -funroll-loops

 

As for memset I said it before and I'll say it again: Your C compiler comes with the functions in Libc (printf, strlen, memset, sin, etc) precompiled so nothing you can do can change their optimisation level. However, as we've discovered in this thread GCC clearly has the ability to spot memset() being used with special values (0x00 here - wonder if 0xFF is also treated specially?) so it may generate specific code in that case and there's a chance that may benefit from optimisation options in place. But otherwise, for memset:
 

C:\SysGCC\avr\avr\lib\avr5>avr-ar x libc.a memset.o

C:\SysGCC\avr\avr\lib\avr5>avr-objdump -S memset.o

memset.o:     file format elf32-avr


Disassembly of section .text.avr-libc:

00000000 <memset>:
   0:   dc 01           movw    r26, r24
   2:   00 c0           rjmp    .+0             ; 0x4 <memset+0x4>
   4:   6d 93           st      X+, r22
   6:   41 50           subi    r20, 0x01       ; 1
   8:   50 40           sbci    r21, 0x00       ; 0
   a:   00 f4           brcc    .+0             ; 0xc <memset+0xc>
   c:   08 95           ret

I extracted the memset code from libc.a (avr5 = mega16 etc) and disassembled. Nothing I do will change this code. It will either be added to my program and called or it won't.

 

It is, of course, possible to take any libc source, add it to your own program and use a local compilation of it rather than the extracted member from libc. That would then be subject to your chosen optimisation options.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you gave your cause for concern over the "stock" memset, then I missed it.

 

But if indeed this construct ends up to be a bottleneck in your app, then construct your own at C level or lower.  Generally when that makes sense, you know more about your particular situation than the compiler writers and library writers can know about the "general" implementation.  For example, you might know that the bottleneck invocation always has exactly the same number of bytes, and perhaps that number is exactly [say] 256 or other power of two which might lead to cycle-shaving.

 

Just for fun, I dug up the memset() invocation that I use in CodeVision in a series of apps with SD card interface, where I clear a 512-byte sector buffer before each build of the sector.  The routine is called and the parameters are stacked so indeed if I tried to inline there would be some cycles of overhead saved.  The loop looks tight to me.

 

Any serious unrolling would bloat the flash usage, wouldn't it?  Instead of a 3-word loop there would then be a 1-word ST X+ repeated say 512 times.  Perhaps not a problem in a particular app.  Something analogous to Duff's device could be used to jump into the instruction train at the right spot, right?

; 0000 25D0 					memset (tbuffptr, 0, sizeof(Sector));
00254d 91e0 1cd8 	LDS  R30,_tbuffptr
00254f 91f0 1cd9 	LDS  R31,_tbuffptr+1
002551 93fa      	ST   -Y,R31
002552 93ea      	ST   -Y,R30
002553 e0e0      	LDI  R30,LOW(0)
002554 93ea      	ST   -Y,R30
002555 e0a0      	LDI  R26,LOW(512)
002556 e0b2      	LDI  R27,HIGH(512)
002557 940e d274 	CALL _memset
...
                 _memset:
00d274 93ba      	ST   -Y,R27
00d275 93aa      	ST   -Y,R26
00d276 81b9          ldd  r27,y+1
00d277 81a8          ld   r26,y
00d278 9610          adiw r26,0
00d279 f031          breq memset1
00d27a 81fc          ldd  r31,y+4
00d27b 81eb          ldd  r30,y+3
00d27c 816a          ldd  r22,y+2
                 memset0:
00d27d 9361          st   z+,r22
00d27e 9711          sbiw r26,1
00d27f f7e9          brne memset0
                 memset1:
00d280 81eb          ldd  r30,y+3
00d281 81fc          ldd  r31,y+4
                 _0x20A0001:
00d282 9625      	ADIW R28,5
00d283 9508      	RET

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

mojo-chan wrote:
I posted right at the top couple be unrolled...
Did you ask the compiler to attempt that?

 

https://gcc.gnu.org/onlinedocs/g...

 

Search "unroll". You are probably looking for -funroll-loops

 

Fair point. Let's try, -O3 and -funroll-loops.

 

	for (uint16_t i = 0; i < 256; i++)
		input_matrix[i] = 0;
   283e:	e6 ef       	ldi	r30, 0xF6	; 246
    2840:	f2 e2       	ldi	r31, 0x22	; 34
    2842:	80 e0       	ldi	r24, 0x00	; 0
    2844:	91 e0       	ldi	r25, 0x01	; 1
    2846:	df 01       	movw	r26, r30
    2848:	9c 01       	movw	r18, r24
    284a:	1d 92       	st	X+, r1
    284c:	21 50       	subi	r18, 0x01	; 1
    284e:	30 40       	sbci	r19, 0x00	; 0
    2850:	e1 f7       	brne	.-8      	; 0x284a <RPT_refresh_input_matrix+0x12>

 

Could have saved 2 bytes of flash by using SBIW but whatever... It didn't unroll the loop. Let's try somethings else.

 

	uint8_t i = 0;
	do { input_matrix[i++] = 0; } while(i != 0);
	uint8_t i = 0;
    283a:	80 e0       	ldi	r24, 0x00	; 0
	do { input_matrix[i++] = 0; } while(i != 0);
    283c:	e1 e0       	ldi	r30, 0x01	; 1
    283e:	e8 0f       	add	r30, r24
    2840:	a8 2f       	mov	r26, r24
    2842:	b0 e0       	ldi	r27, 0x00	; 0
    2844:	aa 50       	subi	r26, 0x0A	; 10
    2846:	bd 4d       	sbci	r27, 0xDD	; 221
    2848:	1c 92       	st	X, r1
    284a:	ae 2f       	mov	r26, r30
    284c:	b0 e0       	ldi	r27, 0x00	; 0
    284e:	aa 50       	subi	r26, 0x0A	; 10
    2850:	bd 4d       	sbci	r27, 0xDD	; 221
    2852:	1c 92       	st	X, r1
    2854:	ef 5f       	subi	r30, 0xFF	; 255
    2856:	f0 e0       	ldi	r31, 0x00	; 0
    2858:	ea 50       	subi	r30, 0x0A	; 10
    285a:	fd 4d       	sbci	r31, 0xDD	; 221
    285c:	10 82       	st	Z, r1
    285e:	23 e0       	ldi	r18, 0x03	; 3
    2860:	28 0f       	add	r18, r24
    2862:	e2 2f       	mov	r30, r18
    2864:	f0 e0       	ldi	r31, 0x00	; 0
    2866:	ea 50       	subi	r30, 0x0A	; 10
    2868:	fd 4d       	sbci	r31, 0xDD	; 221
    286a:	10 82       	st	Z, r1
    286c:	34 e0       	ldi	r19, 0x04	; 4
    286e:	38 0f       	add	r19, r24
    2870:	e3 2f       	mov	r30, r19
    2872:	f0 e0       	ldi	r31, 0x00	; 0
    2874:	ea 50       	subi	r30, 0x0A	; 10
    2876:	fd 4d       	sbci	r31, 0xDD	; 221
    2878:	10 82       	st	Z, r1
    287a:	45 e0       	ldi	r20, 0x05	; 5
    287c:	48 0f       	add	r20, r24
    287e:	e4 2f       	mov	r30, r20
    2880:	f0 e0       	ldi	r31, 0x00	; 0
    2882:	ea 50       	subi	r30, 0x0A	; 10
    2884:	fd 4d       	sbci	r31, 0xDD	; 221
    2886:	10 82       	st	Z, r1
    2888:	56 e0       	ldi	r21, 0x06	; 6
    288a:	58 0f       	add	r21, r24
    288c:	e5 2f       	mov	r30, r21
    288e:	f0 e0       	ldi	r31, 0x00	; 0
    2890:	ea 50       	subi	r30, 0x0A	; 10
    2892:	fd 4d       	sbci	r31, 0xDD	; 221
    2894:	10 82       	st	Z, r1
    2896:	67 e0       	ldi	r22, 0x07	; 7
    2898:	68 0f       	add	r22, r24
    289a:	e6 2f       	mov	r30, r22
    289c:	f0 e0       	ldi	r31, 0x00	; 0
    289e:	ea 50       	subi	r30, 0x0A	; 10
    28a0:	fd 4d       	sbci	r31, 0xDD	; 221
    28a2:	10 82       	st	Z, r1
    28a4:	88 5f       	subi	r24, 0xF8	; 248
    28a6:	51 f6       	brne	.-108    	; 0x283c <RPT_refresh_input_matrix+0x4>

 

Why is the compiler too dumb to just use "st X+, r1" since that's clearly what the loop is doing? How do I give it the necessary hints to do that?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
Something analogous to Duff's device could be used to jump into the instruction train at the right spot, right?

 

I think that might be a good generic solution if you can spare a little flash memory. I might look at writing a little "fast_memx.c" kind of library.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've no idea what "input_matrix[]" is?

 

Is it a buffer of uint8_t ?

 

If I try this:

uint8_t input_matrix[256];

int main(void) {
	uint8_t i = 0;
	do { input_matrix[i++] = 0; } while(i != 0);
}

I get this:

 int main(void) {
        uint8_t i = 0;
  7c:   80 e0           ldi     r24, 0x00       ; 0
        do { input_matrix[i++] = 0; } while(i != 0);
  7e:   e8 2f           mov     r30, r24
  80:   f0 e0           ldi     r31, 0x00       ; 0
  82:   e0 5a           subi    r30, 0xA0       ; 160
  84:   ff 4f           sbci    r31, 0xFF       ; 255
  86:   10 82           st      Z, r1
  88:   8f 5f           subi    r24, 0xFF       ; 255
  8a:   c9 f7           brne    .-14            ; 0x7e <main+0x2>

which I agree is a bit disappointing perhaps but it's not the litany you are able to generate so there's something else going on.

EDIT: and this time with source annotation (forgot -g !)

Last Edited: Mon. Jun 19, 2017 - 03:34 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Your code is the same as mine... I used -funroll-loops though and -O3. I get the same as you without the unrolled loops.

 

By the way, I screwed up the remapping code, so just in case anyone cares here is the corrected code, which is still over 50% faster than GCC's:

 

//	for (uint8_t i = 0; i < map->count; i++)
//		input_matrix[map->mapping[i][0]] |= input_matrix[map->mapping[i][1]];

	asm volatile(
		"loop%=:"					"\n\t"
		"ld		r18, Z+"			"\n\t"	// dest
		"ld		r19, Z+"			"\n\t"	// src
		"movw	Y, X"				"\n\t"
		"add	YL, r18"			"\n\t"
		"adc	YH, r1"				"\n\t"
		"ld		r18, Y"				"\n\t"
		"movw	Y, X"				"\n\t"
		"add	YL, r19"			"\n\t"
		"adc	YH, r1"				"\n\t"
		"ld		r19, Y"				"\n\t"
		"or		r18, r19"			"\n\t"
		"st		X+, __tmp_reg__"	"\n\t"
		"dec	%[count]"			"\n\t"
		"brne	loop%="
	:
	: [count] "r" (map->count),
	  [matrix] "x" (input_matrix),
	  [mapping] "z" (map->mapping[0][0])
	: "r18", "r19", "r28", "r29"
	);

15uS assembler vs 35uS GCC.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1
uint16_t clob0;
asm (
   // 20 words, 45 cycles + 256 ST instructions
   " MOV __tmp_reg__, %a1\n"
   " DEC __tmp_reg__\n"
   " 1:\n"
   " .rept 15  ;  15*17=255   could mess up asm size calculations\n"
   "   ST %a1+, %[byte]  ; syntax not obvious, see cookbook\n"
   " .rend\n"
   " CPSE %a1, __tmp_reg__\n"
   " RJMP 1b\n"
   " ST %a1+, %[byte]  ; 255+1=256\n"
   : "=e"(clob0)
   : "0"(start), [byte]"r"(byte)
   ) ;
// Not sure how to tell it start[0..255] might be affected
// Are we stuck with a memory clobber?
// 16 iterations of 16 ST's would be 18 words and 48 cycles + 256 ST's
// 8 iterations of 32 ST's would be 34 words and 24 cycles + 256 ST instructions

Edit: // comments

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

Last Edited: Wed. Jun 21, 2017 - 02:25 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

50 iterations are well worth using DMA. 

Not clear.  DMA can take a lot of instructions to set up, and it probably adds contention to the memory bus...

 

 

Apparently on ARM it seems GCC can't even optimize memset

 Are you sure?  Looks like fancy "code size isn't an issue, let's see if I can't speed things up by storing longs instead of bytes", perhaps with duff's device added as well.

It looks ugly, but it might run like a bat out of hell...  (Hmm:  https://chromium.googlesource.co... is apparently the algorithm.  arm-gcc (at least from the launchpad site) has .S files for various ARM cores, but they don't have comments and I can't tell whether they've actually been optimized over what the C code would/does produce.)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think an unrolled sequence of "st X+, r1" is the fasted possible clear on XMEGA, unless you can parallel up clearing via DMA with other activity. In fact I can do that, so I might give it a try, see how much difference it makes.

 

Can anyone shed some light on the datasheet's claim that "st X+, r" takes an extra cycle if accessing SRAM? In my test it seems to take only one cycle, not two.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can time your Xmega sequence with a hardware timer.   Or simply use a Logic Analyser.

 

The Xmega can do some operations in one cycle when an AVR takes two cycles.   The Xmega GPIO performance is impressive.

Run your code through the Simulator.   It should give an accurate answer.    (then use a Logic Analyser if you are a sceptic)

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you are not satisfied with GCC's build-in code as generated by memset, use -fno-builtin-memset or -fno-buitin to disable built-in expansion altogether.

 

ARM: Some libc implementations come with highly optimized versions of memset and friends.  They look horrible when disassembled because there's some lines of code, but they perform really fast, in particular on large chunks of memory.  The problem with ARM is that alignment has to be inferred at run-time and head / tail to satisfy alignment contribute to code size and run-time for small sizes.  The upside is that 32-bit load / store can be used.

 

For example, here is Newlib's memset:

http://sourceware.org/git/gitweb...

 

For ARM, there's even hand-coded asm:

http://sourceware.org/git/gitweb...

 

GCC handles this by built-in expanding small sizes / known alignments and only issuing libcalls otherwise.

 

For AVR-LibC, memset codes in asm, of course:

http://svn.savannah.nongnu.org/v...

 

However, standard distributions build with -Os hence OPTIMIZE_SPEED is false (-O2 is not a multilib option).  If you want to benefit from the partial unrolling, build your private variant of AVR-LibC with flags set as you like.

 

avrfreaks does not support Opera. Profile inactive.

Last Edited: Wed. Jun 21, 2017 - 07:38 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

To me, the ARM code looks very much like like it was imported from "big systems", and makes assumptions that are probably not valid on most Cortex Microcontrollers:

  1. Code is cached.
  2. accessing actual RAM memory is really slow.
  3. In particular, one line of code in a tight-ish loop is likely to be much faster than one write to RAM.

 

That's not really awful, if your chip can store one memory cell per flash instruction, but it's getting up to the point where it might not be worth the obfuscation...

(And I'd worry that there is no separate library for the "small flash" CM0 chips with the __OPTIMIZE_SIZE__ preprocessor variable set...)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

AVR is Harvard architecture so the fastest way to clear RAM is likely to be the CPU, since instruction fetch happens on a different bus. ARM is Von Neumann, so instruction fetch consumes memory bandwidth and it's possible that peripherals are faster. The old Amiga architecture was like that - there was a memory manipulation chip called the Blitter that was primarily for graphics but could also do things like memory clear/copy and even floppy disk bitstream decoding faster than the CPU because it didn't need to fetch instructions from RAM.

 

Back to AVR, I think a hand optimized assembler library would be useful. I'll have to play around with the pre-processor, see how feasible it is. While I'm moaning about stuff, I wish the GCC asm() function syntax was a bit nicer.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Some interesting results!

 

static inline void * fmemset(void *str, uint16_t c, size_t n)
{
	void *p = str;

#define ms1		"st		X+, __zero_reg__"	"\n\t"
#define ms8		ms1 ms1 ms1 ms1 ms1 ms1 ms1 ms1

	if (n > 32)
	{
		asm volatile(
			ms8 ms8 ms8 ms8
		: // no outputs
		: "x" (p)
		: // no clobbers
		);
		p += 32;
		n -= 32;
	}

	if (n > 16)
	{
		asm volatile(
			ms8 ms8
		: // no outputs
		: "x" (p)
		: // no clobbers
		);
		p += 16;
		n -= 16;
	}

	return str;
}
uint8_t buffer[1024];
fmemset(buffer, 0, 64);

 

Results in:

 

	if (n > 32)
	{
		asm volatile(
  d2:	de 01       	movw	r26, r28
  d4:	11 96       	adiw	r26, 0x01	; 1
  d6:	1d 92       	st	X+, r1
  d8:	1d 92       	st	X+, r1
...
 112:	1d 92       	st	X+, r1
 114:	1d 92       	st	X+, r1
		n -= 32;
	}

	if (n > 16)
	{
		asm volatile(
 116:	90 96       	adiw	r26, 0x20	; 32
 118:	1d 92       	st	X+, r1
 11a:	1d 92       	st	X+, r1
...
 134:	1d 92       	st	X+, r1
 136:	1d 92       	st	X+, r1

 

So...  GGC doesn't notice that I incremented X already it seems, and adds another 32 to it which produces incorrect code. Not sure if this is a bug or something I'm doing wrong.

 

In any case, not incrementing p in the C code fixes it, and you get a nice unbroken string of "st X+" instructions. So it looks like it is possible to produce optimal code in a general way with a combination of C compiler optimization and inline assembler.

 

I just worry that if this is a bug and it gets fixed in GCC, it will break this code.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ah, wait a minute. It says in the cookbook that input operands are read-only. I assumed that meant that the C compiler would treat them that way, i.e. restoring them, but actually it seems to mean that the assembly code isn't supposed to modify them.

 

You need to add X as an output as well.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Enjoy: https://github.com/kuro68k/fastm...

 

I am a little concerned about the need for the "memory" clobber, but I have verified that it is required and can't see any other way of doing it except for forcing the user to mark their array as volatile. In practice it doesn't seem to have too terrible consequences in my test code.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Don't worry.  They will never admit you were right, and that they were wrong.

S.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ARM is Von Neumann

ARM is more complicated than that :-(

Most ARMs are von Neumann, but CM3, CM4, and CM7 have separate instruction and data buses, and are considered "modified Harvard architectures."

Even in the vN case, there is a "bus matrix" and separate memory controllers, and trying to figure out actual timing of an instruction sequence is daunting :-(

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> I am a little concerned about the need for the "memory" clobber

 

As an alternative, you can make the effect explicit by mentioning that memory area as an "=m" output operand.  You need not to use it in the asm template, but it will tell GCC which chunk of memory will be changed.  IIRC the GCC manual has example for that.
 

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
> I am a little concerned about the need for the "memory" clobber

 

As an alternative, you can make the effect explicit by mentioning that memory area as an "=m" output operand.  You need not to use it in the asm template, but it will tell GCC which chunk of memory will be changed.  IIRC the GCC manual has example for that.

I am unclear about what the "m" constraint actually does.

My understanding is that as an input "r" will cause will cause registers to be loader if the following expression is not already in registers

and that as in output "r" will cause storage to memory if the following lvalue is not a register variable.

My understanding is that these will occur whether or not an "r" item is used in the assembly.

I think that that can be obtained from a careful reading of documentation.

I've not been able to get similar information about "m"

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:

As an alternative, you can make the effect explicit by mentioning that memory area as an "=m" output operand.  You need not to use it in the asm template, but it will tell GCC which chunk of memory will be changed.  IIRC the GCC manual has example for that.

 

Interesting... "=m" is not mentioned in the AVR inline assembler cookbook (http://www.nongnu.org/avr-libc/u...) but is in the GCC manual. 

 

I'm not sure how it would help though. You have a pointer and a size... "m" just seems like it specifies that the operand must be a pointer. In my tests GCC does realize that the first element of the array may have changed. I should have checked the second element, because a void* pointer probably operates like an int, but in any case the 5th element is assumed by GCC to be unchanged and not reloaded unless you specify "memory" as a clobber.