Problem with GCC inline assembly of AVR204

Go To Last Post
37 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi all!

 

I've been playing with an inline assembly implementation of the binary-to-BCD algorithm presented in AVR204.

The code works, it does not use multiplication/division/modulo thus it's short and blazing fast. I like that. :-)

It goes like this:

u32 bin2bcd16(const u16 bin) {
	u32 bcd = 0;
	u08 tmp = 0;
	u08 cnt = 16;
	asm volatile (
"            clr    r31               ;clear ZH (not needed for AT90Sxx0x)\n"
"bcd1_%=:    lsl    %A[bin]           ;shift input value\n"
"            rol    %B[bin]           ;through all bytes\n"
"            rol    %A[bcd]           \n"
"            rol    %B[bcd]           \n"
"            rol    %C[bcd]           \n"
"            dec    %[cnt]            ;decrement loop counter\n"
"            breq   bcd3_%=           ;exit if ==0\n"
"            ldi    r30, 14+1         ;Z points to result MSB + 1\n"
"bcd2_%=:    ld     %[tmp], -Z        ;get (Z) with pre-decrement\n"
"            subi   %[tmp], -3        ;add 0x03\n"
"            sbrc   %[tmp], 3         ;if bit 3 not clear\n"
"            st     Z, %[tmp]         ;    store back\n"
"            ld     %[tmp], Z         ;get (Z)\n"
"            subi   %[tmp], -0x30     ;add 0x30\n"
"            sbrc   %[tmp], 7         ;if bit 7 not clear\n"
"            st     Z, %[tmp]         ;    store back\n"
"            cpi    r30, 12           ;done all three?\n"
"            brne   bcd2_%=           ;loop again if not\n"
"            rjmp   bcd1_%=           \n"
"bcd3_%=:                             \n"
		: [bcd] "+r" (bcd)
		: [bin] "r" (bin), [cnt] "r" (cnt), [tmp] "d" (tmp)
		: "r30", "r31"
	);
	return bcd;
}

However, there's a problem: the two highlighted instructions have references to the addresses of the LSB (12) and MSB (14) of bcd. Here, these hard-linked addresses correspond to r12 (%A[bcd]) and r14 (%C[bcd]), because somehow GCC has decided to fit bcd in registers r12, r13 and r14. Of course, it could change its mind anytime and break this piece of code…

So my question is: is there a way to "dynamically" refer to these addresses from within the inline assembly (i.e. some equivalent of &bcd) ?

 

In case it makes things clearer, here is the corresponding code compiled:

	mov r12,__zero_reg__
	mov r13,__zero_reg__
	movw r14,r12
	ldi r19,0
	ldi r18,lo8(16)
/* #APP */
            clr    r31           ;clear ZH (not needed for AT90Sxx0x)
bcd1_237:   lsl    r24           ;shift input value
            rol    r25           ;through all bytes
            rol    r12
            rol    r13
            rol    r14
            dec    r18           ;decrement loop counter
            breq   bcd3_237      ;exit if ==0
            ldi    r30,14+1      ;Z points to result MSB + 1
bcd2_237:   ld     r19, -Z       ;get (Z) with pre-decrement
            subi   r19, -3       ;add 0x03
            sbrc   r19, 3        ;if bit 3 not clear
            st     Z, r19        ;    store back
            ld     r19, Z        ;get (Z)
            subi   r19, -0x30    ;add 0x30
            sbrc   r19, 7        ;if bit 7 not clear
            st     Z, r19        ;    store back
            cpi    r30, 12       ;done all three?
            brne   bcd2_237      ;loop again if not
            rjmp   bcd1_237
bcd3_237:
/* #NOAPP */

 

This topic has a solution.

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Tue. Mar 22, 2016 - 01:47 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You could just unroll the inner loop, and refer to the operands by name.  It would cost you only 12 words of flash, and it would be 11-13 cycles faster.  With 16 passes of the outer loop, that's 176-208 cycles faster overall.  And no need to clobber Z.  That can be a big deal when using this in other code.

 

You can also use __uint24 instead of u32.  Might help the optimiser is some cases.

 

Also, you should constrain [bcd] with '&'.

 

#include <avr/io.h>

__uint24 bin2bcd16(const uint16_t bin) {
    __uint24 bcd = 0;
    uint8_t tmp = 0;
    uint8_t cnt = 16;
    asm volatile (
"bcd1_%=:    lsl    %A[bin]           ;shift input value              \n\t"
"            rol    %B[bin]           ;through all bytes              \n\t"
"            rol    %A[bcd]                                           \n\t"
"            rol    %B[bcd]                                           \n\t"
"            rol    %C[bcd]                                           \n\t"
"            dec    %[cnt]            ;decrement loop counter         \n\t"
"            breq   bcd3_%=           ;exit if ==0                    \n\t"
"                                                                     \n\t"
"            mov    %[tmp], %C[bcd]   ;get MSB                        \n\t"
"            subi   %[tmp], -3        ;add 0x03                       \n\t"
"            sbrc   %[tmp], 3         ;if bit 3 not clear             \n\t"
"            mov    %C[bcd], %[tmp]   ;    store back                 \n\t"
"            mov    %[tmp], %C[bcd]   ;get MSB                        \n\t"
"            subi   %[tmp], -0x30     ;add 0x30                       \n\t"
"            sbrc   %[tmp], 7         ;if bit 7 not clear             \n\t"
"            mov    %C[bcd], %[tmp]   ;    store back                 \n\t"
"                                                                     \n\t"
"            mov    %[tmp], %B[bcd]   ;get middle byte                \n\t"
"            subi   %[tmp], -3        ;add 0x03                       \n\t"
"            sbrc   %[tmp], 3         ;if bit 3 not clear             \n\t"
"            mov    %B[bcd], %[tmp]   ;    store back                 \n\t"
"            mov    %[tmp], %B[bcd]   ;get middle byte                \n\t"
"            subi   %[tmp], -0x30     ;add 0x30                       \n\t"
"            sbrc   %[tmp], 7         ;if bit 7 not clear             \n\t"
"            mov    %B[bcd], %[tmp]   ;    store back                 \n\t"
"                                                                     \n\t"
"            mov    %[tmp], %A[bcd]   ;get LSB                        \n\t"
"            subi   %[tmp], -3        ;add 0x03                       \n\t"
"            sbrc   %[tmp], 3         ;if bit 3 not clear             \n\t"
"            mov    %A[bcd], %[tmp]   ;    store back                 \n\t"
"            mov    %[tmp], %A[bcd]   ;get LSB                        \n\t"
"            subi   %[tmp], -0x30     ;add 0x30                       \n\t"
"            sbrc   %[tmp], 7         ;if bit 7 not clear             \n\t"
"            mov    %A[bcd], %[tmp]   ;    store back                 \n\t"
"                                                                     \n\t"
"            rjmp   bcd1_%=                                           \n"
"bcd3_%=:                                                             \n\t"
        : [bcd] "+&r" (bcd)
        : [bin] "r" (bin),
          [cnt] "r" (cnt),
          [tmp] "d" (tmp)
        :
    );
    return bcd;
}

EDIT: cycle counts

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Mar 17, 2016 - 02:17 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I do not understand why one would want Z to point at location 15.

Isn't that R15?

In any case, I think that all the asm parameters should be input-output parameters.

 

Also, IIRC there is an asm constraint that specifies the X, Y *or* Z register.

I suggest using it instead of a clobber.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:

I do not understand why one would want Z to point at location 15.

Isn't that R15?

To allow the construction of the inner loop, operating on each of the three bytes of interest in [bcd] by addressing the GP registers via their SRAM mappings (ld and st).

 

The OP has already identified the brittleness of this approach:

 

netizen wrote:

GCC has decided to fit bcd in registers r12, r13 and r14. Of course, it could change its mind anytime and break this piece of code…

The question is moot with an unrolled inner loop.

 

skeeve wrote:

In any case, I think that all the asm parameters should be input-output parameters.

I don't see why?  Only [bcd] is used after the asm.  It is however important to ensure that it is constrained to prevent it's use by another input operand.

 

skeeve wrote:

Also, IIRC there is an asm constraint that specifies the X, Y *or* Z register.

I suggest using it instead of a clobber.

You can specify Z with "z" (similarly for X and Y), or any of the three with "e".  To exclude X, use "b".  In all cases, the pair can be referenced by name (i.e. X, Y, or Z) instead of by GP register name (i.e. rN) by using %a (i.e. %a0 instead of %0).

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Wed. Mar 16, 2016 - 11:34 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
I do not understand why one would want Z to point at location 15. Isn't that R15?

It is r15 indeed. That's actually the trick of the AVR204 implementation: Z points to the BCD byte being worked on (X or Y would do as well).

 

skeeve wrote:
Also, IIRC there is an asm constraint that specifies the X, Y *or* Z register. I suggest using it instead of a clobber.

I could use that "e" constraint, but then I suppose I'd need to define a pointer to assign this constraint to, wouldn't I?

Correct me if I'm wrong: as I see it this clobbering is local, and if the compiler need Z it knows it just needs to push it to the stack before calling bin2bcd16(). I could do the push/pop myself (and not declare Z as clobbered), but then it would be done every time whereas the compiler would do it only if it needs to. Or did I get that clobber thing wrong?

 

joeymorin wrote:
Also, you should constrain [bcd] with '&'.

I'm not sure I understand what this does. I read: "&: Register should be used for output only". As opposed to what, input/output? But if it's only listed as an output operand, isn't that already clear?

Also, I'm initializing bcd in C code. I was afraid that, by declaring it "output only", the compiler might optimize away this initialization. However it does not: it compiles the same with and without & (using "=&r" does optimize initialization away).

 

joeymorin wrote:
You could just unroll the inner loop, and refer to the operands by name.

Thank you for this practical suggestion (and code!). I really want to do it the AVR204 way though, that's the proper speed/size compromise for me. ^^

More importantly: this code is the context in which I've been confronted to the inline assembly problem this thread aims to solve, so I pasted it as an example. However what I'd like a solution to is the general question, not this particular binary-to-BCD instance.

 

 

It seems there is no way to get some kind of dynamic reference to where bcd was allocated to (getting the address of a register variable does not work in C either with GCC, actually). I wish someone more knowledgeable would confirm this though…

 

In the meantime, a solution would be to assign the bcd variable to a specific set of registers (thus we'd know where it is), as in:

register u24 bcd asm("...") = 0;

This is usually not guaranteed to be enforced, unless one uses this variable as an operand — which hopefully is our case:

This option does not guarantee that GCC generates code that has this variable in the register you specify at all times. You may not code an explicit reference to this register in the assembler instruction template part of an asm statement and assume it always refers to this variable. However, using the variable as an asm operand guarantees that the specified register is used for the operand.

However such a declaration does not work either because bcd is 3 bytes long, and it seems it's not possible to assign a set of registers to such a variable, as in for example:

register u24 bcd asm("r12", "r13", "r14") = 0;

If it is indeed possible to do something like that, please enlighten me!

 

Thus we need another workaround, like using 3 independent variables:

register u08 bcdL asm("r12") = 0;
register u08 bcdM asm("r13") = 0;
register u08 bcdH asm("r14") = 0;

This works fine and solves the original issue. However… :-p

However these u08 variables know need to be bundled back into one u24. I don't know any way to declare that they already are…

Something like the following does not work, because &bcdL implies getting the address of a register:

return *((u24*)&bcdL);

So the only way I know to make this work is to use something like:

return (u24)(bcdH<<16 | bcdM<<8 | bcdL);

…which makes the output assembly unnecessarily bigger (by 7 words). :-(

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Wed. Mar 16, 2016 - 12:32 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

netizen wrote:
However such a declaration does not work either because bcd is 3 bytes long, and it seems it's not possible to assign a set of registers to such a variable

Ok, solved that with a register union:

u24 bin2bcd16(const u16 bin) {
	register union {
		u24 u24;
	} bcd asm("r12");
	bcd.u24 = 0;
	u08 tmp = 16;
	asm volatile (
"            mov    __tmp_reg__, %[tmp]  \n"
"            clr    r31                  ;clear ZH (not needed for AT90Sxx0x)\n"
"bcd1_%=:    lsl    %A[bin]              ;shift input value\n"
"            rol    %B[bin]              ;through all bytes\n"
"            rol    %A[bcd]              \n"
"            rol    %B[bcd]              \n"
"            rol    %C[bcd]              \n"
"            dec    __tmp_reg__          ;decrement loop counter\n"
"            breq   bcd3_%=              ;exit if ==0\n"
"            ldi    r30,14+1             ;Z points to result MSB + 1\n"
"bcd2_%=:    ld     %[tmp], -Z           ;get (Z) with pre-decrement\n"
"            subi   %[tmp], -0x03        ;add 0x03\n"
"            sbrc   %[tmp], 3            ;if bit 3 not clear\n"
"            st     Z, %[tmp]            ;    store back\n"
"            ld     %[tmp], Z            ;get (Z)\n"
"            subi   %[tmp], -0x30        ;add 0x30\n"
"            sbrc   %[tmp], 7            ;if bit 7 not clear\n"
"            st     Z, %[tmp]            ;    store back\n"
"            cpi    r30, 12              ;done all three?\n"
"            brne   bcd2_%=              ;loop again if not\n"
"            rjmp   bcd1_%=              \n"
"bcd3_%=:                                \n"
		: [bcd] "+&r" (bcd)
		: [bin] "r" (bin), [tmp] "d" (tmp)
		: "r30", "r31"
	);
	return bcd.u24;
}

Also, I now use __tmp_reg__ as my counter (initializing it via tmp), which saves a register.

It all compiles into:

	mov r12,__zero_reg__
	mov r13,__zero_reg__
	mov r14,__zero_reg__
        ldi r18,lo8(16)
/* #APP */
            mov    __tmp_reg__, r18
            clr    r31               ;clear ZH (not needed for AT90Sxx0x)
bcd1_236:   lsl    r24               ;shift input value
            rol    r25               ;through all bytes
            rol    r12
            rol    r13
            rol    r14
            dec    __tmp_reg__       ;decrement loop counter
            breq   bcd3_236          ;exit if ==0
            ldi    r30,14+1          ;Z points to result MSB + 1
bcd2_236:   ld     r18, -Z           ;get (Z) with pre-decrement
            subi   r18, -0x03        ;add 0x03
            sbrc   r18, 3            ;if bit 3 not clear
            st     Z, r18            ;    store back
            ld     r18, Z            ;get (Z)
            subi   r18, -0x30        ;add 0x30
            sbrc   r18, 7            ;if bit 7 not clear
            st     Z, r18            ;    store back
            cpi    r30, 12           ;done all three?
            brne   bcd2_236          ;loop again if not
            rjmp   bcd1_236
bcd3_236:
/* #NOAPP */

 

Now, to the clobber part! How do you guys think I should solve it?

 

 

EDIT: Fixed it, changing [bcd] "+&r" (bcd.u24) to [bcd] "+&r" (bcd)

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Wed. Mar 16, 2016 - 05:53 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think the whole thing should be in a .S file.

No need to guess where avr-gcc stores variables.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, there is no need to guess anything here: GCC stores them where you tell it to. I tried several and checked the .s temp file.

When it can't, compilation fails with:

error: register specified for ‘*r11’ isn’t suitable for data type

How would putting the whole thing in a .S file make things any better ?

 

Or did you mean what compiler use-class (like call-used registers, call-saved registers, etc) ?

I'm not quite sure which I should be selecting. In doubt, I went for call-used:

Call-used registers (r18-r27, r30-r31): May be allocated by gcc for local data. You may use them freely in assembler subroutines. Calling C subroutines can clobber any of them - the caller is responsible for saving and restoring.

 

Also, nobody answered my clobber questions, so I went for the pointer solution with a "=&e" constraint, so it can pick between X, Y and Z (FWIW it ends up picking Z).
Code now looks like this:

u24 bin2bcd16(const u16 bin) {
	register union {
		u24 u24;
	} bcd asm("r18");
	bcd.u24 = 0;
	u08 tmp = 16;
	u08 *ptr;
	asm volatile (
"            mov    __tmp_reg__, %[tmp]  \n"
"            clr    %B[ptr]              ;clear ZH (not needed for AT90Sxx0x)\n"
"bcd1_%=:    lsl    %A[bin]              ;shift input value\n"
"            rol    %B[bin]              ;through all bytes\n"
"            rol    %A[bcd]              \n"
"            rol    %B[bcd]              \n"
"            rol    %C[bcd]              \n"
"            dec    __tmp_reg__          ;decrement loop counter\n"
"            breq   bcd3_%=              ;exit if ==0\n"
"            ldi    %A[ptr], 20+1        ;Z points to result MSB + 1\n"
"bcd2_%=:    ld     %[tmp], -%a[ptr]     ;get (Z) with pre-decrement\n"
"            subi   %[tmp], -0x03        ;add 0x03\n"
"            sbrc   %[tmp], 3            ;if bit 3 not clear\n"
"            st     %a[ptr], %[tmp]      ;    store back\n"
"            ld     %[tmp], %a[ptr]      ;get (Z)\n"
"            subi   %[tmp], -0x30        ;add 0x30\n"
"            sbrc   %[tmp], 7            ;if bit 7 not clear\n"
"            st     %a[ptr], %[tmp]      ;    store back\n"
"            cpi    %A[ptr], 18          ;done all three?\n"
"            brne   bcd2_%=              ;loop again if not\n"
"            rjmp   bcd1_%=              \n"
"bcd3_%=:                                \n"
		: [bcd] "+&r" (bcd), [ptr] "=&e" (ptr)
		: [bin] "r" (bin), [tmp] "d" (tmp)
	);
	return bcd.u24;
}

 

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Wed. Mar 16, 2016 - 06:30 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I noticed it'd be easy to change the base register later on for some reason and forget to update the corresponding addresses in the inline assembly…

So it's better to make them all relative to a unique definition:

#define str(x)  _str(x)
#define _str(x)  #x

/**
 * Binary to packed BCD (AVR204 implementation).
 */
u24 bin2bcd16(const u16 bin) {
// Code needs a pointer to the BCD registers, thus this static definition:
#define REG  18
	register union {
		u24 u24;
	} bcd asm("r" str(REG));
	bcd.u24 = 0;
	u08 tmp = 16;
	u08 *ptr;
	asm volatile (
"            mov    __tmp_reg__, %[tmp]  \n"
"            clr    %B[ptr]              ;clear ZH (not needed for AT90Sxx0x)\n"
"bcd1_%=:    lsl    %A[bin]              ;shift input value\n"
"            rol    %B[bin]              ;through all bytes\n"
"            rol    %A[bcd]              \n"
"            rol    %B[bcd]              \n"
"            rol    %C[bcd]              \n"
"            dec    __tmp_reg__          ;decrement loop counter\n"
"            breq   bcd3_%=              ;exit if ==0\n"
"            ldi    %A[ptr], %[br2]+1    ;Z points to result MSB + 1\n"
"bcd2_%=:    ld     %[tmp], -%a[ptr]     ;get (Z) with pre-decrement\n"
"            subi   %[tmp], -0x03        ;add 0x03\n"
"            sbrc   %[tmp], 3            ;if bit 3 not clear\n"
"            st     %a[ptr], %[tmp]      ;    store back\n"
"            ld     %[tmp], %a[ptr]      ;get (Z)\n"
"            subi   %[tmp], -0x30        ;add 0x30\n"
"            sbrc   %[tmp], 7            ;if bit 7 not clear\n"
"            st     %a[ptr], %[tmp]      ;    store back\n"
"            cpi    %A[ptr], %[br0]      ;done all three?\n"
"            brne   bcd2_%=              ;loop again if not\n"
"            rjmp   bcd1_%=              \n"
"bcd3_%=:                                \n"
		: [bcd] "+&r" (bcd), [ptr] "=&e" (ptr)
		: [bin] "r" (bin), [tmp] "d" (tmp),
		  [br0] "I" (REG), [br2] "I" (REG+2)
	);
	return bcd.u24;
#undef REG
}

I guess that's the best I can do without dynamic reference within the inline assembly to the registers allocated by GCC.

 

 

PS: How do people get their code syntax highlighted? Mine gets it until I press "OK" and get back to the commenting box. :-/

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Wed. Mar 16, 2016 - 07:41 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

With a .S file, the intellectual effort would be less,

both for writing and for reading.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

netizen wrote:

I could use that "e" constraint, but then I suppose I'd need to define a pointer to assign this constraint to, wouldn't I?

Correct me if I'm wrong: as I see it this clobbering is local, and if the compiler need Z it knows it just needs to push it to the stack before calling bin2bcd16(). I could do the push/pop myself (and not declare Z as clobbered), but then it would be done every time whereas the compiler would do it only if it needs to. Or did I get that clobber thing wrong?

Also, nobody answered my clobber questions, so I went for the pointer solution with a "=&e" constraint, so it can pick between X, Y and Z (FWIW it ends up picking Z).

The compiler makes heavy use of X, Y, and Z.  Allowing it to select one for you instead of forcing it to relinquish Z can help the optimiser with surrounding code.  Your use of "e" is correct.

 

netizen wrote:

Thank you for this practical suggestion

Note that my cycle counts were wrong.  The results are actually somewhat faster, between 176 and 208 cycles faster, and less register pressure on the optimiser, for a cost of 12 words of flash.  Seems a good compromise.

 

netizen wrote:

More importantly: this code is the context in which I've been confronted to the inline assembly problem this thread aims to solve, so I pasted it as an example.

The GNU assembler (for AVR) will accept a GP register's integer number.  That is ldi 24, 0x55 is the same as ldi r24, 0x55.  To my knowledge however there is no inline assembler constraint, modifier, or other mechanism which will transform a register operand from its 'r'-prefixed proper name into its register number.

 

netizen wrote:

I'm not sure I understand what this does. I read: "&: Register should be used for output only". As opposed to what, input/output? But if it's only listed as an output operand, isn't that already clear?

If you haven't already, have a thorough read of these:

http://www.nongnu.org/avr-libc/user-manual/inline_asm.html

https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html

 

The compiler may choose the same registers for input and output, even if not told to do so. This is not a problem in most cases, but may be fatal if the output operator is modified by the assembler code before the input operator is used. In the situation where your code depends on different registers used for input and output operands, you must add the & constraint modifier to your output operand. The following example demonstrates this problem:

 

asm volatile("in %0,%1" "\n\t"

"out %1, %2" "\n\t"

: "=&r" (input)

: "I" (_SFR_IO_ADDR(port)), "r" (output)

);

 

In this example an input value is read from a port and then an output value is written to the same port. If the compiler would have choosen the same register for input and output, then the output value would have been destroyed on the first assembler instruction. Fortunately, this example uses the & constraint modifier to instruct the compiler not to select any register for the output value, which is used for any of the input operands.

 

netizen wrote:

In the meantime, a solution would be to assign the bcd variable to a specific set of registers (thus we'd know where it is):

Again, this forces the compiler's hand.  The point of (or one, at lease) inline assembler is to allow the compiler to make as many choices on your behalf as possible, and to give it as much information as possible, in order to allow the optimiser to make better decisions.

 

netizen wrote:

However such a declaration does not work either because bcd is 3 bytes long, and it seems it's not possible to assign a set of registers to such a variable, as in for example:

register u24 bcd asm("r12", "r13", "r14") = 0;

Specify only the first register.  Register variables are assigned to contiguous registers.  You cannot split a 16-bit variable across two separated registers i.e. r2 and r5.  Furthermore, 16-bit and wider register variables must be word-aligned.

 

No need to mess around with unions.

 

netizen wrote:

Also, I now use __tmp_reg__ as my counter (initializing it via tmp), which saves a register.

You're still using a register for tmp, and it costs a mov instruction.  If you want to use __tmp_reg__ and (possibly) save a register, use the "t" constraint (although curiously, it doesn't seem to work for me even though it should be supported at least as far back as 4.6.0)

 

netizen wrote:

I noticed it'd be easy to change the base register later on for some reason and forget to update the corresponding addresses in the inline assembly…

joeymorin wrote:

skeeve wrote:

In any case, I think that all the asm parameters should be input-output parameters.

I don't see why?

Ah yes, for those input operands which are modified by the inline assembler.  In this case, all of them (except the newly-added 'br' operand).  Use of __tmp_reg__ would also allow cnt to be an input only operand, but at the cost of a mov instruction.  Might be fine if the compiler needs the same constant elsewhere, but loading an 8-bit constant is no more expensive than moving it.

 

#include <avr/io.h>

#define str(x)  _str(x)
#define _str(x)  #x

#define REG  18

__uint24 bin2bcd16(uint16_t bin) {
    register __uint24 bcd asm("r" str(REG))= 0;
    __uint24 *ptr;
    uint8_t cnt = 16;
    uint8_t tmp;
    asm volatile (
"bcd1_%=:    lsl    %A[bin]           ;shift input value              \n\t"
"            rol    %B[bin]           ;through all bytes              \n\t"
"            rol    %A[bcd]                                           \n\t"
"            rol    %B[bcd]                                           \n\t"
"            rol    %C[bcd]                                           \n\t"
"            dec    %[cnt]            ;decrement loop counter         \n\t"
"            breq   bcd3_%=           ;exit if ==0                    \n\t"
"                                                                     \n\t"
"            ldi    %A[ptr], %
+3 ;Z points to result MSB + 1 \n\t" "bcd2_%=: ld %[tmp], -%a[ptr] ;get (Z) with pre-decrement \n\t" " subi %[tmp], -0x03 ;add 0x03 \n\t" " sbrc %[tmp], 3 ;if bit 3 not clear \n\t" " st %a[ptr], %[tmp] ; store back \n\t" " ld %[tmp], %a[ptr] ;get (Z) \n\t" " subi %[tmp], -0x30 ;add 0x30 \n\t" " sbrc %[tmp], 7 ;if bit 7 not clear \n\t" " st %a[ptr], %[tmp] ; store back \n\t" " cpi %A[ptr], %
;done all three? \n\t" " brne bcd2_%= ;loop again if not \n\t" " \n\t" " rjmp bcd1_%= \n" "bcd3_%=: \n\t" : [bcd] "+&r" (bcd), [bin] "+&r" (bin), [cnt] "+&r" (cnt), [tmp] "+&d" (tmp), [ptr] "+&e" (ptr) :
"I" (REG) ); return bcd; }

 

Curiously, ptr and tmp are initialised to zero in the generated assembly, even though they were not initialised in C, and warnings are still issued when compiling.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Mar 17, 2016 - 05:02 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I still don't like hard-coding of a register for bcd since it may not play well with the optimiser.  The GNU Assembler allows macros:

 

    asm volatile (
           ".macro ldr regr, regi                                   \n"
           "  .if \\regi == r0                                      \n"
           "    ldi \\regr, 0                                       \n"
           "  .elseif \\regi == r1                                  \n"
           "    ldi \\regr, 1                                       \n"
           "  .elseif \\regi == r2                                  \n"
           "    ldi \\regr, 2                                       \n"
           "  .elseif \\regi == r3                                  \n"
           "    ldi \\regr, 3                                       \n"
           "  .elseif \\regi == r4                                  \n"
           "    ldi \\regr, 4                                       \n"
           "  .elseif \\regi == r5                                  \n"
           "    ldi \\regr, 5                                       \n"
           "  .elseif \\regi == r6                                  \n"
           "    ldi \\regr, 6                                       \n"
           "  .elseif \\regi == r7                                  \n"
           "    ldi \\regr, 7                                       \n"
           "  .elseif \\regi == r8                                  \n"
           "    ldi \\regr, 8                                       \n"
           "  .elseif \\regi == r9                                  \n"
           "    ldi \\regr, 9                                       \n"
           "  .elseif \\regi == r10                                 \n"
           "    ldi \\regr, 10                                      \n"
           "  .elseif \\regi == r11                                 \n"
           "    ldi \\regr, 11                                      \n"
           "  .elseif \\regi == r12                                 \n"
           "    ldi \\regr, 12                                      \n"
           "  .elseif \\regi == r13                                 \n"
           "    ldi \\regr, 13                                      \n"
           "  .elseif \\regi == r14                                 \n"
           "    ldi \\regr, 14                                      \n"
           "  .elseif \\regi == r15                                 \n"
           "    ldi \\regr, 15                                      \n"
           "  .elseif \\regi == r16                                 \n"
           "    ldi \\regr, 16                                      \n"
           "  .elseif \\regi == r17                                 \n"
           "    ldi \\regr, 17                                      \n"
           "  .elseif \\regi == r18                                 \n"
           "    ldi \\regr, 18                                      \n"
           "  .elseif \\regi == r19                                 \n"
           "    ldi \\regr, 19                                      \n"
           "  .elseif \\regi == r20                                 \n"
           "    ldi \\regr, 20                                      \n"
           "  .elseif \\regi == r21                                 \n"
           "    ldi \\regr, 21                                      \n"
           "  .elseif \\regi == r22                                 \n"
           "    ldi \\regr, 22                                      \n"
           "  .elseif \\regi == r23                                 \n"
           "    ldi \\regr, 23                                      \n"
           "  .elseif \\regi == r24                                 \n"
           "    ldi \\regr, 24                                      \n"
           "  .elseif \\regi == r25                                 \n"
           "    ldi \\regr, 25                                      \n"
           "  .elseif \\regi == r26                                 \n"
           "    ldi \\regr, 26                                      \n"
           "  .elseif \\regi == r27                                 \n"
           "    ldi \\regr, 27                                      \n"
           "  .elseif \\regi == r28                                 \n"
           "    ldi \\regr, 28                                      \n"
           "  .elseif \\regi == r29                                 \n"
           "    ldi \\regr, 29                                      \n"
           "  .elseif \\regi == r30                                 \n"
           "    ldi \\regr, 30                                      \n"
           "  .elseif \\regi == r31                                 \n"
           "    ldi \\regr, 31                                      \n"
           "  .else                                                 \n"
           "    .message \"ERROR: unrecognised register name\"      \n"
           "  .endif                                                \n"
           ".endm                                                   \n"
.
.
.

This achieves the goal of transforming a register's name into it's number i.e. SRAM address.  The macro definition is a bit lengthy, but with only 32 GP registers it is not too onerous.  Since assembler macros are handled in the same way as the C preprocessor handle C macros (i.e. text substitution), it is necessary to have a macro for every instruction we wish to handle.  In this case, one for ldi (named ldr) and one for cpi (named cpr).  The use of a guard symbol permits placing these macros into their own .h, and including them anywhere they're needed without fear of redefining the assembler macros:

$ cat netizenassemblermacros.h
__asm__ (
       ".ifndef REG_MACROS_DEFINED                              \n"
       "                                                        \n"
       ".set REG_MACROS_DEFINED, 0                              \n"
       "                                                        \n"
       ".macro ldr regr, regi                                   \n"
       "  .if \\regi == r0                                      \n"
       "    ldi \\regr, 0                                       \n"
       "  .elseif \\regi == r1                                  \n"
       "    ldi \\regr, 1                                       \n"
       "  .elseif \\regi == r2                                  \n"
       "    ldi \\regr, 2                                       \n"
       "  .elseif \\regi == r3                                  \n"
       "    ldi \\regr, 3                                       \n"
       "  .elseif \\regi == r4                                  \n"
       "    ldi \\regr, 4                                       \n"
       "  .elseif \\regi == r5                                  \n"
       "    ldi \\regr, 5                                       \n"
       "  .elseif \\regi == r6                                  \n"
       "    ldi \\regr, 6                                       \n"
       "  .elseif \\regi == r7                                  \n"
       "    ldi \\regr, 7                                       \n"
       "  .elseif \\regi == r8                                  \n"
       "    ldi \\regr, 8                                       \n"
       "  .elseif \\regi == r9                                  \n"
       "    ldi \\regr, 9                                       \n"
       "  .elseif \\regi == r10                                 \n"
       "    ldi \\regr, 10                                      \n"
       "  .elseif \\regi == r11                                 \n"
       "    ldi \\regr, 11                                      \n"
       "  .elseif \\regi == r12                                 \n"
       "    ldi \\regr, 12                                      \n"
       "  .elseif \\regi == r13                                 \n"
       "    ldi \\regr, 13                                      \n"
       "  .elseif \\regi == r14                                 \n"
       "    ldi \\regr, 14                                      \n"
       "  .elseif \\regi == r15                                 \n"
       "    ldi \\regr, 15                                      \n"
       "  .elseif \\regi == r16                                 \n"
       "    ldi \\regr, 16                                      \n"
       "  .elseif \\regi == r17                                 \n"
       "    ldi \\regr, 17                                      \n"
       "  .elseif \\regi == r18                                 \n"
       "    ldi \\regr, 18                                      \n"
       "  .elseif \\regi == r19                                 \n"
       "    ldi \\regr, 19                                      \n"
       "  .elseif \\regi == r20                                 \n"
       "    ldi \\regr, 20                                      \n"
       "  .elseif \\regi == r21                                 \n"
       "    ldi \\regr, 21                                      \n"
       "  .elseif \\regi == r22                                 \n"
       "    ldi \\regr, 22                                      \n"
       "  .elseif \\regi == r23                                 \n"
       "    ldi \\regr, 23                                      \n"
       "  .elseif \\regi == r24                                 \n"
       "    ldi \\regr, 24                                      \n"
       "  .elseif \\regi == r25                                 \n"
       "    ldi \\regr, 25                                      \n"
       "  .elseif \\regi == r26                                 \n"
       "    ldi \\regr, 26                                      \n"
       "  .elseif \\regi == r27                                 \n"
       "    ldi \\regr, 27                                      \n"
       "  .elseif \\regi == r28                                 \n"
       "    ldi \\regr, 28                                      \n"
       "  .elseif \\regi == r29                                 \n"
       "    ldi \\regr, 29                                      \n"
       "  .elseif \\regi == r30                                 \n"
       "    ldi \\regr, 30                                      \n"
       "  .elseif \\regi == r31                                 \n"
       "    ldi \\regr, 31                                      \n"
       "  .else                                                 \n"
       "    .message \"ERROR: unrecognised register name\"      \n"
       "  .endif                                                \n"
       ".endm                                                   \n"
       "                                                        \n"
       ".macro cpr regr, regi                                   \n"
       "  .if \\regi == r0                                      \n"
       "    cpi \\regr, 0                                       \n"
       "  .elseif \\regi == r1                                  \n"
       "    cpi \\regr, 1                                       \n"
       "  .elseif \\regi == r2                                  \n"
       "    cpi \\regr, 2                                       \n"
       "  .elseif \\regi == r3                                  \n"
       "    cpi \\regr, 3                                       \n"
       "  .elseif \\regi == r4                                  \n"
       "    cpi \\regr, 4                                       \n"
       "  .elseif \\regi == r5                                  \n"
       "    cpi \\regr, 5                                       \n"
       "  .elseif \\regi == r6                                  \n"
       "    cpi \\regr, 6                                       \n"
       "  .elseif \\regi == r7                                  \n"
       "    cpi \\regr, 7                                       \n"
       "  .elseif \\regi == r8                                  \n"
       "    cpi \\regr, 8                                       \n"
       "  .elseif \\regi == r9                                  \n"
       "    cpi \\regr, 9                                       \n"
       "  .elseif \\regi == r10                                 \n"
       "    cpi \\regr, 10                                      \n"
       "  .elseif \\regi == r11                                 \n"
       "    cpi \\regr, 11                                      \n"
       "  .elseif \\regi == r12                                 \n"
       "    cpi \\regr, 12                                      \n"
       "  .elseif \\regi == r13                                 \n"
       "    cpi \\regr, 13                                      \n"
       "  .elseif \\regi == r14                                 \n"
       "    cpi \\regr, 14                                      \n"
       "  .elseif \\regi == r15                                 \n"
       "    cpi \\regr, 15                                      \n"
       "  .elseif \\regi == r16                                 \n"
       "    cpi \\regr, 16                                      \n"
       "  .elseif \\regi == r17                                 \n"
       "    cpi \\regr, 17                                      \n"
       "  .elseif \\regi == r18                                 \n"
       "    cpi \\regr, 18                                      \n"
       "  .elseif \\regi == r19                                 \n"
       "    cpi \\regr, 19                                      \n"
       "  .elseif \\regi == r20                                 \n"
       "    cpi \\regr, 20                                      \n"
       "  .elseif \\regi == r21                                 \n"
       "    cpi \\regr, 21                                      \n"
       "  .elseif \\regi == r22                                 \n"
       "    cpi \\regr, 22                                      \n"
       "  .elseif \\regi == r23                                 \n"
       "    cpi \\regr, 23                                      \n"
       "  .elseif \\regi == r24                                 \n"
       "    cpi \\regr, 24                                      \n"
       "  .elseif \\regi == r25                                 \n"
       "    cpi \\regr, 25                                      \n"
       "  .elseif \\regi == r26                                 \n"
       "    cpi \\regr, 26                                      \n"
       "  .elseif \\regi == r27                                 \n"
       "    cpi \\regr, 27                                      \n"
       "  .elseif \\regi == r28                                 \n"
       "    cpi \\regr, 28                                      \n"
       "  .elseif \\regi == r29                                 \n"
       "    cpi \\regr, 29                                      \n"
       "  .elseif \\regi == r30                                 \n"
       "    cpi \\regr, 30                                      \n"
       "  .elseif \\regi == r31                                 \n"
       "    cpi \\regr, 31                                      \n"
       "  .else                                                 \n"
       "    .message \"ERROR: unrecognised register name\"      \n"
       "  .endif                                                \n"
       ".endm                                                   \n"
       "                                                        \n"
       ".endif                                                  \n"
);

I have several assembler macros which I use in this way.  My collection has just grown by two ;-)

$ cat netizen3.c
#include <avr/io.h>
#include "netizenassemblermacros.h"

__uint24 bin2bcd16(uint16_t bin) {

  __uint24 bcd = 0;
  __uint24 *ptr;
  uint8_t cnt = 16;
  uint8_t tmp;

  __asm__ __volatile__ (
    "bcd1_%=:    lsl    %A[bin]           ;shift input value              \n\t"
    "            rol    %B[bin]           ;through all bytes              \n\t"
    "            rol    %A[bcd]                                           \n\t"
    "            rol    %B[bcd]                                           \n\t"
    "            rol    %C[bcd]                                           \n\t"
    "            dec    %[cnt]            ;decrement loop counter         \n\t"
    "            breq   bcd3_%=           ;exit if ==0                    \n\t"
    "                                                                     \n\t"
    "            ldr    %A[ptr], %D[bcd]  ;Z points to result MSB + 1     \n\t"
    "bcd2_%=:    ld     %[tmp], -%a[ptr]  ;get (Z) with pre-decrement     \n\t"
    "            subi   %[tmp], -0x03     ;add 0x03                       \n\t"
    "            sbrc   %[tmp], 3         ;if bit 3 not clear             \n\t"
    "            st     %a[ptr], %[tmp]   ;    store back                 \n\t"
    "            ld     %[tmp], %a[ptr]   ;get (Z)                        \n\t"
    "            subi   %[tmp], -0x30     ;add 0x30                       \n\t"
    "            sbrc   %[tmp], 7         ;if bit 7 not clear             \n\t"
    "            st     %a[ptr], %[tmp]   ;    store back                 \n\t"
    "            cpr    %A[ptr], %A[bcd]  ;done all three?                \n\t"
    "            brne   bcd2_%=           ;loop again if not              \n\t"
    "                                                                     \n\t"
    "            rjmp   bcd1_%=                                           \n"
    "bcd3_%=:                                                             \n\t"
    : [bcd] "+&r" (bcd),
      [bin] "+&r" (bin),
      [cnt] "+&r" (cnt),
      [tmp] "+&d" (tmp),
      [ptr] "+&e" (ptr)
    :
  );

  return bcd;

}
$ avr-gcc -Wall -save-temps -c -g -O1 -mmcu=atmega328p netizen3.c -o netizen3.elf
netizen3.c: In function 'bin2bcd16':
netizen3.c:11:3: warning: 'tmp' is used uninitialized in this function [-Wuninitialized]
   __asm__ __volatile__ (
   ^
netizen3.c:11:3: warning: 'ptr' is used uninitialized in this function [-Wuninitialized]
$ cat netizen3.s
	.file	"netizen3.c"
__SP_H__ = 0x3e
__SP_L__ = 0x3d
__SREG__ = 0x3f
__tmp_reg__ = 0
__zero_reg__ = 1
	.text
.Ltext0:
	.cfi_sections	.debug_frame
/* #APP */
	.ifndef REG_MACROS_DEFINED                               

.set REG_MACROS_DEFINED, 0                               

.macro ldr regr, regi
  .if \regi == r0
    ldi \regr, 0
  .elseif \regi == r1
    ldi \regr, 1
  .elseif \regi == r2
    ldi \regr, 2
  .elseif \regi == r3
    ldi \regr, 3
  .elseif \regi == r4
    ldi \regr, 4
  .elseif \regi == r5
    ldi \regr, 5
  .elseif \regi == r6
    ldi \regr, 6
  .elseif \regi == r7
    ldi \regr, 7
  .elseif \regi == r8
    ldi \regr, 8
  .elseif \regi == r9
    ldi \regr, 9
  .elseif \regi == r10
    ldi \regr, 10
  .elseif \regi == r11
    ldi \regr, 11
  .elseif \regi == r12
    ldi \regr, 12
  .elseif \regi == r13
    ldi \regr, 13
  .elseif \regi == r14
    ldi \regr, 14
  .elseif \regi == r15
    ldi \regr, 15
  .elseif \regi == r16
    ldi \regr, 16
  .elseif \regi == r17
    ldi \regr, 17
  .elseif \regi == r18
    ldi \regr, 18
  .elseif \regi == r19
    ldi \regr, 19
  .elseif \regi == r20
    ldi \regr, 20
  .elseif \regi == r21
    ldi \regr, 21
  .elseif \regi == r22
    ldi \regr, 22
  .elseif \regi == r23
    ldi \regr, 23
  .elseif \regi == r24
    ldi \regr, 24
  .elseif \regi == r25
    ldi \regr, 25
  .elseif \regi == r26
    ldi \regr, 26
  .elseif \regi == r27
    ldi \regr, 27
  .elseif \regi == r28
    ldi \regr, 28
  .elseif \regi == r29
    ldi \regr, 29
  .elseif \regi == r30
    ldi \regr, 30
  .elseif \regi == r31
    ldi \regr, 31
  .else
    .message "ERROR: unrecognised register name"
  .endif
.endm                                                   

.macro cpr regr, regi
  .if \regi == r0
    cpi \regr, 0
  .elseif \regi == r1
    cpi \regr, 1
  .elseif \regi == r2
    cpi \regr, 2
  .elseif \regi == r3
    cpi \regr, 3
  .elseif \regi == r4
    cpi \regr, 4
  .elseif \regi == r5
    cpi \regr, 5
  .elseif \regi == r6
    cpi \regr, 6
  .elseif \regi == r7
    cpi \regr, 7
  .elseif \regi == r8
    cpi \regr, 8
  .elseif \regi == r9
    cpi \regr, 9
  .elseif \regi == r10
    cpi \regr, 10
  .elseif \regi == r11
    cpi \regr, 11
  .elseif \regi == r12
    cpi \regr, 12
  .elseif \regi == r13
    cpi \regr, 13
  .elseif \regi == r14
    cpi \regr, 14
  .elseif \regi == r15
    cpi \regr, 15
  .elseif \regi == r16
    cpi \regr, 16
  .elseif \regi == r17
    cpi \regr, 17
  .elseif \regi == r18
    cpi \regr, 18
  .elseif \regi == r19
    cpi \regr, 19
  .elseif \regi == r20
    cpi \regr, 20
  .elseif \regi == r21
    cpi \regr, 21
  .elseif \regi == r22
    cpi \regr, 22
  .elseif \regi == r23
    cpi \regr, 23
  .elseif \regi == r24
    cpi \regr, 24
  .elseif \regi == r25
    cpi \regr, 25
  .elseif \regi == r26
    cpi \regr, 26
  .elseif \regi == r27
    cpi \regr, 27
  .elseif \regi == r28
    cpi \regr, 28
  .elseif \regi == r29
    cpi \regr, 29
  .elseif \regi == r30
    cpi \regr, 30
  .elseif \regi == r31
    cpi \regr, 31
  .else
    .message "ERROR: unrecognised register name"
  .endif
.endm                                                   

.endif                                                  

/* #NOAPP */
.global	bin2bcd16
	.type	bin2bcd16, @function
bin2bcd16:
.LFB0:
	.file 1 "netizen3.c"
	.loc 1 4 0
	.cfi_startproc
.LVL0:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
	.loc 1 11 0
	ldi r18,0
	ldi r19,0
	ldi r20,0
	ldi r21,lo8(16)
	ldi r22,0
	ldi r30,0
	ldi r31,0
/* #APP */
 ;  11 "netizen3.c" 1
	bcd1_13:    lsl    r24           ;shift input value
	            rol    r25           ;through all bytes
	            rol    r18
	            rol    r19
	            rol    r20
	            dec    r21            ;decrement loop counter
	            breq   bcd3_13           ;exit if ==0                    

	            ldr    r30, r21  ;Z points to result MSB + 1     
	bcd2_13:    ld     r22, -Z  ;get (Z) with pre-decrement
	            subi   r22, -0x03     ;add 0x03
	            sbrc   r22, 3         ;if bit 3 not clear
	            st     Z, r22   ;    store back
	            ld     r22, Z   ;get (Z)
	            subi   r22, -0x30     ;add 0x30
	            sbrc   r22, 7         ;if bit 7 not clear
	            st     Z, r22   ;    store back
	            cpr    r30, r18  ;done all three?                
	            brne   bcd2_13           ;loop again if not              

	            rjmp   bcd1_13
bcd3_13:                                                             

 ;  0 "" 2
.LVL1:
	.loc 1 44 0
/* #NOAPP */
	mov r24,r20
	movw r22,r18
	ret
	.cfi_endproc
.
.
.
$ avr-objdump -S netizen3.elf

netizen3.elf:     file format elf32-avr

Disassembly of section .text:

00000000 <bin2bcd16>:
  __uint24 bcd = 0;
  __uint24 *ptr;
  uint8_t cnt = 16;
  uint8_t tmp;

  __asm__ __volatile__ (
   0:	20 e0       	ldi	r18, 0x00	; 0
   2:	30 e0       	ldi	r19, 0x00	; 0
   4:	40 e0       	ldi	r20, 0x00	; 0
   6:	50 e1       	ldi	r21, 0x10	; 16
   8:	60 e0       	ldi	r22, 0x00	; 0
   a:	e0 e0       	ldi	r30, 0x00	; 0
   c:	f0 e0       	ldi	r31, 0x00	; 0

0000000e <bcd1_13>:
   e:	88 0f       	add	r24, r24
  10:	99 1f       	adc	r25, r25
  12:	22 1f       	adc	r18, r18
  14:	33 1f       	adc	r19, r19
  16:	44 1f       	adc	r20, r20
  18:	5a 95       	dec	r21
  1a:	01 f0       	breq	.+0      	; 0x1c <bcd1_13+0xe>
  1c:	e5 e1       	ldi	r30, 0x15	; 21

0000001e <bcd2_13>:
  1e:	62 91       	ld	r22, -Z
  20:	6d 5f       	subi	r22, 0xFD	; 253
  22:	63 fd       	sbrc	r22, 3
  24:	60 83       	st	Z, r22
  26:	60 81       	ld	r22, Z
  28:	60 5d       	subi	r22, 0xD0	; 208
  2a:	67 fd       	sbrc	r22, 7
  2c:	60 83       	st	Z, r22
  2e:	e2 31       	cpi	r30, 0x12	; 18
  30:	01 f4       	brne	.+0      	; 0x32 <bcd2_13+0x14>
  32:	00 c0       	rjmp	.+0      	; 0x34 <bcd3_13>

00000034 <bcd3_13>:
    :
  );

  return bcd;

}
  34:	84 2f       	mov	r24, r20
  36:	b9 01       	movw	r22, r18
  38:	08 95       	ret

This may seem excessive to some (even me), but it does achieve the objective outlined in the OP, without tying the hands of the optimiser.  The one remaining mystery is why ptr and tmp are initialised in the generated code (orange) despite the absence of an initialisation in the C code.

 

I still prefer the unrolled loop ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Mar 17, 2016 - 05:00 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Woow, thanks a lot, Joey! I believe that's the most thorough answer I ever got, all topics and forums/mailing lists/etc included. Ever.

My English skills do not allow me to properly express how grateful I am, but I sincerely hope you'll still be able to get a feel of it. :-)

And I'm confident it will be helpful to others in the future if not as much as it has been for me. Thanks again!

I'm not going to thank you for each and every hint (that would be too cumbersome for other readers), but please assume I do when I don't mention it explicitly.

 

joeymorin wrote:
Note that my cycle counts were wrong. The results are actually somewhat faster, between 176 and 208 cycles faster, and less register pressure on the optimiser, for a cost of 12 words of flash. Seems a good compromise.

It sure is. Best implementation if speed is paramount, while the original code is optimized for size. Two implementations for two use cases, one algo to rule them all… ^^

 

joeymorin wrote:
Again, this forces the compiler's hand. The point of (or one, at lease) inline assembler is to allow the compiler to make as many choices on your behalf as possible, and to give it as much information as possible, in order to allow the optimiser to make better decisions.

I understand and agree with the objective, yet I'm wondering whether we're not worrying too much about it. If this code was in a separate .S file, wouldn't all the registers it uses be clobbered?

 

joeymorin wrote:
netizen wrote:
it seems it's not possible to assign a set of registers to such a variable, as in for example: register u24 bcd asm("r12", "r13", "r14") = 0;

Specify only the first register. Register variables are assigned to contiguous registers. (...)

No need to mess around with unions.

I swear I tried that first! It didn't work: the compiler would (silently) fail to respect my wishes and allocate it to the registers it fancied. Yet I tried it again and it does indeed work. Not sure how I managed to mess that up the first time…

Thanks! It's much cleaner that way.

 

joeymorin wrote:
netizen wrote:
Also, I now use __tmp_reg__ as my counter (initializing it via tmp), which saves a register.

You're still using a register for tmp, and it costs a mov instruction. If you want to use __tmp_reg__ and (possibly) save a register, use the "t" constraint (although curiously, it doesn't seem to work for me even though it should be supported at least as far back as 4.6.0)

Just in case: the first code used one GP register for tmp (r19) and one for the counter (r18). Using __tmp_reg__ instead saved one of them.

I first tried to use __tmp_reg__ for tmp, but IIRC it failed at ld __tmp_reg_, -Z because it needs a destination register>15. So I used __tmp_reg__ as a counter instead.

I also tried to initialize it directly, but IIRC it failed at ldi __tmp_reg__, 0x10 (or similar) because it also needs a dreg>15. So I used tmp as a middle-man (code n°2).

Perhaps that's the reason why the "t" constraint did not work for you?

 

EDIT: I remembered wrong, ld is fine, the problem with tmp was  subi __tmp_reg__, -0x03 (for the reasons mentioned above).

 

joeymorin wrote:
skeeve wrote:
In any case, I think that all the asm parameters should be input-output parameters.

Ah yes, for those input operands which are modified by the inline assembler. In this case, all of them (except the newly-added 'br' operand).

As soon as I think I finally understand that input/output thing I read something like that and get confused. :-/
Why would the compiler care about what is done with an input in (inline assembly) code it's not responsible for? Is that because it could try to reuse the same register later on assuming its content has not changed?

 

joeymorin wrote:
Curiously, ptr and tmp are initialised to zero in the generated assembly, even though they were not initialised in C, and warnings are still issued when compiling.

Indeed. Furthermore, neither tmp nor ptr_L need initialization (because they're both written to before they're read from anyway). The end result is one word larger than the last code I posted. Also this code uses one more GP register (for cnt) as it does not use __tmp_reg__ (I don't count r0 as one because of its special role for the C compiler,  i.e. using r0 is free because it does not clobber anything).

I used the "=&e" constraint for ptr to make sure it wouldn't be automatically initialized (and then clr %B[ptr] to initialize its high byte). That's where that extra word comes from. If I'm not mistaken, you're carrying that extra word all along the next code examples.

I also notice you're using asm volatile (as I was earlier). But we're not touching anything outside of the registers that have (hopefully) been properly declared. Why would we need volatile?

 

joeymorin wrote:
I still don't like hard-coding of a register for bcd since it may not play well with the optimiser. The GNU Assembler allows macros:

I understand and agree, event though as mentioned earlier I believe we're being perfectionists here. This being said, I like perfection!

Having conditions within macros really is handy! I wish the C preprocessor would allow that too.

 

joeymorin wrote:
I have several assembler macros which I use in this way. My collection has just grown by two ;-)

I'm glad to see you've at least taken something out of all this. :-]

I need to test all this but I like it: everything I wanted from the beginning!

Thanks again, Joey. :-)

 

EDIT:

joeymorin wrote:
The one remaining mystery is why ptr and tmp are initialised in the generated code (orange) despite the absence of an initialisation in the C code.

As hinted above, I suppose that is because they're not constrained as "output only": if they are used as input they need to be initialized to prevent random results. It seems the compiler automatically does it for you instead of just giving a warning/error. This way you can use any (input or input/output) variable in an inline assembly block without worrying about C initialization.

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Thu. Mar 17, 2016 - 03:37 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I didn't notice it at first, but:

joeymorin wrote:

0000001e <bcd2_13>:
  1e:	62 91       	ld	r22, -Z
  20:	6d 5f       	subi	r22, 0xFD	; 253
  22:	63 fd       	sbrc	r22, 3
  24:	60 83       	st	Z, r22
  26:	60 81       	ld	r22, Z
  28:	60 5d       	subi	r22, 0xD0	; 208
  2a:	67 fd       	sbrc	r22, 7
  2c:	60 83       	st	Z, r22
  2e:	e2 31       	cpi	r30, 0x12	; 18
  30:	01 f4       	brne	.+0      	; 0x32 <bcd2_13+0x14>
  32:	00 c0       	rjmp	.+0      	; 0x34 <bcd3_13>

00000034 <bcd3_13>:

Talk about optimization, GCC! :-D

ɴᴇᴛɪᴢᴇᴎ

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
; __uint24_t bin2bcd16(u16 bin) ;

; low byte of 3-byte return value
#define Rbcd0 22
#define Rbcd1 (Rbcd0+1)
#define Rbcd2 (Rbcd1+1)

#define Rbin0  18
#define Rbin1  19

#define Rcnt 20
#define Rtmp 25

  movw Rbcd0, R24   ; argument in R25:24

  ldi R31, 0
  ldi Rbcd0, 0
  ldi Rbcd1, 0
  ldi Rbcd3, 0
  ldi Rtmp, 0

  ldi Rcnt, 16
1:
  lsl Rbin0  ; shift input value
  rol Rbin1  ; through all bytes
  rol Rbcd0
  rol Rbcd1
  rol Rbcd2
  dec Rcnt   ; decrement loop counter
  breq  3f   ; exit if ==0
  ldi r30, Rbcd2+1  ; Z points to result MSB + 1
2:
  ld     Rtmp, -Z        ; get (Z) with pre-decrement
  subi   Rtmp, -3        ; add 0x03
  sbrc   Rtmp, 3         ; if bit 3 not clear
  st     Z, Rtmp  ; store back
  ld     Rtmp, Z         ; get (Z)
  subi   Rtmp, -0x30     ; add 0x30
  sbrc   Rtmp, 7         ; if bit 7 not clear
  st     Z, Rtmp         ;    store back
  cpi    r30, Rbcd0      ; done all three?
  brne   2b              ; loop again if not
  rjmp   1b
3:
  ret

 

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For my money Michael wins if only in the readability/maintainability stakes (though I'm sure it's a work of genius too! :)

 

Even an AVR Asm beginner like me could read that and instantly see what is going on. That is never going to be true of code written using inline Asm syntax!

 

(then again I guess it depends how highly you rate code maintainability ;)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

netizen wrote:

I understand and agree with the objective, yet I'm wondering whether we're not worrying too much about it. If this code was in a separate .S file, wouldn't all the registers it uses be clobbered?

Indeed they would.  Even as a C function entirely written using inline assembler as we have done, this is true.  However, there are a couple of reasons to 'do the best that we can' when writing inline assembler, and use the facilities present.

 

One is that you don't know what the future holds.  You or someone else may revisit this code later and make changes.  Were the function to grow significantly, the trouble we put ourselves through now may help out.  It also means that changes to the compiler in the future w.r.t register allocation won't cause our code to break.  Using a hard-coded register range via #define REG might.  All of this is 'future-proofing', both against changes to the toolchain, and against changes to your own code.

 

Secondly, and this is more relevant I think, is if you decide to make this function inline.  A function so used loses its call/return and epilogue/prologue overhead, and the code contained therein is fully accessible to the optimiser (save for the restrictions imposed by volatile... see later in this post).  In this situation you want to provide as much information to the optimiser as is possible.

 

As for 'not worrying too much about it'... well, I would have stopped at the unrolled loop ;-) ... but your goal was to find a mechanism to extract a register's address.  My goal was to do that without hindering the toolchain.

 

netizen wrote:

I swear I tried that first! It didn't work: the compiler would (silently) fail to respect my wishes and allocate it to the registers it fancied. Yet I tried it again and it does indeed work. Not sure how I managed to mess that up the first time…

Probably this:

 

joeymorin wrote:

16-bit and wider register variables must be word-aligned.

If you tried to use an odd-numbered register, you would have got this:

error: register specified for ‘*r11’ isn’t suitable for data type

... just like you noted in post #8.

 

netizen wrote:

I first tried to use __tmp_reg__ for tmp, but IIRC it failed at ld __tmp_reg_, -Z because it needs a destination register>15. So I used __tmp_reg__ as a counter instead.

I also tried to initialize it directly, but IIRC it failed at ldi __tmp_reg__, 0x10 (or similar) because it also needs a dreg>15. So I used tmp as a middle-man (code n°2).

Oh, yes, I hadn't caught that you'd eliminated cnt, and used tmp instead.  That is what I was trying to achieve with the "t" constraint, but couldn't get it to work as advertised.  I'm probably missing some key piece of information about its limitations.

 

netizen wrote:

Perhaps that's the reason why the "t" constraint did not work for you?

No, that should be the point of using "t".  The compiler would arranges for __tmp_reg__ to contain the given value, whatever the mechanism.  If it can mov it from another register it knows already contains that value, it's a win.  Otherwise it would ldi that value into a high register first without the need for us to do it ourselves.  It could do so with a register which it will subsequently reuse as another input and/or output operand.  As I say, I'm possibly missing some other detail w.r.t. restrictions on the use of "t".

 

netizen wrote:

Why would the compiler care about what is done with an input in (inline assembly) code it's not responsible for? Is that because it could try to reuse the same register later on assuming its content has not changed?

Precisely.  If you change a register associated with an input operand, you >>must<< make it an output operand as well so the compiler knows it can no longer rely upon previous knowledge of its contents.  I expect that is what @skeeve was pointing out earlier.

 

netizen wrote:

neither tmp nor ptr_L need initialization (because they're both written to before they're read from anyway).

The compiler can't know that.  It does not see the opcodes within the asm statement (that's what 'volatile' means in this context.  Again, see later in this post.).  That why, if the compiler is to know what it can and can't do, we must tell it through the use of appropriate constraints. 

 

netizen wrote:

I used the "=&e" constraint for ptr to make sure it wouldn't be automatically initialized (and then clr %B[ptr] to initialize its high byte). That's where that extra word comes from. If I'm not mistaken, you're carrying that extra word all along the next code examples.

Ah yes.  I'd never noticed that specifying an operand as an input operand (or input/output) would invariably result in its initialisation (to zero, apparently) even in the absence of an initial C value for the associated variable.

 

I could also use the '=' modifier for tmp in my above code, but with the elimination of cnt, tmp must carry 16 into the asm statement.

 

netizen wrote:

It seems the compiler automatically does it for you instead of just giving a warning/error. This way you can use any (input or input/output) variable in an inline assembly block without worrying about initialization.

That behaviour is contrary to the warning issued by the compiler, which is what troubles me.  If the compiler identifies an uninitialised variable and warns me about it, it should not then proceed to initialise it on my behalf.  Outside of inline asm, it never would.  Ultimately, though, the real fault lay with my use of the '+' modifier to make an input/output operand out of what should have been an output-only operand.  With '=', the warnings and the needless initialisation disappear.

 

netizen wrote:

Also this code uses one more GP register (for cnt) as it does not use __tmp_reg__

Easily fixed by eliminating cnt and using tmp to carry 16 into the asm as you have done.

 

netizen wrote:

 

I also notice you're using asm volatile (as I was earlier). But we're not touching anything outside of the registers that have (hopefully) been properly declared. Why would we need volatile?

 

You haven't completed all of your reading ;-)  :

... comments have been added by the compiler to inform the assembler that the included code was not generated by the compilation of C statements, but by inline assembler statements. The compiler selected register r24 for storage of the value read from PORTD. The compiler could have selected any other register, though. It may not explicitely load or store the value and it may even decide not to include your assembler code at all. All these decisions are part of the compiler's optimization strategy. For example, if you never use the variable value in the remaining part of the C program, the compiler will most likely remove your code unless you switched off optimization. To avoid this, you can add the volatile attribute to the asm statement:

6.44.2 Extended Asm - Assembler Instructions with C Expression Operands

 

With extended asm you can read and write C variables from assembler and perform jumps from assembler code to C labels. Extended asm syntax uses colons (‘:’) to delimit the operand parameters after the assembler template:

     asm [volatile] ( AssemblerTemplate
                      : OutputOperands
                      [ : InputOperands
                      [ : Clobbers ] ])

     asm [volatile] goto ( AssemblerTemplate
                           :
                           : InputOperands
                           : Clobbers
                           : GotoLabels)

The asm keyword is a GNU extension. When writing code that can be compiled with -ansi and the various -std options, use __asm__ instead of asm (see Alternate Keywords).

 

Qualifiers

 

volatile

The typical use of extended asm statements is to manipulate input values to produce output values. However, your asm statements may also produce side effects. If so, you may need to use the volatile qualifier to disable certain optimizations. See Volatile.

6.44.2.1 Volatile

 

GCC's optimizers sometimes discard asm statements if they determine there is no need for the output variables. Also, the optimizers may move code out of loops if they believe that the code will always return the same result (i.e. none of its input values change between calls). Using the volatile qualifier disables these optimizations. asm statements that have no output operands, including asm goto statements, are implicitly volatile.

 

This i386 code demonstrates a case that does not use (or require) the volatile qualifier. If it is performing assertion checking, this code uses asm to perform the validation. Otherwise, dwRes is unreferenced by any code. As a result, the optimizers can discard the asm statement, which in turn removes the need for the entire DoCheck routine. By omitting the volatile qualifier when it isn't needed you allow the optimizers to produce the most efficient code possible.

     void DoCheck(uint32_t dwSomeValue)
     {
        uint32_t dwRes;

        // Assumes dwSomeValue is not zero.
        asm ("bsfl %1,%0"
          : "=r" (dwRes)
          : "r" (dwSomeValue)
          : "cc");

        assert(dwRes > 3);
     }

The next example shows a case where the optimizers can recognize that the input (dwSomeValue) never changes during the execution of the function and can therefore move the asm outside the loop to produce more efficient code. Again, using volatile disables this type of optimization.

     void do_print(uint32_t dwSomeValue)
     {
        uint32_t dwRes;

        for (uint32_t x=0; x < 5; x++)
        {
           // Assumes dwSomeValue is not zero.
           asm ("bsfl %1,%0"
             : "=r" (dwRes)
             : "r" (dwSomeValue)
             : "cc");

           printf("%u: %u %u\n", x, dwSomeValue, dwRes);
        }
     }

The following example demonstrates a case where you need to use the volatile qualifier. It uses the x86 rdtsc instruction, which reads the computer's time-stamp counter. Without the volatile qualifier, the optimizers might assume that the asm block will always return the same value and therefore optimize away the second call.

     uint64_t msr;

     asm volatile ( "rdtsc\n\t"    // Returns the time in EDX:EAX.
             "shl $32, %%rdx\n\t"  // Shift the upper bits left.
             "or %%rdx, %0"        // 'Or' in the lower bits.
             : "=a" (msr)
             :
             : "rdx");

     printf("msr: %llx\n", msr);

     // Do other work...

     // Reprint the timestamp
     asm volatile ( "rdtsc\n\t"    // Returns the time in EDX:EAX.
             "shl $32, %%rdx\n\t"  // Shift the upper bits left.
             "or %%rdx, %0"        // 'Or' in the lower bits.
             : "=a" (msr)
             :
             : "rdx");

     printf("msr: %llx\n", msr);

GCC's optimizers do not treat this code like the non-volatile code in the earlier examples. They do not move it out of loops or omit it on the assumption that the result from a previous call is still valid.

 

Note that the compiler can move even volatile asm instructions relative to other code, including across jump instructions. For example, on many targets there is a system register that controls the rounding mode of floating-point operations. Setting it with a volatile asm, as in the following PowerPC example, does not work reliably.

     asm volatile("mtfsf 255, %0" : : "f" (fpenv));
     sum = x + y;

The compiler may move the addition back before the volatile asm. To make it work as expected, add an artificial dependency to the asm by referencing a variable in the subsequent code, for example:

     asm volatile ("mtfsf 255,%1" : "=X" (sum) : "f" (fpenv));
     sum = x + y;

Under certain circumstances, GCC may duplicate (or remove duplicates of) your assembly code when optimizing. This can lead to unexpected duplicate symbol errors during compilation if your asm code defines symbols or labels. Using ‘%=’ (see AssemblerTemplate) may help resolve this problem.

In short, there are very few circumstances where the use of volatile is unwarranted or incorrect.  One of them is in the header file I proposed above (netizenassemblermacros.h), where two things are different.  One is that Basic Asm is used (no operands) instead of Extended Asm, the other is that the asm statement is at file scope rather than function scope.  At file scope, only Basic Asm is permitted, and the volatile keyword cannot be used (even though all Basic Asm statements are implicitly volatile i.e. the compiler will never optimise them in any way).

 

$ cat netizen3.c
#include <avr/io.h>
#include "netizenassemblermacros.h"

inline __attribute__ ((__always_inline__)) __uint24 bin2bcd16(uint16_t bin) {

  __uint24 bcd = 0;
  __uint24 *ptr;
  uint8_t tmp = 16;

  __asm__ __volatile__ (
    "            mov    __tmp_reg__, %[tmp]                               \n\t"
    "            clr    %B[ptr]           ;clear (XYZ)H                   \n"
    "bcd1_%=:    lsl    %A[bin]           ;shift input value              \n\t"
    "            rol    %B[bin]           ;through all bytes              \n\t"
    "            rol    %A[bcd]                                           \n\t"
    "            rol    %B[bcd]                                           \n\t"
    "            rol    %C[bcd]                                           \n\t"
    "            dec    __tmp_reg__       ;decrement loop counter         \n\t"
    "            breq   bcd3_%=           ;exit if ==0                    \n\t"
    "                                                                     \n\t"
    "            ldr    %A[ptr], %D[bcd]  ;XYZ points to result MSB + 1   \n\t"
    "bcd2_%=:    ld     %[tmp], -%a[ptr]  ;get (XYZ) with pre-decrement   \n\t"
    "            subi   %[tmp], -0x03     ;add 0x03                       \n\t"
    "            sbrc   %[tmp], 3         ;if bit 3 not clear             \n\t"
    "            st     %a[ptr], %[tmp]   ;    store back                 \n\t"
    "            ld     %[tmp], %a[ptr]   ;get (XYZ)                      \n\t"
    "            subi   %[tmp], -0x30     ;add 0x30                       \n\t"
    "            sbrc   %[tmp], 7         ;if bit 7 not clear             \n\t"
    "            st     %a[ptr], %[tmp]   ;    store back                 \n\t"
    "            cpr    %A[ptr], %A[bcd]  ;done all three?                \n\t"
    "            brne   bcd2_%=           ;loop again if not              \n\t"
    "                                                                     \n\t"
    "            rjmp   bcd1_%=                                           \n"
    "bcd3_%=:                                                             \n\t"
    : [bcd] "+&r" (bcd),
      [bin] "+&r" (bin),
      [tmp] "+&d" (tmp),
      [ptr] "=&e" (ptr)
    :
  );

  return bcd;

}

int __attribute__ ((__OS_main__)) main(void) {

  TCNT1 = bin2bcd16(TCNT1);
  while(1) {}

}
$ avr-gcc -Wall -save-temps -g -Os -ffunction-sections -fdata-sections -Wl,--gc-sections -mmcu=atmega328p netizen3.c -o netizen3.elf
$ avr-objdump -S netizen3.elf
00000080 <main>:

}

int __attribute__ ((__OS_main__)) main(void) {

  TCNT1 = bin2bcd16(TCNT1);
  80:	20 91 84 00 	lds	r18, 0x0084
  84:	30 91 85 00 	lds	r19, 0x0085

  __uint24 bcd = 0;
  __uint24 *ptr;
  uint8_t tmp = 16;

  __asm__ __volatile__ (
  88:	80 e0       	ldi	r24, 0x00	; 0
  8a:	90 e0       	ldi	r25, 0x00	; 0
  8c:	a0 e0       	ldi	r26, 0x00	; 0
  8e:	40 e1       	ldi	r20, 0x10	; 16
  90:	04 2e       	mov	r0, r20
  92:	ff 27       	eor	r31, r31

00000094 <bcd1_36>:
  94:	22 0f       	add	r18, r18
  96:	33 1f       	adc	r19, r19
  98:	88 1f       	adc	r24, r24
  9a:	99 1f       	adc	r25, r25
  9c:	aa 1f       	adc	r26, r26
  9e:	0a 94       	dec	r0
  a0:	61 f0       	breq	.+24     	; 0xba <bcd3_36>
  a2:	eb e1       	ldi	r30, 0x1B	; 27

000000a4 <bcd2_36>:
  a4:	42 91       	ld	r20, -Z
  a6:	4d 5f       	subi	r20, 0xFD	; 253
  a8:	43 fd       	sbrc	r20, 3
  aa:	40 83       	st	Z, r20
  ac:	40 81       	ld	r20, Z
  ae:	40 5d       	subi	r20, 0xD0	; 208
  b0:	47 fd       	sbrc	r20, 7
  b2:	40 83       	st	Z, r20
  b4:	e8 31       	cpi	r30, 0x18	; 24
  b6:	b1 f7       	brne	.-20     	; 0xa4 <bcd2_36>
  b8:	ed cf       	rjmp	.-38     	; 0x94 <bcd1_36>

000000ba <bcd3_36>:

}

int __attribute__ ((__OS_main__)) main(void) {

  TCNT1 = bin2bcd16(TCNT1);
  ba:	90 93 85 00 	sts	0x0085, r25
  be:	80 93 84 00 	sts	0x0084, r24
  c2:	ff cf       	rjmp	.-2      	; 0xc2 <bcd3_36+0x8>

 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Mar 17, 2016 - 05:43 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

netizen wrote:

Talk about optimization, GCC! :-D

You missed the fact that I was building with -c.  This skips the link stage.  When using avr-objdump on unlinked object code, all symbols are resolved to zero.  You can ignore that 'optimisation', because it isn't one.

 

clawson wrote:

Even an AVR Asm beginner like me could read that and instantly see what is going on. That is never going to be true of code written using inline Asm syntax!

 

(then again I guess it depends how highly you rate code maintainability ;)

I wouldn't expect an asm beginner to maintain >>any<< asm code, whether 'proper' asm or inline assembler.  FWIW, I have little difficulty understanding GCC's inline assembler (at least for AVR), although it has taken some time to get to that point.  I like the ability it has to 'play nice' with the compiler/optimiser.

 

In any event, the OP was specifically related to the handling of inline assembler operands, so I have tried to answer that.

 

EDIT:  I tend to be a bit more meticulous about the formatting and commenting of my inline assembler code.  In the case of the OP's example:

 

  __asm__ __volatile__ (
  /* loop counter    */ "mov    __tmp_reg__,  %[tmp]              \n\t" // 1
  /* clear ptr MSB   */ "clr    %B[ptr]                           \n"   // 1

                      "bcd1_%=:                                   \n\t"
  /* shift input     */ "lsl    %A[bin]                           \n\t" // 1
  /*  value through  */ "rol    %B[bin]                           \n\t" // 1
  /*  all bytes      */ "rol    %A[bcd]                           \n\t" // 1
  /*                 */ "rol    %B[bcd]                           \n\t" // 1
  /*                 */ "rol    %C[bcd]                           \n\t" // 1
  /* loop counter--  */ "dec    __tmp_reg__                       \n\t" // 1
  /* done            */ "breq   bcd3_%=                           \n\t" // 1/2

  /* MSB + 1         */ "ldr    %A[ptr],      %D[bcd]             \n"   // 1
                      "bcd2_%=:                                   \n\t"
  /* start with MSB  */ "ld     %[tmp],       -%a[ptr]            \n\t" // 2
  /* add 0x03        */ "subi   %[tmp],       -0x03               \n\t" // 1
  /* if bit 3 set    */ "sbrc   %[tmp],       3                   \n\t" // 1/2
  /*  store back     */ "st     %a[ptr],      %[tmp]              \n\t" // 2
  /* get again       */ "ld     %[tmp],       %a[ptr]             \n\t" // 2
  /* add 0x30        */ "subi   %[tmp],       -0x30               \n\t" // 1
  /* if bit 7 set    */ "sbrc   %[tmp],       7                   \n\t" // 1/2
  /*  store back     */ "st     %a[ptr],      %[tmp]              \n\t" // 2
  /* done?           */ "cpr    %A[ptr],      %A[bcd]             \n\t" // 1
  /* no, next byte   */ "brne   bcd2_%=                           \n\t" // 1/2
  /* next outer loop */ "rjmp   bcd1_%=                           \n"   // 2

  /* done          */ "bcd3_%=:                                   \n\t"

                      : /* output operands */
  /* BCD output      */ [bcd] "+&r" (bcd),
  /* binary input    */ [bin] "+&r" (bin),
  /* brings loop cnt */ [tmp] "+&d" (tmp),
  /* inner loop ptr  */ [ptr] "=&e" (ptr)

                      : /* input operancds */

                      : /* clobbers        */

                       );

In my editor, syntax highlighting really helps:

 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Mar 17, 2016 - 06:40 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I agree that my code could be broken by an ABI change.

Inline assembly could improve that.

Everything except bcd could readily be made an asm argument.

bcd could be made a register variable with a specified set of registers.

OP could give its number as an integer argument to the asm statement.

An incompatible ABI change would be flagged by the compiler.

 

The big macro method would seem to work,

but I think that one per instruction type is more than necessary:

.macro opr op, regr, regi, post
  .if \regi == r0
    \op \regr, 0 \post
  .elseif \regi == r1
    \op \regr, 1 \post
  ...
  .elseif \regi == r31
    \op \regr, 31 \post
  .else
    .message "oops"
  .endif

post would allow tacking on a +1 or +3 .

No need to refer to the fourth byte of a 3-byte variable.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:

The big macro method would seem to work,

but I think that one per instruction type is more than necessary:

I hadn't considered that the opcode could also be a macro argument.  Much better.

 

skeeve wrote:

post would allow tacking on a +1 or +3 .

I can't get that to work.  Specifying nothing or +0 works (either appended directly to regi, or separated by a space as a fourth argument), but specifying +N where N is non-zero gives an error: 'Error: unknown pseudo-op: `.message''.  This suggests some kind of conflation with the .else clause.  Removing the else clause loses the error, but then also results in the omission of the macro's emitted opcode.

 

Can you give an example of how that fourth macro argument should be used?

 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Mar 17, 2016 - 07:26 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
netizen wrote:
I understand and agree with the objective, yet I'm wondering whether we're not worrying too much about it. If this code was in a separate .S file, wouldn't all the registers it uses be clobbered?

Indeed they would. Even as a C function entirely written using inline assembler as we have done, this is true.

Alright. So why is skeeve "complaining" about the clobbering of Z in #3 and arguing for a separate .S file in #7 ? That seems contradictory. Note: I am not looking for a fight, skeeve, really just wondering what I'm missing. ^^

Are your sure about your second sentence? Because I seem to observe the contrary in my program: bin2bcd16 gets merged with other parts:

avr-objdump:

00000636 <drawDec>:
 636:	cf 93       	push	r28
 638:	df 93       	push	r29
 63a:	40 93 8a 00 	sts	0x008A, r20
 63e:	10 92 8b 00 	sts	0x008B, r1
 642:	24 e0       	ldi	r18, 0x04	; 4
 644:	26 1b       	sub	r18, r22
 646:	20 93 8c 00 	sts	0x008C, r18
 64a:	20 e0       	ldi	r18, 0x00	; 0
 64c:	30 e0       	ldi	r19, 0x00	; 0
 64e:	40 e0       	ldi	r20, 0x00	; 0
 650:	50 e1       	ldi	r21, 0x10	; 16
 652:	05 2e       	mov	r0, r21
 654:	ff 27       	eor	r31, r31

00000656 <bcd1_15>:
 656:	88 0f       	add	r24, r24
 658:	99 1f       	adc	r25, r25
 65a:	22 1f       	adc	r18, r18
 65c:	33 1f       	adc	r19, r19
 65e:	44 1f       	adc	r20, r20
 660:	0a 94       	dec	r0
 662:	61 f0       	breq	.+24     	; 0x67c <bcd3_15>
 664:	e5 e1       	ldi	r30, 0x15	; 21

00000666 <bcd2_15>:
 666:	52 91       	ld	r21, -Z
 668:	5d 5f       	subi	r21, 0xFD	; 253
 66a:	53 fd       	sbrc	r21, 3
 66c:	50 83       	st	Z, r21
 66e:	50 81       	ld	r21, Z
 670:	50 5d       	subi	r21, 0xD0	; 208
 672:	57 fd       	sbrc	r21, 7
 674:	50 83       	st	Z, r21
 676:	e2 31       	cpi	r30, 0x12	; 18
 678:	b1 f7       	brne	.-20     	; 0x666 <bcd2_15>
 67a:	ed cf       	rjmp	.-38     	; 0x656 <bcd1_15>

0000067c <bcd3_15>:
 67c:	e9 01       	movw	r28, r18
 67e:	84 2f       	mov	r24, r20
 680:	90 e0       	ldi	r25, 0x00	; 0
 682:	54 d1       	rcall	.+680    	; 0x92c <_drawDec.2076>

So I suspect LTO might pretty much treat this part as any other. Admittedly, I'm not using __attribute__ ((__always_inline__) precisely because that's the kind of optimization behavior I expect when it makes sense — and it does in this case. Or did I misunderstand you?

 

joeymorin wrote:
As for 'not worrying too much about it'... well, I would have stopped at the unrolled loop ;-) ... but your goal was to find a mechanism to extract a register's address. My goal was to do that without hindering the toolchain.

I'm perfectly fine with that, and that is definitely the code I'll be using. :-)

Again: just wondering about the apparent double standards when it gets to clobbering and inline vs external asm…

 

joeymorin wrote:
netizen wrote:
I swear I tried that first! It didn't work: the compiler would (silently) fail to respect my wishes and allocate it to the registers it fancied. Yet I tried it again and it does indeed work. Not sure how I managed to mess that up the first time…

Probably this: 16-bit and wider register variables must be word-aligned.

No, it did compile fine. It just wouldn't allocate the registers I had asked for. Nevermind, it doesn't really matter how I f*cked up.
 

joeymorin wrote:
That is what I was trying to achieve with the "t" constraint, but couldn't get it to work as advertised. I'm probably missing some key piece of information about its limitations.

I've tried to use it as well and couldn't make it work. I even googled a bit for an example but couldn't find one. But it doesn't seem to matter: using it as we do seems totally legit. That's the most important.

Yet I'd be curious to know how it's supposed to be used, if you ever get it to work.

 

joeymorin wrote:
netizen wrote:
neither tmp nor ptr_L need initialization (because they're both written to before they're read from anyway).

The compiler can't know that. It does not see the opcodes within the asm statement (that's what 'volatile' means in this context. Again, see later in this post.). That why, if the compiler is to know what it can and can't do, we must tell it through the use of appropriate constraints.

Indeed, but we do know that, so we want to take care of that ourselves. Precisely my point. ^^

 

joeymorin wrote:
In short, there are very few circumstances where the use of volatile is unwarranted or incorrect.

Yet I'm afraid this is one of them. :-)

This code has no side-effects, it only transforms inputs into outputs. If ever it was in a loop with a constant input, we do want it to get optimized out of it. If it's acting on input that won't be used, we do want it to be optimized away.

That's a lot of quoting, but you haven't made your case as to why we'd need that asm to be volatile. ^^

 

joeymorin wrote:
In any event, the OP was specifically related to the handling of inline assembler operands, so I have tried to answer that.

You did great. Don't let anyone suggest otherwise. :-)

 

I don't agree with all the complaints about the inline asm style. I think it's alright, could be better, but far from being as bad as people say it is. Once you read a couple, the %[var] notation does not slow you down a bit, clawson. And it references variables that are just a few lines down, not a #define that could be in another file.

Also, this code was quite a tricky one because of how Z is used, thus not a fair assessment of inline asm. And the end result is way more efficient than the .S file solution is. If you're OK with clobbering all the registers as in a .S file, the inline asm looks just as good. Even the version only clobbering Z looks just as good, according to me.

I suppose the hardest bit is the input/output/clobber parts. That takes some thinking at first, and documentation isn't that great.

 

I like your formatting, Joey. Not sure I'd transfer the comments on the C side of things though: it's handy to look at the temp .s file with the comments still there.

 

EDIT: Missed the last comments while typing this one, sorry if it sounds anachronistic.

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Thu. Mar 17, 2016 - 07:52 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

netizen wrote:

So why is skeeve "complaining" about the clobbering of Z in #3

Clobbering should always be avoided when possible.  This is the same argument I made before.  Let the compiler make as many register choices for you as possible.  Clobbering is a last resort and can tie the hands of the compiler.

 

netizen wrote:

and arguing for a separate .S file in #7

I can't speak for @skeeve.  Inline assembler is in many respects a 'kludge', and it differs from straight assembler enough to make it a less obvious choice for any given task.  I suppose it's also partly a personal choice.  I prefer inline assembler for C projects.  Using plain assembler in a .S requires abiding by the ABI.  That doesn't offend me per se, but it puts me on edge.  As with everything, 'it depends'.  Do as you see fit.

 

netizen wrote:

Are your sure about your second sentence? Because I seem to observe the contrary in my program: bin2bcd16 gets merged with other parts:

Note how I have been compiling.  No inlining of the code occurred.  Several factors can lead to inlining.  When it happens, the compiler doesn't have to worry about call-saved vs. call used registers.  They are all local to the code into which the function was inlined, and the register allocator and optimiser will shuffle things around as they see fit.

 

netizen wrote:

Again: just wondering about the apparent double standards when it gets to clobbering and inline vs external asm…

I don't understand.  Or you don't.  I don't perceive anyone has contradicted themselves or each other w.r.t. external ASM v.s. clobbering.  Clobbering is a non-issue for external ASM.  You must abide by the ABI, and preserve/restore any of the call-saved registers your function uses.  The caller forfeits the contents of the call-used registers, regardless of whether or not the called function modified them.  Yes, LTO can make up some of the difference.  Again, do whatever you feel most comfortable with.  In the end, it is much ado about nothing.  While this is all an interesting intellectual exercise, I generally don't get hung up about a few cycles here, or a few flash words there, >>unless<< there is a real-world need to trim things down.  Outside of critical loops, it just doesn't happen often enough to worry about.  I have other things to do ;-)

 

netizen wrote:

I've tried to use it as well and couldn't make it work. I even googled a bit for an example but couldn't find one. But it doesn't seem to matter: using it as we do seems totally legit. That's the most important.

Yet I'd be curious to know how it's supposed to be used, if you ever get it to work.

The advantage of "t", if it worked, would be that we wouldn't have to worry about >>how<< __tmp_reg__ got its value.  In your case, you were able to use the tmp variable for double duty.  In the absence of such an opportunity, you'd have to provide a "d" constrained register variable anyway.  If the value we needed was already in an existing register (which the compiler would know), then the ldi could be skipped altogether in favour of a more general mov, with no need for an additional high register.

 

netizen wrote:

Indeed, but we do know that, so we want to take care of that ourselves. Precisely my point. ^^

That approach is inconsistent with the use case for inline assembler.  If you want to take care of everything, use Basic ASM, or use external .S.  Extended ASM is intended to let the compiler do that kind of work for you.

 

In any event, the 'default zero-init' behaviour >>is<< inconsistent with the compiler's warnings.  That is bad.  Ultimately, though, the error was mine by specifying the operand as both input/output.

 

netizen wrote:

Yet I'm afraid this is one of them. :-)

It isn't.  The output operands bin, tmp, and ptr are never used outside of the asm statement.  The compiler is free to optimise away any code within the statement which operates on those operands.  The fact that it may not today doesn't mean you should rely upon a future version to be so kind.  If you're writing in assembler (whether inline or .S), you're doing it because you want control of the algorithm and implementation of that algorithm.  Yes, use the constraints and other mechanisms to feed the compiler as much data about your code as possible in order to allow the register allocator the chance to make the best choices for the optimiser, but if you want to integrity of the code in the Assembler Template string, you really should be using volatile.

 

netizen wrote:

I like your formatting, Joey. Not sure I'd transfer the comments on the C side of things though: it's handy to look at the temp .s file with the comments still there.

While the .s is handy at times, I find it much easier to clearly see the comments in the C source, since that's what I'll be editing, and code highlighting doesn't work on the comments within the AssemblerTemplate string.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My .S file assumed a fixed ABI.

At the time of its writing, allowing for an ABI change had not occurred to me.

Inline assembly can deal with ABI changes.

 

I'd intended the #define-s to be a convenient mechanism for giving useful names to registers,

hence they would be in the .S file.

Once they are trustworthy, one would not need to look at them to understand the rest of the code.

I should have commented that all named registers need be distinct, that some need to be upper registers

and that some need to be call-used (clobberable) registers.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
Can you give an example of how that fourth macro argument should be used?
Not reliably.

Dead reckoning here.

I don't actually have avr toolchain handy.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

This reply has been marked as the solution. 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:

Not reliably.

 

Dead reckoning here.

I don't actually have avr toolchain handy.

Fair enough.  Until then, here's the generic macro without the fourth argument, and the OP's function which uses it:

 

$ cat netizenassemblermacros.h
__asm__ (
       ".ifndef OPR_MACRO_DEFINED                               \n"
       "                                                        \n"
       ".set OPR_MACRO_DEFINED, 0                               \n"
       "                                                        \n"
       ".macro opr op, regr, regi                               \n"
       "  .if \\regi == r0                                      \n"
       "    \\op \\regr, 0                                      \n"
       "  .elseif \\regi == r1                                  \n"
       "    \\op \\regr, 1                                      \n"
       "  .elseif \\regi == r2                                  \n"
       "    \\op \\regr, 2                                      \n"
       "  .elseif \\regi == r3                                  \n"
       "    \\op \\regr, 3                                      \n"
       "  .elseif \\regi == r4                                  \n"
       "    \\op \\regr, 4                                      \n"
       "  .elseif \\regi == r5                                  \n"
       "    \\op \\regr, 5                                      \n"
       "  .elseif \\regi == r6                                  \n"
       "    \\op \\regr, 6                                      \n"
       "  .elseif \\regi == r7                                  \n"
       "    \\op \\regr, 7                                      \n"
       "  .elseif \\regi == r8                                  \n"
       "    \\op \\regr, 8                                      \n"
       "  .elseif \\regi == r9                                  \n"
       "    \\op \\regr, 9                                      \n"
       "  .elseif \\regi == r10                                 \n"
       "    \\op \\regr, 10                                     \n"
       "  .elseif \\regi == r11                                 \n"
       "    \\op \\regr, 11                                     \n"
       "  .elseif \\regi == r12                                 \n"
       "    \\op \\regr, 12                                     \n"
       "  .elseif \\regi == r13                                 \n"
       "    \\op \\regr, 13                                     \n"
       "  .elseif \\regi == r14                                 \n"
       "    \\op \\regr, 14                                     \n"
       "  .elseif \\regi == r15                                 \n"
       "    \\op \\regr, 15                                     \n"
       "  .elseif \\regi == r16                                 \n"
       "    \\op \\regr, 16                                     \n"
       "  .elseif \\regi == r17                                 \n"
       "    \\op \\regr, 17                                     \n"
       "  .elseif \\regi == r18                                 \n"
       "    \\op \\regr, 18                                     \n"
       "  .elseif \\regi == r19                                 \n"
       "    \\op \\regr, 19                                     \n"
       "  .elseif \\regi == r20                                 \n"
       "    \\op \\regr, 20                                     \n"
       "  .elseif \\regi == r21                                 \n"
       "    \\op \\regr, 21                                     \n"
       "  .elseif \\regi == r22                                 \n"
       "    \\op \\regr, 22                                     \n"
       "  .elseif \\regi == r23                                 \n"
       "    \\op \\regr, 23                                     \n"
       "  .elseif \\regi == r24                                 \n"
       "    \\op \\regr, 24                                     \n"
       "  .elseif \\regi == r25                                 \n"
       "    \\op \\regr, 25                                     \n"
       "  .elseif \\regi == r26                                 \n"
       "    \\op \\regr, 26                                     \n"
       "  .elseif \\regi == r27                                 \n"
       "    \\op \\regr, 27                                     \n"
       "  .elseif \\regi == r28                                 \n"
       "    \\op \\regr, 28                                     \n"
       "  .elseif \\regi == r29                                 \n"
       "    \\op \\regr, 29                                     \n"
       "  .elseif \\regi == r30                                 \n"
       "    \\op \\regr, 30                                     \n"
       "  .elseif \\regi == r31                                 \n"
       "    \\op \\regr, 31                                     \n"
       "  .else                                                 \n"
       "    .message \"ERROR: unrecognised register name\"      \n"
       "  .endif                                                \n"
       ".endm                                                   \n"
       "                                                        \n"
       ".endif                                                  \n"
);
$ cat netizen3.c
#include <avr/io.h>
#include "netizenassemblermacros.h" //

inline __attribute__ ((__always_inline__)) __uint24 bin2bcd16(uint16_t bin) {

  __uint24 bcd = 0;
  __uint24 *ptr;
  uint8_t tmp = 16;

  __asm__ __volatile__ (
  /* loop counter    */ "mov      __tmp_reg__,  %[tmp]            \n\t" // 1
  /* clear ptr MSB   */ "clr      %B[ptr]                         \n"   // 1

                      "bcd1_%=:                                   \n\t"
  /* shift input     */ "lsl      %A[bin]                         \n\t" // 1
  /*  value through  */ "rol      %B[bin]                         \n\t" // 1
  /*  all bytes      */ "rol      %A[bcd]                         \n\t" // 1
  /*                 */ "rol      %B[bcd]                         \n\t" // 1
  /*                 */ "rol      %C[bcd]                         \n\t" // 1
  /* loop counter--  */ "dec      __tmp_reg__                     \n\t" // 1
  /* done            */ "breq     bcd3_%=                         \n\t" // 1/2

  /* MSB + 1         */ "opr ldi  %A[ptr],      %D[bcd]           \n"   // 1
                      "bcd2_%=:                                   \n\t"
  /* start with MSB  */ "ld       %[tmp],       -%a[ptr]          \n\t" // 2
  /* add 0x03        */ "subi     %[tmp],       -0x03             \n\t" // 1
  /* if bit 3 set    */ "sbrc     %[tmp],       3                 \n\t" // 1/2
  /*  store back     */ "st       %a[ptr],      %[tmp]            \n\t" // 2
  /* get again       */ "ld       %[tmp],       %a[ptr]           \n\t" // 2
  /* add 0x30        */ "subi     %[tmp],       -0x30             \n\t" // 1
  /* if bit 7 set    */ "sbrc     %[tmp],       7                 \n\t" // 1/2
  /*  store back     */ "st       %a[ptr],      %[tmp]            \n\t" // 2
  /* done?           */ "opr cpi  %A[ptr],      %A[bcd]           \n\t" // 1
  /* no, next byte   */ "brne     bcd2_%=                         \n\t" // 1/2
  /* next outer loop */ "rjmp     bcd1_%=                         \n"   // 2

  /* done          */ "bcd3_%=:                                   \n\t"

                      : /* output operands */
  /* BCD output      */ [bcd] "+&r" (bcd),
  /* binary input    */ [bin] "+&r" (bin),
  /* brings loop cnt */ [tmp] "+&d" (tmp),
  /* inner loop ptr  */ [ptr] "=&e" (ptr)

                      : /* input operands  */

                      : /* clobbers        */

                       );

  return bcd;

}



int __attribute__ ((__OS_main__)) main(void) {

  TCNT1 = bin2bcd16(TCNT1);
  while(1) {}

}
$ avr-gcc -Wall -save-temps -g -Os -ffunction-sections -fdata-sections -Wl,--gc-sections -mmcu=atmega328p netizen3.c
$ cat netizen3.s
.
.
.
/* #NOAPP */
	.section	.text.bin2bcd16,"ax",@progbits
.global	bin2bcd16
	.type	bin2bcd16, @function
bin2bcd16:
.LFB0:
	.file 1 "netizen3.c"
	.loc 1 4 0
	.cfi_startproc
.LVL0:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
	.loc 1 10 0
	ldi r18,0
	ldi r19,0
	ldi r20,0
	ldi r21,lo8(16)
/* #APP */
 ;  10 "netizen3.c" 1
	mov      __tmp_reg__,  r21            
	clr      r31                         
bcd1_11:                                   
	lsl      r24                         
	rol      r25                         
	rol      r18                         
	rol      r19                         
	rol      r20                         
	dec      __tmp_reg__                     
	breq     bcd3_11                         
	opr ldi  r30,      r21           
bcd2_11:                                   
	ld       r21,       -Z          
	subi     r21,       -0x03             
	sbrc     r21,       3                 
	st       Z,      r21            
	ld       r21,       Z           
	subi     r21,       -0x30             
	sbrc     r21,       7                 
	st       Z,      r21            
	opr cpi  r30,      r18           
	brne     bcd2_11                         
	rjmp     bcd1_11                         
bcd3_11:                                   
	
 ;  0 "" 2
.LVL1:
	.loc 1 53 0
/* #NOAPP */
	mov r24,r20
	movw r22,r18
	ret
	.cfi_endproc
$ avr-objdump -S netizen3.elf
.
.
.
00000080 <main>:



int __attribute__ ((__OS_main__)) main(void) {

  TCNT1 = bin2bcd16(TCNT1);
  80:	20 91 84 00 	lds	r18, 0x0084
  84:	30 91 85 00 	lds	r19, 0x0085

  __uint24 bcd = 0;
  __uint24 *ptr;
  uint8_t tmp = 16;

  __asm__ __volatile__ (
  88:	80 e0       	ldi	r24, 0x00	; 0
  8a:	90 e0       	ldi	r25, 0x00	; 0
  8c:	a0 e0       	ldi	r26, 0x00	; 0
  8e:	40 e1       	ldi	r20, 0x10	; 16
  90:	04 2e       	mov	r0, r20
  92:	ff 27       	eor	r31, r31

00000094 <bcd1_36>:
  94:	22 0f       	add	r18, r18
  96:	33 1f       	adc	r19, r19
  98:	88 1f       	adc	r24, r24
  9a:	99 1f       	adc	r25, r25
  9c:	aa 1f       	adc	r26, r26
  9e:	0a 94       	dec	r0
  a0:	61 f0       	breq	.+24     	; 0xba <bcd3_36>
  a2:	eb e1       	ldi	r30, 0x1B	; 27

000000a4 <bcd2_36>:
  a4:	42 91       	ld	r20, -Z
  a6:	4d 5f       	subi	r20, 0xFD	; 253
  a8:	43 fd       	sbrc	r20, 3
  aa:	40 83       	st	Z, r20
  ac:	40 81       	ld	r20, Z
  ae:	40 5d       	subi	r20, 0xD0	; 208
  b0:	47 fd       	sbrc	r20, 7
  b2:	40 83       	st	Z, r20
  b4:	e8 31       	cpi	r30, 0x18	; 24
  b6:	b1 f7       	brne	.-20     	; 0xa4 <bcd2_36>
  b8:	ed cf       	rjmp	.-38     	; 0x94 <bcd1_36>

000000ba <bcd3_36>:



int __attribute__ ((__OS_main__)) main(void) {

  TCNT1 = bin2bcd16(TCNT1);
  ba:	90 93 85 00 	sts	0x0085, r25
  be:	80 93 84 00 	sts	0x0084, r24
  c2:	ff cf       	rjmp	.-2      	; 0xc2 <bcd3_36+0x8>

 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Beautiful, skeeve. A macro implementation of the C & operator (for right-hand operands). Exactly what I was looking for from the start: ^^

netizen wrote:
So my question is: is there a way to "dynamically" refer to these addresses from within the inline assembly (i.e. some equivalent of &bcd) ?

Now, could it reside in some .S header file instead of being inline asm (in a .h file)?

 

joeymorin wrote:
Clobbering should always be avoided when possible. This is the same argument I made before. Let the compiler make as many register choices for you as possible. Clobbering is a last resort and can tie the hands of the compiler.

Of course. That is the whole point of this thread: finding an & operator, because otherwise I have to clobber the bcd registers by making them hard-linked.

I assume that in such a case they cannot be moved around and, should the compiler absolutely need them, it would use the stack (sort of a on-demand individualized prologue/epilogue). Sure it might impair its ability to produce uber-optimal code… in theory. Most likely (as in this project) it will have no effect at all in practice.

On the other side, we have an external ASM routine, abiding by the ABI. Which means it's clobbering all of the 12 call-used registers (4x more). And on the optimization front, it raises an even bigger barrier.

Which one is better for our particular use-case?

 

Note that since we can dynamically refer to automatically allocated registers, there is no clobbering at all and maximal optimization potential, thus there is no competition (in both the code size and speed domains)… in theory.

 

joeymorin wrote:
netizen wrote:
Again: just wondering about the apparent double standards when it gets to clobbering and inline vs external asm…

I don't understand. Or you don't.

Sure thing. But that mystery shouldn't be too hard to unravel: I believe I've above given enough details about how I picture the whole thing, it should be easy to point out what I'm getting wrong. ^^

 

joeymorin wrote:
In the end, it is much ado about nothing. While this is all an interesting intellectual exercise, I generally don't get hung up about a few cycles here, or a few flash words there, >>unless<< there is a real-world need to trim things down.

Good, that's how I intended it to be like. Not something you do every day, but at times it's good to brush the rust a bit.^^

And for us learners, an excellent way to see what we should generally aim for, and how to better evaluate the cost/benefit ratio of getting all the way there.

Thanks again for sharing your experience and expertise. This thread wouldn't have gone very far without your help. :-)

 

joeymorin wrote:
netizen wrote:
Indeed, but we do know that, so we want to take care of that ourselves. Precisely my point. ^^

That approach is inconsistent with the use case for inline assembler. If you want to take care of everything, use Basic ASM, or use external .S. Extended ASM is intended to let the compiler do that kind of work for you.

I don't see it as an "approach", but a pragmatic assessment of the issue: this pointer does not need to be initialized precisely because it's used for output only. Preventing automatic initialization by the compiler and correctly declaring the = constraint of the operand are one and the same thing.

 

joeymorin wrote:
netizen wrote:
Yet I'm afraid this is one of them. :-)

It isn't. The output operands bin, tmp, and ptr are never used outside of the asm statement. The compiler is free to optimise away any code within the statement which operates on those operands.

They're never used only if this function is never called, in which case it is desired behavior to get rid of it. If "external" code uses it, then bin is used, thus tmp and ptr are used as well. Nothing but standard optimizing compiler stuff. I don't get your point.

 

skeeve wrote:
My .S file assumed a fixed ABI. At the time of its writing, allowing for an ABI change had not occurred to me. Inline assembly can deal with ABI changes.

It did not occur to me either!

Considering the other relative "flaws" this approach has in this particular use case, I don't think it would become a deal-breaker.

Anyway, thanks for mentioning it: it was worth pointing out.

 

 

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Fri. Mar 18, 2016 - 02:47 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

« This is not the post you're looking for… »

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Fri. Mar 18, 2016 - 02:33 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

netizen wrote:

Now, could it reside in some .S header file instead of being inline asm (in a .h file)?

Of course:

$ cat netizenassemblermacros.inc

.ifndef OPR_MACRO_DEFINED

.set OPR_MACRO_DEFINED, 0

.macro opr op, regr, regi
  .if \regi == r0
    \op \regr, 0
  .elseif \regi == r1
    \op \regr, 1
  .elseif \regi == r2
    \op \regr, 2
  .elseif \regi == r3
    \op \regr, 3
  .elseif \regi == r4
    \op \regr, 4
  .elseif \regi == r5
    \op \regr, 5
  .elseif \regi == r6
    \op \regr, 6
  .elseif \regi == r7
    \op \regr, 7
  .elseif \regi == r8
    \op \regr, 8
  .elseif \regi == r9
    \op \regr, 9
  .elseif \regi == r10
    \op \regr, 10
  .elseif \regi == r11
    \op \regr, 11
  .elseif \regi == r12
    \op \regr, 12
  .elseif \regi == r13
    \op \regr, 13
  .elseif \regi == r14
    \op \regr, 14
  .elseif \regi == r15
    \op \regr, 15
  .elseif \regi == r16
    \op \regr, 16
  .elseif \regi == r17
    \op \regr, 17
  .elseif \regi == r18
    \op \regr, 18
  .elseif \regi == r19
    \op \regr, 19
  .elseif \regi == r20
    \op \regr, 20
  .elseif \regi == r21
    \op \regr, 21
  .elseif \regi == r22
    \op \regr, 22
  .elseif \regi == r23
    \op \regr, 23
  .elseif \regi == r24
    \op \regr, 24
  .elseif \regi == r25
    \op \regr, 25
  .elseif \regi == r26
    \op \regr, 26
  .elseif \regi == r27
    \op \regr, 27
  .elseif \regi == r28
    \op \regr, 28
  .elseif \regi == r29
    \op \regr, 29
  .elseif \regi == r30
    \op \regr, 30
  .elseif \regi == r31
    \op \regr, 31
  .else
    .message "ERROR: unrecognised register name"
  .endif
.endm

.endif

However, I don't see the point.  The value of such a macro is to convert a compiler-generated operand of the form rN into an integer of the form N.  When writing pure assembler in a .S file, you are under no such restrictions.  More conventional approaches can be used, such as #define, or .set, or simply symbol = value.

 

netizen wrote:

Preventing automatic initialization by the compiler and correctly declaring the = constraint of the operand are one and the same thing.

They are not.  One is questionable behaviour on the part of the compiler, initialising a variable without being asked to do so.  The other is a coding error on the part of the programmer.  Once the error is expunged, the questionable behaviour no longer manifests, but that does not excuse the former.

 

In truth, it may not actually break the standard.  An uninitialised variable is according to the standard 'undefined', and zero is just a good an undefined value as any other.  My concern is two-fold:  1) the wasted code space, and 2) the discrepancy between the warning re: uninitialised variable and the actual initialisation.  In truth, again, inline assembler is a GCC extension so the standard doesn't apply in the same way, but for my money I'd be happier if the unasked for initialisation didn't occur.

 

netizen wrote:

They're never used only if this function is never called

Again, no.  The variables are declared in C, but never used in C.  This is true even if the function itself is called.  This makes them ripe for elimination by the compiler.  Any references to them by code in a non-volatile asm statement's assembler template string could therefore also be eliminated.  The only variable which is referenced in C code is bcd, so that is the only one which would be exempt from elimination, and likewise only instructions within the assembler template string which make direct reference to bcd's operand would be exempt.  Instructions like ld %[tmp], %a[ptr] could be culled.

 

Without volatile, you are relying on the optimiser to recognise dependencies between instruction sequences, registers, and variables, which it did not itself generate.  That is not something I'm willing to allow it to do.  I know from experience that it doesn't always get it right.  The whole point of using assembler in the first place is that I know (or think) that I can do better than the compiler for this particular piece of code.  To avoid the use of volatile is to say that I was wrong.  The only thing I want the compiler to do is select the appropriate registers according to the constraints I choose (the selection of which is informed by my understanding of the architecture and instruction set), and to do so in a way that gives it the greatest latitude w.r.t. its own optimisation strategies in the context of >>surrounding code<<.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
netizen wrote:
Now, could it reside in some .S header file instead of being inline asm (in a .h file)?

Of course.

However, I don't see the point.

The point is to avoid the unnecessary inline asm kludge (" and \n), which happens to make it look worse than it deserves to people like clawson. :-)

 

joeymorin wrote:
netizen wrote:
Preventing automatic initialization by the compiler and correctly declaring the = constraint of the operand are one and the same thing.

They are not. One is questionable behaviour on the part of the compiler, initialising a variable without being asked to do so. The other is a coding error on the part of the programmer. Once the error is expunged, the questionable behaviour no longer manifests, but that does not excuse the former.

I see your point, yet they are: you're just getting confused by your functional analysis. :-)

There is no valid use-case for declaring an un-initialized local variable as input (meaning you are definitely going to read it before you —perhaps— write to it). If your plan is to explicitly read first from an un-initialized local variable, that is a straight and square bug. Assuming you know what you're doing, the only reason you'd have to do such a thing is to expect to read a default value (which happens to be zero); Thus GCC behavior is perfectly reasonable.

 

joeymorin wrote:
The only variable which is referenced in C code is bcd, so that is the only one which would be exempt from elimination

You're again forgetting about bin, which is referenced in C as the function return value.

 

joeymorin wrote:
Without volatile, you are relying on the optimiser to recognise dependencies between instruction sequences, registers, and variables, which it did not itself generate. That is not something I'm willing to allow it to do. I know from experience that it doesn't always get it right. The whole point of using assembler in the first place is that I know (or think) that I can do better than the compiler for this particular piece of code. To avoid the use of volatile is to say that I was wrong.

Assuming the asm block has been properly defined (with correct input/output/clobbers/etc) why should I fear the dependency mechanisms to mess up any more than they would with the rest of my compiled C code?

As I see it, the point of inline asm goes further than what you restrict it to: it is to seamlessly integrate code that could as well have been C as far as the other stages of the compile process are concerned, except it couldn't be expressed as pure C (and thus needed a lower-level push). Hopefully the compiler can optimize it even further using its knowledge of the global picture in a particular use-case.

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Fri. Mar 18, 2016 - 05:14 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

netizen wrote:

The point is to avoid the unnecessary inline asm kludge

Right, but as I've said, if you're writing in pure assembler, you're not restricted by the operand mechanism provided by inline assembler.  You don't need to (can't, actually) rely on the compiler to select a register for you, the nomenclature of that register won't be force upon you as 'rN', and you won't need to transform it into an address in the form of 'N'.  Simply declare the register as a symbol with a plain numeric value.  It can be used unmodified both as a register name and as an integer constant representing an address in precisely the manner you seek.  Job done.  No need for the 'opr' macro at all.

 

netizen wrote:

Assuming you know what you're doing, the only reason you'd have to do such a thing is to expect to read a default value (which happens to be zero);  Thus GCC behavior is perfectly reasonable.

It is perfectly arbitrary.  It is also undocumented.  An undocumented default value is not much use, is it?  How do I know it will always be the same 'default' value?  I don't.  How can I 'know what I'm doing' if there is no supporting documentation to point to?  I can't.

 

I have already noted that making ptr and tmp both input/output operands was a bug, one which has been remedied.  The bug does not make the compiler's behaviour correct.  Nor incorrect.  The two issues (bug, and unasked-for-initialisation) are at best peripherally related (in that one exposed the other), despite your protestations that I am confused.

 

netizen wrote:

You're again forgetting about bin, which is referenced in C as the function return value.

You and I must be working from different code:

__uint24 bin2bcd16(uint16_t bin) {

  __uint24 bcd = 0;
  __uint24 *ptr;
  uint8_t tmp = 16;

  __asm__ __volatile__ ("stuff");

  return bcd;

}

bin is an argument to the function, not the return value.  As such it is semantically equivalent to a local variable.  The only difference is that it is initialised by the caller.  The compiler may not see that anything in bin2bcd() depends upon its value or even its existence.  As such, any references to it can be expunged, including any instructions within the asm statement (were it not declared volatile) be expunged.

 

Relying on the compiler to follow the code contained within an asm statement may, or may not, lead to unexpected results.  The same is true of a normal variable.  Relying on the compiler to refrain optimising away an apparently needless operation will often result in disappointment.  The solution provided by the language is to declare the variable volatile.  This disables a specific set of optimisations.  So it is when you declare an asm statement volatile.

 

netizen wrote:

why should I fear the dependency mechanisms to mess up any more than they would with the rest of my compiled C code?

The instructions you place within an asm statement are to some extent opaque to the compiler.  They are not an expression of any code generation model at its disposal.  Rather, they are an expression of your knowledge as an asm programmer.  The optimiser will still analyse it, but it may deduce that a code sequence is superfluous when in fact it is not.  It is not merely a matter of it knowing which variables are associated with which operands.

 

netizen wrote:

except it couldn't be expressed as pure C

Everything which can be expressed in assembler can also be expressed in pure C.  Whether or not it can be expressed efficiently is another matter.  For example, there is no notion of or support for the carry bit in C, but an algorithm which can be expressed in asm and which uses the carry bit can also be expressed in C with standard bit manipulation techniques.  After all, C isn't very much more than a collection of portable assembler macros.  This is (one of the reasons) why C and asm integrate so readily, and so well.

 

I've said it already, if you've chosen to use inline assembler in the first place it's because you've concluded that you will do a better job than the compiler.  Giving it the last word by allowing it to optimise code you didn't trust it to generate in the first place is nonsensical.  Yes, give it all the information it needs to nestle your hand-crafted assembler neatly and efficiently within the surrounding code, but no more.

 

You may chose to refrain from using volatile with asm, but you do so in the opposition to explicit warning from the compiler developers not to.  The relevant documentation has already been reference in this thread.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Fri. Mar 18, 2016 - 06:05 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
Right, but as I've said, if you're writing in pure assembler, you're not restricted by the operand mechanism provided by inline assembler.

Indeed. My point wasn't to use this macro in pure assembler, but from within inline asm. I wrongly assumed it was obvious in our context (the fact that you missed it seems to suggest the idea is so outlandish it couldn't be seen as a reasonable interpretation). Anyway, let me try again: is it possible to use in an inline asm block an assembler macro that as been defined in a pure assembler file?

 

joeymorin wrote:
It is perfectly arbitrary. It is also undocumented. An undocumented default value is not much use, is it?

I admit I haven't tried to find documentation for it. But you know what they say: this is an open source project, the ultimate documentation is the code itself. :-)

I'm afraid you're assuming an ideal symmetry between code and documentation that can too rarely be relied upon. This is not a standard-related issue, it is a corner case of purely GCC-centric features (that are not very well documented to begin with): what official and documented expectation/contract would either behavior go against?

What would you rather have seen? A simple warning seems not up to the task as, if you do not consider automatic initialization proper behavior, it then always is a bug worth of a plain error. Unlike in C where, in the absence of explicit input/output constraints, it could be legitimate implementation that go beyond what the compiler can know about (thus a warning seems adequate). Would you have expected an error then?

It is clear to me the only expectation that cannot be right is precisely the one we were making in that piece of code: silent passing of un-initialized local variable explicitly constrained as input. I personally don't have any dog in that fight and it does not bother me as much as it seems to bother you, yet I'd be curious to know more if you dig this any deeper: perhaps we should just ask on a GCC list what the rationale behind this behavior is?

 

It seems I'm getting out of topic here, but please stick with me: I believe the input/output naming is ill-advised and unnecessarily confusing, not unlike the UART naming of RX/TX lines btw. We're defining an interface between two pieces of software, so what is input for one is output for the other (and reciprocally). This leads to such gems right in the Inline Assembler Cookbook:

asm volatile("in %0,%1" "\n\t"
"out %1, %2" "\n\t"
: "=&r" (input)
: "I" (_SFR_IO_ADDR(port)), "r" (output)
);

… where "input" is the output variable and "output" the input one. Although to be fair, input/output were probably named wrt to the IO port in this case.

Anyway, I think it would make things clearer if constraints were expressed as read, write or read/write:

  • Read only variable : the compiler can assume the allocated register value will not change. If the variable is un-initialized it must be an error (because automatically initializing it would mean changing the register value).
  • Write only variable : the compiler can assume the allocated register is tainted (it must consider any previously known value as obsolete). Un-initialized variables are fine.
  • Read/write variable : the allocated register is tainted and un-initialized variables raise an error.

Note that it is tempting at first glance to look at "read/write" the same way we look at any local C variable, but it would be a mistake: one (C) is read/write agnostic, the other (inline asm) is explicitly read/write. Thus the latter allows checks and optimizations opportunities that are not available to the former.

 

Back to our point: with such qualifiers/constraints there would be no need for separate parameter spaces for InputOperands and OutputOperands in the extended asm syntax, a single Operands would do:

asm [volatile] ( AssemblerTemplate
               [ : Operands
               [ : Clobbers ] ])

And the "constraint domain/logic" we're talking about wouldn't be split between syntaxic parameter spaces and constraint strings (= and +), as it is today.
More importantly for us here, I believe it would make it more obvious why the proper behavior in case of read-only un-initialized variable should be to raise an error. And that there actually is only one single issue here, not two peripherally related ones. But perhaps that's just me…

 

joeymorin wrote:
netizen wrote:
You're again forgetting about bin, which is referenced in C as the function return value.

You and I must be working from different code:

You're absolutely right: I got confused from working on other binary/BCD functions on the side. I meant the return value all along, and that is indeed bcd in this case. :-)

So (correct me if I'm wrong) your point is that the compiler might miss the relationship between bin and bcd because it has not established this relationship itself: it was created by us within the assembler template.

Here is how I see it: either the compiler considers our template as a black box, in which case it must rely on our constraints only to decide if there is or not a relationship, and thus assume there is when one variable is input and another output. Or the compiler can attempt to crack the black box open and optimize through it, in which case it must be able to determine that the carry is transferred from one to the other (attempting to optimize disregarding our explicit constraints makes no sense if it cannot make at least equivalent deductions on its own).

What you're saying is more or less that there can be a mix of both resulting in improper behavior. Is that right? If it is, can you provide any source or evidence that it would indeed behave that way?

 

joeymorin wrote:
Everything which can be expressed in assembler can also be expressed in pure C. Whether or not it can be expressed efficiently is another matter.

I of course meant "expressed as efficiently". :-)

It's amazing how mis-communication can still find cracks to exploit even between two people honestly trying hard to be as explicit as they can. Thanks for not letting it go. ^^

 

joeymorin wrote:
I've said it already, if you've chosen to use inline assembler in the first place it's because you've concluded that you will do a better job than the compiler. Giving it the last word by allowing it to optimise code you didn't trust it to generate in the first place is nonsensical.

Well, if what you were saying was true, what would be the point of having an optional volatile modifier? All inline asm would have to be volatile.

Don't get me wrong: I understand why in a production environment one might not want to take such a risk, nor dedicate the resources to prove that volatile indeed isn't mandatory in a particular case.

But in an intellectual exercise (as you put it yourself) such as this, I believe this question cannot be set aside on such basis: if the compiler could mess it up, we either need to explain precisely why on theoretical grounds (or, as a lazy practical alternative, let it run without volatile in the hope it crashes us out of doubt ^^). I don't think such theoretical point has been established or, if it has, I have missed its significance when I read it.

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Fri. Mar 18, 2016 - 02:50 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

netizen wrote:

is it possible to use in an inline asm block an assembler macro that as been defined in a pure assembler file?

Ah!  Sokath, his eyes uncovered!

 

That's a good question.  Without actually doing any research or testing, I believe 'no'.  A single compilation unit (i.e. source file) results in a single .s file.  That is, when you have a .c file, the compiler driver (avr-gcc, in our case), interprets its contents as only one language.  It does this either automatically by virtue of the file's extension, or at your explicit behest by means of the -x command line option.  Any files which are #include'd into that compilation unit will also be interpreted using that language.  Since inline asm is a GCC extension to C, it is only valid with a C compilation unit, and it is only that type of source file in which you would place an inline asm statement.  Were you to try to #include an assembler source file to such a C source file, I can only imagine that the compiler would barf.

 

Let me try:

$ head netizen3.c
#include <avr/io.h>
#include "netizenassemblermacros.inc"

$ avr-gcc -Wall -save-temps -g -Os -ffunction-sections -fdata-sections -Wl,--gc-sections -mmcu=atmega328p netizen3.c -o netizen3.elf
In file included from netizen3.c:2:0:
netizenassemblermacros.inc:1:1: error: expected identifier or '(' before '.' token
 .ifndef OPR_MACRO_DEFINED
 ^

Yup.

 

There may be another way, but I expect it won't be elegant.

 

netizen wrote:

What would you rather have seen? A simple warning seems not up to the task

I would prefer only the warning, with the default behaviour being 'no code emitted'.  Alternatively, default initialisation code could be emitted as is the case now, but the warnings issued should inform the user that this has occurred.  Instead of just the usual warning: 'foo' is used uninitialized in this function [-Wuninitialized] resulting from the uninitialised variable, a second diagnostic issued by the part of the compiler responsible for handling operands could say warning: Input operand [foo] associated with uninitialised variable 'foo'.  Initialised to default value (0).  That is what I would >>like<< to see.  If wishes were horses.  As it stands the warnings and errors emitted w.r.t. to inline assembler are already quite spare, and I don't expect this to materialise any time soon.  Bigger fish to fry. 

 

In any case, any serious developer (anyone, I'd say) should treat all warnings as errors.  Most toolchains have a means of enforcing this.  For GCC you should always use -Werror.  Renders this whole argument moot, since the code defect would be immediately identified and expunged.  No more warning, no more default initialisation, correct or no.

 

netizen wrote:

I personally don't have any dog in that fight and it does not bother me as much as it seems to bother you

You've misunderstood me.  I am not bothered by it.  It exists.  I have observed it.  I have an opinion.  It can differ from yours.  I've already moved on.  If you'd like to continue discussing it, I'm game, but since there is no standard to support a (any) toolchain extension, it boils down to opinion.  I think it should be one way, you think it should be another, and the compiler writers haven't provided any documentation, apart from the compiler source code, to support either position.  While source code spelunking for the code responsible sounds like a treat, I'll leave that to someone else.

 

netizen wrote:

So (correct me if I'm wrong) your point is that the compiler might miss the relationship between bin and bcd because it has not established this relationship itself: it was created by us within the assembler template.

Precisely.

 

netizen wrote:

What you're saying is more or less that there can be a mix of both resulting in improper behavior. Is that right?

Use of volatile forces the compiler to treat it as a black box.  Refraining from using volatile may allow the compiler to apply some optimisation strategies to the code contained therein (irrespective of optimisation strategies which may be applied to set-up and tear-down code, and register allocations, governed by the operand constraints).  The trouble is, some (not all) optimisation strategies are predicated on assumptions conferred by the code generation models in concert with which those strategies were developed.  However, the compiler didn't generate the code within the asm statement, you did.  While it >>might<< correctly analyse that code, and >>might<< make correct optimal changes to it i.e. changes which >>do<< improve its performance or reduce its size, and >>don't<< break it (or simply correctly [or incorrectly] determine that it is already optimal and leave it alone), I'm not willing to let it try.  The reason is that, although it might prove itself to me in a given circumstance, with a given asm statement, there is no guarantee that it won't fail tomorrow with a different piece of asm, or the same piece of asm in a different context, or the same piece of asm in the same context compiled with a future version of the toolchain.

 

If I want the compiler's opinion about how to optimally generate code for a given algorithm, I'll code up that algorithm in C, build it, and examine the generated code.  I might learn something.  I often do.  I might notice an opportunity for further improvements in size and/or performance which the compiler (for whatever reason) did not recognise or otherwise chose not express.  I often do.  Regardless, I can be highly (if not actually 100%) confident that the generated code is correct, as the code generation and optimisation worked together (by design) to create the final output.  I take all of that, and hand-craft my own asm solution, pop it into a volatile asm statement, and I know it won't be changed.  If I were satisfied with the compiler's code generation and optimisation, there would have been very little reason to resort to asm (inline or external) in the first place.  Having taken the reigns in this way, there is very little sense in asking the compiler to third-guess me after I've second-guessed it.

 

Ultimately it is a matter of correctly communicating with the compiler your intent as an asm programmer.  The only mechanisms provided for doing so are the operand constraints and register clobbers.  If you have absolute faith that these are 100% sufficient for expressing the intent of the code, then have at it.  I contend however that any such mechanism can never be sufficient for expressing the effects of an arbitrary sequence of instructions.  It can only be sufficient for expressing the non-orthogonal relationships between each opcode and their permissible operands.  It is these relationships which you know as an asm programmer must be maintained, and which you communicate to the compiler via the operand constraints.  The latitude provided by the architecture and instruction set is what permits optimisation in the first place i.e. 'give me any pointer register pair, I don't care which' allows the compiler to manage >>all<< of it's pointer register pairs in a manner benign to your asm statement, but beneficial to surrounding code.

 

netizen wrote:

can you provide any source or evidence that it would indeed behave that way?

Not at the moment.  I know that I have seen it more than once.  If my memory clears I'll post something, but it was very long ago.  In a way this only supports my view.  While a previous version of the compiler may have failed to correctly optimise an asm statement, today's version might perform just fine.  Tomorrow's might break again.

 

netizen wrote:

Thanks for not letting it go

That's me, dog with a bone ;-) ... although I'll likely be somewhat less active here for the next couple of weeks due to other obligations... don't interpret it as a slight.

 

netizen wrote:

Well, if what you were saying was true, what would be the point of having an optional volatile modifier? All inline asm would have to be volatile.

Indeed, volatile is implicit for Basic Asm statements (no operands, no clobbers, no colons), and for Extended Asm statements with no output operands.  It is only Extended Asm statements with output operands which may be declared without volatile and actually be so.  The quoted documentation has examples where this is a Good Thing.  All of the examples are for x86 or PowerPC.  That raises another matter, that of GCC's pedigree.  Although its origin goes back to Vaxen and Sun hardware, for decades it has been developed primarily for PC class hardware, and more recently for ARM.  Those are very different beasts than the typical 8-bit embedded mcu.  Indeed, there is some anecdotal evidence that more recent versions of the compiler generate worse (i.e. larger and/or slower) AVR code than earlier versions, and that this is due to core changes made to benefit those other architectures, changes which only coincidentally but negatively affect small-fry like the AVR backend.  This is likely to continue, as the development effort on the AVR side pales in comparison to the core effort.

 

netizen wrote:

nor dedicate the resources to prove that volatile indeed isn't mandatory in a particular case.

It would be very difficult to prove.  Virtually any change to the code inside the asm statement >>or anywhere else<< in the functional unit could change the behaviour.

 

I just don't see the point, apart from 'I wonder'.  You either write entirely in C, or you write entirely in ASM, or you write in both and glue the two together.  Whether you glue a self-contained ASM piece in the form of a .S file onto a C build (where the glue is the ABI), or glue it as inline assembler inside a C source file (where the glue is the operand constraint mechanism), you are still glueing two different animals together.  Apart from the intellectual exercise of it, what real value is there in permitting the compiler to play Doctor Morreau with the carefully hand-crafted ASM cheetah half of your C-ASM hybrid sloth-cheetah?

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Fri. Mar 18, 2016 - 06:12 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
There may be another way, but I expect it won't be elegant.

Indeed, I wasn't really thinking about including a .S file as if it were C. :-)

More to something like:

__asm__ (".include \"netizenassemblermacros.inc\"  \n\t");

… in lieu of your suggested:

#include "netizenassemblermacros.h"

Actually, both kinds of global includes fail with "unknown opcode « opr »" on my machine, even though they are indeed present in the temp .s file (!). Not sure why…

 

But both techniques do work though when inclusion happens within the bin2bcd16() boundaries, as in:

u24 bin2bcd16(u16 bin) {
//#include "netizenassemblermacros.h"
	u24 bcd = 0;
	u08 tmp = 16;
	u08 *ptr;
	 __asm__ (
          /* opr macro       */ ".include \"netizenassemblermacros.inc\"  \n\t" // 0
	  /* loop counter    */ "mov      __tmp_reg__,  %[tmp]            \n\t" // 1
	  /* clear ptr MSB   */ "clr      %B[ptr]                         \n"   // 1

	                      "bcd1_%=:                                   \n\t"
	  /* shift input     */ "lsl      %A[bin]                         \n\t" // 1
	  /*  value through  */ "rol      %B[bin]                         \n\t" // 1
	  ...

So the generic answer to my question is: yes, we can save ourselves all the inline asm kludge (", \n\t, etc) and still use this beautiful macro from within inline asm.

Now to the question as to why global inclusion does not have the expected result…?

 

joeymorin wrote:
I think it should be one way, you think it should be another, and the compiler writers haven't provided any documentation, apart from the compiler source code, to support either position.

As I said earlier, I do not care about this issue on practical grounds: it can only happen following a bug in my own operands declaration, in which case I don't really care if it falls left or right. On theoretical grounds however, this discussion got me curious. What do you think would be the proper channel to ask for more details? Would avr-gcc-list do?

 

joeymorin wrote:
If I were satisfied with the compiler's code generation and optimisation, there would have been very little reason to resort to asm (inline or external) in the first place. Having taken the reigns in this way, there is very little sense in asking the compiler to third-guess me after I've second-guessed it.

I don't see it that way: we're using ASM in this particular case just because we don't know how (or there's no way to) express it in C, such that GCC compiles it into something as efficient. So we're going around this problem by expressing it directly in ASM. We're not second-guessing anything: we're taking a short-cut.

And we're not asking the compiler to third-guess us either, but to apply other classes of optimization that have nothing to do with the first issue. For example, optimizations that can only be made wrt the global context, something we purposefully ignore when coding this local routine.

There's absolutely no contradiction in wanting both at the same time.

 

joeymorin wrote:
Ultimately it is a matter of correctly communicating with the compiler your intent as an asm programmer. The only mechanisms provided for doing so are the operand constraints and register clobbers. If you have absolute faith that these are 100% sufficient for expressing the intent of the code, then have at it. I contend however that any such mechanism can never be sufficient for expressing the effects of an arbitrary sequence of instructions.

For "an arbitrary sequence of instructions" it sure isn't enough. For one particular sequence however, it can be. That's why volatile exists and why it is optional (as opposed to mandatory/implicit): so you can inform the compiler which it is.

 

Look at the documentation you've quoted and its example of when volatile is required:

The following example demonstrates a case where you need to use the volatile qualifier. It uses the x86 rdtsc instruction, which reads the computer's time-stamp counter. Without the volatile qualifier, the optimizers might assume that the asm block will always return the same value and therefore optimize away the second call.

If the compiler does not know the counter might change for external reasons, it just cannot do the right thing. But this is nothing specific to inline asm: we would witness the very same thing with a plain C variable altered by an ISR (that's why we'd qualify it as volatile too).

What you're afraid of, and using as a basis for making all asm blocks volatile (i.e. messed up optimization), isn't mentioned at all. This does not mean it could not happen: GCC's obviously not perfect and it could. What it means however is that you can't rationally ground you incredulity on this documentation.

 

joeymorin wrote:
That's me, dog with a bone ;-) ... although I'll likely be somewhat less active here for the next couple of weeks due to other obligations... don't interpret it as a slight.

Darmok on the ocean…

You'll be missed. I'm afraid no one is gonna come take your place to mentor me out of these questions.  :-)
Anyway, best of luck in that endeavor!

 

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Sun. Mar 20, 2016 - 09:04 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Note that OP's original question regarded a C function that served as a wrapper for inline assembly.

If separately compiled, all the call-used registers would be "clobbered" by calling the function.

One might as well use a .S file.

Making it static inline might make the inline assembly useful.

The compiler might even use a different set of registers at each invocation.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You're right. I presented it like that because I thought a binary-to-BCD routine made more sense in a library context. However it is static for now in my code (and gets inlined during optimization). And Joey was careful to make his version __always_inline__, which I believe fixes the issue you've raised.

Definitely worth mentioning anyway.

 

The whole thing was a badly disguised exercise to learn more about inline asm (and close topics). And it worked: I pretty much didn't know anything about it a few days ago, I now am deflowered. Hope others can benefit from it as well. :-)

 

ɴᴇᴛɪᴢᴇᴎ

Last Edited: Sun. Mar 20, 2016 - 09:08 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
If separately compiled, all the call-used registers would be "clobbered" by calling the function.

By the way, this also means the initial inline asm clobbering isn't an issue in such a case either (as long as one sticks to call-used registers). So the inline asm solution is pretty much as straightforward as the .S file. It might even make the build process simpler if one isn't already using ASM files.

ɴᴇᴛɪᴢᴇᴎ

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

netizen wrote:
skeeve wrote:
If separately compiled, all the call-used registers would be "clobbered" by calling the function.

By the way, this also means the initial inline asm clobbering isn't an issue in such a case either (as long as one sticks to call-used registers). So the inline asm solution is pretty much as straightforward as the .S file. It might even make the build process simpler if one isn't already using ASM files.

Correct.

Not much simpler though.

With an IDE, your simplicity gain is just that the function can be placed in an existing C file.

That might or might not be a logical thing to do.

If using a hand-made make file,

the difference between adding another .c file and adding the first .S file might be more substantial.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles