Inline ASM giving me grief - need a hand.

Go To Last Post
9 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi all,

 

I'm trying to replace a  PORT |= BITMASK / PORT &= ~BITMASK  type of line with pure assembler sbi() and cbi() calls.

 

My problem is that I can't seem to get it to work using variables, but it does work with constants. For example, THIS works:

 

__asm__ __volatile__ (
    "sbi %0,%1"
    :
    :
    "I" (0x04),
    "I" (7)
);

 

But THIS fails with an error message: error: impossible constraint in 'asm'

 

volatile uint8_t *DDR;
x = digitalPinToPort (13);
DDR = portModeRegister(x);
__asm__ __volatile__ (
    "sbi %0,%1"
    :
    :
    "I" (DDR),
    "I" (7)
);

 

Likewise, if I use a variable in place of the "7" (i.e. bit 7), it fails with the same error (impossible constraint).

 

Anyone have any idea what I'm doing wrong?  I'd appreciate some help. Thanks!

 

Gentlemen may prefer Blondes, but Real Men prefer Redheads!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What are you trying to gain here?

 

You'd be better of into learning how to massage the C compiler to spit out optimized code, and how to analyze the code spit out.

 

Krupski wrote:
I'm trying to replace a PORT |= BITMASK / PORT &= ~BITMASK
  These are often compiled into single sbi or cbi instructions by GCC, so there is no need for hand assembly for this. Experiment a bit with te optimisation settings of the compiler and make it generate .LSS files (in assembly) which you can analyze.

 

A long time ago I came to the conclusion that the C compiler is so good that it is hardly ever worth doing anything in assembly that the C compiler can do for you.

 

There are only a few area's left where assembly is still usefull:

- Task switching.

- Naked interrupts.

- Very crafty optimized algorithms ( Look in util/crc.h).

- Writing libraries with very crafty optimized algorithms: (libm)

-  ...

Doing magic with a USD 7 Logic Analyser: https://www.avrfreaks.net/comment/2421756#comment-2421756

Bunch of old projects with AVR's: http://www.hoevendesign.com

Last Edited: Sun. Mar 4, 2018 - 09:15 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can't do this, anyway. sbi takes only constant expression arguments (stuff that can be calculated at compile time). This means constexpr variables (always valid) and const variables (only valid if they are constant expressions).

 

You are using, for example, digitalPinToPort that is basically an array in flash, so it has to be read at execute time, and therefore any value that results from it's use can't go into sbi.

definition in Arduino:

const uint8_t PROGMEM digital_pin_to_port_PGM[] = {
        PD, /* 0 */
        PD,
        PD,
        PD,
        PD,
        PD,
        PD,
        PD,
        PB, /* 8 */
        PB,
        PB,
        PB,
        PB,
        PB,
        PC, /* 14 */
        PC,
        PC,
        PC,
        PC,
        PC,
};

To be usable with sbi, it would have to be something like:

constexpr uint8_t digital_pin_to_port_PGM[] = {
        PD, /* 0 */
        PD,
        PD,
        PD,
        PD,
        PD,
        PD,
        PD,
        PB, /* 8 */
        PB,
        PB,
        PB,
        PB,
        PB,
        PC, /* 14 */
        PC,
        PC,
        PC,
        PC,
        PC,
};

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I can't seem to get it to work using variables

  :

THIS fails with an error message: error: impossible constraint in 'asm'

the compiler is exactly correct.  The sbi/cbi instructions only work with constants as the port value (and bit value)...  in/out (another possible optimization) also require constants.

 

 

I came to the conclusion that the C compiler is so good that it is hardly ever worth doing anything in assembly that the C compiler can do for you.

Normally, I'd agree.  But it always pays to check...   In this case, the *PORTP |= BITMASK / *PORTP &= ~BITMASK  sequence compiles to:

    volatile uint8_t * port = get_port();
   0:	0e 94 00 00 	call	get_port
   4:	fc 01       	movw	r30, r24
    *port |= 8;
   6:	80 81       	ld	r24, Z
   8:	88 60       	ori	r24, 0x08	; 8
   a:	80 83       	st	Z, r24
    *port &= ~8;
   c:	80 81       	ld	r24, Z
   e:	87 7f       	andi	r24, 0xF7	; 247
  10:	80 83       	st	Z, r24

 

There is a TINY bit of room for improvement.   If you're in the depths of your high-speed IO code, and have already handled your interruptability issues (if any), then the port register isn't really fully "volatile any more, and you COULD leave out the second load from the port, since you have confidence that it hasn't changed since the last time you read the port.  That saves one instruction and two cycles.

 

the Arduino digitalWrite() function gets a lot of complaints just for being ~80x slower (and biggger) than sbi/cbi.  But a lot of that is because sbi/cbi is essentially a degenerate case, and digitalWrite() has a lot more stuff that really NEEDS TO BE DONE.   Now, when you have a sequence of writes to the same pin, you certainly do not need to do a lot of that every time, and the speedup you can get by doing your own port lookup ONCE is quite significant.   Going from "*port |= 8" to asm code - NOT significant.

 

Finally, it might be worth checking that you DO get "pretty good" code like I've shown.  My initial experiments had "port" as a global variable, which apparently caused it to be reloaded each time it was used (treating both the pointer AND the thing it was pointing to as "volatile", or something.)   If the code is inexplicably "bad", it's worth finding the explanation.

 

Last Edited: Sun. Mar 4, 2018 - 10:44 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Paulvdh wrote:

What are you trying to gain here?

 

You'd be better of into learning how to massage the C compiler to spit out optimized code, and how to analyze the code spit out.

 

 

What I want to gain is about a 3X improvement in speed.  A test loop with the conventional PORT |= BIT stuff takes around 6000 ms to run whereas the same loop with sbi() sbi() takes around 2000 ms.

Gentlemen may prefer Blondes, but Real Men prefer Redheads!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:

I can't seem to get it to work using variables

  :

THIS fails with an error message: error: impossible constraint in 'asm'

the compiler is exactly correct.  The sbi/cbi instructions only work with constants as the port value (and bit value)...  in/out (another possible optimization) also require constants.

 

Well that explains it all, doesn't it?

 

Darn.

Gentlemen may prefer Blondes, but Real Men prefer Redheads!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Krupski wrote:
A test loop with the conventional PORT |= BIT stuff takes around 6000 ms to run whereas the same loop with sbi() sbi() takes around 2000 ms.

??? Where BIT resolves to a constant expression?  Then show the generated code.

 

Also, tell a bit more about the important loop and what it does/what it needs to do.  Can e.g. PIN-write toggle feature be used effectively?  If a single pin, perhaps a timer in CTC mode?

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Krupski wrote:
What I want to gain is about a 3X improvement in speed.  A test loop with the conventional PORT |= BIT stuff takes around 6000 ms to run whereas the same loop with sbi() sbi() takes around 2000 ms.
How did you do that test?

When possible, PORT |= BIT normally becomes an sbi.

I think it even happens with -O1.

I recommend against -O0.

 

LD, OR, ST takes five cycles.

SBI takes two.

That is not a 3:1 ratio,

especially after the common loop stuff is added in.

 

The PORT seems to be a DDRx register.

Why OP needs to change a DDRx register quickly, I do not know.

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Krupski wrote:
What I want to gain is about a 3X improvement in speed. A test loop with the conventional PORT |= BIT stuff takes around 6000 ms to run whereas the same loop with sbi() sbi() takes around 2000 ms.
But your test was clearly flawed! If I write:

C:\SysGCC\avr\bin>type avr.c
#include <avr/io.h>

int main(void) {
        uint16_t n;

        for (n=10000; n; n--) {
                PORTB |= (1 << 3);
                PORTB &= ~(1 << 5);
        }
        while(1) {
        }
}

C:\SysGCC\avr\bin>avr-gcc -mmcu=atmega16 -Os -g avr.c -o avr.elf

C:\SysGCC\avr\bin>avr-objdump -S avr.elf | tail -n 28
0000006c <main>:
#include <avr/io.h>

int main(void) {
  6c:   80 e1           ldi     r24, 0x10       ; 16
  6e:   97 e2           ldi     r25, 0x27       ; 39
        uint16_t n;

        for (n=10000; n; n--) {
                PORTB |= (1 << 3);
  70:   c3 9a           sbi     0x18, 3 ; 24
                PORTB &= ~(1 << 5);
  72:   c5 98           cbi     0x18, 5 ; 24
  74:   01 97           sbiw    r24, 0x01       ; 1
#include <avr/io.h>

int main(void) {
        uint16_t n;

        for (n=10000; n; n--) {
  76:   e1 f7           brne    .-8             ; 0x70 <main+0x4>
  78:   ff cf           rjmp    .-2             ; 0x78 <main+0xc>

0000007a <_exit>:
  7a:   f8 94           cli

0000007c <__stop_program>:
  7c:   ff cf           rjmp    .-2             ; 0x7c <__stop_program>

The setting of a bit and the clearing of a bit generate the single opcodes SBI and CBI as expected. Everything else is involve in the counting of the for() loop.

 

About the only way you could get the compiler to not generate the most efficient SBI/CBI here would be if you turned the optimiser off:

C:\SysGCC\avr\bin>avr-gcc -mmcu=atmega16 -O0 -g avr.c -o avr.elf

C:\SysGCC\avr\bin>avr-objdump -S avr.elf | tail -n 65

0000006c <main>:
#include <avr/io.h>

int main(void) {
  6c:   cf 93           push    r28
  6e:   df 93           push    r29
  70:   00 d0           rcall   .+0             ; 0x72 <main+0x6>
  72:   cd b7           in      r28, 0x3d       ; 61
  74:   de b7           in      r29, 0x3e       ; 62
        uint16_t n;

        for (n=10000; n; n--) {
  76:   80 e1           ldi     r24, 0x10       ; 16
  78:   97 e2           ldi     r25, 0x27       ; 39
  7a:   9a 83           std     Y+2, r25        ; 0x02
  7c:   89 83           std     Y+1, r24        ; 0x01
  7e:   17 c0           rjmp    .+46            ; 0xae <main+0x42>
                PORTB |= (1 << 3);
  80:   88 e3           ldi     r24, 0x38       ; 56
  82:   90 e0           ldi     r25, 0x00       ; 0
  84:   28 e3           ldi     r18, 0x38       ; 56
  86:   30 e0           ldi     r19, 0x00       ; 0
  88:   f9 01           movw    r30, r18
  8a:   20 81           ld      r18, Z
  8c:   28 60           ori     r18, 0x08       ; 8
  8e:   fc 01           movw    r30, r24
  90:   20 83           st      Z, r18
                PORTB &= ~(1 << 5);
  92:   88 e3           ldi     r24, 0x38       ; 56
  94:   90 e0           ldi     r25, 0x00       ; 0
  96:   28 e3           ldi     r18, 0x38       ; 56
  98:   30 e0           ldi     r19, 0x00       ; 0
  9a:   f9 01           movw    r30, r18
  9c:   20 81           ld      r18, Z
  9e:   2f 7d           andi    r18, 0xDF       ; 223
  a0:   fc 01           movw    r30, r24
  a2:   20 83           st      Z, r18
#include <avr/io.h>

int main(void) {
        uint16_t n;

        for (n=10000; n; n--) {
  a4:   89 81           ldd     r24, Y+1        ; 0x01
  a6:   9a 81           ldd     r25, Y+2        ; 0x02
  a8:   01 97           sbiw    r24, 0x01       ; 1
  aa:   9a 83           std     Y+2, r25        ; 0x02
  ac:   89 83           std     Y+1, r24        ; 0x01
  ae:   89 81           ldd     r24, Y+1        ; 0x01
  b0:   9a 81           ldd     r25, Y+2        ; 0x02
  b2:   89 2b           or      r24, r25
  b4:   29 f7           brne    .-54            ; 0x80 <main+0x14>
                PORTB |= (1 << 3);
                PORTB &= ~(1 << 5);
        }
        while(1) {
        }
  b6:   ff cf           rjmp    .-2             ; 0xb6 <main+0x4a>

000000b8 <_exit>:
  b8:   f8 94           cli

000000ba <__stop_program>:
  ba:   ff cf           rjmp    .-2             ; 0xba <__stop_program>

But no one in their right mode would ever enable this compiler test mode!!