Should this be optimized?

Go To Last Post
7 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've been optimizing some code for size and speed, and found that the code below generated a surprisingly large amount of ASM.

xor_block(byte *dst, byte *src1, byte *src2)
{
    for (uint8_t i = 0; i < 8 ; ++i) {
        dst[i] = src1[i] ^ src2[i];
    }
}

The resulting ASM actually adds a 16-bit 'i' to each stored 16-bit address each time through the loop (compiled with -Os) before reading and writing the data.

Changing the code to be more literal results in the small amount of ASM I'd expect, with just X,Y,Z incrementing in a tiny loop.

xor_block(byte *dst, byte *src1, byte *src2)
{
    for (uint8_t i = 0; i < 8; ++i) {
        *dst++ = *src1++ ^ *src2++;
    }
}

Even that can be made a tiny bit smaller with

xor_block(byte *dst, byte *src1, byte *src2)
{
    for (uint8_t i = 8; i > 0; --i) {
        *dst++ = *src1++ ^ *src2++;
    }
}

Given that 'i' is local (and not declared volatile), shouldn't GCC be able to optimize this? I'm not an optimization expert, and I know it's a complex subject, but I was pretty surprised.

Edit: Fixed typos. Also, see ASM output in the followup post.

-Brad

Last Edited: Tue. Apr 22, 2008 - 05:14 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

FYI, this is the resulting code for the first C block in the OP.

push r16
push r17
push r28
push r29
movw r16,r24
movw r28,r22
movw r22,r20
ldi r20,lo8(0)
ldi r21,hi8(0)
.L2:
movw r18,r16
add r18,r20
adc r19,r21
movw r30,r22
add r30,r20
adc r31,r21
movw r26,r28
add r26,r20
adc r27,r21
ld r24,Z
ld r25,X
eor r24,r25
movw r30,r18
st Z,r24
subi r20,lo8(-(1))
sbci r21,hi8(-(1))
cpi r20,8
cpc r21,__zero_reg__
brne .L2
pop r29
pop r28
pop r17
pop r16
ret

This is the resulting code for the last C block.

push r28
push r29
movw r28,r24
movw r26,r22
movw r30,r20
ldi r18,lo8(8)
.L2:
ld r24,Z+
ld r25,X+
eor r24,r25
st Y+,r24
subi r18,lo8(-(-1))
brne .L2
pop r29
pop r28
ret

-Brad

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

After reading another post complaining about avr-gcc, I feel compelled to add that I think its a great piece of work and is much appreciated.

Personally, I'm posting these comments for discussion and because the topic interests me. They are not meant to be complaints. I hope I'll have enough time at some point to contribute to avr-gcc myself. Maybe when the kids start school in a few years ;)

-Brad

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can decrease the code size and the number of used registers if you pack your arrays into a struct.

typedef struct
{
    uint8_t dst[8];
    uint8_t src1[8];
    uint8_t src2[8];
}data_t;

void xor_block (data_t *data)
{
    for (uint8_t count = 0; count < 8; count++) 
    {
        data->dst[count] = data->src1[count] ^ data->src2[count];
    }
}

This enables the compiler to access all arrays with a single pointer.

movw    r30, r16
ldd     r24, Z+16
ldd     r25, Z+8
eor     r24, r25
st      Z+, r24
cp      r30, r18
cpc     r31, r19
brne    .-14

Regards
Sebastian

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

S-Sohn wrote:
You can decrease the code size and the number of used registers if you pack your arrays into a struct.

typedef struct
{
    uint8_t dst[8];
    uint8_t src1[8];
    uint8_t src2[8];
}data_t;

void xor_block (data_t *data)
{
    for (uint8_t count = 0; count < 8; count++) 
    {
        data->dst[count] = data->src1[count] ^ data->src2[count];
    }
}

That changes the semantics considerably.
OTOH, considerations like that are a reason to put one's globals into a struct.
OYAH, I had thought that reductions like those suggested by the OP were SOP for optimizers.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Unfortunately, the data locations of those three parameter are different depending on the caller. So a struct like that wouldn't work without memcpy'ing in the callers. Good idea generally though.

-Brad

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

schickb wrote:
Unfortunately, the data locations of those three parameter are different depending on the caller. So a struct like that wouldn't work without memcpy'ing in the callers. Good idea generally though.
You might try static inline.
It could be a win if differences among the
arguments could be determined at compile time.

Iluvatar is the better part of Valar.