Help working with algorithm,

Go To Last Post
16 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Not following this procedure here at all.

 

(Position A to Position C):
Perform the Shift Right and Xor operations as shown in the flow diagram 64 times.
Each time use the next sequential ChallengeBit for the (A1 Xor NextChallengeBit) operation. Example:
Fist time through the loop: B24 = A1 Xor CB1

Second time through the loop: B24 = A1 Xor CB2

Third time through the loop: B24 = A1 Xor CB3

Forth time through the loop: B24 = A1 Xor CB4
-
-
64th time through the loop: B24 = A1 Xor CB64

 

-------------------------------------------------------------

CB is random data 64 bits in length

B24 is C541A9

A1 is the first bit of B24

-----------------------------------------------------------

 

So if I do that I only change the last bit of B24 and that seem very pointless. This procedure is direction for decryption. Attached is a a chart depicting the above.

 

This is the way I read it

for (i=0;i<64;i++)
    {
        output = (init & 1)  ^  (CHALLENGE  & 1);//take the first bit from CHALLENGE  and OR to first bit in init
        CHALLENGE  >>= 1;//bit is done go to next
    }

 

Though that can not be what is being asked for here. I'll end up with a 1 or a 0 at the MSB in init,  I'd assume it would change more then 1 bit? Anyone see what I'm missing here?

 

Maybe something like this is implied?

   for (i=0;i<64;i++)
    {
        
        for (j=0;j<24;j++)
        {
            output |= (init & 1)  ^  (CHALLANGE & 1);
            init >>= 1;
            output <<= 1;
        }
        CHALLANGE >>= 1;
        init = output;
    }

Attachment(s): 

Last Edited: Thu. Apr 21, 2016 - 07:07 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It seems like you're ignoring most of the arrows and xor operations to me...this is the diagram as I read it:

 

Load init value 0xC541A9 into buffer

 

Repeat 64 times:

1) Xor low bit of buffer with low bit of challenge, save result as tempbit.

2) Right shift buffer.

3) Place tempbit in high bit of buffer. (covers Xor statement in blue)

4) If tempbit is one, xor buffer with 0x109028 and store result to buffer, if tempbit is zero do nothing (covers the 5 Xor statements above C24:1 boxes)

5) Right shift challenge. (moves CB2, CB3... to low bit)

 

And then rearrange buffer nibbles to form response bytes.

 

Obviously some of that shouldn't be done with shifts in an 8 bit MCU, but I'm going for the shortest explanation without trying to optimize anything.

Last Edited: Thu. Apr 21, 2016 - 08:47 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I got bored.  I realize you didn't specify 8 bit mcu now that I look at the forum it's in, but what the hell, here's an 8 bit implementation:

#include <avr/io.h>

/**********************
Right shift an array one place.
arr is the array.
size is the length of the array
**********************/
void rShiftArr(uint8_t * arr, uint8_t size)
{
	size--;
	for(; size > 0; size--)
	{
		arr[size] >>= 1;
		if (arr[size-1] & 0x01)
			arr[size] |= 0x80;
	}
	arr[0] >>= 1;
}

/***********************
challenge must be provided.  It's 8 bytes in length, msB at index 0, lsB at index 7.
response must only be allocated.  3 bytes in length, value is ignored

The result is left in response.  msB at index 0, lsB at index 2.
Additionally, challenge will be zeroed out.
***********************/
void createResponse(uint8_t * challenge, uint8_t * response)
{
	uint8_t temp;
	
	response[0] = 0xC5;
	response[1] = 0x41;
	response[2] = 0xA9;
	
	for(uint8_t byteIndex = 7; byteIndex < 8 ; byteIndex--)  //8 bits per byte
	{
		for(uint8_t bitCounter = 0; bitCounter < 8; bitCounter++) //from lsb to msb of byte
		{
			temp = 0x01 & (challenge[byteIndex] ^ response[2]);  //store lsb XOR
			rShiftArr(response, 3);  //right shift response
			if(temp)  //if buffer lsb XOR challenge lsb is 1, set msb of response and XOR response with 0x109028
			{
				response[0] |= 0x80;
				response[0] ^= 0x10;
				response[1] ^= 0x90;
				response[2] ^= 0x28;
			}
			challenge[byteIndex] >>= 1;  //next lsb of challenge
		}
	}
	
	//reusing temp to juggle nibbles
	temp = (response[2] & 0xF0) >> 4;
	temp |= response[1] << 4; // temp C12:9 C8:5

	response[1] &= 0xF0;
	response[1] |= response[0] >> 4; // response[1] C16:13 C24:21

	response[2] <<= 4;
	response[2] |= 0x0F & response[0]; //response[2] C4:1 C20:C17

	response[0] = temp; 	//response[0] C12:9 C8:5
}


int main(void)
{
    while(1)
    {
		uint8_t challenge[] = {0xFF, 0xEE, 0xDD, 0xCC, 0xBB, 0xAA, 0x99, 0x88};  //0xFFEEDDCCBBAA9988
		uint8_t response[3];
		createResponse(challenge, response);	
    }
}

The result with that challenge is 0xEE61CE

Last Edited: Fri. Apr 22, 2016 - 01:42 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Rezer wrote:

I got bored.  I realize you didn't specify 8 bit mcu now that I look at the forum it's in, but what the hell, here's an 8 bit implementation:

#include <avr/io.h>

/**********************
Right shift an array one place.
arr is the array.
size is the length of the array
**********************/
void rShiftArr(uint8_t * arr, uint8_t size)
{
	size--;
	for(; size > 0; size--)
	{
		arr[size] >>= 1;
		if (arr[size-1] & 0x01)
			arr[size] |= 0x80;
	}
	arr[0] >>= 1;
}

/***********************
challenge must be provided.  It's 8 bytes in length, msB at index 0, lsB at index 7.
response must only be allocated.  3 bytes in length, value is ignored

The result is left in response.  msB at index 0, lsB at index 2.
Additionally, challenge will be zeroed out.
***********************/
void createResponse(uint8_t * challenge, uint8_t * response)
{
	uint8_t temp;

	response[0] = 0xC5;
	response[1] = 0x41;
	response[2] = 0xA9;

	for(uint8_t byteIndex = 7; byteIndex < 8 ; byteIndex--)  //8 bits per byte
	{
		for(uint8_t bitCounter = 0; bitCounter < 8; bitCounter++) //from lsb to msb of byte
		{
			temp = 0x01 & (challenge[byteIndex] ^ response[2]);  //store lsb XOR
			rShiftArr(response, 3);  //right shift response
			if(temp)  //if buffer lsb XOR challenge lsb is 1, set msb of response and XOR response with 0x109028
			{
				response[0] |= 0x80;
				response[0] ^= 0x10;
				response[1] ^= 0x90;
				response[2] ^= 0x28;
			}
			challenge[byteIndex] >>= 1;  //next lsb of challenge
		}
	}

	//reusing temp to juggle nibbles
	temp = (response[2] & 0xF0) >> 4;
	temp |= response[1] << 4; // temp C12:9 C8:5

	response[1] &= 0xF0;
	response[1] |= response[0] >> 4; // response[1] C16:13 C24:21

	response[2] <<= 4;
	response[2] |= 0x0F & response[0]; //response[2] C4:1 C20:C17

	response[0] = temp; 	//response[0] C12:9 C8:5
}

int main(void)
{
    while(1)
    {
		uint8_t challenge[] = {0xFF, 0xEE, 0xDD, 0xCC, 0xBB, 0xAA, 0x99, 0x88};  //0xFFEEDDCCBBAA9988
		uint8_t response[3];
		createResponse(challenge, response);
    }
}

The result with that challenge is 0xEE61CE

 

 

That explains the troubles I'm having then so now I need to prepare the 8 bits properly (see new image). To make the 64 bit storage I was trying this but that is not going to work with an 8 bit MCU.

 

 

        uint8_t response[3];
        uint8_t challenge[] = {0, 0, 0, 0, 0, 0, 0, 0};
        uint8_t FIXED[] = {0x41,0xaa,0x42,0xbb,0x43};//f1,f2,f3,f4,f5
        uint8_t KEY[] = {0x1a,0xf9,0x64};//s1,s2,s3
        
        //I was using long long for challenge , now I have to stuff them in the right place for the covered code.
        challenge |= (FIXED[4]>>7)<<63;//load MSB-f5 to 64
        challenge |= (FIXED[4] & 1)<<56;//load LSB-f5 to 57
        challenge |= (FIXED[3]>>7)<<55;//load MSB-f5 to 56
        challenge |= (KEY[1]  & 1)<<8;//load LSB-s2 to 9
        challenge |= (KEY[0] >>7)<<7;//load MSB-s1 to 8
        challenge |= (KEY[0]  & 1);//load LSB-s1 to 1

 

so I think

    challenge[0] |= (FIXED[4] & 0x80);//load MSB-f5 to 64
    challenge[0] |= (FIXED[4] & 1);//load LSB-f5 to 57
    challenge[1] |= (FIXED[3] & 0x80);//load MSB-f4 to 56
    challenge[6] |= (KEY[1]  & 1);//load LSB-s2 to 9
    challenge[7] |= (KEY[0] & 0x80);//load MSB-s1 to 8
    challenge[7] |= (KEY[0]  & 1);//load LSB-s1 to 1

then

 createResponse(challenge, response);

 

but I should get 0xb6 f4 cf

 

 

 

 

 

 

Attachment(s): 

Last Edited: Fri. Apr 22, 2016 - 06:36 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The compiler is avr-gcc. It supports uint64_t (badly - but what do you expect on an 8bit micro where just one 64 bit variable occupies 8 of the 32 registers!).

 

The point here (kind of) is that if you use:

#include <stdint.h>

and the standard types like uint64_t then this code will behave identically on whatever architecture you build it on. It won't be a case of "PC version" and "AVR version" as it is the SAME version. If, however, you write in terms of long long, long, int, short, char then you will hit issues when you try to use the same source on different architectures. This is the very reason <stdint.h> exists. It means you can develop and test numeric processing algorithms in the comfort of a PC .exe then just rebuild for the AVR later. The downside, as I say, is that while a uint64_t is one register on modern Pentiums and AMD64's it is 8 registers on an AVR. So while the code will have the same result on the AVR the route to the solution may be more tortuous. In which case you may then want to spend a little time on optimising the code (which can likely mostly be done in PC comfort) to get the same result but perhaps not using such wide types except when absolutely necessary (don't be tempted to just throw uint64_t's at everything!).

 

Oh and you don't need a specific include of it if you use <avr/io.h> as that already includes it early on anyway.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Your logic looks fine to me for the modified version...the reason your original method wouldnt work I *think* is due to the way type promotion works with bit shifts.  You're shifting 8 bit values, and by default they'll only be promoted to int values which are 16 bits wide.  If you wanted it to work with a uint64_t you'd have to promote each variable to uint64_t before doing the shift, otherwise you shift the data right off the end of your variable.  Also, unless GCC got better at optimizing bitshifts since last I checked, shifting 64 bit variables large numbers of times is very expensive.

 

And clawson, I have no doubt that you're correct about using uint64 types, but I still think his original method is flawed for the above reason.  Disclaimer:  I haven't tested it cheeky

Last Edited: Fri. Apr 22, 2016 - 09:52 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok, I just ran a few tests:

 

With the following declarations:

		uint8_t in = 0x01;
		uint64_t out;

This produces 0x0000000000000000 in the out variable:

		out = in << 56;

While this produces the expected 0x0100000000000000:

		out = (uint64_t)in << 56;

However, it takes 133 cycles at all optimization levels even though in essence it's only copying a single byte.  Why are optimizations so terrible on bitshifts?  Anyway...

 

This results in 0x0000000000000000:

		out = in << 16;

And this 0xFFFFFFFFFFFF8000:

		out = in << 15;

 

So there you go, it promotes to a signed int16 and no further without an explicit cast.

Last Edited: Fri. Apr 22, 2016 - 10:20 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Rezer, thx.

 

I guess then we both are misunderstanding the requirements. I'm not arriving at these answers

 

test operations.

 

Fixed bytes (f1-f5) 14,aa,42,bb,23

(k1-k3) 1a,f9,64 ---> should yield  b6,f4,cf

(k1-k3) 4e,96,8a---> should yield  81,4b,a6

(k1-k3) 4e,96,8b ---> should yield  c0,25,eb

 

Fixed bytes (f1-f5) 1a,2b,3c,4d,5e

(k1-k3) 1a,f9,64 ---> should yield  a1,1c,a8

(k1-k3) 4e,96,8a---> should yield  96,a3,c1

(k1-k3) 4e,96,8b---> should yield  d7,cd,8c

 

 

 

Last Edited: Mon. Apr 25, 2016 - 12:28 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry, I didn't look hard enough at that second table you posted vs your implementation.  I thought it was mixing bits around with existing challenge bytes, but it looks like F1-F5 and S1-S3 ARE the challenge bytes and that table is dashed because whoever made it hates being easily understood  The dashes I was reading as "don't modify" are really standins for bit 7 of F5, bit 6 of F5, bit 5 of F5, etc.  You just need to do this:

uint8_t challenge[] = {F5, F4, F3, F2, F1, S3, S2, S1};

and then it should work fine.  I say "should" because I ran it on your first 3 test samples and it failed miserably, while the second 3 worked exactly as expected.  I think you have a typo in the fixed bytes in the first set of sample data.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

so should it not be , mix bits then challenge?

 

unsigned char FIXED[5] = {0x43,0xbb,0x42,0xaa,0x41};//tried changing this from f1-f5 to f5-f1
unsigned char KEY[3]={0x1a,0xf9,0x64};//tried both directions here.

 

//mix up the bits.
    challenge[0] |= (FIXED[4] & 0x80);//load MSB-f5 to 64
    challenge[0] |= (FIXED[4] & 1);//load LSB-f5 to 57
    challenge[1] |= (FIXED[3] & 0x80);//load MSB-f4 to 56
    challenge[6] |= (KEY[1]  & 1);//load LSB-s2 to 9
    challenge[7] |= (KEY[0] & 0x80);//load MSB-s1 to 8
    challenge[7] |= (KEY[0]  & 1);//load LSB-s1 to 1
   

createResponse(challenge, response);//now challange

 

is that not how you read it?

 

 

Last Edited: Mon. Apr 25, 2016 - 06:38 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Nope, like I said that image you posted, "c.png", isn't very clear.  There's no mixing of bits at all, simply loading F1-F5 and S1-S3 as the challenge bytes.  Loading the challenge array like I posted previously works fine for all 6 challenge response samples, including the first 3 now that I realize your fixed bytes were meant to be F1-F5 = 0x41, 0xaa, 0x42, 0xbb, 0x43.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wow, that's just friggen nutz. The doc is for people on your team guys LOL. ok Rezer thx so much for all your help. I would never have gotten thru that doc on my own at my level.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Rezer wrote:
However, it takes 133 cycles at all optimization levels even though in essence it's only copying a single byte. Why are optimizations so terrible on bitshifts?

Such questions should never remain unanswered:

$ cat avr.c
#include <avr/io.h>

uint64_t out;

int main(void) {
	out = (uint64_t)PINB << 56;
}

in this I use PINB as a useful input as it just happens to be volatile so the optimiser has no choice to precalculate anything here. When built I get this for main():

	out = (uint64_t)PINB << 56;
  7e:	26 b3       	in	r18, 0x16	; 22
  80:	30 e0       	ldi	r19, 0x00	; 0
  82:	40 e0       	ldi	r20, 0x00	; 0
  84:	50 e0       	ldi	r21, 0x00	; 0
  86:	60 e0       	ldi	r22, 0x00	; 0
  88:	70 e0       	ldi	r23, 0x00	; 0
  8a:	80 e0       	ldi	r24, 0x00	; 0
  8c:	90 e0       	ldi	r25, 0x00	; 0
  8e:	08 e3       	ldi	r16, 0x38	; 56
  90:	0e 94 5c 00 	call	0xb8	; 0xb8 <__ashldi3>
  94:	20 93 60 00 	sts	0x0060, r18
  98:	30 93 61 00 	sts	0x0061, r19
  9c:	40 93 62 00 	sts	0x0062, r20
  a0:	50 93 63 00 	sts	0x0063, r21
  a4:	60 93 64 00 	sts	0x0064, r22
  a8:	70 93 65 00 	sts	0x0065, r23
  ac:	80 93 66 00 	sts	0x0066, r24
  b0:	90 93 67 00 	sts	0x0067, r25

To make the CALL there this could not really be any simpler. It loads PINB into the lowest register of a group of 8 (R25:R18) then loads 56 into R16 and makes the CALL. On the way back it just stores the 8 registers to "out" in RAM (as it must because I made it a global). So there's nothing there that can be optimised away. The actual 56 bit shift of the uint64_t is done by the generic routine:

000000b8 <__ashldi3>:
  b8:	0f 93       	push	r16
  ba:	df 93       	push	r29
  bc:	cf 93       	push	r28
  be:	cd b7       	in	r28, 0x3d	; 61
  c0:	de b7       	in	r29, 0x3e	; 62
  c2:	60 97       	sbiw	r28, 0x10	; 16
  c4:	0f b6       	in	r0, 0x3f	; 63
  c6:	f8 94       	cli
  c8:	de bf       	out	0x3e, r29	; 62
  ca:	0f be       	out	0x3f, r0	; 63
  cc:	cd bf       	out	0x3d, r28	; 61
  ce:	00 23       	and	r16, r16
  d0:	09 f4       	brne	.+2      	; 0xd4 <__ashldi3+0x1c>
  d2:	59 c0       	rjmp	.+178    	; 0x186 <__ashldi3+0xce>
  d4:	29 83       	std	Y+1, r18	; 0x01
  d6:	3a 83       	std	Y+2, r19	; 0x02
  d8:	4b 83       	std	Y+3, r20	; 0x03
  da:	5c 83       	std	Y+4, r21	; 0x04
  dc:	6d 83       	std	Y+5, r22	; 0x05
  de:	7e 83       	std	Y+6, r23	; 0x06
  e0:	8f 83       	std	Y+7, r24	; 0x07
  e2:	98 87       	std	Y+8, r25	; 0x08
  e4:	e0 e2       	ldi	r30, 0x20	; 32
  e6:	e0 1b       	sub	r30, r16
  e8:	29 81       	ldd	r18, Y+1	; 0x01
  ea:	3a 81       	ldd	r19, Y+2	; 0x02
  ec:	4b 81       	ldd	r20, Y+3	; 0x03
  ee:	5c 81       	ldd	r21, Y+4	; 0x04
  f0:	1e 16       	cp	r1, r30
  f2:	bc f0       	brlt	.+46     	; 0x122 <__ashldi3+0x6a>
  f4:	19 86       	std	Y+9, r1	; 0x09
  f6:	1a 86       	std	Y+10, r1	; 0x0a
  f8:	1b 86       	std	Y+11, r1	; 0x0b
  fa:	1c 86       	std	Y+12, r1	; 0x0c
  fc:	8e 2f       	mov	r24, r30
  fe:	99 27       	eor	r25, r25
 100:	87 fd       	sbrc	r24, 7
 102:	90 95       	com	r25
 104:	90 95       	com	r25
 106:	81 95       	neg	r24
 108:	9f 4f       	sbci	r25, 0xFF	; 255
 10a:	04 c0       	rjmp	.+8      	; 0x114 <__ashldi3+0x5c>
 10c:	22 0f       	add	r18, r18
 10e:	33 1f       	adc	r19, r19
 110:	44 1f       	adc	r20, r20
 112:	55 1f       	adc	r21, r21
 114:	8a 95       	dec	r24
 116:	d2 f7       	brpl	.-12     	; 0x10c <__ashldi3+0x54>
 118:	2d 87       	std	Y+13, r18	; 0x0d
 11a:	3e 87       	std	Y+14, r19	; 0x0e
 11c:	4f 87       	std	Y+15, r20	; 0x0f
 11e:	58 8b       	std	Y+16, r21	; 0x10
 120:	2a c0       	rjmp	.+84     	; 0x176 <__ashldi3+0xbe>
 122:	ca 01       	movw	r24, r20
 124:	b9 01       	movw	r22, r18
 126:	00 2e       	mov	r0, r16
 128:	04 c0       	rjmp	.+8      	; 0x132 <__ashldi3+0x7a>
 12a:	66 0f       	add	r22, r22
 12c:	77 1f       	adc	r23, r23
 12e:	88 1f       	adc	r24, r24
 130:	99 1f       	adc	r25, r25
 132:	0a 94       	dec	r0
 134:	d2 f7       	brpl	.-12     	; 0x12a <__ashldi3+0x72>
 136:	69 87       	std	Y+9, r22	; 0x09
 138:	7a 87       	std	Y+10, r23	; 0x0a
 13a:	8b 87       	std	Y+11, r24	; 0x0b
 13c:	9c 87       	std	Y+12, r25	; 0x0c
 13e:	da 01       	movw	r26, r20
 140:	c9 01       	movw	r24, r18
 142:	04 c0       	rjmp	.+8      	; 0x14c <__ashldi3+0x94>
 144:	b6 95       	lsr	r27
 146:	a7 95       	ror	r26
 148:	97 95       	ror	r25
 14a:	87 95       	ror	r24
 14c:	ea 95       	dec	r30
 14e:	d2 f7       	brpl	.-12     	; 0x144 <__ashldi3+0x8c>
 150:	2d 81       	ldd	r18, Y+5	; 0x05
 152:	3e 81       	ldd	r19, Y+6	; 0x06
 154:	4f 81       	ldd	r20, Y+7	; 0x07
 156:	58 85       	ldd	r21, Y+8	; 0x08
 158:	04 c0       	rjmp	.+8      	; 0x162 <__ashldi3+0xaa>
 15a:	22 0f       	add	r18, r18
 15c:	33 1f       	adc	r19, r19
 15e:	44 1f       	adc	r20, r20
 160:	55 1f       	adc	r21, r21
 162:	0a 95       	dec	r16
 164:	d2 f7       	brpl	.-12     	; 0x15a <__ashldi3+0xa2>
 166:	82 2b       	or	r24, r18
 168:	93 2b       	or	r25, r19
 16a:	a4 2b       	or	r26, r20
 16c:	b5 2b       	or	r27, r21
 16e:	8d 87       	std	Y+13, r24	; 0x0d
 170:	9e 87       	std	Y+14, r25	; 0x0e
 172:	af 87       	std	Y+15, r26	; 0x0f
 174:	b8 8b       	std	Y+16, r27	; 0x10
 176:	29 85       	ldd	r18, Y+9	; 0x09
 178:	3a 85       	ldd	r19, Y+10	; 0x0a
 17a:	4b 85       	ldd	r20, Y+11	; 0x0b
 17c:	5c 85       	ldd	r21, Y+12	; 0x0c
 17e:	6d 85       	ldd	r22, Y+13	; 0x0d
 180:	7e 85       	ldd	r23, Y+14	; 0x0e
 182:	8f 85       	ldd	r24, Y+15	; 0x0f
 184:	98 89       	ldd	r25, Y+16	; 0x10
 186:	60 96       	adiw	r28, 0x10	; 16
 188:	0f b6       	in	r0, 0x3f	; 63
 18a:	f8 94       	cli
 18c:	de bf       	out	0x3e, r29	; 62
 18e:	0f be       	out	0x3f, r0	; 63
 190:	cd bf       	out	0x3d, r28	; 61
 192:	cf 91       	pop	r28
 194:	df 91       	pop	r29
 196:	0f 91       	pop	r16
 198:	08 95       	ret

I'm not going to try and analyze that but a Google for the symbol name and the word "source" will lead you to many copies of the identical source:

/*
 * Shift a (signed) quad value left (arithmetic shift left).
 * This is the same as logical shift left!
 */
quad_t
__ashldi3(a, shift)
	quad_t a;
	qshift_t shift;
{
	union uu aa;

	aa.q = a;
	if (shift >= LONG_BITS) {
		aa.ul[H] = shift >= QUAD_BITS ? 0 :
		    aa.ul[L] << (shift - LONG_BITS);
		aa.ul[L] = 0;
	} else if (shift > 0) {
		aa.ul[H] = (aa.ul[H] << shift) |
		    (aa.ul[L] >> (LONG_BITS - shift));
		aa.ul[L] <<= shift;
	}
	return (aa.q);
}

HOWEVER I just explored the recent GCC source tree (the above was built using a very old 4.5.3) and I now find:

 

https://gcc.gnu.org/viewcvs/gcc/...

 

So I'm going to suggest this may have a LOT to do with which version of the compiler you are running. In fact look at the revision history on that file:

 

https://gcc.gnu.org/viewcvs/gcc/...

 

and notice what happened at 196431. A check in by Georg-Johan Lay (who else!) in March 2013 appears to have implemented what were previously C shifts in hand crafted AVR asm. So make sure your C compiler dates from a release made after March 2014. looking at:

 

https://gcc.gnu.org/releases.html

 

that would seem to suggest you would need to have a 4.7.3 or later. If using Atmel builds the one in AS6.2 was 4.8.1 I believe and the one in AS7 is 4.9.2. I'll just try that code with 4.9.2 now...

000000b8 <__ashldi3>:
  b8:	0f 93       	push	r16
  ba:	08 30       	cpi	r16, 0x08	; 8
  bc:	90 f0       	brcs	.+36     	; 0xe2 <__ashldi3+0x2a>
  be:	98 2f       	mov	r25, r24
  c0:	87 2f       	mov	r24, r23
  c2:	76 2f       	mov	r23, r22
  c4:	65 2f       	mov	r22, r21
  c6:	54 2f       	mov	r21, r20
  c8:	43 2f       	mov	r20, r19
  ca:	32 2f       	mov	r19, r18
  cc:	22 27       	eor	r18, r18
  ce:	08 50       	subi	r16, 0x08	; 8
  d0:	f4 cf       	rjmp	.-24     	; 0xba <__ashldi3+0x2>
  d2:	22 0f       	add	r18, r18
  d4:	33 1f       	adc	r19, r19
  d6:	44 1f       	adc	r20, r20
  d8:	55 1f       	adc	r21, r21
  da:	66 1f       	adc	r22, r22
  dc:	77 1f       	adc	r23, r23
  de:	88 1f       	adc	r24, r24
  e0:	99 1f       	adc	r25, r25
  e2:	0a 95       	dec	r16
  e4:	b2 f7       	brpl	.-20     	; 0xd2 <__ashldi3+0x1a>
  e6:	0f 91       	pop	r16
  e8:	08 95       	ret

Ah ha! Much more satisfying :-)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Wow...that is just about the most thorough reply to an offhand remark I have ever seen!  I was using atmel studio 6.2, so I downloaded atmel studio 7 to compare and the disassembly looks nearly identical while the performance is, unsurprisingly, roughly the same.  So while I can appreciate that routine is highly optimized, in that it's managing to left shift 56 times a variable eight times wider than the cpu registers in under 140 clock cycles, I had expected/hoped/wished for slightly different behavior.  Namely, that the optimizer would recognize that a bitshift which is a multiple of a whole byte on an 8 bit variable was just moving a single byte in memory, without bothering going through the motions to juggle all the other memory locations that are guaranteed to be zero both before and after the shift.  Maybe it's an unrealistic expectation, but in my likely under-informed opinion it doesn't seem like it should be that difficult for the compiler to figure out.  Of course it might just be something nobody bothered with since it only makes an appreciable impact on the larger types, which don't see as much use in 8 bit applications.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Rezer wrote:
Maybe it's an unrealistic expectation,

It is.

 

But the compiler is open source so if you can develop a patch that would implement such an optimisation you may well get it accepted. Not only does the compiler get better but you get your name on the tin too!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'd be far more willing to believe that my underlying assumptions are false than that I might have anything useful to contribute to compiler development :p