## Help working with algorithm,

16 posts / 0 new
Author
Message

Not following this procedure here at all.

(Position A to Position C):
Perform the Shift Right and Xor operations as shown in the flow diagram 64 times.
Each time use the next sequential ChallengeBit for the (A1 Xor NextChallengeBit) operation. Example:
Fist time through the loop: B24 = A1 Xor CB1

Second time through the loop: B24 = A1 Xor CB2

Third time through the loop: B24 = A1 Xor CB3

Forth time through the loop: B24 = A1 Xor CB4
-
-
64th time through the loop: B24 = A1 Xor CB64

-------------------------------------------------------------

CB is random data 64 bits in length

B24 is C541A9

A1 is the first bit of B24

-----------------------------------------------------------

So if I do that I only change the last bit of B24 and that seem very pointless. This procedure is direction for decryption. Attached is a a chart depicting the above.

This is the way I read it

for (i=0;i<64;i++)
{
output = (init & 1)  ^  (CHALLENGE  & 1);//take the first bit from CHALLENGE  and OR to first bit in init
CHALLENGE  >>= 1;//bit is done go to next
}

Though that can not be what is being asked for here. I'll end up with a 1 or a 0 at the MSB in init,  I'd assume it would change more then 1 bit? Anyone see what I'm missing here?

Maybe something like this is implied?

for (i=0;i<64;i++)
{

for (j=0;j<24;j++)
{
output |= (init & 1)  ^  (CHALLANGE & 1);
init >>= 1;
output <<= 1;
}
CHALLANGE >>= 1;
init = output;
}

## Attachment(s):

Last Edited: Thu. Apr 21, 2016 - 07:07 PM

It seems like you're ignoring most of the arrows and xor operations to me...this is the diagram as I read it:

Load init value 0xC541A9 into buffer

Repeat 64 times:

1) Xor low bit of buffer with low bit of challenge, save result as tempbit.

2) Right shift buffer.

3) Place tempbit in high bit of buffer. (covers Xor statement in blue)

4) If tempbit is one, xor buffer with 0x109028 and store result to buffer, if tempbit is zero do nothing (covers the 5 Xor statements above C24:1 boxes)

5) Right shift challenge. (moves CB2, CB3... to low bit)

And then rearrange buffer nibbles to form response bytes.

Obviously some of that shouldn't be done with shifts in an 8 bit MCU, but I'm going for the shortest explanation without trying to optimize anything.

Last Edited: Thu. Apr 21, 2016 - 08:47 PM

I got bored.  I realize you didn't specify 8 bit mcu now that I look at the forum it's in, but what the hell, here's an 8 bit implementation:

```#include <avr/io.h>

/**********************
Right shift an array one place.
arr is the array.
size is the length of the array
**********************/
void rShiftArr(uint8_t * arr, uint8_t size)
{
size--;
for(; size > 0; size--)
{
arr[size] >>= 1;
if (arr[size-1] & 0x01)
arr[size] |= 0x80;
}
arr[0] >>= 1;
}

/***********************
challenge must be provided.  It's 8 bytes in length, msB at index 0, lsB at index 7.
response must only be allocated.  3 bytes in length, value is ignored

The result is left in response.  msB at index 0, lsB at index 2.
Additionally, challenge will be zeroed out.
***********************/
void createResponse(uint8_t * challenge, uint8_t * response)
{
uint8_t temp;

response[0] = 0xC5;
response[1] = 0x41;
response[2] = 0xA9;

for(uint8_t byteIndex = 7; byteIndex < 8 ; byteIndex--)  //8 bits per byte
{
for(uint8_t bitCounter = 0; bitCounter < 8; bitCounter++) //from lsb to msb of byte
{
temp = 0x01 & (challenge[byteIndex] ^ response[2]);  //store lsb XOR
rShiftArr(response, 3);  //right shift response
if(temp)  //if buffer lsb XOR challenge lsb is 1, set msb of response and XOR response with 0x109028
{
response[0] |= 0x80;
response[0] ^= 0x10;
response[1] ^= 0x90;
response[2] ^= 0x28;
}
challenge[byteIndex] >>= 1;  //next lsb of challenge
}
}

//reusing temp to juggle nibbles
temp = (response[2] & 0xF0) >> 4;
temp |= response[1] << 4; // temp C12:9 C8:5

response[1] &= 0xF0;
response[1] |= response[0] >> 4; // response[1] C16:13 C24:21

response[2] <<= 4;
response[2] |= 0x0F & response[0]; //response[2] C4:1 C20:C17

response[0] = temp; 	//response[0] C12:9 C8:5
}

int main(void)
{
while(1)
{
uint8_t challenge[] = {0xFF, 0xEE, 0xDD, 0xCC, 0xBB, 0xAA, 0x99, 0x88};  //0xFFEEDDCCBBAA9988
uint8_t response[3];
createResponse(challenge, response);
}
}

```

The result with that challenge is 0xEE61CE

Last Edited: Fri. Apr 22, 2016 - 01:42 AM

Rezer wrote:

I got bored.  I realize you didn't specify 8 bit mcu now that I look at the forum it's in, but what the hell, here's an 8 bit implementation:

```#include <avr/io.h>

/**********************
Right shift an array one place.
arr is the array.
size is the length of the array
**********************/
void rShiftArr(uint8_t * arr, uint8_t size)
{
size--;
for(; size > 0; size--)
{
arr[size] >>= 1;
if (arr[size-1] & 0x01)
arr[size] |= 0x80;
}
arr[0] >>= 1;
}

/***********************
challenge must be provided.  It's 8 bytes in length, msB at index 0, lsB at index 7.
response must only be allocated.  3 bytes in length, value is ignored

The result is left in response.  msB at index 0, lsB at index 2.
Additionally, challenge will be zeroed out.
***********************/
void createResponse(uint8_t * challenge, uint8_t * response)
{
uint8_t temp;

response[0] = 0xC5;
response[1] = 0x41;
response[2] = 0xA9;

for(uint8_t byteIndex = 7; byteIndex < 8 ; byteIndex--)  //8 bits per byte
{
for(uint8_t bitCounter = 0; bitCounter < 8; bitCounter++) //from lsb to msb of byte
{
temp = 0x01 & (challenge[byteIndex] ^ response[2]);  //store lsb XOR
rShiftArr(response, 3);  //right shift response
if(temp)  //if buffer lsb XOR challenge lsb is 1, set msb of response and XOR response with 0x109028
{
response[0] |= 0x80;
response[0] ^= 0x10;
response[1] ^= 0x90;
response[2] ^= 0x28;
}
challenge[byteIndex] >>= 1;  //next lsb of challenge
}
}

//reusing temp to juggle nibbles
temp = (response[2] & 0xF0) >> 4;
temp |= response[1] << 4; // temp C12:9 C8:5

response[1] &= 0xF0;
response[1] |= response[0] >> 4; // response[1] C16:13 C24:21

response[2] <<= 4;
response[2] |= 0x0F & response[0]; //response[2] C4:1 C20:C17

response[0] = temp; 	//response[0] C12:9 C8:5
}

int main(void)
{
while(1)
{
uint8_t challenge[] = {0xFF, 0xEE, 0xDD, 0xCC, 0xBB, 0xAA, 0x99, 0x88};  //0xFFEEDDCCBBAA9988
uint8_t response[3];
createResponse(challenge, response);
}
}
```

The result with that challenge is 0xEE61CE

That explains the troubles I'm having then so now I need to prepare the 8 bits properly (see new image). To make the 64 bit storage I was trying this but that is not going to work with an 8 bit MCU.

uint8_t response[3];
uint8_t challenge[] = {0, 0, 0, 0, 0, 0, 0, 0};
uint8_t FIXED[] = {0x41,0xaa,0x42,0xbb,0x43};//f1,f2,f3,f4,f5
uint8_t KEY[] = {0x1a,0xf9,0x64};//s1,s2,s3

//I was using long long for challenge , now I have to stuff them in the right place for the covered code.
challenge |= (FIXED[4]>>7)<<63;//load MSB-f5 to 64
challenge |= (FIXED[4] & 1)<<56;//load LSB-f5 to 57
challenge |= (FIXED[3]>>7)<<55;//load MSB-f5 to 56
challenge |= (KEY[1]  & 1)<<8;//load LSB-s2 to 9
challenge |= (KEY[0] >>7)<<7;//load MSB-s1 to 8
challenge |= (KEY[0]  & 1);//load LSB-s1 to 1

so I think

challenge[0] |= (FIXED[4] & 0x80);//load MSB-f5 to 64
challenge[0] |= (FIXED[4] & 1);//load LSB-f5 to 57
challenge[1] |= (FIXED[3] & 0x80);//load MSB-f4 to 56
challenge[6] |= (KEY[1]  & 1);//load LSB-s2 to 9
challenge[7] |= (KEY[0] & 0x80);//load MSB-s1 to 8
challenge[7] |= (KEY[0]  & 1);//load LSB-s1 to 1

then

createResponse(challenge, response);

but I should get 0xb6 f4 cf

## Attachment(s):

Last Edited: Fri. Apr 22, 2016 - 06:36 PM

The compiler is avr-gcc. It supports uint64_t (badly - but what do you expect on an 8bit micro where just one 64 bit variable occupies 8 of the 32 registers!).

The point here (kind of) is that if you use:

`#include <stdint.h>`

and the standard types like uint64_t then this code will behave identically on whatever architecture you build it on. It won't be a case of "PC version" and "AVR version" as it is the SAME version. If, however, you write in terms of long long, long, int, short, char then you will hit issues when you try to use the same source on different architectures. This is the very reason <stdint.h> exists. It means you can develop and test numeric processing algorithms in the comfort of a PC .exe then just rebuild for the AVR later. The downside, as I say, is that while a uint64_t is one register on modern Pentiums and AMD64's it is 8 registers on an AVR. So while the code will have the same result on the AVR the route to the solution may be more tortuous. In which case you may then want to spend a little time on optimising the code (which can likely mostly be done in PC comfort) to get the same result but perhaps not using such wide types except when absolutely necessary (don't be tempted to just throw uint64_t's at everything!).

Oh and you don't need a specific include of it if you use <avr/io.h> as that already includes it early on anyway.

Your logic looks fine to me for the modified version...the reason your original method wouldnt work I *think* is due to the way type promotion works with bit shifts.  You're shifting 8 bit values, and by default they'll only be promoted to int values which are 16 bits wide.  If you wanted it to work with a uint64_t you'd have to promote each variable to uint64_t before doing the shift, otherwise you shift the data right off the end of your variable.  Also, unless GCC got better at optimizing bitshifts since last I checked, shifting 64 bit variables large numbers of times is very expensive.

And clawson, I have no doubt that you're correct about using uint64 types, but I still think his original method is flawed for the above reason.  Disclaimer:  I haven't tested it

Last Edited: Fri. Apr 22, 2016 - 09:52 PM

Ok, I just ran a few tests:

With the following declarations:

```		uint8_t in = 0x01;
uint64_t out;```

This produces 0x0000000000000000 in the out variable:

`		out = in << 56;`

While this produces the expected 0x0100000000000000:

`		out = (uint64_t)in << 56;`

However, it takes 133 cycles at all optimization levels even though in essence it's only copying a single byte.  Why are optimizations so terrible on bitshifts?  Anyway...

This results in 0x0000000000000000:

`		out = in << 16;`

And this 0xFFFFFFFFFFFF8000:

`		out = in << 15;`

So there you go, it promotes to a signed int16 and no further without an explicit cast.

Last Edited: Fri. Apr 22, 2016 - 10:20 PM

Rezer, thx.

I guess then we both are misunderstanding the requirements. I'm not arriving at these answers

test operations.

Fixed bytes (f1-f5) 14,aa,42,bb,23

(k1-k3) 1a,f9,64 ---> should yield  b6,f4,cf

(k1-k3) 4e,96,8a---> should yield  81,4b,a6

(k1-k3) 4e,96,8b ---> should yield  c0,25,eb

Fixed bytes (f1-f5) 1a,2b,3c,4d,5e

(k1-k3) 1a,f9,64 ---> should yield  a1,1c,a8

(k1-k3) 4e,96,8a---> should yield  96,a3,c1

(k1-k3) 4e,96,8b---> should yield  d7,cd,8c

Last Edited: Mon. Apr 25, 2016 - 12:28 PM

Sorry, I didn't look hard enough at that second table you posted vs your implementation.  I thought it was mixing bits around with existing challenge bytes, but it looks like F1-F5 and S1-S3 ARE the challenge bytes and that table is dashed because whoever made it hates being easily understood  The dashes I was reading as "don't modify" are really standins for bit 7 of F5, bit 6 of F5, bit 5 of F5, etc.  You just need to do this:

`uint8_t challenge[] = {F5, F4, F3, F2, F1, S3, S2, S1};`

and then it should work fine.  I say "should" because I ran it on your first 3 test samples and it failed miserably, while the second 3 worked exactly as expected.  I think you have a typo in the fixed bytes in the first set of sample data.

so should it not be , mix bits then challenge?

unsigned char FIXED[5] = {0x43,0xbb,0x42,0xaa,0x41};//tried changing this from f1-f5 to f5-f1
unsigned char KEY[3]={0x1a,0xf9,0x64};//tried both directions here.

//mix up the bits.
challenge[0] |= (FIXED[4] & 0x80);//load MSB-f5 to 64
challenge[0] |= (FIXED[4] & 1);//load LSB-f5 to 57
challenge[1] |= (FIXED[3] & 0x80);//load MSB-f4 to 56
challenge[6] |= (KEY[1]  & 1);//load LSB-s2 to 9
challenge[7] |= (KEY[0] & 0x80);//load MSB-s1 to 8
challenge[7] |= (KEY[0]  & 1);//load LSB-s1 to 1

createResponse(challenge, response);//now challange

is that not how you read it?

Last Edited: Mon. Apr 25, 2016 - 06:38 PM

Nope, like I said that image you posted, "c.png", isn't very clear.  There's no mixing of bits at all, simply loading F1-F5 and S1-S3 as the challenge bytes.  Loading the challenge array like I posted previously works fine for all 6 challenge response samples, including the first 3 now that I realize your fixed bytes were meant to be F1-F5 = 0x41, 0xaa, 0x42, 0xbb, 0x43.

wow, that's just friggen nutz. The doc is for people on your team guys LOL. ok Rezer thx so much for all your help. I would never have gotten thru that doc on my own at my level.

Rezer wrote:
However, it takes 133 cycles at all optimization levels even though in essence it's only copying a single byte. Why are optimizations so terrible on bitshifts?

Such questions should never remain unanswered:

```\$ cat avr.c
#include <avr/io.h>

uint64_t out;

int main(void) {
out = (uint64_t)PINB << 56;
}
```

in this I use PINB as a useful input as it just happens to be volatile so the optimiser has no choice to precalculate anything here. When built I get this for main():

```	out = (uint64_t)PINB << 56;
7e:	26 b3       	in	r18, 0x16	; 22
80:	30 e0       	ldi	r19, 0x00	; 0
82:	40 e0       	ldi	r20, 0x00	; 0
84:	50 e0       	ldi	r21, 0x00	; 0
86:	60 e0       	ldi	r22, 0x00	; 0
88:	70 e0       	ldi	r23, 0x00	; 0
8a:	80 e0       	ldi	r24, 0x00	; 0
8c:	90 e0       	ldi	r25, 0x00	; 0
8e:	08 e3       	ldi	r16, 0x38	; 56
90:	0e 94 5c 00 	call	0xb8	; 0xb8 <__ashldi3>
94:	20 93 60 00 	sts	0x0060, r18
98:	30 93 61 00 	sts	0x0061, r19
9c:	40 93 62 00 	sts	0x0062, r20
a0:	50 93 63 00 	sts	0x0063, r21
a4:	60 93 64 00 	sts	0x0064, r22
a8:	70 93 65 00 	sts	0x0065, r23
ac:	80 93 66 00 	sts	0x0066, r24
b0:	90 93 67 00 	sts	0x0067, r25
```

To make the CALL there this could not really be any simpler. It loads PINB into the lowest register of a group of 8 (R25:R18) then loads 56 into R16 and makes the CALL. On the way back it just stores the 8 registers to "out" in RAM (as it must because I made it a global). So there's nothing there that can be optimised away. The actual 56 bit shift of the uint64_t is done by the generic routine:

```000000b8 <__ashldi3>:
b8:	0f 93       	push	r16
ba:	df 93       	push	r29
bc:	cf 93       	push	r28
be:	cd b7       	in	r28, 0x3d	; 61
c0:	de b7       	in	r29, 0x3e	; 62
c2:	60 97       	sbiw	r28, 0x10	; 16
c4:	0f b6       	in	r0, 0x3f	; 63
c6:	f8 94       	cli
c8:	de bf       	out	0x3e, r29	; 62
ca:	0f be       	out	0x3f, r0	; 63
cc:	cd bf       	out	0x3d, r28	; 61
ce:	00 23       	and	r16, r16
d0:	09 f4       	brne	.+2      	; 0xd4 <__ashldi3+0x1c>
d2:	59 c0       	rjmp	.+178    	; 0x186 <__ashldi3+0xce>
d4:	29 83       	std	Y+1, r18	; 0x01
d6:	3a 83       	std	Y+2, r19	; 0x02
d8:	4b 83       	std	Y+3, r20	; 0x03
da:	5c 83       	std	Y+4, r21	; 0x04
dc:	6d 83       	std	Y+5, r22	; 0x05
de:	7e 83       	std	Y+6, r23	; 0x06
e0:	8f 83       	std	Y+7, r24	; 0x07
e2:	98 87       	std	Y+8, r25	; 0x08
e4:	e0 e2       	ldi	r30, 0x20	; 32
e6:	e0 1b       	sub	r30, r16
e8:	29 81       	ldd	r18, Y+1	; 0x01
ea:	3a 81       	ldd	r19, Y+2	; 0x02
ec:	4b 81       	ldd	r20, Y+3	; 0x03
ee:	5c 81       	ldd	r21, Y+4	; 0x04
f0:	1e 16       	cp	r1, r30
f2:	bc f0       	brlt	.+46     	; 0x122 <__ashldi3+0x6a>
f4:	19 86       	std	Y+9, r1	; 0x09
f6:	1a 86       	std	Y+10, r1	; 0x0a
f8:	1b 86       	std	Y+11, r1	; 0x0b
fa:	1c 86       	std	Y+12, r1	; 0x0c
fc:	8e 2f       	mov	r24, r30
fe:	99 27       	eor	r25, r25
100:	87 fd       	sbrc	r24, 7
102:	90 95       	com	r25
104:	90 95       	com	r25
106:	81 95       	neg	r24
108:	9f 4f       	sbci	r25, 0xFF	; 255
10a:	04 c0       	rjmp	.+8      	; 0x114 <__ashldi3+0x5c>
10c:	22 0f       	add	r18, r18
10e:	33 1f       	adc	r19, r19
110:	44 1f       	adc	r20, r20
112:	55 1f       	adc	r21, r21
114:	8a 95       	dec	r24
116:	d2 f7       	brpl	.-12     	; 0x10c <__ashldi3+0x54>
118:	2d 87       	std	Y+13, r18	; 0x0d
11a:	3e 87       	std	Y+14, r19	; 0x0e
11c:	4f 87       	std	Y+15, r20	; 0x0f
11e:	58 8b       	std	Y+16, r21	; 0x10
120:	2a c0       	rjmp	.+84     	; 0x176 <__ashldi3+0xbe>
122:	ca 01       	movw	r24, r20
124:	b9 01       	movw	r22, r18
126:	00 2e       	mov	r0, r16
128:	04 c0       	rjmp	.+8      	; 0x132 <__ashldi3+0x7a>
12a:	66 0f       	add	r22, r22
12c:	77 1f       	adc	r23, r23
12e:	88 1f       	adc	r24, r24
130:	99 1f       	adc	r25, r25
132:	0a 94       	dec	r0
134:	d2 f7       	brpl	.-12     	; 0x12a <__ashldi3+0x72>
136:	69 87       	std	Y+9, r22	; 0x09
138:	7a 87       	std	Y+10, r23	; 0x0a
13a:	8b 87       	std	Y+11, r24	; 0x0b
13c:	9c 87       	std	Y+12, r25	; 0x0c
13e:	da 01       	movw	r26, r20
140:	c9 01       	movw	r24, r18
142:	04 c0       	rjmp	.+8      	; 0x14c <__ashldi3+0x94>
144:	b6 95       	lsr	r27
146:	a7 95       	ror	r26
148:	97 95       	ror	r25
14a:	87 95       	ror	r24
14c:	ea 95       	dec	r30
14e:	d2 f7       	brpl	.-12     	; 0x144 <__ashldi3+0x8c>
150:	2d 81       	ldd	r18, Y+5	; 0x05
152:	3e 81       	ldd	r19, Y+6	; 0x06
154:	4f 81       	ldd	r20, Y+7	; 0x07
156:	58 85       	ldd	r21, Y+8	; 0x08
158:	04 c0       	rjmp	.+8      	; 0x162 <__ashldi3+0xaa>
15a:	22 0f       	add	r18, r18
15c:	33 1f       	adc	r19, r19
15e:	44 1f       	adc	r20, r20
160:	55 1f       	adc	r21, r21
162:	0a 95       	dec	r16
164:	d2 f7       	brpl	.-12     	; 0x15a <__ashldi3+0xa2>
166:	82 2b       	or	r24, r18
168:	93 2b       	or	r25, r19
16a:	a4 2b       	or	r26, r20
16c:	b5 2b       	or	r27, r21
16e:	8d 87       	std	Y+13, r24	; 0x0d
170:	9e 87       	std	Y+14, r25	; 0x0e
172:	af 87       	std	Y+15, r26	; 0x0f
174:	b8 8b       	std	Y+16, r27	; 0x10
176:	29 85       	ldd	r18, Y+9	; 0x09
178:	3a 85       	ldd	r19, Y+10	; 0x0a
17a:	4b 85       	ldd	r20, Y+11	; 0x0b
17c:	5c 85       	ldd	r21, Y+12	; 0x0c
17e:	6d 85       	ldd	r22, Y+13	; 0x0d
180:	7e 85       	ldd	r23, Y+14	; 0x0e
182:	8f 85       	ldd	r24, Y+15	; 0x0f
184:	98 89       	ldd	r25, Y+16	; 0x10
186:	60 96       	adiw	r28, 0x10	; 16
188:	0f b6       	in	r0, 0x3f	; 63
18a:	f8 94       	cli
18c:	de bf       	out	0x3e, r29	; 62
18e:	0f be       	out	0x3f, r0	; 63
190:	cd bf       	out	0x3d, r28	; 61
192:	cf 91       	pop	r28
194:	df 91       	pop	r29
196:	0f 91       	pop	r16
198:	08 95       	ret
```

I'm not going to try and analyze that but a Google for the symbol name and the word "source" will lead you to many copies of the identical source:

```/*
* Shift a (signed) quad value left (arithmetic shift left).
* This is the same as logical shift left!
*/
__ashldi3(a, shift)
qshift_t shift;
{
union uu aa;

aa.q = a;
if (shift >= LONG_BITS) {
aa.ul[H] = shift >= QUAD_BITS ? 0 :
aa.ul[L] << (shift - LONG_BITS);
aa.ul[L] = 0;
} else if (shift > 0) {
aa.ul[H] = (aa.ul[H] << shift) |
(aa.ul[L] >> (LONG_BITS - shift));
aa.ul[L] <<= shift;
}
return (aa.q);
}
```

HOWEVER I just explored the recent GCC source tree (the above was built using a very old 4.5.3) and I now find:

https://gcc.gnu.org/viewcvs/gcc/...

So I'm going to suggest this may have a LOT to do with which version of the compiler you are running. In fact look at the revision history on that file:

https://gcc.gnu.org/viewcvs/gcc/...

and notice what happened at 196431. A check in by Georg-Johan Lay (who else!) in March 2013 appears to have implemented what were previously C shifts in hand crafted AVR asm. So make sure your C compiler dates from a release made after March 2014. looking at:

https://gcc.gnu.org/releases.html

that would seem to suggest you would need to have a 4.7.3 or later. If using Atmel builds the one in AS6.2 was 4.8.1 I believe and the one in AS7 is 4.9.2. I'll just try that code with 4.9.2 now...

```000000b8 <__ashldi3>:
b8:	0f 93       	push	r16
ba:	08 30       	cpi	r16, 0x08	; 8
bc:	90 f0       	brcs	.+36     	; 0xe2 <__ashldi3+0x2a>
be:	98 2f       	mov	r25, r24
c0:	87 2f       	mov	r24, r23
c2:	76 2f       	mov	r23, r22
c4:	65 2f       	mov	r22, r21
c6:	54 2f       	mov	r21, r20
c8:	43 2f       	mov	r20, r19
ca:	32 2f       	mov	r19, r18
cc:	22 27       	eor	r18, r18
ce:	08 50       	subi	r16, 0x08	; 8
d0:	f4 cf       	rjmp	.-24     	; 0xba <__ashldi3+0x2>
d2:	22 0f       	add	r18, r18
d4:	33 1f       	adc	r19, r19
d6:	44 1f       	adc	r20, r20
d8:	55 1f       	adc	r21, r21
da:	66 1f       	adc	r22, r22
dc:	77 1f       	adc	r23, r23
de:	88 1f       	adc	r24, r24
e0:	99 1f       	adc	r25, r25
e2:	0a 95       	dec	r16
e4:	b2 f7       	brpl	.-20     	; 0xd2 <__ashldi3+0x1a>
e6:	0f 91       	pop	r16
e8:	08 95       	ret
```

Ah ha! Much more satisfying :-)

Wow...that is just about the most thorough reply to an offhand remark I have ever seen!  I was using atmel studio 6.2, so I downloaded atmel studio 7 to compare and the disassembly looks nearly identical while the performance is, unsurprisingly, roughly the same.  So while I can appreciate that routine is highly optimized, in that it's managing to left shift 56 times a variable eight times wider than the cpu registers in under 140 clock cycles, I had expected/hoped/wished for slightly different behavior.  Namely, that the optimizer would recognize that a bitshift which is a multiple of a whole byte on an 8 bit variable was just moving a single byte in memory, without bothering going through the motions to juggle all the other memory locations that are guaranteed to be zero both before and after the shift.  Maybe it's an unrealistic expectation, but in my likely under-informed opinion it doesn't seem like it should be that difficult for the compiler to figure out.  Of course it might just be something nobody bothered with since it only makes an appreciable impact on the larger types, which don't see as much use in 8 bit applications.