Less intensive checksum..

Go To Last Post
52 posts / 0 new

Pages

Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


Using an atmega1284p @ 18.432 Mhz to be a go between a PC parallel port and a microSD card.  My goal is to be able to read/write files to the microsd via PC parallel port.  So far with the right kind of parallel port, performance is pretty decent (>500 KB/s) considering what it is.  The below is a packet of 1024 bytes.  I am using the "pc send error" line temporarily to show the time consumed by running a crc16 on this 1K on the AVR and it is roughly 1.8 milliseconds which ends up halving the speed because the AVR is busy during that time.  I am using the standard _crc16_update(); function for this.  Obviously I do want a checksum, but is there something less computationally demanding that is still decent?  I could just add up all the bytes, but that is probably pretty crummy.  Is there something between that and crc16?

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The crc hash function is not especially cheap, although a lot of modern hashes that are cheaper on desktop hardware might well be significantly worse on AVR, where anything over 8-bit operations is fairly expensive.

 

What are you checksumming, and why do you need it checksummed? One thing to consider is whether you can interleave the checksumming with the actual transmission -- if there's any busy-loops or waits or anything happening during the transmission, that's cycles you could be stealing to do computations.

 

Note in particular that crc16_update looks to me like it's not using precomputed tables, but it's possible to precompute a table that gives, in a single step, the CRC effect of any given byte, and there's only 256 possible bytes, and you might well be able to afford 512 bytes of RAM for that and get a very large speedup.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is Fletcher's checksum and Adler-32 checksum (a modification to the Fletcher checksum), but they are not as effective as a CRC. Both require division so they probably don't have good performance without hardware division.

 

Are you using a bitwise CRC implementation or a table driven one? A table driven implementation shouldn't be very computationally demanding.

github.com/apcountryman/build-avr-gcc: a script for building avr-gcc

github.com/apcountryman/toolchain-avr-gcc: a CMake toolchain for cross compiling for the Atmel AVR family of microcontrollers

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you have 512 bytes of unused flash (and possibly also 512 bytes of RAM) you can use a table-driven method to eliminate looped CRC calculations on bits.
Another option is to use an 8-bit CRC.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Each error detection scheme has a .... failure rate (?) - the rate at which it will miss possible errors. None is perfect. It generally ends up being a tradeoff between computation overhead and rate of misses. CRC16 will miss fewer than CRC8. CRC8 will miss fewer than a classic checksum. Somewhere, in there, is XOR. Basically, you do the lowest error rate for the time that you an afford. However, that may not always be the easiest to determine. In the end, I think the common choice is to simply use the best that will fit the time that you have. 

 

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Is there no way to do it on the fly?

 

you could start with a uint16 = 0

then receive two bytes and XOR them with the uint16.

all you need is a counter to 1 (0= first byte received, 1 is second received then do crc and set to 0 again)

You might have enough time to do that. IIRC you can have 1 byte stil in the receive buffer while already receiving the second byte( check datasheet as it might be that I messing up processors)

Another thing is a small X position uint16 ring buffer that you fill and outside the receive ISR in the main loop start working your way through that.

 

As said another thing might be going to a crc 8 then it should for sure be possible to do it on the fly.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Lookup table and on-the-fly has to be the way to go.

 

My goto crc16 function looks like this...

 

unsigned int crc16(unsigned int crcReg, unsigned char crcData) {
    return (crcReg << 8) ^ crc16LUT[((unsigned char)(crcReg >> 8)) ^ crcData];
}

 

...which compiles down to 21 assembly instructions, so less than 2us/byte.

 

 

[EDIT]

 

But of course, for 1024 bytes, that's still approaching the OPs time of 1.8ms. Interesting problem, more coffee needed.

#1 Hardware Problem? https://www.avrfreaks.net/forum/...

#2 Hardware Problem? Read AVR042.

#3 All grounds are not created equal

#4 Have you proved your chip is running at xxMHz?

#5 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Not sure this is relevant but I was looking for a "fast hash" a while back (to uniquely identify image frames in a recording) and came across "FNV = Fowler/Noll/Vo". Some research work someone had done had shown this to be a very fast hash.

 

See the top answer here:

 

https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

 

In particular, in the table of timings see how FNV/FNV-1A compares to plain old CRC32

 

Having said all that I note you are using CRC16 not CRC32 so it could be faster than all?

Last Edited: Fri. Sep 11, 2020 - 08:27 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You have to step back and see what you actually want to check.

 

CRC is very good compared to the transmitted overhead, but heavy in calculations.

 

At the other end is parity bit (in RS232) big data overhead.

 

If you have to send 10 different sums of your data (real sum , 3 sums of every 3. data ......xor .....) would not be a problem, and it's still simple calculations that can be done on by the AVR.

 

 

   

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for al the replies guys - lots to think about/try here.  I'm using the build in function _crc16_update() which almost certainly doesn't use a LUT.  I'll google that and see if I can find it to try.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

a go between a PC parallel port and a microSD card

I hate it when everyone else understands what is going on and I'm clueless...

 

Simple question:

 

Are you validating the PC to micro data transmission, or the uSD card storage, (write and read back), which actually validates the full signal path?

 

How attached to the M1284 are you?

 

Recall that the Xmega will run at 32 MHz, in spec, which gives yo a lot more horsepower to work with, and many of them include a hardware CRC module which is also much faster than running the calc in software.

If you needed the M1284 for some reason one might even envision a system that used an Xmega as a smart peripheral to run the CRC for you, (with chip to chip overhead instead of calculations overhead).

 

JC 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


The project is to bring a microsd fat/fat32/xfat filesystem to DOS (which doesn't understand fat32/xfat) via a network redirector interface TSR driver using a parallel port.  The AVR will handle all the SPI to the microSD using FATFS at clk/2 (9.216 MHz).  I have it supporting various parallel port modes depending on the parallel port capability.  This is the raw performance so far in different modes on a 700 MHz Pentium III system, but I want it to work all the way down to an 8088/8086.

 

The crc16 is purely a validation on the parallel port communication.  I plan to make a plug on module, but it could be used with a cable too.  I want to verify that the parallel part of the interface doesn't get corrupted.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The reason for the 1284 is 5V compatibility and also the 16K SRAM that it has.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, how about the new AVR DA and DB series micros instead?

 

They both work at 5V and they both incorporate CRC-16-CCITT and CRC-32 (IEEE 802.3), in hardware.

 

Work smarter, not harder!

(Edit:  I mean that nicely, BTW!)

 

(The DB chips allow one to use an external Xtal.)

 

JC

 

Last Edited: Fri. Sep 11, 2020 - 01:11 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Good thoughts JC - I'm wanting to keep it simple with code I already have for a bootloader, etc., so that was why I chose the 1284p.  I can't argue with your direction though and may try it out if i can't find something suitable!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just curious, what kinds of error rates have you been detecting with your current approach?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No errors detected so far, but I just implemented the detection and retry a couple nights ago.  I suppose I should throw a 6 feet DB25M/DB25F cable in the mix and see if any show up.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I would absolutely not recommend FNV as a replacement for CRC, it's going to be dramatically slower than a reasonable CRC implementation on AVR, and I wouldn't consider it a very good hash.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


DocJC wrote:

Well, how about the new AVR DA and DB series micros instead?

They both work at 5V and they both incorporate CRC-16-CCITT and CRC-32 (IEEE 802.3), in hardware.

 

Isn't the CRC unit tied to only scanning the on-chip memory though?

 

 

#1 Hardware Problem? https://www.avrfreaks.net/forum/...

#2 Hardware Problem? Read AVR042.

#3 All grounds are not created equal

#4 Have you proved your chip is running at xxMHz?

#5 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

alank2 wrote:

No errors detected so far, but I just implemented the detection and retry a couple nights ago.  I suppose I should throw a 6 feet DB25M/DB25F cable in the mix and see if any show up.

Yeah, I think trying to get a worst-case error rate figure would be helpful in making your decision.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

On chip memory would be fine, I'm currently doing it as a post operation to receiving the data.

 

I may try integrating it into the receiving loop, but then I'll have to either check to see if the byte being received or sent is past the first 4 bytes or not which contain the size of the packet (2b) and the crc16 (2b).  Either that or unwind that into two loops instead of just one which I'd rather not do.

 

I'm going to start with testing the performance of table based crc16 and if that is still slower than I want, maybe look at crc8.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Straight sums are cheap.

Do two sets, one group of five, each a sum of a fifth of the bytes,

the other a group of seven, each a sum of a seventh of the bytes.

Two bytes will be sufficient for each sum.

Do your unrolling well and its four cycles per byte.

Note that 25 registers are needed for that.

Actually getting it down to four cycles might require assembly.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

How does one make a lookup table for this:

 

uint16_t _crc16_update(uint16_t crc, uint8_t a)
{
  uint8_t i;

  crc ^= a;
  for (i = 0; i < 8; ++i)
    {
      if (crc & 1)
        crc = (uint16_t)((crc >> 1) ^ 0xA001);
      else crc = (uint16_t)(crc >> 1);
    }
  return crc;
}

EDIT- never mind I'll see if I can figure it out.

Last Edited: Fri. Sep 11, 2020 - 08:57 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

alank2 wrote:
The project is to bring a microsd fat/fat32/xfat filesystem to DOS (which doesn't understand fat32/xfat) via a network redirector interface TSR driver using a parallel port.

Emulated DOS environments like DOSEMU (linux) and DOSBOX (Windows) were my way around this problem. Both are really good emulators BTW:

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

N.Winterbottom wrote:

Emulated DOS environments like DOSEMU (linux) and DOSBOX (Windows) were my way around this problem. Both are really good emulators BTW:

 

I agree - I like collecting old equipment like the original PC AT/5170 or the Compaq Deskpro/M 486.

 

I figured out the make table function so I've got that to try and see what type of performance difference it makes!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

alank2 wrote:

How does one make a lookup table for this:

 

uint16_t _crc16_update(uint16_t crc, uint8_t a)
{
  uint8_t i;

  crc ^= a;
  for (i = 0; i < 8; ++i)
    {
      if (crc & 1)
        crc = (uint16_t)((crc >> 1) ^ 0xA001);
      else crc = (uint16_t)(crc >> 1);
    }
  return crc;
}

EDIT- never mind I'll see if I can figure it out.

I've used the table method, but never analyzed it.  I wonder if you could just generate the table by by performing this CRC calculation on every 8-bit data value from 0 to 0xFF, and in each case doing an XOR between the starting and ending values (for each data value), thus generating a table of XOR values for every 8-bit data value.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Search for "A painless guide to CRC detection algorithms".
I found it a heavy (but informative) read.
It has C code to generate lookup tables.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Tested the table version, improvement, but still slower than I want to see:

 

1.335ms        sram lookup table
1.446ms        flash lookup table

1.780ms        no table
 

Maybe I should try crc8 and see how it does.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

  Just want to mention, the IP checksum. However, I think it is weaker than CRC8.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I would say there are something wrong if lookup only about 25% than normal code.

Try to find out where the time is spend.

(perhaps inline the function)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 are you saying the lookup should have made much more difference?  I thought so too.

 

gcc is lready inlining crc16t_update() - I tried to inline it with no difference.

 

const uint16_t /*__flash*/ crc16t_lookup[256] =
{
0x0000, 0xc0c1, 0xc181, 0x0140, 0xc301, 0x03c0, 0x0280, 0xc241, 0xc601, 0x06c0, 0x0780, 0xc741, 0x0500, 0xc5c1, 0xc481, 0x0440, 
0xcc01, 0x0cc0, 0x0d80, 0xcd41, 0x0f00, 0xcfc1, 0xce81, 0x0e40, 0x0a00, 0xcac1, 0xcb81, 0x0b40, 0xc901, 0x09c0, 0x0880, 0xc841, 
0xd801, 0x18c0, 0x1980, 0xd941, 0x1b00, 0xdbc1, 0xda81, 0x1a40, 0x1e00, 0xdec1, 0xdf81, 0x1f40, 0xdd01, 0x1dc0, 0x1c80, 0xdc41, 
0x1400, 0xd4c1, 0xd581, 0x1540, 0xd701, 0x17c0, 0x1680, 0xd641, 0xd201, 0x12c0, 0x1380, 0xd341, 0x1100, 0xd1c1, 0xd081, 0x1040, 
0xf001, 0x30c0, 0x3180, 0xf141, 0x3300, 0xf3c1, 0xf281, 0x3240, 0x3600, 0xf6c1, 0xf781, 0x3740, 0xf501, 0x35c0, 0x3480, 0xf441, 
0x3c00, 0xfcc1, 0xfd81, 0x3d40, 0xff01, 0x3fc0, 0x3e80, 0xfe41, 0xfa01, 0x3ac0, 0x3b80, 0xfb41, 0x3900, 0xf9c1, 0xf881, 0x3840, 
0x2800, 0xe8c1, 0xe981, 0x2940, 0xeb01, 0x2bc0, 0x2a80, 0xea41, 0xee01, 0x2ec0, 0x2f80, 0xef41, 0x2d00, 0xedc1, 0xec81, 0x2c40, 
0xe401, 0x24c0, 0x2580, 0xe541, 0x2700, 0xe7c1, 0xe681, 0x2640, 0x2200, 0xe2c1, 0xe381, 0x2340, 0xe101, 0x21c0, 0x2080, 0xe041, 
0xa001, 0x60c0, 0x6180, 0xa141, 0x6300, 0xa3c1, 0xa281, 0x6240, 0x6600, 0xa6c1, 0xa781, 0x6740, 0xa501, 0x65c0, 0x6480, 0xa441, 
0x6c00, 0xacc1, 0xad81, 0x6d40, 0xaf01, 0x6fc0, 0x6e80, 0xae41, 0xaa01, 0x6ac0, 0x6b80, 0xab41, 0x6900, 0xa9c1, 0xa881, 0x6840, 
0x7800, 0xb8c1, 0xb981, 0x7940, 0xbb01, 0x7bc0, 0x7a80, 0xba41, 0xbe01, 0x7ec0, 0x7f80, 0xbf41, 0x7d00, 0xbdc1, 0xbc81, 0x7c40, 
0xb401, 0x74c0, 0x7580, 0xb541, 0x7700, 0xb7c1, 0xb681, 0x7640, 0x7200, 0xb2c1, 0xb381, 0x7340, 0xb101, 0x71c0, 0x7080, 0xb041, 
0x5000, 0x90c1, 0x9181, 0x5140, 0x9301, 0x53c0, 0x5280, 0x9241, 0x9601, 0x56c0, 0x5780, 0x9741, 0x5500, 0x95c1, 0x9481, 0x5440, 
0x9c01, 0x5cc0, 0x5d80, 0x9d41, 0x5f00, 0x9fc1, 0x9e81, 0x5e40, 0x5a00, 0x9ac1, 0x9b81, 0x5b40, 0x9901, 0x59c0, 0x5880, 0x9841, 
0x8801, 0x48c0, 0x4980, 0x8941, 0x4b00, 0x8bc1, 0x8a81, 0x4a40, 0x4e00, 0x8ec1, 0x8f81, 0x4f40, 0x8d01, 0x4dc0, 0x4c80, 0x8c41, 
0x4400, 0x84c1, 0x8581, 0x4540, 0x8701, 0x47c0, 0x4680, 0x8641, 0x8201, 0x42c0, 0x4380, 0x8341, 0x4100, 0x81c1, 0x8081, 0x4040, 
};

/*void crc16t_make()
{
  uint16_t i1, result;
  uint8_t c1;

  for (i1=0; i1<256;i1++)
    {
      result=i1;
      for (c1=0; c1<8; c1++)
        if (result & 1)
          result = (uint16_t)((result >> 1) ^ 0xA001);
        else result = (uint16_t)(result >> 1);
      crc16t_lookup[i1]=result;
    }
}*/

uint16_t crc16t_update(uint16_t crc, uint8_t a)
{
  return (uint16_t)((crc>>8)^crc16t_lookup[(uint8_t)(crc^a)]);
}

uint16_t crc_memory(uint8_t *AData, uint16_t ASize)
  {
    uint16_t ui1, crc;

    crc=0xffff;

    for (ui1=0; ui1<ASize; ui1++)
      //crc=_crc16_update(crc,AData[ui1]);
      crc=crc16t_update(crc,AData[ui1]);

    return ~crc;
  }

...

    //does the crc match
    //debug
    PORTA&=~_BV(PA_S3_PIN);
    if (crc_memory(buffer+4, ui1)!=*(uint16_t*)&buffer[2])
      {
        if (debuglevel>=1)
          uart_puts("crc16 does not match\r\n");
        pc_send_error=1;
        receive_available=0;
        goto avr_ready;
      }
    //debug
    PORTA|=_BV(PA_S3_PIN);

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

>The crc16 is purely a validation on the parallel port communication

 

Maybe simply send a fixed 'start of packet' header which validates no data lines are open or stuck low. Once established the data lines work, use whatever cheap hash/checksum you want if it can be done while waiting on spi, or use none at all. I doubt all the parallel port printers in use for years that spit out tons of pages ever did any data validation, and the only problems I seem to recall relating to moving the data was a flaky cable or a loose connection (and apparently we didn't have 'start of packet' data line validation).

 

You could also provide the option to verify on the software side, like the dos VERIFY used to signal to the device driver to write with verify. Then you have made it an optional thing, and the verify is all the way to the sd card and back.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

I would say there are something wrong if lookup only about 25% than normal code.

...

 

I too was surprised how long it was taking but consider the original post...1.8ms to check 1024 bytes.

 

1.8ms sounds like a lot but it's only 1.75us per byte which, at 16Mhz, is only 28 machine cycles.

#1 Hardware Problem? https://www.avrfreaks.net/forum/...

#2 Hardware Problem? Read AVR042.

#3 All grounds are not created equal

#4 Have you proved your chip is running at xxMHz?

#5 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here is the LSS

 

    crc=0xffff;
 22e:	8f ef       	ldi	r24, 0xFF	; 255
 230:	9f ef       	ldi	r25, 0xFF	; 255

    for (ui1=0; ui1<ASize; ui1++)
 232:	a6 17       	cp	r26, r22
 234:	b7 07       	cpc	r27, r23
 236:	89 f0       	breq	.+34     	; 0x25a <crc_memory+0x32>
      //crc=_crc16_update(crc,AData[ui1]);
      crc=crc16t_update(crc,AData[ui1]);
 238:	ed 91       	ld	r30, X+
    }
}*/

uint16_t crc16t_update(uint16_t crc, uint8_t a)
{
  return (uint16_t)((crc>>8)^crc16t_lookup[(uint8_t)(crc^a)]);
 23a:	49 2f       	mov	r20, r25
 23c:	55 27       	eor	r21, r21
 23e:	2e 2f       	mov	r18, r30
 240:	28 27       	eor	r18, r24
 242:	e2 2f       	mov	r30, r18
 244:	f0 e0       	ldi	r31, 0x00	; 0
 246:	ee 0f       	add	r30, r30
 248:	ff 1f       	adc	r31, r31
 24a:	e8 59       	subi	r30, 0x98	; 152
 24c:	fe 4f       	sbci	r31, 0xFE	; 254
 24e:	20 81       	ld	r18, Z
 250:	31 81       	ldd	r19, Z+1	; 0x01
 252:	ca 01       	movw	r24, r20
 254:	82 27       	eor	r24, r18
 256:	93 27       	eor	r25, r19
 258:	ec cf       	rjmp	.-40     	; 0x232 <crc_memory+0xa>
    for (ui1=0; ui1<ASize; ui1++)
      //crc=_crc16_update(crc,AData[ui1]);
      crc=crc16t_update(crc,AData[ui1]);

    return ~crc;
  }
 25a:	80 95       	com	r24
 25c:	90 95       	com	r25
 25e:	08 95       	ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Huh. What are the 0x98/0xFE coming from? I suppose that could be the address lookup for the table, maybe? Because that would actually sort of make sens; it looks a bit like 242-24c is computing 2 times something and adding/subtracting (the same thing really) to get a value somewhere. So some of the cost here is just that, because there's no 16-bit registers, every 16-bit operation is requiring two operations. It still seems like this should be more than a little faster than doing it separately for each bit.

 

Unless the optimizer had already figured that out and squashed it, which I guess it could potentially have done.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I was puzzled by the low speed improvement of the table method versus the bits-in-a-byte technique for an AVR8.
On a 32-bit AVR32 MCU (where an 'int' is 32-bits) the table method was about 3.5 times faster.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'll post the LSS for the traditional non table version in a little bit if that helps to see why.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'd run it in the simulator in "Goto Assembly" view so that each step is one opcode (not one C line) and watch the cycle counter in the Processor view to get an idea of where the cycles are being eaten.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That is a good idea clawson - I'll give that a try.

 

Here is the LSS for the built in non-table function:

 

uint16_t crc_memory(uint8_t *AData, uint16_t ASize)
  {
 228:	68 0f       	add	r22, r24
 22a:	79 1f       	adc	r23, r25
    uint16_t ui1, crc;

    crc=0xffff;
 22c:	2f ef       	ldi	r18, 0xFF	; 255
 22e:	3f ef       	ldi	r19, 0xFF	; 255

    for (ui1=0; ui1<ASize; ui1++)
 230:	86 17       	cp	r24, r22
 232:	97 07       	cpc	r25, r23
 234:	d9 f0       	breq	.+54     	; 0x26c <crc_memory+0x44>
      crc=_crc16_update(crc,AData[ui1]);
 236:	fc 01       	movw	r30, r24
 238:	41 91       	ld	r20, Z+
 23a:	cf 01       	movw	r24, r30
_crc16_update(uint16_t __crc, uint8_t __data)
{
	uint8_t __tmp;
	uint16_t __ret;

	__asm__ __volatile__ (
 23c:	24 27       	eor	r18, r20
 23e:	42 2f       	mov	r20, r18
 240:	42 95       	swap	r20
 242:	42 27       	eor	r20, r18
 244:	04 2e       	mov	r0, r20
 246:	46 95       	lsr	r20
 248:	46 95       	lsr	r20
 24a:	40 25       	eor	r20, r0
 24c:	04 2e       	mov	r0, r20
 24e:	46 95       	lsr	r20
 250:	40 25       	eor	r20, r0
 252:	47 70       	andi	r20, 0x07	; 7
 254:	02 2e       	mov	r0, r18
 256:	23 2f       	mov	r18, r19
 258:	46 95       	lsr	r20
 25a:	07 94       	ror	r0
 25c:	47 95       	ror	r20
 25e:	30 2d       	mov	r19, r0
 260:	24 27       	eor	r18, r20
 262:	06 94       	lsr	r0
 264:	47 95       	ror	r20
 266:	30 25       	eor	r19, r0
 268:	24 27       	eor	r18, r20
 26a:	e2 cf       	rjmp	.-60     	; 0x230 <crc_memory+0x8>
      //crc=crc16t_update(crc,AData[ui1]);

    return ~crc;
  }
 26c:	c9 01       	movw	r24, r18
 26e:	80 95       	com	r24
 270:	90 95       	com	r25
 272:	08 95       	ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am really confused by that, I can't see how it does anything of the sort, but presumably it does.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Could

_crc16_update(uint16_t __crc, uint8_t __data)

already be optimized to not have a loop of 8 - I don't even see any branching in it...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If done in hand-crafted assembly and some loop-unrolling,

I think the lookup-table method can be done for just over 13 cycles per byte.

Six of those cycles will be LPMs. Another four will be address computation.

 

Edit: 11 cycles if the table is in SRAM.

Iluvatar is the better part of Valar.

Last Edited: Mon. Sep 14, 2020 - 03:18 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just curious....

is a crc not just a: start with 0 and XOR each next value to the end?

 

Also I wonder how busy is your CPU normally?

 

If I take the first thing as being a yes, then

inside the reception interrupt you need to know when you have your 16 bits together.

So every 2 bytes you need to do an XOR. Why not add the 16 bits to both a re-transmit buffer ( you can only forward the 1024 bytes when you know they are good I guess) and a small CRC buffer.

In main you then, outside the reception interrupt,  can start working on processing the CRC while reception is still ongoing.

I think you can perhaps even do it inside the ISR already.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

the_real_seebs wrote:

I am really confused by that, I can't see how it does anything of the sort, but presumably it does.

The user manual:

 

https://www.nongnu.org/avr-libc/...

 

    1 uint16_t
    2 crc16_update(uint16_t crc, uint8_t a)
    3 {
    4 int i;
    5 
    6 crc ^= a;
    7 for (i = 0; i < 8; ++i)
    8 {
    9     if (crc & 1)
   10    crc = (crc >> 1) ^ 0xA001;
   11     else
   12    crc = (crc >> 1);
   13 }
   14 
   15 return crc;
   16 }

 

gives psuedo C code to document what the code is doing. But in reality it is coded in Asm:

 

112	static __inline__ uint16_t
113	_crc16_update(uint16_t __crc, uint8_t __data)
114	{
115	        uint8_t __tmp;
116	        uint16_t __ret;
117	
118	        __asm__ __volatile__ (
119	                "eor %A0,%2" "\n\t"
120	                "mov %1,%A0" "\n\t"
121	                "swap %1" "\n\t"
122	                "eor %1,%A0" "\n\t"
123	                "mov __tmp_reg__,%1" "\n\t"
124	                "lsr %1" "\n\t"
125	                "lsr %1" "\n\t"
126	                "eor %1,__tmp_reg__" "\n\t"
127	                "mov __tmp_reg__,%1" "\n\t"
128	                "lsr %1" "\n\t"
129	                "eor %1,__tmp_reg__" "\n\t"
130	                "andi %1,0x07" "\n\t"
131	                "mov __tmp_reg__,%A0" "\n\t"
132	                "mov %A0,%B0" "\n\t"
133	                "lsr %1" "\n\t"
134	                "ror __tmp_reg__" "\n\t"
135	                "ror %1" "\n\t"
136	                "mov %B0,__tmp_reg__" "\n\t"
137	                "eor %A0,%1" "\n\t"
138	                "lsr __tmp_reg__" "\n\t"
139	                "ror %1" "\n\t"
140	                "eor %B0,__tmp_reg__" "\n\t"
141	                "eor %A0,%1"
142	                : "=r" (__ret), "=d" (__tmp)
143	                : "r" (__data), "0" (__crc)
144	                : "r0"
145	        );
146	        return __ret;
147	}

which is why you don't see source level annotation in the LSS.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There are no interrupts with this project - part of the parallel is supporting EPP which is has assisted handshaking done in hardware - the AVR has to respond quickly without the risk of being in an interrupt somewhere.  Essentially the AVR listens for a packet, validates it, processes it, produces a response (also with crc16), and then the PC reads the response back.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, I understand about the inline asm, what I don't understand is how that ASM correlates at all to the pseudocode. Basically, I'm not seeing anything that corresponds to the ^0xA001 part, and I just don't understand what the inlined assembly actually does.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

alank2 wrote:
between a PC parallel port and a microSD card

How old it that PC, I've not seen a parallel port on a pc since 386 days?

 

sorry to derail the thread devil

 

 

Jim

 

(Possum Lodge oath) Quando omni flunkus, moritati.

"I thought growing old would take longer"

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Jim - pretty old - I'd say most PC's had a parallel port past the Pentium III days.  Its end was USB eventually.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I had some good ideas about this project last night and got it about half way implemented.  Testing with the crc16 being calculated in realtime with the transfer gave a speed up of about 14% over doing the two as separate processes.

 

Instead of just a read or write mode, I've implemented 5 modes which allow me to do all the crc calculations during the transfer now.  For the PC sending to the AVR, first it will use the WRITE_BUFFER command followed by an EXECUTE that supplies the crc16.  Both the sending and receiving side can calculate the crc in realtime during the transfer.  On the PC receiving side, a READ_SIZE gives the size of what can be received, then READ_BUFFER, and finally READ_CRC.  Now each mode is distinct and can have the fastest loop time without having to test when it should or shouldn't calculate a crc and so on.  I'm going to work on the PC side code tonight and then see what type of performance it has.  From there I can revisit if I need to change the crc16 to a different type of checksum to speed it up further.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It is doing much better now with the new packet process I am using:

 

SPP parallel, pcsend 316 KB/s, pcrecv 135 KB/s

PS2 (bidirectional) parallel, pcsend 322 KB/s, pcrecv 252 KB/s

EPP (hardware handshaking) parallel, pcsend 434 KB/s, pcrecv 311 KB/s

 

There is probably still room for optimizing both on the AVR side and the PC side code.

 

First thing I'll try is disabling the crc16 table code and see how much the numbers increase.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The first block is 1K being sent from the PC.

 

 

Pages