AES encrypt/decrypt/key expansion released

Go To Last Post
27 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I want to contribute my AES assembler code to the project forum. First I thought I should release it here for more extensive testing. It does not have a slick configuration interface (you have to set equates). However, its documented and someone may find it useful. Its free and I would appreciate any feedback/testing before moving it into the project forum.

It is just the basic block routines (ECB style). It is not a full boot loader, so treat it like an AES library. It is developed on the AT90CAN128 and will require porting for other chips (the ATmega128 may not be any problem at all). There are comments about this in the code.

So far it has passed the Advanced Encryption Standard FIPS PUB 197 document published tests for 128, 192 and 256 bit key schedule expansion and the 128 bit encryption test. It has some rough edges, but if I waited for it to be perfect it would never be released :wink:.

I found a problem already. I accidentally placed exe_spm in the non-FLASH key schedule expander code. I fixed the directive.

BTW, it appears you have to login to see the attached program zip file.

Last Edited: Sat. Apr 29, 2006 - 08:05 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Michael,
I am busy making an AVR cryptographic liberty for the GNU CC compiler.
I have made de DES and RSA part in assembly and is small and fast,
I have some code for a SHA-1 but that will need a lot of work before I will release that part.
But can I used you AES ?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The comment in the AES.ASM file explains its free for use, including commercial use. The only thing I want is to retain credit for the initial version in this source file. Anyone adding to it or modifying it should add their name as another author. You can put your version under GPL if you like. The original version will stay as is and have less requirements than even GPL.

There is a little more information in the comments about porting to different chips. My requirement of having the lower two memory address bits be zero (MOD 4 addressing) for some tables could be interesting if a C compiler or linker assigns the memory address. I suppose you could always go 3 bytes larger table size than needed and bump the address up to the first MOD 4 = 0 address.

There is also a comment about the memory size in the SubBytes and Inverse SubBytes tables (it could still stand some improvement here).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

After thinking about it, I decided to rewrite the routines that required the MOD 4 addressing boundaries (as a bonus they are slightly smaller and faster now). Except for the FLASH page boundary, now it should be happy to work with whatever SRAM address its given. I also forgot to start the version number with an x for the pre-release versions. Here is the improved program.

I found a redundant andi instruction in the new code. Its not a bug, it just takes an extra CPU cycle for no good reason. I will hold off fixing it until the next release. Hopefully after some testing feedback.

Last Edited: Sat. Mar 18, 2006 - 12:41 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks, Mike.

Andreas

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Keeping in mind that this is a work in process on its way to release, I found a bug. When the ENCRYPT is not used and KEY_GEN conditional assembly directive is used, the aes_sub_xy table required for key expansion was not included in the assembly.

I added another embellishment. The new aes_mix_col and aes_invmix_col routines now have a faster large FLASH memory footprint version (the LARGE_MEM directive selects which ones are used). I also got rid of the separate SRAM k_box[Nk] table. The initial key is now stored at the beginning of the SRAM key schedule expansion table. The other stuff is included in the comments.

So far I am not aware of any bugs except the one mentioned above. There have been more embellishments than I expected, but I guess thats what this process is all about. For example, I never gave any real thought to using it with a C complier until the question was raised. After posting this version, I will try and leave it alone until I get some feedback.

Last Edited: Sat. Apr 1, 2006 - 02:34 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The current AES routines have already satisfied my boot loader needs, however it bugs me that the FLASH memory usage is still relatively high. The best solution I have found is trading off SRAM usage for FLASH usage. The SRAM solution generates the aes_log_xy table and aes_alog_xy table with a software algorithm and saves them in SRAM (the FLASH versions of these tables are not loaded or used). This takes 512 bytes of SRAM. Then the software overwrites the SRAM aes_log_xy table with the inverse finite field multiplication values (using both tables to generate the inverse function). Then the second table SRAM storage space is not needed and is released, so only 256 bytes of SRAM are needed for the finished inverse finite field multiplication table.

This allows the aes_sub_xy table and aes_invsub_xy table to be removed from FLASH for a much more reasonable small memory footprint. The aes_sub_bytes, aes_invsub_bytes and aes_key_expand code can all use the new SRAM table.

I have not been able to find or figure out a reasonable way to calculate a single inverse finite field multiplication on the fly, so trading off temporary use of the SRAM instead of a table in FLASH seems to be the best available solution to reduce the FLASH memory usage and still have the program run reasonably quickly.

I have been testing this version and it has been working. It does require running the code to generate the inverse finite field multiplication table first before running any key generation, encryption or decryption code. As long as the table memory in SRAM is not disturbed it can continue to be used. The extra time needed to generate the table is a small one time penalty. Using the table is a simple lookup, so it runs very efficiently. The only other thing that makes it run slower than the current aes_sub_bytes, aes_invsub_bytes and aes_key_expand code is the extra code to apply the affine to the inverse finite field multiplication result (really not all that much code, but more than my original routines).

It is possible to continue the process and convert the SRAM inverse finite field multiplication table into an aes_sub_xy table and/or an aes_invsub_xy table. Aside from the one time penalty to generate the table or tables, the program would run slightly faster then the original large memory footprint version. However, I can not think of a good reason to do this.

This new SRAM based implementation is most useful for trying to squeeze AES into a smaller AVR chip. Keep in mind that AES can also be used for pseudo random number generators, stream cyphers, hash functions and other uses beyond the boot loader. So, a squeezed down version may also be useful on larger AVR chips.

If there is any interest in this even smaller FLASH memory footprint version, I will include it in the next release?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

emilevanderlaan wrote:
Michael,
I am busy making an AVR cryptographic liberty for the GNU CC compiler.
I have made de DES and RSA part in assembly and is small and fast,
I have some code for a SHA-1 but that will need a lot of work before I will release that part.
But can I used you AES ?

RSA on an AVR, that sounds interresting!

But how long does a 1024bit private key operation take on a 16MHz AVR (10-20sec?)?
And how much SRAM does it need?

Bertolt

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just a progress report. I wrote a quick test program for the small FLASH footprint SRAM based version. Without any optimization, using AES with a 256 bit key for decryption only, it was 830 bytes (415 words) of FLASH program code (this includes the 32 byte key and the setup/test code). It reported using 733 bytes of SRAM. However, 256 bytes of SRAM were used one time only for table generation, so the sustained SRAM usage is only 517 bytes (16 byte s_box, 256 byte inverse FFmult table, 240 byte expanded key and 5 bytes of variables). The key schedule expansion code had to be included since the key is expanded in SRAM to save on FLASH space.

It will not take long to clean it up and also release this small FLASH footprint version, however I do not see any point if no one shows any interest. Please let me know if anyone wants it.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For the RSA you will need about 5 * Key size for encryption and 4 * the Key size.
Encrypt is about 1500 bytes of flash, if you will use only decrypt that is about 1050 bytes.
Timing is 63 sec encrypt for 512 bits and 169 ms for a decrypt ( exponent 3 ).
What I expect is that a 1024 bits encrypt will take +/- 250 second.
That is @8MC
I have a small bug at the moment that the 1024 bits part is not working. But I will fix this week.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I had to do some rework to the 16 bits index counters in the RSA, that is needed for a 1024 bits RSA,
The good thing is that it has become a bit faster and smaller.
At the moment I am at 69ms for 512^3 @ 16 MC and 25.65 sec for 512^512 @16MC
And I also have the 1024^3 working @ 264 ms@16MC I did not test the 1024^1024 but I expect that will take +/- 100 sec @16 MC.
I still have to clean you the code before I will release it and do some additional testing.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

emilevanderlaan:

Are you using CRT (Chinese remainder theorem) in our implementation?
If not, it might help you to speed up your implementation.

http://en.wikipedia.org/wiki/Chi...

Bertolt

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So far I have had zero feedback on the actual pre-release code. My AES needs changed and I developed an implementation that uses SRAM for all four tables. This also resulted in an even smaller memory footprint implementation. Even though no one asked for it, I will include it since I have already done the work. Without any feedback I will assume the code is working and bug free. Unless there is a bug report, this one will eventually become the release “project” version.

The new code is broken into three different directories. On with primarily FLASH usage, one with primarily SRAM usage and one with a minimum memory usage footprint. The small version is a little slower, but has the most potential. I hope someone finds something useful in them. There is a new “readme.txt” file in the zip parent directory.

Last Edited: Mon. Apr 3, 2006 - 11:40 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have put the AVRCrypolib on my site
www.emsign.nl

And I will start adding the AES

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry, I've just had my AVR from hell day and I'm running way behind. As usual, Roland Kruse from ATMEL set me straight on the AVR assembler problems I was having (thanks Roland). In response I cleaned up my pre-release code and went to test it. Well, nothing worked. My encryption and decryption failed and I started debugging against the AES test suite. I was getting really confusing debug results. So, I went back to the x1.3 release and that did not work either. Now its panic time, it appears I released bad code even though I remembered testing it first. Well I finally found out my old AT90CAN128 test chip has worn out its FLASH. Some FLASH bits deep in the expanded key schedule went bad. I replaced the chip and am now laughing about it. I'm also trying to ignore those guys in white suits with that really long sleeved jacket with the straps and buckles they want me to try on :).

Here is a cleaned up x1.4 version. The x1.3 version is still workable unless you set the KEY_FLASH on the small memory footprint implementation. I'm also including a simplified release notes text file.

Last Edited: Fri. Apr 7, 2006 - 07:04 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I did a very quick timing test on the AES encrypt and decrypt with a 256 bit key and a 16 byte state array. Since it ran under JTAGICE control using a stop watch to time it, I expect the absolute values of the recorded times to be useless, so only relative times are reported. The tests ran 5,000 iterations (the equivalent of 80,000 bytes) of first encrypt and then decrypt (I did not time the SRAM table generation periods). The data was only loaded one time into the 16 byte state buffer. BTW, the final data was correct after being run through 10,000 total cycles, as it should be.

What I found out was the FLASH version is only very, very slightly slower than the SRAM version at 5,000 iterations. The small memory footprint version is slightly over 2.1 times slower than the other two versions. These results were the same for both encrypt and decrypt. Still, for what the small version gives up in speed it makes up for by being under 400 words of FLASH (including a 240 byte expanded key schedule from a 256 bit key stored in FLASH) in a decrypt only version.

The practical difference between the FLASH and SRAM versions is just which type of memory you have to commit to the program.

The small version is just slower and much smaller than the other two versions.

In the real world the rate at which data is fed into the program can have as large or larger impact on the program run time.

Decryption is about 1.6 times slower than encryption for all versions.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Do you have any cycle counts for encryption/decryption with different key sizes? And perhaps some details of which parts of the AES iterations use more cycles?

Rein Anders Apeland

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I quickly used the AVRstudio simulator run to cursor to get CPU cycle counts for the various AES functions. The small memory footprint implementation and SRAM intensive implementation were checked. Both the Small and SRAM tests used an AES block size of 16 bytes and an expanded key size of 240 bytes stored in FLASH.

Small     SRAM
-----    -----
52477    24598  - Encrypt Total
  460      206  - SubBytes
   77       77  - ShiftRows
 3226     1355  - MixColumns
  196	   196  - AddRoundKey

87260    40041  - Decrypt Total
  508	   206  - InvSubBytes
   77       77  - InvShiftRows
 5850     2543  - InvMixColumns
  196      196  - AddRoundKey

You can see the most amount of time is spent in MixColumns and InvMixColumns. Both of these use Finite field Multiplication (FFMult). In fact the InvMixColumns uses two more FFMult operations per s_box[] byte than MixColumns does.

The key size is not that big a factor because every encryption or decryption round always uses the same number of blocks as the state buffer (s_box[]). The difference is the number of blocks (the size of the s_box[]) is combined with the block size of the key to lookup the number of rounds for each complete encryption or decryption cycle. Each round uses the exact same s_box[] size key bytes from the expanded key schedule no matter what the original key size is. This expanded key schedule size per round will only change when you increase the s_box[] size from 16 bytes to 24 bytes or 32 bytes (AES only allows 16 bytes). A 128 bit AES key will use 11 rounds total. A 192 bit AES key will use 13 rounds total. A 256 bit AES key will use 15 rounds total. BTW, the first and last round are not the same. You can add up the CPU cycles for each of the 4 individual routines (plus 5 cycles for loop control), multiply that number by the difference in rounds which is for 15 rounds in the tables above and subtract that from the total number of CPU cycles to get the difference for a different key size. For example: to calculate the SRAM version decryption difference between a 128 bit key and the 256 bit key in the table (206 + 77 + 2543 + 196 = 3,022) then add 5 CPU cycles for the program round loop control (3,022 + 5 = 3,027) next (15 rounds – 11 rounds = 4) which is (4 x 3,027 = 12,108) cycles difference for (40,041 - 12,108 = 27,933) total Decrypt cycles with a 128 bit key. You can replace the table 40,041 with 27,933 for a 128 bit AES key.

The total CPU cycles for encryption or decryption does not exactly match the totaled number of cycles of the individual routines because there is other “house keeping” program code not included in the individual counts.

I found an unimportant bug (it does not cause any problems other than wasting a CPU cycle) in all three implementations SubBytes code. There is a “clr state_cnt” that was not removed when I changed the loop from a count up to a count down. It will add one cycle if its not removed. It will be corrected in the released version.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

DUH! I had my head stuck too far up the Rijndael/AES documentation and missed something really important :oops:. The aes_ffmult routine is only called by MixColumns and InvMixColumns. The values being multiplied are always {02}, {03}, {0E}, {0B}, {0D} or {09}. The small memory footprint aes_ffmult routine was shifting through all 8 bits when it only needed to shift through 4 bits maximum. So, the aes_mix_col and aes_invmix_col had the aes_ffmult val_a and val_b assignments swapped and the aes_ffmult loop counter was reduced from 8 to 4. This simple change took the small memory footprint implementation down to 37501 CPU cycles for encryption and 57308 cycles for decryption. Now the SMALL to SRAM comparison is:

37501 to 24598 – encrypt

56308 to 40041 – decrypt

Next I'm going to analyze the overall shift pattern per s_box[] byte and see if there is a more efficient/direct way to do the matrix multiplications in MixColumns and InvMixColumns.

Now the 2.1 times slower estimate is thrown out the window. When the new routines are finished I will get a new speed measurement.

Sorry, it looks like there will be a x1.5 pre-release after all.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The cycle count for the MixColumns subroutine seems unneccessarily high. The following C-code snippet performs the operation in 355 cycles, using full optimization with the IAR compiler. Perhaps you can draw some tips from it. It's part of a soon-to-be-released AVR Appnote.

-Rein

//! Perform an AES column mix operation on the 4 bytes in 'column' buffer.
static void mixColumn( byte * column )
{
  byte result0, result1, result2, result3;
  byte column0, column1, column2, column3;

  // This generates more effective code, at least
  // with the IAR C compiler.
  column0 = column[0];
  column1 = column[1];
  column2 = column[2];
  column3 = column[3];

  // Partial sums (modular addition using XOR).
  result0 = column1 ^ column2 ^ column3;
  result1 = column0 ^ column2 ^ column3;
  result2 = column0 ^ column1 ^ column3;
  result3 = column0 ^ column1 ^ column2;

  // Multiply column bytes by 2 modulo BPOLY.
  if( column0 & 0x80 ) {
    column0 = (column0 << 1) ^ BPOLY;
  } else {
    column0 = (column0 << 1);
  }
  if( column1 & 0x80 ) {
    column1 = (column1 << 1) ^ BPOLY;
  } else {
    column1 = (column1 << 1);
  }
  if( column2 & 0x80 ) {
    column2 = (column2 << 1) ^ BPOLY;
  } else {
    column2 = (column2 << 1);
  }
  if( column3 & 0x80 ) {
    column3 = (column3 << 1) ^ BPOLY;
  } else {
    column3 = (column3 << 1);
  }

  // Final sums stored into original column bytes.
  column[0] = result0 ^ column0 ^ column1;
  column[1] = result1 ^ column1 ^ column2;
  column[2] = result2 ^ column2 ^ column3;
  column[3] = result3 ^ column0 ^ column3;
}



//! Perform AES column mixing for the whole AES block/state.
static void mixColumns( byte * state )
{
  mixColumn( state + 0*4 );
  mixColumn( state + 1*4 );
  mixColumn( state + 2*4 );
  mixColumn( state + 3*4 );
}

Rein Anders Apeland

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

raapeland,

You are absolutely correct about MixColumns and InvMixColumns being waaaay to cumbersome. While my original program was technically correct (as in it worked), I did not realize how badly it was in need of optimization until your first message got me to look at the timing. Thank you for your question. Too bad there is not an emoticon for a swift kick in my rear end :).

As mentioned in my last post, I broke down the multiplication through repeated shifts method and applied it to the specific values {02}, {03}, {0E}, {0B}, {0D} and {09}. This yielded a matrix of the sequence and minimum number of operations required for each finite field operation. My matrix results match the sequence in your C code. I just finished testing this in the SRAM version and got these results:

SRAM implementation CPU cycle results:

10467 encrypt
13222 decrypt

268 - MixColumns
480 - InvMixColumns

The aes_ffmult code has been deleted from the source as it is no longer needed. BTW, it turns out the entire MixColumns operation is used as the first half of InvMixColumns. This was not obvious to me until I saw the matrix breakdown results.

Getting rid of aes_ffmult and using this new code could have a major impact on the need for 3 different implementations. I will be making the changes and evaluating the effects today. I should post a new x1.5 version tonight. Thank you for you help.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A major optimization has rendered all previous versions of this code obsolete. Version x1.5a is the newest and hopefully last release candidate.

The MixColumns routine now takes only 260 CPU cycles (down from a high of 3226).

The InvMixColumns routine now takes only 472 CPU cycles (down from a high of 5850).

The CPU cycles are now:

FLASH:
10587 - encrypt
13342 - decrypt

SRAM:
10371 - encrypt
13126 - decrypt
15375 - One time table generation

Small:
13927 - encrypt
17354 - decrypt
8452 - One time table generation

You can see there is not much difference in execution time between the different implementations any longer. There is also not that much FLASH memory usage difference now for decrypt only versions. The small footprint still has an advantage if you want both encrypt and decrypt functions. The SRAM and then the FLASH implementations would still be the choice for real time oriented response applications.

Last Edited: Sat. Apr 29, 2006 - 08:04 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Since there has been very little feedback and no bug reports for the current version, I will submit it as a version 1.0 release AVR project tonight.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The Rijndael / Advanced Encryption Standard (AES) Toolkit is now an AVR Project.

https://www.avrfreaks.net/index.p...

Just search for project - "aes", type - General Lib.functions and complier - AVR Assembler.

A counter in the key generation code was reduced from 16 bits to 8 bits, some comments were cleaned up and some spell checking was done since the last x1.5a pre-release version.

Now that the released 1.0 source code is in the projects area, I am removing all the pre-release source code from this thread.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi,

I read this fascinating forum dedicated to AES Toolkit development.

Just curious...where's the 'C' code for this AES toolkit?

Thanks in advance,

Robert

Robert J. Rademacher
866.394-5409 Fax
607-844-8134 Voice

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Mike B wrote:
Just search for project - "aes", type - General Lib.functions and complier - AVR Assembler.
The reason I referred to AVR Assembler as a complier is I had no choice because that is the fixed name of the search field on the projects page. The projects tab with this search page can be found from the main AVRfreaks page. My assembler source code versions of AES for AVRs can be found there.

One source for C source code is the ATMEL application note AVR231: AES Bootloader. This is a complete AVR bootloader program that has AES source code in C (I have no involvement with this program or application note).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Dear Mike,

Thanks for Atmel's AES details. I would like to use AES for encrypting/decrypting the small data packet (32-64 bytes) to be sent through Ethernet. It would be much easier to use C for portability sake.

You know, someone has developed the "low byte" AES in 'C'. See this web site:

http://fp.gladman.plus.com/AES/i...

The AES 'C' source code developed by Gladman seems not to be very effective.

In fact, I have written RC4 encryption 'C' routine for AVR, which works great.

If you like, I can post the RC4 C source code here. This code runs pretty fast and uses small memory.

Robert

Robert J. Rademacher
866.394-5409 Fax
607-844-8134 Voice