Every since I started working with Microcontrollers, I've been on an continuous quest to push the limits of what you can do using a small, 8-bit CPU running at 20MHz. From a Mandelbrot Renderer, to a full web-/fileserver, a 3D Raycasting Engine and a N-Body Simulation. But I have the feeling that I won't be topping this one anytime soon. I've basically succeeded in creating a bit of C code to train and run large Artificial Neural Networks....that can run on my AtMega1284p.
In this post, I'm going to be talking about the Hardware required, give a (hopefully) brief explanation of how the C code works and lastly I will train a Neural Net from scratch to recognize images of handwritten digits on my AtMega1284p to prove that this does indeed work. (I feel like I should explain what Artificial Neural Networks even are, but I'd be here all day if I tried to do that. Instead, I'm just going to say that if you have any questions after reading this post, feel free to ask them and I'll do my best to answer them.)
Anyways, let's get into it.
Now, it is obvious that the 16KB of my AtMega1284p's internal SRAM isn't going to be nearly enough for running Neural Nets. In fact, getting enough memory was so large of a problem, that it ended up preventing me from starting this project in the first place for a long time. 32KB SRAM ICs just won't cut it, and chaining 40 of them obviously wasn't a good Idea either. Luckily, a few weeks ago, while buying components for another project, I stumbled across the single IC that made this project possible: the Lyontek LY68L6400SLIT. A single, 8-pin IC that provides 8 Megabytes of fast SRAM (up to about 144 mbits per second read/write) through a simple SPI port. And best of all, it only costs 80 cents a piece. So that was my Memory problems solved. Only obstacle was that the IC is both in an SMD package and runs at 3.3 Volts. But a combination of an SMD-to-THT adapter and simple level converter solves that pretty easily too. One tip though: don't be like me and think that resistor voltage dividers are good enough level converters. They work fine if only the IC is connected to the SPI bus, but once I added another device to the SPI bus, things started getting weird. Let's just say that my hardware setup still isn't 100% stable to this day.
Next, I needed a place to store my Datasets, after all, I was going to be training a Neural Net. And SD-Card was a pretty obvious choice. Nothing much to say here.
This is where things get interesting. Now, this wasn't my first time writing a bit of code to train a Neural Net, and I know the Maths behind it like the back of my hand, but the fact that I needed to store literally everything on external Memory made things a lot more difficult. I also ended up restricting myself to implementing only simple "Dense" Layers. Theoretically, one could implement Convolutional Layers as well, but implementing those already gives me a headache under normal circumstances, and they're not required for what I wanted to accomplish for now, so I skipped adding them. That left me with only those Dense Layers. Luckily, they boil down to just a matrix-vector multiplication, a vector-vector add and applying a simple mathematical function to the result. Operations which are also conveniently easy to implement memory caches for. And with the AtMega1284p still having 16KB of internal SRAM, I was able to cache entire vectors and rows of matrices, which obviously provides an insane speedup at the cost of 70% of the MCU's SRAM, which leaves just enough for....well, everything else.
I also had to implement backpropagation, which basically means I needed a bit more code that computes the derivatives of the needed operations, but at that point I was getting familiar enough with the SRAM IC to finish this step quite easily.
Then, I added a few data structures to help keep track of network parameters in the external SRAM and actually put together entire Neural Nets out of the Dense Layers I just implemented (at this point, I'd just like to mention how ridiculously weird it is to have to use 32-bit pointers for memory addressing on an 8-bit CPU).
Lastly, I simply needed some code to actually train a Neural Network using Gradient Descent and a given dataset and cost function. This is where I had the most trouble. Let's just say that debugging something this complex on a Microcontroller is almost impossible. I still don't know if my code works 100% as intended, but it works well enough, so I guess I'll be fine with that.
The final result was a long bit of code, that is easy to use on the surface, but is an absolute MESS underneath. But I was finally ready to test it.
The Final Test
In order to test what I had just created, I simply implemented the "Hello, World!" of Neural Networks: a Net to solve the MNIST dataset. The MNIST dataset is a collection of 10000, 28 by 28 pixel, grayscale images of handwritten digits. To solve the dataset, you have to train an Artificial Neural Network to take any of these images as its input and output the value of the number represented in each image. The Neural Net is simply given these 10000 images plus the correct solutions and has to use these to learn how to solve the problem.Afterwards, the Net is tested using images it hasn't seen before, and the accuracy of its results is measured in terms of Number of correctly identified images divided by the total number of images.
My setup was a relatively small Net using only two Dense Layers with 25 hidden units in-between. On my PC, this trains to 80%+ accuracy in about half a second. In comparison, and as a big surprise to no one, the AtMega1284p is abysmally slow. I only had enough time to let it run for about 20 - 30 minutes, and in that time, it completed about 1% as much work as my PC did in half a second. But, when I tested the now partially-trained Neural Net from the Microcontroller, it showed an accuracy of 20%, which may not seem as much, but the highest accuracy I managed to achieve using a Random Number Generator to classify the images was 5%. So....it worked! I just successfully trained an Artificial Neural Network using my AtMega1284p.
Now, at this point you may be wondering just how all of this could be useful. Well, even though training neural networks is quite slow on an AtMega, running them isn't. The Net from my test was capable of classifying the 100 images in the test set in just a few seconds. This means that it would be completely possible to import a pre-trained Net and use it for some purpose. For example, it should be perfectly possible to connect up some kind of camera to an SMD version of the Microcontroller, load a Neural Net trained for image recognition, and create the world's smallest and cheapest live image recognition system. I do believe there's a lot of possibility here and I will be trying to actually create something useful with this in the near future.
Is available in this zip file. It's still really messy, and I wouldn't trust anything inside trainer.c at ALL (I'm 100% sure it'll break with anything that isn't my test network). I've also included the dataset used for the final test (data.dat).