[CODE][C] Parsing strings flexibly/efficiently with Ragel

Last post
21 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

[EDIT] Follow on tutorial after this: http://www.avrfreaks.net/forum/codec-parsing-json-based-config-file-using-micro-memory-ragel

 

This is my first tutorial here so please be nice. I'm pretty new to AVRs, not so new to programming. I've fiddled with AVRs a few times over the last couple of years but only recently got UART <-> RS232 communications up and running. As part of my first project doing that I had a need to interpret strings sent over the UART. Naturally, I did this the complicated way. I guess the first question to answer is why you might need to do this. The obvious one is the example given above - receiving some form of commands in a text-based format via the UART. It's possibly overkill for simple command parsing but I really wanted to test and prove the feasibility of this technique. There seems to be a common conception that small microcontrollers just aren't fast enough or have enough memory to handle a `proper' parser. A quick search seems to indicate that people either use basic string comparisons or write small hand parsers. While it's true that most parser generators assume the availability of many KB of memory and as a result don't suit micros, hand-made parsers can also be extremely difficult to change and debug and often turn out to be less efficient than machine-generated ones. Techniques used frequently on PCs such as using the stack heavily (e.g. recursive descent parsers), pushing and popping tokens on and off possibly large stacks (e.g. Lex/YACC) and using large tables in memory to control the parser don't fit well at all with microcontroller resource limits. A while ago I came across a state machine compiler and parser generator called Ragel ( http://www.complang.org/ragel/ ), which has much lower requirements to run. It's stable, well-documented and most importantly for us a couple of its output modes are great for low-memory systems. The disadvantage is that it can be very complicated at first and almost lost me several times. I've since learned to love it though, which is why I decided to write this to take people through its capabilities, quirks and potential stumbling points. To illustrate the capabilities of the system I will go through from start to finish writing an interpreter for a tiny scripting language designed to run on an atmega8 chip. All examples are using avr-gcc but should be fairly easy to port - ragel adds no dependencies to the final code. I assume readers have good experience with C programming and understand subjects such as pointers. I also assume readers will be able to install ragel and can enter commands on the command line. Windows users can obtain ragel via the Cygwin project installer and use it via the BASH shell that is installed with it. The details of input and output to the parser will not be covered - this sort of parser could be plugged in to any communication medium. Note on versions of ragel - this article was written for ragel 6.0. Apparently some older versions may not support all of the features used here. In particular 5.3 has been identified as not supporting some options used.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Firstly, let's go over our basic layout of the module for our parser (microscript.rl):

#include 
#include 

%%{
  machine microscript;
  
  main := "";
}%%

%% write data;

static uint8_t cs; /* The current parser state */

void init_microscript( void ) {
  %% write init;
}

void parse_microscript(const char* p, uint16_t len, uint8_t is_eof) {
  const char* pe = p + len; /* pe points to 1 byte beyond the end of this block of data */
  char* eof = is_eof ? pe : ((char*) 0); /* Indicates the end of all data, 0 if not in this block */
  
  %% write exec;
}

This is a lot to take in at once - I'll just cover the basic points. When we write Ragel parsers we embed special instructions inside the host language, in this case C. A Ragel parser requires a machine (parser) definition, a block of any constants and tables needed to operate the machine, an initialisation section and an execution section. The only variable Ragel itself needs to retain between calls to know where it is the "cs" variable - the current state of the parser.

The function init_microscript will need to be called once before using the interpreter. It can then be called again later to reset it. The parse_microscript function can be called repeatedly with chunks of text to be parsed, for example loaded in chunks from EEPROM or an external source. It's perfectly fine to split half way through a line or any other unit of text - ragel will keep its place in the cs variable and pick up where it left off. It can even be run with single characters at a time, though this suffers quite a large overhead.

At the moment our machine contains nothing but we can already build it.

ragel -G2 -o microscript.c microscript.rl

This will take the file above, microscript.rl, and produce a C source file in microscript.c . The C source file will contain all our instructions from the previous file, with generated code filling in the various blocks where we told it. This file will already look pretty unreadable. I highly recommend taking a look but don't worry if it all looks like line noise - it certainly does to me. There are various ways of generating the parser. For now the -G2 (fully inlined GOTO-driven parser) mode will be the best for our purpose.

So, let's actually allow this machine to parse something now. The easiest thing to start with is numbers:

%%{
  machine microscript;
  
  number = digit+;
  whitespace = space+;
  
  main := number (whitespace number)*;
}%%

Ragel machine definitions are written as a series of available expressions made up by combining other expressions and raw inputs using operators very similar to regular expressions. digit+ means "one or more of digit", where digit is a built-in definition covering the characters 0-9. space is a built in definition covering all forms of spacing characters - spaces, tabs, new lines, etc. Finally we declare a machine - a fully set up parser declared with ":=". If we build this now we will have a fully functional machine which matches a series of numbers.

A good way of visualising what's going on is to use Ragel's support for creating GraphViz-compatible files. To generate a graph of the parser created you can run (requires GraphViz):

ragel -Vp -o microscript.dot microscript.rl
dot -Tps -o microscript.ps microscript.dot

Which should produce something like:

Ragel expresses parsers as a finite state machine, with each character coming in driving a transition from one numbered state to another. The states of the machine are illustrated as circles and the transitions between them as arrows. The circle on the left with the IN arrow leading to it is where the machine starts. If in this state it sees a digit it would change to state 3. The transitions are all labelled with the characters that will cause it to take that path. If a character is found which doesn't match a path out of the current state it would signal an error.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

We don't yet have any means of storing or using those numbers - the parser forgets about the digits as soon as it passes them. We need to add a method of storing them. Firstly let's create a global variable to store each number we come across.

static uint16_t currentNumber;

Now we need to actually record the numbers as we find them. Ragel lets us embed actions in the transitions in and out of parts of our machine. Let's take a look at how we would write those actions:

%%{
  machine microscript;
  
  action ClearNumber {
    currentNumber = 0;
  }
  
  action RecordDigit {
    uint8_t digit = (*p) - '0';
    currentNumber = (currentNumber * 10) + digit;
  }

  number = ((digit @RecordDigit)+) >ClearNumber;
  whitespace = space+;
  
  main := number (whitespace number)*;
}%%

This is entirely new so will need some explanation. The action blocks define blocks of C code to run when the action is triggered. We then embed indicators in the expressions of when these actions should be run. To make things as clear as possible I've bracketed off the bits of number where this is attached. The ">" character indicates an entering action. This means that when we encounter the first digit that starts a number this action is performed. We set the current number to 0 so we can start adding in the digits. The '@' character indicates a finishing action. That is it will be run when we have a complete "digit" expression - as we encounter each digit. This action shifts the current number up a digit and adds the current digit in the last place. We use the pointer p which ragel steps through the character block to access the current character.

Ragel usually gets the actions in the order that would be expected. As always, the best way to check is with GraphViz:

This is the same as before but with actions written on the transitions where they will be performed.

EXERCISE: For those keen to experiment, try copying and adjusting this technique so that we also count the number of spaces between numbers in a separate variable. Make sure to set it back to 0 at the start of each block of white space. Hint: instead of reading a digit you'll just be incrementing.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Now let's start fleshing out the language a bit. We can read numbers and they can be separated by spaces but that isn't really much use to us. Let's add some variables to the system and actually store our numbers. To keep things simple I'm going to give the language 26 variables called the fixed names $a through $z. Let's have a look at the definition of variable:

static uint8_t currentVariable;
static uint16_t variableValues[26];
  action RecordVariable {
    currentVariable = (*p) - 'a';
  }
  
  var = ('$' [a-z] @RecordVariable);

We set up a place to record the variable as we see it and a table to store the values. Now that we have the definition, we can change the syntax to make our program a series of assignments to variables. It's been a while and quite a lot has changed so here's the complete machine + globals:

static uint16_t currentNumber;
static uint8_t currentVariable, assignmentVariable;
static uint16_t variableValues[26];

%%{
  machine microscript;
  
  action ClearNumber { currentNumber = 0; }
  action RecordVariable { currentVariable = (*p) - 'a'; }
  action ReadVariable { currentNumber = variableValues[currentVariable]; }
  action SetAssignmentVariable { assignmentVariable = currentVariable; }
  action AssignValue { variableValues[assignmentVariable] = currentVariable; }
  
  action RecordDigit {
    uint8_t digit = (*p) - '0';
    currentNumber = currentNumber * 10 + digit;
  }
  
  var = ('$' [a-z] @RecordVariable);
  number = ((digit @RecordDigit)+) >ClearNumber;
  
  # A value is a variable or a number. If a variable is given as a value, read the number it represents.
  value = (var @ReadVariable) | number;
  
  # An assignment looks like "$a = 3" or "$b=$c"
  assignment = (var @SetAssignmentVariable space* '=' space* value) @AssignValue;
  
  # We're going to use a semicolon on the end of commands to make the boundaries clear
  command = space* assignment space* ';';
  
  # A program can consist of any number of commands
  main := command* space*;
}%%

I won't explain all the C code here - I'll leave it to you to see how that side of it works and will stick to explaining the Ragelisms. The only unusual thing is that since the currentVariable value is overwritten each time we encounter one and we can use variables as values we need to save a copy of the variable to save to in its own location.

One thing to observe is the use of "|" to match one of two machines. This allows us to say either a variable or a number goes here. Based on the text it will try to match both of these and will execute actions for whichever it appears to be. There are a few things to be careful of when combining expressions like this as until it's sure which way to go it can run actions for both possibilities. This doesn't affect our syntax as it's clear whether we have a number or variable from the very first character, so we're safe here.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The graphs of the machine are getting too big for the page now - take a look at http://www.goddard.net.nz/~tim/files/tut01/figure_assign.gif. Notice the whitespace-eating loops and the loop back to the start as it waits for another command. Also notice an intriguing quirk of this design - it doesn't wait for the number to complete before assigning and will keep updating the variable on each new digit. This is one of the major quirks and traps of Ragel - an expression such as a number is considered complete if what is collected so far makes a valid number and completion actions may be run multiple times. In this case it would do no damage but let's get rid of it anyway:

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've encountered a small forum issue with percent signs. Please substitute `/, for a percent sign in the code samples if you want to use them.

  assignment = (var @SetAssignmentVariable space* '=' space* value) `/,AssignValue;

We change the @ - the 'finished' action to a percent sign - the 'leaving' action. This will be run on the first character that follows the expression. In this case it ensures that assignment happens after the number is fully entered, not after each digit. The result is http://www.goddard.net.nz/~tim/files/tut01/figure_assign_once.gif.

So that's pretty boring - let's add IO.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
static uint8_t currentPort, outPort;
  action RecordPort { currentPort = (*p) - 'A'; }
  action RecordOutPort { outPort = currentPort; }
  action SetPortValue {
    switch(outPort) {
      case 1:
        PORTB = (uint8_t) (currentNumber & 255);
        break;
      case 2:
        PORTC = (uint8_t) (currentNumber & 255);
        break;
      case 3:
        PORTD = (uint8_t) (currentNumber & 255);
        break;
    }
  }
  action ReadPortValue {
    switch(currentPort) {
      case 1:
        currentNumber = PINB;
        break;
      case 2:
        currentNumber = PINC;
        break;
      case 3:
        currentNumber = PIND;
        break;
    }
  }
  action SetPortDirection {
    switch(outPort) {
      case 1:
        DDRB = (uint8_t) (currentNumber & 255);
        break;
      case 2:
        DDRC = (uint8_t) (currentNumber & 255);
        break;
      case 3:
        DDRD = (uint8_t) (currentNumber & 255);
        break;
    }
  }
  
  port_in = "PIN" [B-D] @RecordPort;
  port_out = "PORT" [B-D] @RecordPort;
  port_direction = "DDR" [B-D] @RecordPort;
  
  value = (var @ReadVariable) | number | port_in @ReadPortValue;
  
  assignment = (var @SetAssignmentVariable space* '=' space* value) `/,AssignValue;
  set_out = (port_out @RecordOutPort space* '=' space* value) `/,SetPortValue;
  set_ddr = (port_direction @RecordOutPort space* '=' space* value) `/,SetPortDirection;
  
  command = space* (assignment | set_out | set_ddr) space* ';';

OK... that just grew a lot. We've added special variables for ports, both input and output. These are treated as separate possible commands and execute their own actions to write and read data from the AVR ports. I copied the naming used in the C code itself so this should look pretty familiar. We could now write things like this in our micro-language:

  $a = PINB;
  PORTC = $a;
  PORTD = PINC;
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The state machine is now much larger. An image can be seen at http://www.goddard.net.nz/~tim/files/tut01/figure_io.gif. Hmm, this is actually starting to look like it could do something if not useful at least visible. To make it really useful we need some way to alter information. I know - let's try adding operators.

#define OP_ADD 0
#define OP_SUB 1
#define OP_MUL 2
#define OP_AND 3
#define OP_OR 4
#define OP_XOR 5
#define OP_SHIFTL 6
#define OP_SHIFTR 7

static uint8_t currentOp;
static uint16_t leftValue;
  action SaveLeftValue { leftValue = currentNumber; }
  
  action ApplyOperator {
    switch(currentOp) {
      case OP_ADD:
        currentNumber = leftValue + currentNumber;
        break;
      case OP_SUB:
        currentNumber = leftValue - currentNumber;
        break;
      case OP_MUL:
        currentNumber = leftValue * currentNumber;
        break;
      case OP_AND:
        currentNumber = leftValue & currentNumber;
        break;
      case OP_OR:
        currentNumber = leftValue | currentNumber;
        break;
      case OP_XOR:
        currentNumber = leftValue ^ currentNumber;
        break;
      case OP_SHIFTL:
        currentNumber = leftValue << currentNumber;
        break;
      case OP_SHIFTR:
        currentNumber = leftValue >> currentNumber;
        break;
    }
  }
  
  infix_op = (
    '+' @{currentOp = OP_ADD;}
    | '-' @{currentOp = OP_SUB;}
    | '*' @{currentOp = OP_MUL;}
    | '&' @{currentOp = OP_AND;}
    | '|' @{currentOp = OP_OR;}
    | '^' @{currentOp = OP_XOR;}
    | '<<' @{currentOp = OP_SHIFTL;}
    | '>>' @{currentOp = OP_SHIFTR;}
  );
  
  valueOrOp = value ((infix_op @SaveLeftValue value) `/,ApplyOperator)?;
  
  # An assignment looks like "$a = 3" or "$b=$c". Port output looks similar.
  assignment = (var @SetAssignmentVariable space* '=' space* valueOrOp) `/,AssignValue;
  set_out = (port_out @RecordOutPort space* '=' space* valueOrOp) @SetPortValue;
  set_ddr = (port_direction @RecordOutPort space* '=' space* valueOrOp) @SetPortDirection;

This is now a very minimal but working scripting language. It supports basic arithmetic and I/O. It doesn't support any form of conditional statements, functions or branching yet but if the source of the script supports seeking those could be added fairly easily. Incredibly the Ragel parser generated uses only 1 byte of memory in globals. Including the space for variables the interpreter uses 62 bytes of memory.

The parser compiled as above takes up just short of 4.5KB of flash. That amount of flash isn't really necessary though. Most of that is from repeating those actions throughout the code wherever they need to be run. We can change the -G2 option on the ragel command line to -G1 and it will consolidate them all in one place. This will reduce the flash requirements to under 3KB, at the cost of running very slightly slower. Sometimes this mode will also set up a small table in memory (it decided not to this time for me) - check the size of your compiled executables and decide which option to pick based on that. You could also let it generate the table then move that in to program memory or EEPROM if both space and memory are both at a premium. The other modes are largely unsuitable for machines with little memory.

A complete copy of the code developed through this tutorial has been attached. If you made it this far, congratulations. Please let me know if you notice any mistakes in this tutorial or ask for clarification on any points. Ragel can be very difficult to pick up but as you can see, the results can also be quite spectacular and it can make difficult things very easy.

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I should say - the graph of the final state machine (large file - ~200K ) is http://goddard.net.nz/~tim/files/tut01/figure_ops.gif.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi pruby,

Thanks to your tutorial I have discovered Ragel and reimplemented my hand-coded FSM code on my ATMEGA 32. Much easier!

Cheers
Tony

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Pruby, this is a great "first" tutorial, it helped me to easily understand ragel and it's potential.

Thanks.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Prior to reading this tutorial I had not heard of Ragel before.

For our hardware I needed to implement a telnet and web interface for configuration and status read-out. Both of course require some form of parser. I had started writing my own but soon came to the point where things started to get real ugly.

I spent half a day playing with Ragel and reading its manual and I must say, it's fantastic! It made implementing both the web server and the telnet interface a snap without using much memory!

Highly recommended.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

*BIG THANKS*

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks to the recommendation, I have discovered Ragel and reimplemented my hand-coded FSM code on my ATMEGA 32. Much easier! Cheers cell phone spyware Clarkson

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Very helpful, thank you!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Very interesting. The weekend just got another activity on its tentative schedule! Although I have just skimmed the firt two part it seems to be well written!

Quote:
I should say - the graph of the final state machine (large file - ~200K ) is http://goddard.net.nz/~tim/files/tut01/figure_ops.gif


I get a 404 for both of those...

EDIT: Doh! I didn't even notice that the tutorials are several years old. I thought I browsed the Tutorials often enough to know what is here, but here I failed miserably.

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington]

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I get a 404 for both of those...

That's what happens when people host tutorials off-site. I suppose it was 3 years ago and I suppose no one would have noticed if Galina had not just resurrected the thread.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

https://github.com/pruby
http://www.goddard.net.nz/

This is why ALL tutorials should be hosted in their entirety on AVR FREAKS.

He does not mention Ragel on his website.

Rick

I emailed him.

His reply.

Quote:
Thanks for pointing out. Files missing from new server - will look in backups when get home.

Cheers,

Tim

Sent from my phone

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for email Rick - images have been restored. What's the best way to store images on avrfreaks for use in tutorials like this? If can be stored and served as images like that, would be happy to edit posts to load from there.

Cheers,

Tim

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

What's the best way to store images on avrfreaks for use in tutorials like this?

Here's a bit of a trick.

Go to the private messages section (for example click the "you have no new messages" link above). Send a PM to yourself, when in the message editor use "Attachement" to add up to 3 images to the post. Once sent open the PM you just got and highlight each picture in turn and "Copy Image Location". Over where you want to embed the pictures (Freaks in a second tab) past each URL in turn, highlight it and click the [IMG] button. Eh voila!

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Clawson, that sounds like a very cool trick. I would never had come up with it myself. Thanks for sharing