[CODE][C] Parsing a JSON-based config file using micro-memory with Ragel

Go To Last Post
3 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

A long time ago I wrote a tutorial on using Ragel for writing parsers compatible with the resource constraints of AVRs. This tutorial is available here:

 

http://www.avrfreaks.net/forum/codec-parsing-strings-flexiblyefficiently-ragel?page=0&file=viewtopic&t=80042

 

This tutorial follows on from the lessons in that, to the use of Ragel for parsing some more complicated structures. We will be developing a parser for a JSON-based config file, supporting strings and recursive data structures.

 

To start out, let’s start a machine:

 

%%{

 machine json;

 access parser->;

}%%

 

One thing we’ve done differently here to the last tutorial is that “access” line. What it means is that Ragel will access its state via a pointer called “parser” - we can pass this in to the functions:

 

 

typedef struct {

 uint8_t cs;

} jsonish_config_parser;

%% write data;

void init_jsonish_config_parser(jsonish_config_parser* parser)

{

 %% write init;

}

void parse_jsonish_config(jsonish_config_parser* parser, const char* p, uint16_t len, uint8_t is_eof )

{

 const char* pe = p + len; /* pe points to 1 byte beyond the end of this block of data */
 const char* eof = is_eof ? pe : ((char*) 0); /* Indicates the end of all data, 0 if not in this block */

 %% write exec;

}

 

This lets us store our parser state wherever we like, including on the stack. By allocating a parser on the stack, we use its memory only when actively in use, not full time. An example of using this would be:

 

 

void test_parser() {

 jsonish_config_parser parser;

 char buffer[READ_BLOCK];

 uint16_t length;

 init_jsonish_config_parser(&parser);

 while (length = read_something(buffer, READ_BLOCK)) {

   parse_jsonish_config(buffer, length, 0);

 }

 parse_jsonish_config(buffer, 0, 1);

}

 

This would call the read_something method to fill the buffer (returns the length read), and while it succeeds feed it to the parser. If read_something returns zero, it signals an end-of-file. The parser’s memory is only in use during the test_parser call.

 

 

Now on to the actual format.

 

The basic JSON types are an object, an array, a number, a string, and a few special values. The format of these is defined on the JSON home page pretty thoroughly. We’re going to make a couple of major changes for today:

 

  • We’ll only support integers, not floating point or e notation.

  • We’re ignoring the true, false and null special values. They’re not hard, but add complication to this tutorial.

 

Let’s try numbers and strings to start:

 

%%{

 machine json;

 access parser->;

 

 escape = ‘\\’;

 number = (‘-’? (‘0’ | ([1-9] digit*))) >start_number;

 escaped_character = escape ([“/bfnrt] | escape | (‘u’ [0-9a-fA-F]{4}));

 string = ‘“‘ ((any - (‘“‘ | escape)) | escaped_character)* ‘“‘;

 value = number | string;

 

 main := value;

}%%

 

Firstly the numbers. These have an optional negative sign, and at least one digit with no excess 0s on the front. Proper JSON allows, but we don’t, an optional decimal portion, and an optional exponent. We’re not actually recording the numbers yet, but the pattern will match them.

 

 

We wrap the strings in quote characters (“) so these are explicitly excluded from the characters in the string. We use a back-slash as an escape character, so \” is a literal quote in the string, and \\ is a literal backslash. There are a few allowed other escape codes, which we’ll replace with spaces for now (to keep this simpler).

 

The strings in JSON are very clearly specified. For the quote or back-slash characters to appear in a string, they have to be part of an escape sequence. In ragel, we can match anything but these characters with the expression:

 

any - (‘“‘ | escape)

 

So this compiles, and the state machine looks like:

 

json1.png

 

Now we get on to something harder. Let’s try storing those numbers and strings. For the numbers, I only really want integer support, so I’m going to ignore the fractional portion, and drop parsing the exponent entirely (will error if provided with one). Let’s set up actions for that:

 

Our data structure gets some number fields:

typedef struct {

 uint8_t cs;

 uint16_t current_number;

 int8_t current_sign;

} jsonish_config_parser;

 

 

And our machine gets some actions:

 

 

 action start_number { parser->current_number = 0; parser->current_sign = 1; }

 action see_negative { parser->current_sign = -1; }

 action see_digit { parser->current_number = parser->current_number * 10 + (*p - ‘0’) ; }

 number = ((‘-’ @see_negative)? (‘0’ | ([1-9] @see_digit (digit @see_digit)*)) (‘.’ digit+)?) >start_number;

 

We then have a number stored as (current_number * current_sign).

 

 

Strings are a little harder - we’re going to need to store a buffer:

 

#define MAX_STRING 32

typedef struct {

 uint8_t cs;

 uint8_t state;

 uint16_t current_number;

 int8_t current_sign;

 char current_string[MAX_STRING];

 uint8_t current_string_length;

} jsonish_config_parser;

 

Our machine gets some string actions:

 

 

 action start_string { parser->current_string_length = 0; }

 

 action see_character {

   if (parser->current_string_length < MAX_STRING) {

     parser->current_string[parser->current_string_length++] = *p;

   };

 }

 

 action see_character_replace_space {

   if (parser->current_string_length < MAX_STRING) {

     parser->current_string[parser->current_string_length++] = ' ';

   };

 }

 

 …

 escaped_character = escape (‘“‘ @see_character | [/bfnrt] @see_character_replace_space | escape @see_character | (‘u’ @see_character_replace_space [0-9a-fA-F]{4}));

 string = ‘“‘ %start_string ((any - (‘“‘ | escape)) @see_character | escaped_character)* ‘“‘;

 

This keeps the current string in a buffer for us. We stop recording if we hit the end of our buffer - only capture the first MAX_STRING bytes of the string. We turn most escape codes and encoded characters in to a space.

 

 

Our machine now looks like:

 

json2.png

Now let’s move on to arrays. They might seem like a simple definition:

 

array = '[' space* value space* (',' space* value space*)* ’]';

value = number | string | array;

 

However, if we try to compile, we’ll get errors like this:

 

 

json.rl:29:22: graph lookup of "value" failed

json.rl:29:47: graph lookup of "value" failed

 

The reason is that we can’t have recursive definitions. array refers to value refers to array - and it’s not compatible with state machine parsing.

 

Ragel provides us with a way around this, by allowing us to explicitly jump around the state machine chart and keep track of where we’ve been. We need to create another couple of fields on the parser structure:

 

#define MAX_JSON_DEPTH 8

typedef struct {

 uint8_t cs;

 uint8_t stack[MAX_JSON_DEPTH];

 uint8_t top;

 ...

} jsonish_config_parser;

The stack and top variables let us call in to sub-parsers like functions. Before we continue, let’s add a security guard against this overflowing:

 

enum {

 PARSE_OK,

 PARSE_FAILED

};

void init_jsonish_config_parser(jsonish_config_parser* parser)

{

 %% write init;

 parser->state = PARSE_OK;

}

 

And in the Ragel machine:

 

 prepush {

   if (parser->top >= MAX_JSON_DEPTH) {

     parser->state = PARSE_FAILED;

     fbreak;

   }

 }

 

This checks that we have space on our call stack. We use a new field, “state” on the parser to signal when we’ve failed. The init function should set this to PARSE_OK, and any failure condition sets this and executes the special Ragel command “fbreak” - immediately breaks out of the parser loop.

 

 

We split the inner body of the array in to its own machine by using := instead of = to assign it. All extra machine definitions with := should appear at the end, just before the "main :=" assignment.

 

 array_inner := space* value space* (',' space* value space*)* ’]';

 

And jump in to it when we see an opening bracket using Ragel’s “fcall” function:

 

 action start_array { fcall array_inner; }

 array = ‘[‘ @start_array

 

When we finish the array inner, we return with Ragel’s “fret” function:

 action return { fret; }

 array_inner := space* (value space* (',' space* value space*)*)? ’]' @return;

 

These let us parse arrays in arrays - it uses a stack of state numbers to remember where we came from and how deep we are.

 

The machine now looks like:

 

json3.png

 

We have two copies of most of the inner definitions - one in a top level context and one inside an array, when we’re looking for a closing bracket from an array. On seeing an opening array bracket in either of these, we jump to the start of the array_inner machine, remembering where to return to using the stack.

 

The object notation is similar:

 

 action start_object { fcall object_inner; }

 key_value = string space* ':' space* value;

 object = ‘{‘ @start_object;

 object_inner := space* (key_value space* (',' space* key_value space*)* space*)? '}' @return;

We can now parse our way through numbers and strings, arrays and objects, but still aren’t actually using them. We want an interface that limits the required resources, so let’s make a few decisions:

 

  • The string keys for objects will either be in a known list, or we use a marker for an unknown string.

  • We track our position as an array of breadcrumbs, containing both array offsets (0, 1, 2) and codes for known strings.

  • We assign high breadcrumb numbers to known strings, and assume we won’t get large arrays where we expect objects (if so, so be it).

  • We call back in to a function with our current breadcrumbs and the value whenever we see a final number or string value. These call back functions could interpret the breadcrumbs to set configuration values.

 

We add our breadcrumbs to the parser state:

 

typedef struct {

 uint8_t cs;

 uint8_t stack[MAX_JSON_DEPTH];

 int8_t breadcrumbs[MAX_JSON_DEPTH];

 ...

} jsonish_config_parser;

 

This will line up perfectly with the stack push/pop setting parser->top, so we don’t need a separate counter for this list. We already have “top”, which is always one past the end of the stack.

 

 

For arrays, we start at zero:

 

 action start_array { if (parser->top < MAX_JSON_DEPTH) parser->breadcrumbs[parser->top] = 0; fcall array_inner; }

 

and we add one to the top breadcrumb after each item:

 

 

 action array_next_item { parser->breadcrumbs[parser->top-1] += 1; }

 array_inner := space* (value (space* ',' @array_next_item space* value)*)? ']' @return;

 

For objects, we store an identifier for each known string, so we don't have to keep those buffers hanging around in valuable memory:

 

 

#define UNKNOWN_STRING 255

#define CATS 200

#define DOGS 201

#define KITTENS 202

#define PUPPIES 203

…

 action start_object { if (parser->top < MAX_JSON_DEPTH) parser->breadcrumbs[parser->top] = UNKNOWN_STRING; fcall object_inner; }

 known_string =

    "cats"i %{ parser->breadcrumbs[parser->top-1] = CATS; }
    | "dogs"i %{ parser->breadcrumbs[parser->top-1] = DOGS; }
    | "kittens"i %{ parser->breadcrumbs[parser->top-1] = KITTENS; }
    | "puppies"i %{ parser->breadcrumbs[parser->top-1] = PUPPIES; }
    ;

 object_key = string | ('"' known_string '"');

 key_value = object_key space* ':' space* value space*;

 …

 object_inner := space* (key_value (',' space* key_value space*)* space*)? '}' @return;

 

When we see a known string, the top breadcrumb will be set. This way, the breadcrumbs always point to where we are in the structure, and can be matched in code.

 

We can add some callback functions to the parser for numbers and strings respectively:

 

void seen_number(const jsonish_config_parser* parser, uint8_t *breadcrumbs, uint8_t depth, int32_t number) {

 ...

}

void seen_string(const jsonish_config_parser* parser, uint8_t *breadcrumbs, uint8_t depth, const char* string, uint8_t string_length) {

 ...

}

 

And call these from our parser:

 

 

 action seen_number {

    int32_t actual_number = parser->current_number;

    actual_number *= parser->current_sign;

    seen_number(parser, parser->breadcrumbs, parser->top, parser->current_number * parser->current_sign);

 }

 action seen_string { seen_string(parser, parser->breadcrumbs, parser->top, parser->current_string, parser->current_string_length); }



 value = number %seen_number | string @seen_string | array | object;

 

With a debug print statement in those callbacks, a file like:

 

 

{

 "cats": [{"name": "bob", "kittens": []}, {"name": "sheila", "kittens": [12, 16, 17]}],

 "dogs": [{"name": "kane", "puppies": [12, 16, 17]}, {"name": "dog"}]

}

 

Now produces callbacks:

 

Seen string 200 0 255 : bob
Seen string 200 1 255 : sheila
Seen number 200 1 202 0 : 12
Seen number 200 1 202 1 : 16
Seen number 200 1 202 2 : 17
Seen string 201 0 255 : kane
Seen number 201 0 203 0 : 12
Seen number 201 0 203 1 : 16
Seen number 201 0 203 2 : 17
Seen string 201 1 255 : dog

 

The breadcrumbs 201 0 203 2 can be interpreted as DOGS/0/PUPPIES/2 - mixed strings and array indexes. Note that “name” isn’t a known string, so shows up with the UNKNOWN_STRING code we defined as 255, and would usually be ignored. Known strings become the high numbers, array offsets are the low numbers.

 

There are a few deficiencies of this approach - we can’t tell whether we had nothing, or an empty array or object, but this would usually be enough for some configuration.. We would enter all our configuration keys as known strings, and just match the breadcrumb numbers associated with them.

 

Long arrays could potentially overlap with hashes at the same level, so this imposes a maximum array length in sensitive contexts. I’ve chosen not to trigger an error under these conditions, as there are plenty of circumstances where we know whether to expect an array or hash.

 

All up this parser takes 56 bytes of memory including its 32-byte string buffer. This memory can also be allocated on the stack only when needed. It’s capable of parsing our subset of JSON to a depth of 8 places, with each string up to 32 characters.

 

Example code, which can be compiled on a host machine (uses printf statements for debugging) is attached. This is easily dropped in to an AVR context and compiled there.

Attachment(s): 

Last Edited: Mon. Oct 6, 2014 - 10:51 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Note: example code must be compiled with Ragel first:

 

mv jsonish.txt jsonish.rl
ragel -G2 jsonish.rl
gcc -o jsonish jsonish.c
jsonish test.json

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I havent taken all the information in yet, but many thanks for writing this tutorial. I'm sure i can make good use of it. I had looked at writing a simple json parser by hand but now you've shown me some automated tools to make the job easier and less error prone. Kudos.