Building a C library for regular expressions

Thursday, 3 March 2016

Finishing up on the lexer...

In the last post I said I would verify the lexer by writing a test "main.c" which prints out a string representing each token found in the given input string. The code at the link above has moved on quite a bit since the lexer was finished, but you can roll back to commit 36781762cf14a520aa72677be824a2f029769ff8 to see the code for testing the lexer. Basically, it's just the implementation of the lex() function, which is then invoked like so:

int main (void)                                                                 
{                                                                               
    char *regex = "(.[+*A-Z])+\\+*\\.+";                                        
                                                                                
    printf("regex: %s\n\n", regex);                                             
    printf("tokens:\n");                                                        
    while ((tok = lex(&regex)) != END) {                                        
        printf("%s ", token_string(tok));                                       
    }                                                                           
                                                                                
    printf("\n");                                                               
                                                                                
    return 0;                                                                   
}

To build & execute:

NOTE: Lines preceded by $> are shell commands

$> make
$> ./regexc

And the output is:

regex: (.[+*A-Z])+\+*\.+

tokens:
LPAREN ANY CHARC_OPEN LITERAL LITERAL CHAR_RANGE CHARC_CLOSE RPAREN ONE LITERAL ZERO LITERAL ONE

Pretty straightforward. If you're still interested, you can look at the source in lex.c. Moving swiftly on ...

Parsing

The parser runs in lock-step with the lexer, calling the lex() function repeatedly and processing the input expression one token at a time. The parser also generates an intermediate representation (it's more like a 99% finished representation, really, but I'll get to that later on) of the final list of instructions for the VM.

I thought it would be fun to just try and figure this part out, instead of copying the design of someone else's parser. My first attempt at this was a recursive design, really just a big messy switch statement with no real structure. It didn't work, and I abandoned it when I realised it would never be able to properly handle nested parenthesis groups. I started over again, and got the idea into my head that I should try converting the input expressions to postfix notation, and go from there. I eventually abandoned this idea too (an explicit conversion to postfix notation, requiring a 2nd pass over the postfix version of the input expression, is really unnecessary in this case), but while researching that I came across the Shunting-yard algorithm. The Shunting-yard algorithm is a method for converting infix mathematical expressions (3 + 4) to postfix (3 4 +). Go read about it in detail if you're interested. Although I didn't end up using it, it did get me thinking about things in the right way; save all the operands (literal characters in the input expression) in a buffer until an operator (* ? + | ) is seen, and then let the operator operate on the operands saved in the buffer. Operators do stuff, they make stuff happen, without operators it's just a string of literal characters. So if it's not an operator, save it in the buffer. If it is an operator, use the items in the buffer to do the operation, and write the result to the output queue. Repeat.

Parser design

To aid explanation, I'll start with some simple expressions, and introduce a method for compiling them. Imagine for a moment that we live in a world without parenthesis groups, and the only possible expressions consist of literal characters and three operators, "? * +". Handling nested parenthesis groups, and the "|" alternation operator, will be addressed later. Take the expression "abc*de+", which needs to be converted to the following list of instructions:

0 char a
1 char b
2 branch 5 3
3 char c
4 branch 5 3
5 char d
6 char e
7 branch 6 8
8 match

In this world without parenthesis groups, we could generate these instructions from the input expression quite easily using two stacks. One stack is used as a temporary buffer, and the other will hold the final list of instructions. Here's an illustration of how these two stacks could be used to generate the instructions, as the input expression is read one character at a time:

The formula for converting this parenthesis-less expression into VM instructions, as illustrated in the diagram above, goes something like this;

read the input expression, one character at a time:

current input character = C

if C is a literal:
add a "char C" instruction to the temp. stack

if C is an operator:

move temp. stack items into output stack, until the last item N

if C is "*":
add a branch instruction, then N, the another branch, to the output stack

if C is "+":
add N, then a branch instruction to the output stack

if C is "?":
add a branch instruction, then N to the output stack

In C, an instruction is represented using the following structure:

typedef struct inst {                                                                   
    char *ccs;                                                                  
    short x;                                                                    
    short y;                                                                    
    char c;                                                                     
    uint8_t op;                                                                 
} inst_t;

A stack for holding these instructions is implemented as a doubly linked list. A singly linked list would work too, but using a doubly linked list means less traversal, which I prefer.

After compilation, the instructions are moved out of the stack structure and into an array. This will make matching faster, since any instruction in the array can be accessed by a using a fixed number of operations, regardless of the position within the array. To access an instruction in the middle of the linked list, all of the items from the list head have to be traversed until the middle instruction is reached. The x and y members inside the inst_t struct are the x and y of a "branch" instruction (only x is used in the case of a "jmp" instruction), and each one will hold an index to a particular instruction in the final array.

Notable limitation: using a signed short for x and y limits the instruction list size to 32,767. I only chose this because it can considerably reduce the total size of heap allocations on *some* machines (read: the machine I was using at the time). However this limit is unlikely to be a problem for most cases. Given that a literal character generates 1 instruction, an operator generates at most 2 instructions, and a valid operator can only occur in the input expression at most every 2nd character, then in the worst case scenario, an input expression containing as little as 24,575 characters (75% of 32,767) could hit the limit of 32,767 instructions. 24,575 characters is pretty big, but if it's not big enough, changing both x and y to an int is no big deal.

The stack structure is implemented like this:

                                                    
typedef struct stackitem {                                                              
    inst_t *inst;                                                               
    stackitem_t *next;                                                          
    stackitem_t *previous;                                                      
} stackitem_t;                                                                                                                                                                                                                                
typedef struct stack {                                                                  
    stackitem_t *head;                                                          
    stackitem_t *tail;                                                 
    unsigned short size;                                                       
} stack_t;

the following functions are then necessary for stack manipulation:

stack_t *create_stack (void);                                                   
stackitem_t *stack_add_head (stack_t *stack, inst_t *inst);
void stack_cat (stack_t *stack1, stack_t *stack2);                              
void stack_free (stack_t *stack);

create_stack() allocates a new stack_t structure and returns a pointer to it.

stack_add_head() allocates space for a new stack item, containing the instruction inst, and adds it to the head of stack.

stack_cat() points the head item of stack1 to the tail item of stack2, with the effect of appending the contents of stack2 on to the end of stack1. This is one of the reasons a doubly linked list is used; it makes the concatenation operation O(1). If a singly linked list was used, then stack_cat() would be O(n), since each item in stack2 would have to be traversed to find the tail.

stack_free() frees all allocated memory associated with stack; all instruction contents, all inst_t structs, all stackitem_t structs and finally the stack_t struct.

The implementation of these functions can be seen at the github link I posted.

Now that these basic building blocks are in place, it's time to look at handling the alternation "|" operator. Take the input expression "abc|def" as an example. The VM instructions for this expression should be as follows:

0 branch 1 5
1 char a
2 char b
3 char c
4 jmp 8
5 char d
6 char e
7 char f
8 match

The method for generating these instructions, again using two stacks, is illustrated below:

As shown in the diagram above, in order to generate instructions for an alternation operator, the position of the "jmp" expression must be saved. Then, when the end of the following expression is reached (which, in this parenthesis-less world, will be either the end of the input string or another alternation operator), the saved "jmp" instruction is set to the end of the new expression.Using this approach, where an alternation operator is not "aware" of any other possible alternation operators and simply jumps to the end of the expression following it, will work even in the case of a chain of alternations i.e. "a|b|c|d", however it will generate inefficient code for such a situation:

0 branch 1 9

1 branch 2 7

2 branch 3 5

3 char a

4 jmp 6

5 char b

6 jmp 8

7 char c

8 jmp 10

9 char d

10 match

If the input string is "a", then after matching this character the VM will have to jump to 6, then jump to 8, and finally to 10 in order to reach the match instruction, where this could have been achieved with only one jump. The longer the chain is, the more unnecessary jumps there are for expressions nearer to the beginning of the chain. This will be fixed after parenthesis groups have been introduced.

Next time: parenthesis groups!

Sunday, 27 December 2015

high-level design, instruction set, grammar, lexer implementation

What

About a month ago, I was googling around for an overview of popular methods for matching regular expressions, and stumbled across this gem. The first two posts in that series, and particularly the second one ("Regular Expression Matching: the Virtual Machine Approach"), changed the way I was thinking about regular expression matching from "this almost seems like magic, and I don't think I could ever implement it" to "this looks like fun". So I decided to try it, and write about it as I progress through it (credit to Russ Cox for all the great information about regular expression matching at https://swtch.com/~rsc/regexp-- there's a lot more to be found in other sections of the site but I haven't gone beyond regular expressions yet).

I won't waste time going over anything that Russ already laid out in his overview of the virtual machine approach to matching regular expressions- give it a read, if you haven't already. If you're a programmer of some sort with an interest in the inner workings of compilers and regular expression engines, but little or no experience (like me), then I'm confident you'll find it fascinating!

Why

I had a brief "Compiler Construction" module while studying computer science, but I have been interested ever since and have continued reading. Compilers, in their many forms, really are a major contributor to all the technological advancement that has occurred in the last 60 years. Back in 1952, when Grace Hopper (credited with writing the first working compiler, though credit must be given to those that came before her) was told that "computers could only do arithmetic", this was indicative of how a lot of people thought about computers during that time. In the '40s and '50s, the UNIVAC was the cutting edge of technology, and one of the machines that Grace Hopper did some of her compiler work on. This work led to early compiler-based languages like MATH-MATIC, FLOW-MATIC and COBOL. To get some sense of how these things were programmed before compiler-based languages, you can peruse her keynote address at the beginning of "History of Programming Languges", edited by Richard L. Wexelblat. And now, in 2015, any kid with a $9 computer can write 2 or 3 lines of Python that would blow their minds in 1952. Of course this advancement was also due to so many other things, but compilers were certainly a big part of it. The ease of creation, accessibility and portability they have increasingly provided for the last 60 years means that more and more people are involved in creating software in some fashion, and more people than ever have the tools freely available to them to pursue programming of some kind as a hobby. Which I think is a good thing, although providing tools that make it easy to write code without thinking too hard about the underlying machine does have some downsides.

So, that's why I think compilers are cool, but why regular expressions? It seems that the common techniques for compiling regular expressions are conceptually the same as those used for compiling a more complicated language like C. The virtual machine approach lets us create a simpler, self-contained version of the whole process of lexing, parsing, code generation and finally execution, by creating a simple instruction set for a simple machine which knows nothing but regular expressions. It's an opportunity to learn a little about all of these things, albeit in a much simpler & scaled down version.

DISCLAIMER: I am not a compiler expert. This is just for fun.

Code

https://github.com/eriknyquist/regexvm

The plan

I want to build a C library that functions similarly to the POSIX regexp library, using Russ Cox's 'Virtual Machine Approach' overview as a loose guide, so that I could do something like this:

#include <stdio.h>
#include "vmregexc.h"

int ret;
const char *input_string = "aabbccdd 0.01 33 1023.3302 qwerty"
const char *regex = "([0-9]+(\.[0-9]+)?)+";
vmregex_t compiled;
char result[10];


void main (void)
{
    /* Compile the expression, and print an (informative)? error
     * message if it fails. */
    if ((ret = vmregex_compile(regex, &compiled)) != 0) {
        fprintf(stderr, vmregex_strerror(ret));
    }

    /* Match the compiled expression against an input string */
    if(vmregex_match(&compiled, input_string)) {
        fprintf(stdout, "Match!");
    } else {
        fprintf(stdout, "No match.");
    }
}

Where vmregex_compile compiles the expression and returns a list of instructions, and vmregex_match is the VM, which takes an input string and executes the instructions.

Initially, this implementation will have fewer features than the POSIX library, most notably:

support for only the following regular expression constructs & metacharacters:
"[ ]" character classes
"( )" parentheses, for grouping
"*" zero or more
"+" one or more
"?" zero or one
"." any character
"\" backslash, for escaping metacharacters
No support for providing information about which parts of the string matched the regex, or submatches.

I expect to eventually implement some of the extras once I have a solid compiler running. Part of the the draw to using the VM approach is that implementing new features is relatively easy, once the system for compiliation and executing code on the VM is in place.

I have devised an instruction set that should allow me to handle all of the symbols & constructs laid out above, which is really just the same set from Russ Cox's link with a 'class' instruction added for character classes. The VM will be implemented in the vmregex_match() function, which accepts a list of these instructions, and returns a 0 or 1 for a match or no match, depending on whether any threads reach the 'match' instruction.

just like in Russ Cox's tutorial, SP (string pointer) represents a pointer to the current position in the input string, and IP (instruction pointer) is a pointer to the next instruction to be executed:

char x : if character at SP matches x, this thread succeeded

any : matches any character-- this thread succeeded

class x y z ... : if character at SP matches any of the
listed characters, this thread succeeded

branch x y : queue a new thread with IP=x, and
another with IP=y

jmp x : queue a new thread with IP=x

match : the input string matches

Some examples of compiled regular expressions

To make things clearer, I'll show a few regular expressions and what their compiled instructions should look like, using the instruction set given above:

Regex:

(aa)+b

Instructions:

 0    char 'a'                                                                    
 1    char 'a'                                                                   
 2    branch 0 3                                                                
 3    char 'b'                                                                   
 4    match

Regex:

a*bc

Instructions:

 0    branch 1 3                                                                
 1    char 'a'                                                                   
 2    branch 1 3                                                                
 3    char 'b'                                                                   
 4    char 'c'                                                                   
 5    match

Regex:

(ab+c)*d?

Instructions:

0    branch 1 6
1    char 'a'
2    char 'b'
3    branch 2 4
4    char 'c'
5    branch 1 6
6    branch 7 8
7    char 'd'
8    match

Regex:

([0-9]+(\.[0-9]+)?)+

Instructions:

0    class 0123456789
1    branch 0 2
2    branch 3 6
3    char '.'
4    class 0123456789
5    branch 4 6
6    branch 0 7
7    match

So how the hell do I get started?

It's pretty clear that the first thing that needs to get done is the vmregex_compile function. That's a big job, since this function needs to do a lot-- lexical analysis, parsing, and finally code generation, to return the final list of VM instructions. I'm not really sure about anything parser-related right now, not sure what kind of parser I'll need to use, how complicated it will be, whether I'll need to d convert the input to postfix (RPN) first. So, for now, I'm going to start with implementing a lexer (scanner, tokenizer), and once that is in place I'll investigate the parser.

Before writing any code, I'd like to have a grammar defined for valid expressions. I don't technically need the grammar to write the lexer, but I'd like to have it in front of me anyway so I don't lose track of things. After a couple of (wrong) attempts at doing one myself from scratch, I found this example of a BNF grammar for regular expressions. I removed the parts I don't need to get this:

<RE> := <union>, <simple>
<union> := <RE> | <simple>
<simple> := <concat>, <basic>
<concat> := <simple><basic>
<basic> := <zero>, <one>, <onezero>, <elem>
<zero> := <elem>*
<one> := <elem>+
<onezero> := <elem>?
<elem> := <group>, <any>, <literal>, <class>
<group> := (<RE>)
<class> := [<items>]
<items> := <item>, <item><items>
<item> := <range>, <literal>

I could feed this grammar into lex & yacc, or some similar set of tools, to generate the lexer and parser code for me. But that's no fun. So what follows is sort of a plan for the first bit, the lexer.

You might notice that the above grammar has some nonterminal symbols not defined- that is, they don't appear on the left hand side of any productions- namely, <any>, <literal> and <range>. That's because these are tokens that will be returned by the lexer. <any> and <literal> are very simple; <any> is just a single "." character, and <literal> is any printable ASCII character (or escaped metacharacter) that is not inside a character class. <range> is a character range (i.e. "A-Z"). Since the lexer will take care of recognising these simple constructs, I'll take them for granted in the grammar. I'll just start populating a header file for the lexer while I'm at it, and you can see the full list of tokens in the file below;

#ifndef LEX_H_
#define LEX_H_

/* lexer return tokens */
enum {CHARC_OPEN, CHARC_CLOSE, CHAR_RANGE, LPAREN, RPAREN, ONE,
      ZERO, ONEZERO, ANY, LITERAL, INVALIDSYM, END, STOP};

/* internal lexer states */
enum {STATE_START, STATE_LITERAL, STATE_RANGE, STATE_DEREF,
      STATE_END};

int lex (char **input);
char *get_token_text (void);

#endif

The first enum is the list of tokens that will be returned by the lexer. The second enum is all the states that (I'm pretty sure-- may have to add a couple when it comes to actually code it) the lexer's state machine will use internally. I've also added two function declarations. These functions will act like the yylex() function and yytext pointer provided by lex/flex. Each time the lex() function is called, it eats up characters on the input string until a valid token is found, and returns that token. It will take a pointer to a pointer to the input string so that it may simply increment the same pointer passed by the caller, until the input string is finished, and avoid allocating extra memory inside lex() to store a copy of the input string.

The get_token_text() function serves the same purpose as lex's yytext, except that it's a function instead of a char pointer, and it does allocate memory on the heap to store a copy of the portion of the input string that makes up the current token. I originally did it this way because there will be some cases where I don't need the characters in the input stream, because the token returned by the lexer is enough (any of the metacharacters, for example. If the lexer says ZERO, the I know the current character in the input string is a '*'-- I don't need to look). Therefore I can avoid allocating extra memory by only calling the function when I need to. But when I looked at how this is implemented in lex, I realised of course I didn't need to allocate memory at all, I can just provide a pointer to the original input string, like lex does.
BUT! there is one thing about yytext that I don't like, and which I'd like to do differently. As any lex users would know, yytext is a pointer to the portion of the input string that corresponds to the current token returned by yylex(). But of course you don't want to print the whole remainder of the input string, only the part that constitutes the current token, so to achieve this lex sticks a null terminator right in the string to achieve the desired length, having recorded the character it just clobbered so it can repair the string when it needs to. Yuck. I think I'd rather just keep track of the length, so I know how many characters to read, however I won't truly know the best way to do this until I'm doing the parser. For now, I'll implement get_token_text() the crappy way (allocating new memory), and then I can fix it once I know exactly what the parser will need to do with these snippets.

To illustrate how the lexer should move through particular states to return a corresponding token, I've drawn a simplified state diagram (note that token names are in red, and input characters are in green):

7Vxbc9q4F/8s+5CZ3Z1JxrZ845GktN2ZbNohmflv/y8dBwR4YyzWOLd++h7Z52DLF0iC7RIuD2AdZEnW79x14IRdzJ8+Rd5i9rcY8+DE0MZPJ+zDiWHozNLgQ1KeU4rdYylhGvlj7JQRrv0fHIl43/TeH/Ol0jEWIoj9hUociTDko1iheVEkHtVuExGosy68Kc6IE0rC9cgLaB1nVkb/nz+OZyndNeyM/pn70xnNrdu99Jtbb3Q3jcR9iDOeGGySvNKv5x6NhTM/aWkTb3/GJjORsPBCZek/hJgrhIgvV/uHY058XBa2b0U05lFKIlrgh3f5XWIDADQSAu6UV/OnCx5IUAkvvG0ZP9MmnbDzWTwPoKHDZfL1x5qbcanrb4ZHjXiorLxuPENHHHIj8jFAiE0RxTMxFaEXDDLqeYIKl0PAJuen509+/A+S5fU3eQ0sAK0wjp6Tr7Jm8i00/uVx/Iy8693HAkjZvJdCLJSHk8sr8GD2tLQ/4j4aYS9Dc5HxvWjKsZ+hJVyBm/KJizmHBSVMEHix/6DO4CG+01U/vLUfRZ68izoshB/Gy9zIXyVBMieKL4KP0sx6CjpwkQ5IrdzKMlKCYA2aJmqHZtGka4mXvO4Er22hKe09c9TNtzVs0xApf+BdCiqvx6EdqUpF590jYbi9s57uuJpr6PJTN1VgWHvA9Jy2BSQF6X0CY5JxIiTMFpFArXxEohIJRzvTci+4uyPV1UN36YhLFS6WsRaX9jQX08quX8OO2js2KCQPiIOFflUrOCDkRxzKOOgUrpI8UBi6AYfXutHFeUw9NSarKGdDfwtdwab8bqa17nfvZhSVem5b8JEiievMEk3/4AX3FPkadgALOZ+IBOts7+3/7mX0nXxxmsb0feiga4un7Eu4msrP65v+zeA7vA9vaEBYTDpm2qMMbRBA/kRC+DjzY3698JJdeoQUTlU0/sCjmBMn1mx2eQtJtZELRiEjsfJjlkzRSf3NcokUtyB6b9r13p7EMY0E+2VnoHfmMluzLcthjqPZOAAi5RBSLThp6QO1Jg0fBsPBx12UBrInpMep3Ykw2PsRO1K0rShyux2HwHZUmXBbcghsp5BCQMXVWGKth+O/c0W4tb2uUYSFvOZLYX4LEsS+Lam+y79uBsP+5S4qv5W2Iya30Dp3ov0oNVS/7yMRiPRsJNtaZtujkVxTRtqMj2FW4UOz3a4IFqx6BVNGfiF0AISk5zBKDoC826SDlFjUOtDbgsk+AEWKJB4SyRu8wJ/KQ6WAT+RQEl0fzr76SI6luJ4vgSf8cHqTyO6p2QwvpPo6C4BMPGbLsQIBlucEom3HCfUSCM8aViIMh2T8lFaSYKyf2etQfrEsG1WjeKHcq0UE+CV4GloowlN4VG808yJvFMMhnpHnHpVTVuT0eVTy2H8oknblsUH84GBUfUrND5dwBixBlzIaeEuwo295biArj75P4mTZqEpJnhgOochThUAVE+RvEijKGlYIVAWzrdO2k4lMSpYZ4/Jrfzi42oz76wBexpG44xe4EJAwaQcnfhAUSITrCNCQp+clZOf+eJx4TFVGVEDvSSAegTKDfhxuaCSUwJxRKaeeg5y4Io94Map7G+KYL3qlMdW01ozp7wdrSnVbTeKbJNNd2FKGdnt3GOGPg2UEQ/+VjFCf0WnIBgyPNqAs+lQshrqdIqUubIBZ70Y3hPjF5/7w4vuXr0fUFdQZ1R6uUMfEXg51OkBsHvWdU/hSER+qymeOqgBMhk5hFyrfRLZrTwH0r74dJT9v4S3c8tX5dDlt0p6+33iC2bXknx2uq0fVZytXD12/LuTewuXukAlg+dO+w2KFlcpfYwIoQdM8KyDbtWcCvlwN/j8YfjmaAcXqF1I/FdJf5QBS4ns7yNG+7I70/3m4su/+QjNAZ6btyf5R8It4GwX/Tyv7f1WCr+Nt2+G9e2bfONzIz2LFE58OzT5lnFs1+0fJz8OtqXDbRlnyGaKSh5sOaraDu5zpiyPuSVBDuc2Abe68FqaSPyqFhgej7DksanmYTf7wpooWsp3bwVLviTV17LqhmOggczBqxs2hg48uUjBUNXNEvEvEbbNYGF0urqiC3EC0tqtgb+WXGZ0XbjqJYmy+cLNYwq4ApUOZUdWv3Vqo6kyfr7WqzmH/6lO9Q7RDNZ0unUF3UtNJVdO5XYc/Alnez2W52ILzO5CB1DnZex2lqQXkLoV8+cgQXcY8DPSfAtvBUD4TlJsPlN9Shvfm8tHD26X82LNCVvoDEnIBHVTYeePQVhzmdJeLY2vLPlex9+nhVjRTlLSmur21gNwpH8WiBMrGvsugSQUQ5KBRqqoTGezuWPQogxuTYgVGoBqlLmTQra+MaaoCbq0XeJBpMYqxMo//ZRURjcRmbr35Pf6Q5PhDknepQQs/eO1Vnig280MSaGZ/M5gG3dmfSLLBTw==

Given the state diagram above, if the input string "(.[+*A-Z])+\+*\.+" were supplied, then the lexer should produce the following stream of tokens:

LPAREN, ANY, CHARC_OPEN, LITERAL, LITERAL, RANGE, CHARC_CLOSE, RPAREN, ONE, LITERAL, ZERO, LITERAL, ONE

So you can see that this lexer will do a little bit more than just splitting up the input string. It has to keep track of when it's inside a character class, so that a metacharacter (e.g. "+") outside a character class is recognised as such (e.g. ONE token), but inside a character class it is a LITERAL. Additionally, it will filter out any "\" characters (so they will not appear in strings returned by get_token_text()), while recognising it as a command to treat the following character as a LITERAL.

I'm going to finish implementing the lexer, and then I'll add a test main() to call the lex() function with an input string, and print out the stream of tokens. In the next post, I'll use this to verify that the lexer works like I expect it to, and start into the parser.