Building a C library for regular expressions: high-level design, instruction set, grammar, lexer implementation

What

About a month ago, I was googling around for an overview of popular methods for matching regular expressions, and stumbled across this gem. The first two posts in that series, and particularly the second one ("Regular Expression Matching: the Virtual Machine Approach"), changed the way I was thinking about regular expression matching from "this almost seems like magic, and I don't think I could ever implement it" to "this looks like fun". So I decided to try it, and write about it as I progress through it (credit to Russ Cox for all the great information about regular expression matching at https://swtch.com/~rsc/regexp-- there's a lot more to be found in other sections of the site but I haven't gone beyond regular expressions yet).

I won't waste time going over anything that Russ already laid out in his overview of the virtual machine approach to matching regular expressions- give it a read, if you haven't already. If you're a programmer of some sort with an interest in the inner workings of compilers and regular expression engines, but little or no experience (like me), then I'm confident you'll find it fascinating!

Why

I had a brief "Compiler Construction" module while studying computer science, but I have been interested ever since and have continued reading. Compilers, in their many forms, really are a major contributor to all the technological advancement that has occurred in the last 60 years. Back in 1952, when Grace Hopper (credited with writing the first working compiler, though credit must be given to those that came before her) was told that "computers could only do arithmetic", this was indicative of how a lot of people thought about computers during that time. In the '40s and '50s, the UNIVAC was the cutting edge of technology, and one of the machines that Grace Hopper did some of her compiler work on. This work led to early compiler-based languages like MATH-MATIC, FLOW-MATIC and COBOL. To get some sense of how these things were programmed before compiler-based languages, you can peruse her keynote address at the beginning of "History of Programming Languges", edited by Richard L. Wexelblat. And now, in 2015, any kid with a $9 computer can write 2 or 3 lines of Python that would blow their minds in 1952. Of course this advancement was also due to so many other things, but compilers were certainly a big part of it. The ease of creation, accessibility and portability they have increasingly provided for the last 60 years means that more and more people are involved in creating software in some fashion, and more people than ever have the tools freely available to them to pursue programming of some kind as a hobby. Which I think is a good thing, although providing tools that make it easy to write code without thinking too hard about the underlying machine does have some downsides.

So, that's why I think compilers are cool, but why regular expressions? It seems that the common techniques for compiling regular expressions are conceptually the same as those used for compiling a more complicated language like C. The virtual machine approach lets us create a simpler, self-contained version of the whole process of lexing, parsing, code generation and finally execution, by creating a simple instruction set for a simple machine which knows nothing but regular expressions. It's an opportunity to learn a little about all of these things, albeit in a much simpler & scaled down version.

DISCLAIMER: I am not a compiler expert. This is just for fun.

Code

https://github.com/eriknyquist/regexvm

The plan

I want to build a C library that functions similarly to the POSIX regexp library, using Russ Cox's 'Virtual Machine Approach' overview as a loose guide, so that I could do something like this:

#include <stdio.h>
#include "vmregexc.h"

int ret;
const char *input_string = "aabbccdd 0.01 33 1023.3302 qwerty"
const char *regex = "([0-9]+(\.[0-9]+)?)+";
vmregex_t compiled;
char result[10];


void main (void)
{
    /* Compile the expression, and print an (informative)? error
     * message if it fails. */
    if ((ret = vmregex_compile(regex, &compiled)) != 0) {
        fprintf(stderr, vmregex_strerror(ret));
    }

    /* Match the compiled expression against an input string */
    if(vmregex_match(&compiled, input_string)) {
        fprintf(stdout, "Match!");
    } else {
        fprintf(stdout, "No match.");
    }
}

Where vmregex_compile compiles the expression and returns a list of instructions, and vmregex_match is the VM, which takes an input string and executes the instructions.

Initially, this implementation will have fewer features than the POSIX library, most notably:

support for only the following regular expression constructs & metacharacters:
"[ ]" character classes
"( )" parentheses, for grouping
"*" zero or more
"+" one or more
"?" zero or one
"." any character
"\" backslash, for escaping metacharacters
No support for providing information about which parts of the string matched the regex, or submatches.

I expect to eventually implement some of the extras once I have a solid compiler running. Part of the the draw to using the VM approach is that implementing new features is relatively easy, once the system for compiliation and executing code on the VM is in place.

I have devised an instruction set that should allow me to handle all of the symbols & constructs laid out above, which is really just the same set from Russ Cox's link with a 'class' instruction added for character classes. The VM will be implemented in the vmregex_match() function, which accepts a list of these instructions, and returns a 0 or 1 for a match or no match, depending on whether any threads reach the 'match' instruction.

just like in Russ Cox's tutorial, SP (string pointer) represents a pointer to the current position in the input string, and IP (instruction pointer) is a pointer to the next instruction to be executed:

char x : if character at SP matches x, this thread succeeded

any : matches any character-- this thread succeeded

class x y z ... : if character at SP matches any of the
listed characters, this thread succeeded

branch x y : queue a new thread with IP=x, and
another with IP=y

jmp x : queue a new thread with IP=x

match : the input string matches

Some examples of compiled regular expressions

To make things clearer, I'll show a few regular expressions and what their compiled instructions should look like, using the instruction set given above:

Regex:

(aa)+b

Instructions:

 0    char 'a'                                                                    
 1    char 'a'                                                                   
 2    branch 0 3                                                                
 3    char 'b'                                                                   
 4    match

Regex:

a*bc

Instructions:

 0    branch 1 3                                                                
 1    char 'a'                                                                   
 2    branch 1 3                                                                
 3    char 'b'                                                                   
 4    char 'c'                                                                   
 5    match

Regex:

(ab+c)*d?

Instructions:

0    branch 1 6
1    char 'a'
2    char 'b'
3    branch 2 4
4    char 'c'
5    branch 1 6
6    branch 7 8
7    char 'd'
8    match

Regex:

([0-9]+(\.[0-9]+)?)+

Instructions:

0    class 0123456789
1    branch 0 2
2    branch 3 6
3    char '.'
4    class 0123456789
5    branch 4 6
6    branch 0 7
7    match

So how the hell do I get started?

It's pretty clear that the first thing that needs to get done is the vmregex_compile function. That's a big job, since this function needs to do a lot-- lexical analysis, parsing, and finally code generation, to return the final list of VM instructions. I'm not really sure about anything parser-related right now, not sure what kind of parser I'll need to use, how complicated it will be, whether I'll need to d convert the input to postfix (RPN) first. So, for now, I'm going to start with implementing a lexer (scanner, tokenizer), and once that is in place I'll investigate the parser.

Before writing any code, I'd like to have a grammar defined for valid expressions. I don't technically need the grammar to write the lexer, but I'd like to have it in front of me anyway so I don't lose track of things. After a couple of (wrong) attempts at doing one myself from scratch, I found this example of a BNF grammar for regular expressions. I removed the parts I don't need to get this:

<RE> := <union>, <simple>
<union> := <RE> | <simple>
<simple> := <concat>, <basic>
<concat> := <simple><basic>
<basic> := <zero>, <one>, <onezero>, <elem>
<zero> := <elem>*
<one> := <elem>+
<onezero> := <elem>?
<elem> := <group>, <any>, <literal>, <class>
<group> := (<RE>)
<class> := [<items>]
<items> := <item>, <item><items>
<item> := <range>, <literal>

I could feed this grammar into lex & yacc, or some similar set of tools, to generate the lexer and parser code for me. But that's no fun. So what follows is sort of a plan for the first bit, the lexer.

You might notice that the above grammar has some nonterminal symbols not defined- that is, they don't appear on the left hand side of any productions- namely, <any>, <literal> and <range>. That's because these are tokens that will be returned by the lexer. <any> and <literal> are very simple; <any> is just a single "." character, and <literal> is any printable ASCII character (or escaped metacharacter) that is not inside a character class. <range> is a character range (i.e. "A-Z"). Since the lexer will take care of recognising these simple constructs, I'll take them for granted in the grammar. I'll just start populating a header file for the lexer while I'm at it, and you can see the full list of tokens in the file below;

#ifndef LEX_H_
#define LEX_H_

/* lexer return tokens */
enum {CHARC_OPEN, CHARC_CLOSE, CHAR_RANGE, LPAREN, RPAREN, ONE,
      ZERO, ONEZERO, ANY, LITERAL, INVALIDSYM, END, STOP};

/* internal lexer states */
enum {STATE_START, STATE_LITERAL, STATE_RANGE, STATE_DEREF,
      STATE_END};

int lex (char **input);
char *get_token_text (void);

#endif

The first enum is the list of tokens that will be returned by the lexer. The second enum is all the states that (I'm pretty sure-- may have to add a couple when it comes to actually code it) the lexer's state machine will use internally. I've also added two function declarations. These functions will act like the yylex() function and yytext pointer provided by lex/flex. Each time the lex() function is called, it eats up characters on the input string until a valid token is found, and returns that token. It will take a pointer to a pointer to the input string so that it may simply increment the same pointer passed by the caller, until the input string is finished, and avoid allocating extra memory inside lex() to store a copy of the input string.

The get_token_text() function serves the same purpose as lex's yytext, except that it's a function instead of a char pointer, and it does allocate memory on the heap to store a copy of the portion of the input string that makes up the current token. I originally did it this way because there will be some cases where I don't need the characters in the input stream, because the token returned by the lexer is enough (any of the metacharacters, for example. If the lexer says ZERO, the I know the current character in the input string is a '*'-- I don't need to look). Therefore I can avoid allocating extra memory by only calling the function when I need to. But when I looked at how this is implemented in lex, I realised of course I didn't need to allocate memory at all, I can just provide a pointer to the original input string, like lex does.
BUT! there is one thing about yytext that I don't like, and which I'd like to do differently. As any lex users would know, yytext is a pointer to the portion of the input string that corresponds to the current token returned by yylex(). But of course you don't want to print the whole remainder of the input string, only the part that constitutes the current token, so to achieve this lex sticks a null terminator right in the string to achieve the desired length, having recorded the character it just clobbered so it can repair the string when it needs to. Yuck. I think I'd rather just keep track of the length, so I know how many characters to read, however I won't truly know the best way to do this until I'm doing the parser. For now, I'll implement get_token_text() the crappy way (allocating new memory), and then I can fix it once I know exactly what the parser will need to do with these snippets.

To illustrate how the lexer should move through particular states to return a corresponding token, I've drawn a simplified state diagram (note that token names are in red, and input characters are in green):

7Vxbc9q4F/8s+5CZ3Z1JxrZ845GktN2ZbNohmflv/y8dBwR4YyzWOLd++h7Z52DLF0iC7RIuD2AdZEnW79x14IRdzJ8+Rd5i9rcY8+DE0MZPJ+zDiWHozNLgQ1KeU4rdYylhGvlj7JQRrv0fHIl43/TeH/Ol0jEWIoj9hUociTDko1iheVEkHtVuExGosy68Kc6IE0rC9cgLaB1nVkb/nz+OZyndNeyM/pn70xnNrdu99Jtbb3Q3jcR9iDOeGGySvNKv5x6NhTM/aWkTb3/GJjORsPBCZek/hJgrhIgvV/uHY058XBa2b0U05lFKIlrgh3f5XWIDADQSAu6UV/OnCx5IUAkvvG0ZP9MmnbDzWTwPoKHDZfL1x5qbcanrb4ZHjXiorLxuPENHHHIj8jFAiE0RxTMxFaEXDDLqeYIKl0PAJuen509+/A+S5fU3eQ0sAK0wjp6Tr7Jm8i00/uVx/Iy8693HAkjZvJdCLJSHk8sr8GD2tLQ/4j4aYS9Dc5HxvWjKsZ+hJVyBm/KJizmHBSVMEHix/6DO4CG+01U/vLUfRZ68izoshB/Gy9zIXyVBMieKL4KP0sx6CjpwkQ5IrdzKMlKCYA2aJmqHZtGka4mXvO4Er22hKe09c9TNtzVs0xApf+BdCiqvx6EdqUpF590jYbi9s57uuJpr6PJTN1VgWHvA9Jy2BSQF6X0CY5JxIiTMFpFArXxEohIJRzvTci+4uyPV1UN36YhLFS6WsRaX9jQX08quX8OO2js2KCQPiIOFflUrOCDkRxzKOOgUrpI8UBi6AYfXutHFeUw9NSarKGdDfwtdwab8bqa17nfvZhSVem5b8JEiievMEk3/4AX3FPkadgALOZ+IBOts7+3/7mX0nXxxmsb0feiga4un7Eu4msrP65v+zeA7vA9vaEBYTDpm2qMMbRBA/kRC+DjzY3698JJdeoQUTlU0/sCjmBMn1mx2eQtJtZELRiEjsfJjlkzRSf3NcokUtyB6b9r13p7EMY0E+2VnoHfmMluzLcthjqPZOAAi5RBSLThp6QO1Jg0fBsPBx12UBrInpMep3Ykw2PsRO1K0rShyux2HwHZUmXBbcghsp5BCQMXVWGKth+O/c0W4tb2uUYSFvOZLYX4LEsS+Lam+y79uBsP+5S4qv5W2Iya30Dp3ov0oNVS/7yMRiPRsJNtaZtujkVxTRtqMj2FW4UOz3a4IFqx6BVNGfiF0AISk5zBKDoC826SDlFjUOtDbgsk+AEWKJB4SyRu8wJ/KQ6WAT+RQEl0fzr76SI6luJ4vgSf8cHqTyO6p2QwvpPo6C4BMPGbLsQIBlucEom3HCfUSCM8aViIMh2T8lFaSYKyf2etQfrEsG1WjeKHcq0UE+CV4GloowlN4VG808yJvFMMhnpHnHpVTVuT0eVTy2H8oknblsUH84GBUfUrND5dwBixBlzIaeEuwo295biArj75P4mTZqEpJnhgOochThUAVE+RvEijKGlYIVAWzrdO2k4lMSpYZ4/Jrfzi42oz76wBexpG44xe4EJAwaQcnfhAUSITrCNCQp+clZOf+eJx4TFVGVEDvSSAegTKDfhxuaCSUwJxRKaeeg5y4Io94Map7G+KYL3qlMdW01ozp7wdrSnVbTeKbJNNd2FKGdnt3GOGPg2UEQ/+VjFCf0WnIBgyPNqAs+lQshrqdIqUubIBZ70Y3hPjF5/7w4vuXr0fUFdQZ1R6uUMfEXg51OkBsHvWdU/hSER+qymeOqgBMhk5hFyrfRLZrTwH0r74dJT9v4S3c8tX5dDlt0p6+33iC2bXknx2uq0fVZytXD12/LuTewuXukAlg+dO+w2KFlcpfYwIoQdM8KyDbtWcCvlwN/j8YfjmaAcXqF1I/FdJf5QBS4ns7yNG+7I70/3m4su/+QjNAZ6btyf5R8It4GwX/Tyv7f1WCr+Nt2+G9e2bfONzIz2LFE58OzT5lnFs1+0fJz8OtqXDbRlnyGaKSh5sOaraDu5zpiyPuSVBDuc2Abe68FqaSPyqFhgej7DksanmYTf7wpooWsp3bwVLviTV17LqhmOggczBqxs2hg48uUjBUNXNEvEvEbbNYGF0urqiC3EC0tqtgb+WXGZ0XbjqJYmy+cLNYwq4ApUOZUdWv3Vqo6kyfr7WqzmH/6lO9Q7RDNZ0unUF3UtNJVdO5XYc/Alnez2W52ILzO5CB1DnZex2lqQXkLoV8+cgQXcY8DPSfAtvBUD4TlJsPlN9Shvfm8tHD26X82LNCVvoDEnIBHVTYeePQVhzmdJeLY2vLPlex9+nhVjRTlLSmur21gNwpH8WiBMrGvsugSQUQ5KBRqqoTGezuWPQogxuTYgVGoBqlLmTQra+MaaoCbq0XeJBpMYqxMo//ZRURjcRmbr35Pf6Q5PhDknepQQs/eO1Vnig280MSaGZ/M5gG3dmfSLLBTw==

Given the state diagram above, if the input string "(.[+*A-Z])+\+*\.+" were supplied, then the lexer should produce the following stream of tokens:

LPAREN, ANY, CHARC_OPEN, LITERAL, LITERAL, RANGE, CHARC_CLOSE, RPAREN, ONE, LITERAL, ZERO, LITERAL, ONE

So you can see that this lexer will do a little bit more than just splitting up the input string. It has to keep track of when it's inside a character class, so that a metacharacter (e.g. "+") outside a character class is recognised as such (e.g. ONE token), but inside a character class it is a LITERAL. Additionally, it will filter out any "\" characters (so they will not appear in strings returned by get_token_text()), while recognising it as a command to treat the following character as a LITERAL.

I'm going to finish implementing the lexer, and then I'll add a test main() to call the lex() function with an input string, and print out the stream of tokens. In the next post, I'll use this to verify that the lexer works like I expect it to, and start into the parser.

Building a C library for regular expressions

Sunday, 27 December 2015

high-level design, instruction set, grammar, lexer implementation

No comments:

Post a Comment