Building a C library for regular expressions: March 2016

Code

Finishing up on the lexer...

In the last post I said I would verify the lexer by writing a test "main.c" which prints out a string representing each token found in the given input string. The code at the link above has moved on quite a bit since the lexer was finished, but you can roll back to commit 36781762cf14a520aa72677be824a2f029769ff8 to see the code for testing the lexer. Basically, it's just the implementation of the lex() function, which is then invoked like so:

int main (void)                                                                 
{                                                                               
    char *regex = "(.[+*A-Z])+\\+*\\.+";                                        
                                                                                
    printf("regex: %s\n\n", regex);                                             
    printf("tokens:\n");                                                        
    while ((tok = lex(&regex)) != END) {                                        
        printf("%s ", token_string(tok));                                       
    }                                                                           
                                                                                
    printf("\n");                                                               
                                                                                
    return 0;                                                                   
}

To build & execute:

NOTE: Lines preceded by $> are shell commands

$> make
$> ./regexc

And the output is:

regex: (.[+*A-Z])+\+*\.+

tokens:
LPAREN ANY CHARC_OPEN LITERAL LITERAL CHAR_RANGE CHARC_CLOSE RPAREN ONE LITERAL ZERO LITERAL ONE

Pretty straightforward. If you're still interested, you can look at the source in lex.c. Moving swiftly on ...

Parsing

The parser runs in lock-step with the lexer, calling the lex() function repeatedly and processing the input expression one token at a time. The parser also generates an intermediate representation (it's more like a 99% finished representation, really, but I'll get to that later on) of the final list of instructions for the VM.

I thought it would be fun to just try and figure this part out, instead of copying the design of someone else's parser. My first attempt at this was a recursive design, really just a big messy switch statement with no real structure. It didn't work, and I abandoned it when I realised it would never be able to properly handle nested parenthesis groups. I started over again, and got the idea into my head that I should try converting the input expressions to postfix notation, and go from there. I eventually abandoned this idea too (an explicit conversion to postfix notation, requiring a 2nd pass over the postfix version of the input expression, is really unnecessary in this case), but while researching that I came across the Shunting-yard algorithm. The Shunting-yard algorithm is a method for converting infix mathematical expressions (3 + 4) to postfix (3 4 +). Go read about it in detail if you're interested. Although I didn't end up using it, it did get me thinking about things in the right way; save all the operands (literal characters in the input expression) in a buffer until an operator (* ? + | ) is seen, and then let the operator operate on the operands saved in the buffer. Operators do stuff, they make stuff happen, without operators it's just a string of literal characters. So if it's not an operator, save it in the buffer. If it is an operator, use the items in the buffer to do the operation, and write the result to the output queue. Repeat.

Parser design

To aid explanation, I'll start with some simple expressions, and introduce a method for compiling them. Imagine for a moment that we live in a world without parenthesis groups, and the only possible expressions consist of literal characters and three operators, "? * +". Handling nested parenthesis groups, and the "|" alternation operator, will be addressed later. Take the expression "abc*de+", which needs to be converted to the following list of instructions:

0 char a
1 char b
2 branch 5 3
3 char c
4 branch 5 3
5 char d
6 char e
7 branch 6 8
8 match

In this world without parenthesis groups, we could generate these instructions from the input expression quite easily using two stacks. One stack is used as a temporary buffer, and the other will hold the final list of instructions. Here's an illustration of how these two stacks could be used to generate the instructions, as the input expression is read one character at a time:

The formula for converting this parenthesis-less expression into VM instructions, as illustrated in the diagram above, goes something like this;

read the input expression, one character at a time:

current input character = C

if C is a literal:
add a "char C" instruction to the temp. stack

if C is an operator:

move temp. stack items into output stack, until the last item N

if C is "*":
add a branch instruction, then N, the another branch, to the output stack

if C is "+":
add N, then a branch instruction to the output stack

if C is "?":
add a branch instruction, then N to the output stack

In C, an instruction is represented using the following structure:

typedef struct inst {                                                                   
    char *ccs;                                                                  
    short x;                                                                    
    short y;                                                                    
    char c;                                                                     
    uint8_t op;                                                                 
} inst_t;

A stack for holding these instructions is implemented as a doubly linked list. A singly linked list would work too, but using a doubly linked list means less traversal, which I prefer.

After compilation, the instructions are moved out of the stack structure and into an array. This will make matching faster, since any instruction in the array can be accessed by a using a fixed number of operations, regardless of the position within the array. To access an instruction in the middle of the linked list, all of the items from the list head have to be traversed until the middle instruction is reached. The x and y members inside the inst_t struct are the x and y of a "branch" instruction (only x is used in the case of a "jmp" instruction), and each one will hold an index to a particular instruction in the final array.

Notable limitation: using a signed short for x and y limits the instruction list size to 32,767. I only chose this because it can considerably reduce the total size of heap allocations on *some* machines (read: the machine I was using at the time). However this limit is unlikely to be a problem for most cases. Given that a literal character generates 1 instruction, an operator generates at most 2 instructions, and a valid operator can only occur in the input expression at most every 2nd character, then in the worst case scenario, an input expression containing as little as 24,575 characters (75% of 32,767) could hit the limit of 32,767 instructions. 24,575 characters is pretty big, but if it's not big enough, changing both x and y to an int is no big deal.

The stack structure is implemented like this:

                                                    
typedef struct stackitem {                                                              
    inst_t *inst;                                                               
    stackitem_t *next;                                                          
    stackitem_t *previous;                                                      
} stackitem_t;                                                                                                                                                                                                                                
typedef struct stack {                                                                  
    stackitem_t *head;                                                          
    stackitem_t *tail;                                                 
    unsigned short size;                                                       
} stack_t;

the following functions are then necessary for stack manipulation:

stack_t *create_stack (void);                                                   
stackitem_t *stack_add_head (stack_t *stack, inst_t *inst);
void stack_cat (stack_t *stack1, stack_t *stack2);                              
void stack_free (stack_t *stack);

create_stack() allocates a new stack_t structure and returns a pointer to it.

stack_add_head() allocates space for a new stack item, containing the instruction inst, and adds it to the head of stack.

stack_cat() points the head item of stack1 to the tail item of stack2, with the effect of appending the contents of stack2 on to the end of stack1. This is one of the reasons a doubly linked list is used; it makes the concatenation operation O(1). If a singly linked list was used, then stack_cat() would be O(n), since each item in stack2 would have to be traversed to find the tail.

stack_free() frees all allocated memory associated with stack; all instruction contents, all inst_t structs, all stackitem_t structs and finally the stack_t struct.

The implementation of these functions can be seen at the github link I posted.

Now that these basic building blocks are in place, it's time to look at handling the alternation "|" operator. Take the input expression "abc|def" as an example. The VM instructions for this expression should be as follows:

0 branch 1 5
1 char a
2 char b
3 char c
4 jmp 8
5 char d
6 char e
7 char f
8 match

The method for generating these instructions, again using two stacks, is illustrated below:

As shown in the diagram above, in order to generate instructions for an alternation operator, the position of the "jmp" expression must be saved. Then, when the end of the following expression is reached (which, in this parenthesis-less world, will be either the end of the input string or another alternation operator), the saved "jmp" instruction is set to the end of the new expression.Using this approach, where an alternation operator is not "aware" of any other possible alternation operators and simply jumps to the end of the expression following it, will work even in the case of a chain of alternations i.e. "a|b|c|d", however it will generate inefficient code for such a situation:

0 branch 1 9

1 branch 2 7

2 branch 3 5

3 char a

4 jmp 6

5 char b

6 jmp 8

7 char c

8 jmp 10

9 char d

10 match

If the input string is "a", then after matching this character the VM will have to jump to 6, then jump to 8, and finally to 10 in order to reach the match instruction, where this could have been achieved with only one jump. The longer the chain is, the more unnecessary jumps there are for expressions nearer to the beginning of the chain. This will be fixed after parenthesis groups have been introduced.

Next time: parenthesis groups!

Thursday 3 March 2016

Parser

Code

Finishing up on the lexer...

Parsing