Flex and Bison are aging Unix tools that when used together can create simple yet powerful parsers. Parsers are a widely utilized concept in computer science and are used to construct compilers, interpreters, and other text processing utilities. In this tutorial, we will use Flex and Bison to create a simple calculator to demonstrate how to effectively use the tools, as well as to provide a general overview as to how they work. To do this, we will use Flex to break our input down into tokens, which then get passed to Bison, where the bulk of the parsing happens. Flex is a lexical analyzer and can be best Thought of as the backend of the pair, as it does much of the rote searching and pattern matching, using regular expressions. Bison is a syntactical analyzer, and is also the actual parser generator. Bison parses data by using two stacks to impose structure to the tokens passed to it by Flex. One stack, the parse stack, contains the tokens which have been passed to it, while the other stack, the value stack, contains the values of the tokens in the parse stack. So, now that we have covered the basic purpose of both Flex and Bison, we can begin to write our calculator utility.The first step in creating our calculator is to tokenize our inputs by using Flex. Tokens are used to aid in parsing efficiency since they are essentially numerical representations of strings. Now, for an application as simple as our calculator, breaking our input down into easy to parse tokens might not be entirely necessary. Tokens are a critical component other parsing applications, so they will be used here to demonstrate the concept. In order for Flex to start recognizing patterns and converting input to tokens, we first must define a set of rules for Flex to abide by. We do this by writing a regular expression, accompanied by an action or set of actions for Flex to take once it finds a string that matches the regex. The rules section for our calculator would look like this: %% [0-9]+ { yylval = atoi(yytext); return INTEGER; } "+" { return PLUS; } "-" { return MINUS; } "*" { return MULTIPLY; } "/" { return DIVIDE; } "(" { return LPAREN; } ")" { return RPAREN; } [ \t] { ; } [\n] { return NEWLINE; } [.] { yyerror("unknown character"); } %% Our first rule looks for integers by searching for any string comprised of the characters 0 through 9, of any length. This means the strings "54" and"89270" will be treated as matches but "983.22" will not. Inside the brackets, you can see the actions we have defined for when Flex matches this regular expression. In this case, once we find a match, Flex will store the value of the integer in the yylval variable, and then tell Bison what we found by returning an INTEGER token. The next set of rules looks for our calculator's operators. Our calculator will support addition, subtraction, multiplication, division, and parenthesis. We have given each operator their own token definition. For the action associated with these rules, we simply return the token to Bison. We don't have to store a value for these rules, since all we care about is that we found the operator. Notice the lack of action for the next rule that was defined, "[ /t]". This is because we don't want to do anything when we come across white space, but we still want to match it in case our calculator's users want to separate numbers or operators with spaces or tabs. By leaving our action side blank, we will still match whitespace, but effectively skip over it by not doing anything when we find some. Our final rule is to match any characters that we haven't matched already, and treat them as errors. If it's not an integer or an operator, we don't care about it. You may have noticed the "%%" that the rules section starts and ends with. These are here so Flex can tell whats what inside its source code. A Flex source actually contains 3 parts, a definition section, a rules section (see above), and a subroutines section. The definition and subroutine section is generally written in C or C++, but can also be written using java. The definition section is overwhelmingly the most important part of a Flex source, but we do need the other two for our calculator utility. Our full Flex source would look like this: /* definition section */ %{ #include "y.tab.h" #include void yyerror(char *); %} %% /* rules section */ [0-9]+ { yylval = atoi(yytext); return INTEGER; } "+" { return PLUS; } "-" { return MINUS; } "*" { return MULTIPLY; } "/" { return DIVIDE; } "(" { return LPAREN; } ")" { return RPAREN; } [ \t] { ; } [\n] { return NEWLINE; } [.] { yyerror("unknown character"); } %% /* subroutines section */ int yywrap(void) { return 1; } Now that we've tokenized our input appropriately, we can begin working with Bison to put the finishing touches on our calculator. Bison will take the tokens and other things returned to it by Flex and do the actual parsing work. Bison uses a variant of Backus Naur form and implements shift-reduce parsing. Bison works by looking at the order of the tokens it gets passed, and tries to reduce a set of tokens into a simpler expression. Bison's source code contains a rules section as well,and in it is defined how it should do the actual reducing. The rules section in our Bison source would look like this: %% program: program expr NEWLINE { printf("%d\n", $2); } | ; expr: INTEGER { $$ = $1; } | expr PLUS expr { $$ = $1 + $3; } | expr MINUS expr { $$ = $1 - $3; } | expr MULTIPLY expr { $$ = $1 * $3; } | expr DIVIDE expr { $$ = $1 / $3; } | LPAREN expr RPAREN { $$ = $2; } ; %% Bison uses a variant of Backus Naur form to parse. Each production contains a nonterminal followed by a colon, followed by an expression or set of expressions and actions associated with the production. Bison will receive tokens from Flex and push them onto it's stack until it can match a pattern outlined by the right hand side of a production. At this point, Bison will pop off of the parse stack all the values that match the right hand side of the production and push the nonterminal on the left hand side onto the parse stack. Bison will then execute the actions associated with the production to perform a series of pushesand pops on the value stack to keep it in sync with the parse stack. The goal is to simplify tokens and other items present on the stack into simpler and simpler units using the rules we define in this section. Ultimately we want to end up with a single nonterminal on the stack, in this case "program". Look at the first production; "expr : INTEGER {$$ = $1}". Every time Bison gets passed an INTEGER token by Flex, it will pop it off the parse stack and replace it with "expr", which is used to create other productions. On the value stack, it pops off the first element and then pushes it back on. This effectively does nothing, and is actually the default behavior when no action is outlined, so you could leave this part offif you wanted to. The productions are listed in order of increasing precedence, so or slightly more complex productions involving operators are listed after the INTEGER production. Look at production involving the PLUS token. Recall that when Flex encountered an addition sign, it would simply return a PLUS token to Bison as a signal. So, when the parse stack contains an "expr" followed by a PLUS token followed by another "expr", this production tells Bison to pop off of the parse stack those three values and replace them with a single "expr". The associated action mimics this behavior on the value stack by popping off the first, second, and third elements, then pushing back onto the stack the sum of the first and third elements (the second element being the addition sign). The last production listed involves parenthesis and has the highest precedence. Like Flex, Bison source code also contains three parts, delimited by "%%". The main method for our calculator utility will be contained in the subroutines part of the Bison source. Our full Bison source is as follows: /* definitions section */ %{ #include void yyerror(char *); int yylex(void); %} %token INTEGER %token PLUS %token MINUS %token MULTIPLY %token DIVIDE %token LPAREN %token RPAREN %token NEWLINE %left PLUS MINUS %left MULTIPLY DIVIDE %% /* rules section */ program: program expr NEWLINE { printf("%d\n", $2); } | ; expr: INTEGER { $$ = $1; } | expr PLUS expr { $$ = $1 + $3; } | expr MINUS expr { $$ = $1 - $3; } | expr MULTIPLY expr { $$ = $1 * $3; } | expr DIVIDE expr { $$ = $1 / $3; } | LPAREN expr RPAREN { $$ = $2; } ; %% /* subroutines section */ void yyerror(char *s) { fprintf(stderr, "%s\n", s); } int main(void) { yyparse(); } And our calculator utility is now complete! We have written our source code required for the utility, so now all that is left is to compile them. You can accomplish this with a make file for larger projects, but here a couple of command lines will do. Make sure you have Flex and Bison installed on your machine before entering the following lines in your command prompt or terminal. First, we must generate the lexical analyzer by calling flex with the name of our lex file: daniel@ubuntu:~$ flex calculator.lex This will create a file with the name "lex.yy.c" in your current directory. This is our actual lexical analyzer, written in C. Flex is merely a generator, What was written in the .lex file was expanded into the almost 2000 lines of code that you now see in your current directory. We will compile this file soon, but we must now generate our parser by using a similar command line calling bison: daniel@ubuntu:~$ bison -dy calculator.yacc This will create 2 more files in your current directory, one of which, "y.tab.c", is the source code for our parser, and the other is a header file "y.tab.h" which flex needed so it could work with the generated parser. This is why we had an #include "y.tab.h" line in our lex source. we can now compile and run our lexical analyzer and parser to achieve a working calculator: daniel@ubuntu:~$ gcc lex.yy.c y.tab.c -o calculator daniel@ubuntu:~$ ./calculator And thats it! We now have a working calculator. Try inputing some lines to test our new utility. Enter in what you want calculated, and the utility will echo back the answer on the next line. It observes the correct order of operations: 1 + 2 3 3 + 4 * 5 23 The parenthesis behave as expected: (3 + 4) * 5 35 Thanks to the [ \t] rule in our lexical analyzer, we can enter our input separated by spaces or tabs, or even having no separation at all: 3 * 8 + 5 29 3 * 8 + 5 29 3*8+5 29 Even though the calculator we created was trivially simple, it serves as an effective demonstration of the usefullness of the Bison and Flex tools. With well under a hundred lines of code we were able to create a calculator using Bison and Flex. With a few changes, we could easily add the ability to our calculator recognize logic and perform symbolic manipulation. The power of these tools is a result of their sheer effeciency and simplicity.