Flex and Bison are aging unix tools that when used together can create simple yet powerful parsers. Parsers are a widely utilized concept in computer science and are used to construct compilers, interepreters, and other text processing utilities. In this tutorial, we will use Flex and Bison to create a simple calculator to demonstrate how to effectively use the tools, as well as to provide a general overview as to how they work. To do this, we will use Flex to break our input down into tokens, which then get passed to Bison, where the bulk of the parsing happens. Flex is a lexical analyzer and can be best thought of as the backend of the pair, as it does much of the rote searching and pattern matching, using regular expressions. Bison is a syntatical analyzer, and is also the actual parser generator. Bison parses data by using two stacks to impose structure to the tokenspassed to it by Flex. One stack, the parse stack, contains the tokens which have been passed to it, while the other stack, the value stack, contains the values of the tokens in the parse stack. So, now that we have covered the basic purpose of both Flex and Bison, we can begin to write our calculator utility. The first step in creating our calculator is to tokenize our inputs by using Flex. Tokens are used to aid in parsing effeciency since they are essentially numerical representations of strings. Now, for an applicaton as simple as our calculator, breaking our input down into easy to parse tokens might not be entirely necessary. Tokens are a critical component other parsing applications, so they will be used here to demonstrate the concept. In order for Flex to start recognizing patterns and converting input to tokens, we first must define a set of rules for Flex to abide by. We do this by writing a regular expression, accoompinied by an action or set of actions for Flex to take once it finds a string that matches the regex. The rules section for our calculator would look like this: %% [0-9]+ { yylval = atoi(yytext); return INTEGER; } [-+()/*\n] { return *yytext; } [ \t] ; . { yyerror("character not recognized"); } %% Our first rule looks for integers by searching for any string comprised of the characters 0 through 9, of any length. This means the strings "54" and "89270" will be treated as matches but "983.22" will not. Inside the brackets, you can see the actions we have defined for when Flex matches this regular expression. In this case, once we find a match, Flex will store the value of the integer in the yylval variable, and then tell Bison what we found by returning an INTEGER token. Our next rule looks for our calculator's operators. Our calculator will support addition, subtraction, multiplication, division, and paranthesis. For our action associated with this rule, we simply return the operator we found to Bison. Notice the lack of action for the next rule that was defined, "[ /t]". This is because we don't want to do anything when we come across white space, but we still want to match it in case our calculator's users want to seperate numbers or operators with spaces. By leaving our action side blank, we will still match whitespace, but effectively skip over it by not doing anything when we find some. Our final rule is to match any characters that we haven't matched already, and treat them as errors. If it's not an integer or an operator, we don't care about it. You may have noticed the "%%" that the rules section starts and ends with. These are here so Flex can tell whats what inside its source code. A Flex source actually contains 3 parts, a definition section, a rules section (see above), and a subroutines section. The definition and subroutine section is generally written in C or C++, but can also be written using java. The definition section is overwhelmingly the most important part of a Flex source, but we do need the other two for our calculator utility. Our full Flex source would look like this: /* Definition Section */ %{ include void yyerror(char *); #include "y.tab.h" %} %% /* Rules Section */ [0-9]+ { yylval = atoi(yytext); return INTEGER; } [-+()=/*\n] { return *yytext; } [ \t] ; . yyerror("invalid character"); %% /* Subroutines section */ int yywrap(void) { return 1; } Now that we've tokenized our input appropiately, we can begin working with Bison to put the finishing touches on our calculator. Bison will take the tokens and other things returned to it by Flex and do the actual parsing work. Bison uses a variant of Backus Naur form and implements shift-reduce parsing. Bison works by looking at the order of the tokens it gets passed, and tries to reduce a set of tokens into a simpler expression. Bison's source code contains a rules section as well, and in it is defined how it should do the actual reducing. The rules section in our Bison source would look like this: %% program: program expr '\n' { printf("%d\n", $2); } | ; expr: INTEGER { $$ = $1; } | expr '+' expr { $$ = $1 + $3; } | expr '-' expr { $$ = $1 - $3; } | expr '*' expr { $$ = $1 * $3; } | expr '/' expr { $$ = $1 / $3; } | '(' expr ')' { $$ = $2; } ; %% Bison uses a variant of Backus Naur form to parse. Each production contains a nonterminal followed by a colon, followed by an expression or set of expressions and actions associated with the production. Bison will recieve tokens from Flex and push them onto it's stack until it can match a pattern outlined by the right hand side of a production. At this point, Bison will pop off of the parse stack all the values that match the right hand side of the production and push the nonterminal on the left hand side onto the parse stack. Bison will then execute the actions associated with the production to perforrm a series of pushes and pops on the value stack to keep it in sync with the parse stack. The goal is to simplify tokens and other items present on the stack into simpler and simpler units using the rules we define in this section. Ultimately we want to end up with a single nonterminal on the stack, in this case "program". Look at the first production; "expr : INTEGER {$$ = $1}". Every time Bison gets passed an INTEGER token by Flex, it will pop it off the parse stack and replace it with "expr", which is used to create other productions. On the value stack, it pops off the first element and then pushes it back on. This effectively does nothing, and is actually the default behaviour when no action is outlined, so you could leave this part off if you wanted to. The productions are listed in order of increasing precedence, so or slightly more complex productions involving operators are listed after the INTEGER production. Look at production involving addition. Recall that when Flex encountered a plus sign, it would simply return it to Bison. So, when the parse stack contains an "expr" followed by the plus sign followed by another "expr", this production tells Bison to pop off of the parse stack those three values and replace them with a single "expr". The associated action mimics this behaviour on the value stack by popping off the first, second, and third elements, then pushing back onto the stack the sum of the first and third elements (the second element being the addition sign). The last production listed involves our paranthesis and has the highest precedence. Like Flex, Bison source code also contains three parts, delimited by "%%". The main method for our calculator utility will be contained in the subroutines part of the Bison source. Our full Bison source is as follows: %token INTEGER %left '+' '-' %left '*' '/' %{ void yyerror(char *); int yylex(void); %} %% program: program expr '\n' { printf("%d\n", $2); } | ; expr: INTEGER { $$ = $1; } | expr '+' expr { $$ = $1 + $3; } | expr '-' expr { $$ = $1 - $3; } | expr '*' expr { $$ = $1 * $3; } | expr '/' expr { $$ = $1 / $3; } | '(' expr ')' { $$ = $2; } ; %% void yyerror(char *s) { fprintf(stderr, "%s\n", s); return 0; } int main(void) { yyparse(); return 0; } And our calulator utility is now complete! With well under a hundered lines of code we were able to create a simple calculator using Bison and Flex.