05/21/2023

Introduction To Compilers Chapter 5: Writing a Parser

I'm currently reading the book Introduction to Compilers and Language Design by Douglas Thain. Below are my notes from chapter 5, which walks us through the basics of writing a parser using Bison.

Note: Initially I was having some trouble getting the parser to work after following the examples directly from the book. I was able to get everything to compile, but my parser identified everything as being an invalid syntax. Eventually I was able to talk through it with ChatGPT and put together a working parser. As a result, the code described below varies slightly from the examples in this part of the book.

What is a Parser?

A parser is a program that reads a stream of tokens and determines whether or not the tokens follow the grammar of the language. A parser is also known as a syntax analyzer.

What is Bison?

Bison is a tool that generates parsers. It is a program that reads a parser specification and produces a parser program. Bison is a part of the GNU Compiler Collection (GCC). Bison is also known as yacc.

What is a Parser Specification?

A parser specification is a file that describes the grammar of the language. It is written in a language called yacc. The yacc language is a subset of C. The yacc language is also known as bison.

Writing our Parser

To write our parser, we will need to create the following:

  • A parser specification (parser.bison)
  • A scanner specification (scanner.flex)
  • A main.c program that runs our parser on some input and tells us if the syntax is valid.

Once we have these 3 things, we can compile everything into a parser executable. Then we can run that against another file to determine if the contents of the file are a valid syntax for the defined grammar rules of our language.

parser.bison

Let's start with the parser.bison file. In this file we will declare our tokens as well as our grammar rules for our language.

Here is the basic structure of a parser.bison file:

%{
  (C Preamble Code)
%}
  (Token Declarations)
%%
  (Grammar Rules)
%%
  (Additional Code)

The C Preamble Code is where we can include any C header files that we need. We can also define any global variables that we need. The Token Declarations is where we define our tokens. The Grammar Rules is where we define the grammar rules for our language. The Additional Code is where we can define any additional C code that we need.

In the previous post on writing a Flex scanner, we defined a list of tokens in the file tokens.h. This time, we just have to define our tokens in the parser.bison file. Then when we run the Bison command to generate a parser.c file, we can actually use Bison to automatically create a tokens.h file for us based on what we have defined in our Bison file.

Let's start with the C Preamble code. Here is what I ended up including in mine by the time I got everything working:

%{
#include <stdio.h>
void yyerror(const char* msg);
int yylex();
%}

Here is a quick breakdown of each piece:

  • #include <stdio.h>: This line includes the standard input/output library in C. This library is needed for functions like printf and fprintf, which our program will use for printing error messages.
  • void yyerror(const char* msg);: This line declares a function yyerror that takes a const char* as an argument and returns nothing (void). Bison will call this function when it encounters a syntax error. We will define this function later in the "Additional Code" section.
  • int yylex();: This line declares a function yylex that takes no arguments and returns an integer. The yylex function is the lexer function that Bison calls to get the next token. It is typically generated by a tool like Flex.

Next, we will define our tokens. Here is what I ended up with, which mostly follows along with the example from the book:

%union {
  int ival;
}

%token <ival> TOKEN_INT
%token TOKEN_PLUS
%token TOKEN_MINUS
%token TOKEN_MUL
%token TOKEN_DIV
%token TOKEN_LPAREN
%token TOKEN_RPAREN
%token TOKEN_SEMI
%token TOKEN_ERROR

The %union declaration is used to specify the types of semantic values that tokens and grammar symbols can have. This was reccommended as a solution by ChatGPT, and is used in combination with the <ival> tag in our token declarations to specify that TOKEN_INT tokens have an integer semantic value. Outside of this, I think the rest of our tokens are fairly self explanatory.

Next, we will define our grammar rules. For this part, what I have should match exactly with the example from the book:

%%
program : program expr TOKEN_SEMI
         | expr TOKEN_SEMI
         ;

expr : expr TOKEN_PLUS term
     | expr TOKEN_MINUS term
     | term
     ;

term : term TOKEN_MUL factor
     | term TOKEN_DIV factor
     | factor
     ;

factor : TOKEN_MINUS factor
       | TOKEN_LPAREN expr TOKEN_RPAREN
       | TOKEN_INT
       ;
%%

This grammar defines a set of rules that describe how to understand and break down lines of simple math. It allows for numbers, plus, minus, multiply, divide, and brackets, organized in statements separated by semicolons.

Finally, we will define our additional code in the last section. For this part, we just need to define yyerror which was declared in the C Preamble section. Here is what I ended up with:

void yyerror(const char* msg) {
  fprintf(stderr, "%s\n", msg);
}

This function takes a const char* as an argument and returns nothing (void). It uses the fprintf function to print the error message to the standard error stream.

So now, putting it all together, here is what my parser.bison file looks like:

%{
#include <stdio.h>
void yyerror(const char* msg);
int yylex();
%}

%union {
  int ival;
}

%token <ival> TOKEN_INT
%token TOKEN_PLUS
%token TOKEN_MINUS
%token TOKEN_MUL
%token TOKEN_DIV
%token TOKEN_LPAREN
%token TOKEN_RPAREN
%token TOKEN_SEMI
%token TOKEN_ERROR

%%
program : program expr TOKEN_SEMI
         | expr TOKEN_SEMI
         ;

expr : expr TOKEN_PLUS term
     | expr TOKEN_MINUS term
     | term
     ;

term : term TOKEN_MUL factor
     | term TOKEN_DIV factor
     | factor
     ;

factor : TOKEN_MINUS factor
       | TOKEN_LPAREN expr TOKEN_RPAREN
       | TOKEN_INT
       ;
%%

void yyerror(const char* msg) {
  fprintf(stderr, "%s\n", msg);
}

Now that we have a complete parser.bison file, we can run the bison command to generate a parser.c file. As mentioned above, we can also generate our tokens.h file at the same time. Here is the command I used:

bison --defines=tokens.h --output=parser.c parser.bison

After running that, you should see a parser.c file and a tokens.h file in your directory

scanner.flex

Next, we need to create a scanner.flex file. We covered scanners and the basic structure of what a scanner.flex file needs to include in the previous post on Chapter 3. The main differences here will be in the tokens defined, as well as some funkiness related to ChatGPT's reccomendation to use ival for integer tokens, which we will cover below. Here is what my scanner.flex file looks like:

%{
#include <stdio.h>
#include <stdlib.h>
#include "tokens.h"
%}

%%

[0-9]+    { yylval.ival = atoi(yytext); return TOKEN_INT; }
[ \t\n]+  /* skip whitespace */
"+"       { return TOKEN_PLUS; }
"-"       { return TOKEN_MINUS; }
"*"       { return TOKEN_MUL; }
"/"       { return TOKEN_DIV; }
"("       { return TOKEN_LPAREN; }
")"       { return TOKEN_RPAREN; }
";"       { return TOKEN_SEMI; }
.         { return TOKEN_ERROR; }

%%

int yywrap() { return 1; }

Let's break down each part of this file:

  • %{ and %}: These enclose the C code section. This is where you put raw C code that Flex will copy directly into the generated C source file.
  • #include <stdio.h> and #include <stdlib.h>: These lines include the standard input/output and standard library in C, which provide functionalities such as input/output processing and memory allocation.
  • #include "tokens.h": This line includes the token definitions that were generated by Bison. This is needed so that your Flex lexer and Bison parser use the same token codes.
  • %%: This symbol marks the beginning and the end of the rules section. This is where you define patterns and the actions that Flex should take when it matches those patterns in the input.
  • [0-9]+: This is a regular expression that matches one or more digits. If it finds a match, it converts the matched text into an integer and returns a TOKEN_INT token.
  • { yylval.ival = atoi(yytext); return TOKEN_INT; }: This is the action that is executed when the above regular expression matches the input. yytext is a string that contains the text that was just matched by the regular expression ([0-9]+). atoi(yytext) converts this string to an integer.
    • Essentially this is saying, "whenever we see one or more digits, convert those digits to an integer and tell Bison that we've found an integer token".
    • Adding this was a recomendation from ChatGPT after my original parser seemed to be having trouble identifying integer tokens.
  • [ \t\n]+: This pattern matches any amount of whitespace (spaces, tabs, or newlines). If this pattern is matched, no action is taken, effectively skipping over the whitespace.
  • "+", "-", "*", "/", "(", ")", and ";": Each of these patterns matches a specific character or string. When matched, the lexer returns the corresponding token.
  • .: This pattern matches any single character that hasn't been matched by the other rules. If this pattern is matched, the lexer returns a TOKEN_ERROR token.
  • int yywrap() { return 1; }: This function is a part of the Flex scanner. When Flex reaches the end of the input file, it calls yywrap. If yywrap returns 1, Flex stops scanning and wraps up. If yywrap would return 0, Flex would continue scanning the same file from the beginning or some other file. The typical behavior is to stop scanning when the end of the file is reached, so this function returns 1.

Now that we have our complete scanner.flex file, we can generate a scanner.c file. Here is the command I used:

flex --outfile=scanner.c scanner.flex

After running this, you should see a scanner.c file in your directory.

main.c

Now that we have our parser.c and scanner.c files, we can create a main.c file that will run our parser on some input and tell us if the syntax is valid. Here is what my main.c file looks like:

#include <stdio.h>

extern int yyparse();
extern int yylex();
extern char *yytext;
extern void yyerror(const char* msg);

int main() {
  if (yyparse() == 0) {
    printf("Parse Successful!\n");
  } else {
    printf("Parse Failed!\n");
  }
}

Let's break down what everything is doing here:

  • #include <stdio.h>: This line tells the compiler to include the standard input/output library. This library gives us the ability to perform tasks such as printing to the console and reading from the keyboard.
  • extern int yyparse();, extern int yylex();, extern char *yytext;, extern void yyerror(const char* msg);: These lines declare functions and variables that are defined in the other files of our program (scanner.c and parser.c). 'extern' keyword here tells the compiler that these functions or variables exist, but they are defined somewhere else (not in this file). This allows you to use these functions and variables in main.c.
  • int main(): This is the starting point of our program. When we run our program, this is the function that gets called first.
  • if (yyparse() == 0): This line calls the yyparse function, which was generated by Bison. This function reads and parses the input (according to the rules defined in our Bison file). If the input is correctly formatted, yyparse returns 0.
  • printf("Parse Successful!\n");: If yyparse returned 0, this line prints the message "Parse Successful!" to the console.
  • else: If yyparse did not return 0, it means there was some error in our input.
  • printf("Parse Failed!\n");: If there was an error, this line prints the message "Parse Failed!" to the console.

So in simple terms, this program reads some input, checks if it follows the correct format (according to the rules in the Bison file), and then prints out whether the check was successful or not.

Compiling our Parser

Now that we have all of our files, we can compile them into a parser executable. Here is the command I used:

gcc -o parser main.c parser.c scanner.c

After running this, you should see a parser executable in your directory. At this point, the files in your current directory should look something like this:

/my-project
|-- main.c
|-- parser.bison
|-- scanner.flex
|-- tokens.h
|-- parser.c
|-- scanner.c
|-- parser (executable)

Testing Out Our Parser

Now that we have our parser executable, we can run it against some input to see if it follows the correct format. Let's create a file test.txt with the following contents:


10 + 20 - 30;
10 * 20 / 30;

Now we can run our parser against this file. Here is the command I used:

./parser < test.txt

After running this, you should see the message "Parse Successful!" printed to the console:

./parser < test.txt                                    
Parse Successful!

Now let's try running our parser against some invalid input. Let's create a file test2.txt with the following contents:


10 + 20 - fish;
10 * cat / 

Now we can run our parser against this file. Here is the command I used:

./parser < test2.txt

After running this, you should see the message "Parse Failed!" printed to the console:

./parser < test2.txt
syntax error
Parse Failed!

Note that in this example, we see "syntax error" and "Parse Failed!" in the output, even though the message we are outputting in main.c for the failure case only contains "Parse Failed!". In this case, "syntax error" is actually coming from the definition of yyerror we included in our parser.bison file.

We can play around with our yyerror method definition to make this more clear:

void yyerror(const char* msg) {
  fprintf(stderr, "AAHHH!");
  fprintf(stderr, "%s\n", msg);
}

Remember, because we changed our parser.bison file, we need to regenerate our parser.c file, and therefore our parser executable:

bison --defines=tokens.h --output=parser.c parser.bison
gcc -o parser main.c parser.c scanner.c

Now let's try re-running our parser on the bad input file:

./parser < test2.txt
AAHHH!syntax error
Parse Failed!

Conclussion

As mentioned in the previous post on chapter 3, I'm currently not very familiar with compilers or writing programs in C; any feedback on things I can improve in the notes above is appreciated!