01/29/2023

Introduction To Compilers Chapter 3: Writing a Scanner

I'm currently reading the book Introduction to Compilers and Language Design by Douglas Thain. Below are my notes from chapter 3, which walks us through the basics of writing a scanner using Flex.

What is a Scanner?

A scanner is a program that reads a stream of characters and breaks it into meaningful tokens. A token is a sequence of characters that has some meaning to the language. For example, the token int is a keyword in C, and the token + is an operator. A scanner is also known as a lexical analyzer.

What is Flex?

Flex is a tool that generates scanners. It is a program that reads a scanner specification and produces a scanner program. Flex is a part of the GNU Compiler Collection (GCC). Flex is also known as lex.

What is a Scanner Specification?

A scanner specification is a file that describes the tokens that a scanner should recognize. It is written in a language called lex. The lex language is a subset of C. The lex language is also known as flex.

Writing our Scanner

To write our scanner, we will need to create the following:

  • A file to define our tokens (tokens.h)
  • A scanner specification (scanner.flex)
  • A main program that calls the scanner and prints the tokens that it finds. (main.c)

tokens.h

To begin, let's write a token.h file that defines the tokens that we want to recognize. We'll start by using the list of tokens from the books example:

/* tokens.h */

typedef enum {
  TOKEN_EOF=0
  TOKEN_WHILE,
  TOKEN_ADD,
  TOKEN_IDENT,
  TOKEN_NUMBER,
  TOKEN_ERROR
} token_t;

In this example, we've defined a new type called token_t. This type is an enumeration, which is a list of named constants. Here we have defined a list of constants that we can use to represent the tokens that we want to recognize. For example, TOKEN_WHILE is a constant that represents the token while. We'll use these constants to represent the tokens that we find in our scanner.

scanner.flex

Next, we'll write our scanner.flex file. Here is how the book defines the basic structure for a scanner.flex file:

%{
   (C Preamble Code)
%}
   (Character Classes)
%%
   (Regular Expression Rules)
%%
    (Additional Code)

For the first section, we'll start by including the tokens.h file that we just wrote:

/* scanner.flex */

%{
#include "tokens.h"
%}

In the second section, we'll define some character classes. According to the book, these are "a symbolic short-hand for commonly used regular expressions". Below we define character classes for digits and letters:

/* scanner.flex */

%{
#include "tokens.h"
%}
DIGIT [0-9]
LETTER [a-zA-Z]

The third section is the most important part. Here we will define a set of regular expressions that we'll use to recognize tokens. Below, we define regular expressions for whitespace as well as for the token types that we defined in tokens.h:

/* scanner.flex */

%{
#include "tokens.h"
%}
DIGIT [0-9]
LETTER [a-zA-Z]
%%
(" "|\t|\n) { /* skip whitespace */ }
\+          { return TOKEN_ADD; }
while       { return TOKEN_WHILE; }
{DIGIT}+    { return TOKEN_NUMBER; }
{LETTER}+   { return TOKEN_IDENT; }
.           { return TOKEN_ERROR; }
%%

The fourth and final section contains arbitrary code that will go at the end of the scanner. From the book: "A peculiar requirement of flex is that we must define a function called yywrap which returns 1 to indicate that the input is complete at the end of the file."

Below we define this function, completing our scanner.flex file:

/* scanner.flex */

%{
#include "tokens.h"
%}
DIGIT [0-9]
LETTER [a-zA-Z]
%%
(" "|\t|\n) { /* skip whitespace */ }
\+          { return TOKEN_ADD; }
while       { return TOKEN_WHILE; }
{DIGIT}+    { return TOKEN_NUMBER; }
{LETTER}+   { return TOKEN_IDENT; }
.           { return TOKEN_ERROR; }
%%
int yywrap() { return 1; }

main.c

Finally, we'll write our main.c file. This file will call our scanner and print the tokens that it finds.

We'll start by including the tokens.h file that we wrote earlier, as well as stdio.h, which contains the printf function that we'll use to print the tokens that we find.

/* main.c */

#include "tokens.h"
#include <stdio.h>

Next, "the main program must declare extern the symbols it expects to use in the generated scanner code". The yyin variable is a pointer to the input file that we'll use to read our program. The yylex function is the function that we'll call to get the next token from our scanner. The yytext variable is a pointer to the text of the token that we just found. Here is what that looks like:

/* main.c */

#include "tokens.h"
#include <stdio.h>

extern FILE *yyin;
extern int yylex();
extern char *yytext;

And finally, we can write the code for our main function. First, we will open the input file that we want to scan using fopen. If the file is not found, we will print an error message and exit. If the file is found, we will call yylex in a loop until we reach the end of the file. Each time we call yylex, it will return the next token that it finds. We'll print the token that we find, as well as the text of the token. Here is what our main function looks like:


int main() {
  yyin = fopen("program.c", "r");
  if(!yyin){
    printf("Error: could not open file/n");
    return 1;
  }

  while(1){
    token_t t = yylex();
    if (t == TOKEN_EOF) break;
    printf("token: %d, text: %s/n", t, yytext);
  }
}

Below is the complete code for our main.c file:

/* main.c */
#include "tokens.h"
#include <stdio.h>

extern FILE *yyin;
extern int yylex();
extern char *yytext;

int main() {
  yyin = fopen("program.c", "r");
  if(!yyin){
    printf("Error: could not open file/n");
    return 1;
  }

  while(1){
    token_t t = yylex();
    if (t == TOKEN_EOF) break;
    printf("token: %d, text: %s/n", t, yytext);
  }
}

Note that in the above example, our main function expects a file program.c to exists, which we have not actually created yet. Once we have our scanner compiled and ready to run, we will create a program.c that we can test it on.

Compiling and Running the Scanner

Now that we've written our scanner, we can compile it. We'll use the flex command to generate the scanner code from our scanner.flex file. We'll then compile the generated code using gcc. Finally, we'll run the scanner on our program.c file.

First, we'll use the flex command to generate the scanner code from our scanner.flex file. We'll call the generated code scanner.c:

flex -o scanner.c scanner.flex

Next, we'll compile the generated code using gcc. We'll call the executable scanner:

gcc -o scanner scanner.c main.c

Finally, we'll run the scanner on our program.c file. First, let's see what happens when we run the scanner on a file that doesn't exist:

./scanner
Error: could not open file

As expected, we get an error message. Now, let's create a program.c file that contains the following code:

/* program.c */

#include <stdio.h>

main() {
  printf("Hello World");

  int x = 0;
  while (x < 10) {
    x = x + 1;
    printf("x is %d/n", x);
  }

  return 0;
}

Now, let's run the scanner on our program.c file:


./scanner
token: 5 text: #
token: 3 text: include
token: 5 text: <
token: 3 text: stdio
token: 5 text: .
token: 3 text: h
token: 5 text: >
token: 3 text: main
token: 5 text: (
token: 5 text: )
token: 5 text: {
token: 3 text: printf
token: 5 text: (
token: 5 text: "
token: 3 text: Hello
token: 3 text: World
token: 5 text: "
token: 5 text: )
token: 5 text: ;
token: 3 text: int
token: 3 text: x
token: 5 text: =
token: 4 text: 0
token: 5 text: ;
token: 1 text: while
token: 5 text: (
token: 3 text: x
token: 5 text: <
token: 4 text: 10
token: 5 text: )
token: 5 text: {
token: 3 text: x
token: 5 text: =
token: 3 text: x
token: 2 text: +
token: 4 text: 1
token: 5 text: ;
token: 3 text: printf
token: 5 text: (
token: 5 text: "
token: 3 text: x
token: 3 text: is
token: 5 text: %
token: 3 text: d
token: 5 text: /
token: 3 text: n
token: 5 text: "
token: 5 text: ,
token: 3 text: x
token: 5 text: )
token: 5 text: ;
token: 5 text: }
token: 3 text: return
token: 4 text: 0
token: 5 text: ;
token: 5 text: }

As expected, we get a list of tokens. But how do we know that the scanner is working correctly? Let's take a look at a portion of the output:

token: 1 text: while

we can see our scanner has correctly identified the while keyword as token 1, which maps to TOKEN_WHILE in our tokens.h file (where TOKEN_EOF=0, TOKEN_WHILE=1, TOKEN_ADD=2, etc). Let's take a look at another token:

token: 3 text: Hello

We can see our scanner has correctly identified the Hello identifier as token 3, which maps to TOKEN_IDENT in our tokens.h file. Let's take a look at another token:

token: 4 text: 10

We can see our scanner has correctly identified the 10 integer literal as token 4, which maps to TOKEN_NUMBER in our tokens.h file.

Additionally, we can see a number of results in the output that we have no defined tokens for yet that are being identiifed as token 5, which is our TOKEN_ERROR. As a next step, we could define tokens for the remaining keywords and operators in our language, and then update our scanner to correctly identify them.

Conclussion

At the time of writing this, I'm currently not very familiar with compilers or writing programs in C; any feedback on things I can improve in the notes above is appreciated!