Project 5: Language interpreter

The goal of this project is to practice using trees.

1. Languages, interpreters, and trees

One important use of the tree data structure is in the intermediate representation of programming languages inside a compiler or interpreter. A typical compiler does not merely scan the source file as one giant String and come up with a compiled version directly; modern programming languages and the translation process are much too complex for that. Instead, the compiler will transform the program several times through intermediate representations--- ways to represent the program that are "intermediate" in the sense that they are not the final product of the compiler. One of the first representations that the compiler uses is a parse tree.

For example, suppose we have the following simple language:

EXPR ::= NUM | ( EXPR OP EXPR )

This is using standard notation (from the field of linguistics) for specifying a grammar. What this means is that an expression (EXPR) is either a literal number (NUM) or (|) it is two expressions, parenthesized, and with an operator between them.

Now say we have an expression in this language:

((14 - 3) * ((2 + 51) / ((3 + 1) * 4)))

We can interpret (sub)expressions that fit the first production (NUM) to be leaves in a tree, and other expressions to be non-leaf nodes. We can represent this expression, then, with the following tree.

                 *
                / \
               /   \
              /     \
             -       '/'
            / \     /   \
          14   3   /     \
                  /       \
                 +         *
                / \       / \
               2   51    /   \
                        +     4
                       / \
                      3   1

Notice that the parentheses don't appear; they were merely used for discerning the tree structure.

In some simple interpreters, trees of some sort can be used as the final representation. The program can be interpreted during a traversal of the trees.

In this program you will write an interpreter for a simple language. This language will have one-letter (lowercase) variables, and statements and expressions will be made using the following grammar:

STMT ::= VAR = EXPR
EXPR ::= NUM  |  VAR  | (EXPR OP EXPR)
OP ::= + | - | * | /

Each line of the program will be a statement, except for the last line, which will be a single variable. The output of the program will be the value of the variable in the last line. For example, we could have a program

t = (32 + 21)
a = (614 / (t * ((12 + t) - 45)))
b = (a - t)
b

The output of the program would be -53.

2. Setup

After making the directory for this project, checkout the given files for this project.

svn checkout file:///cslab/class/cs245/projects/proj5

The file ProcessFile.java is a class I wrote that will "tokenize" a program in the source language; specifically it will chop up the lines into Strings containing symbols, numbers, or variable names. This is for your convenience; you may decide to modify this file (or, I suppose, not use it at all). The file Interpreter.java contains a skeleton of the interpreter program, which you will complete.

3. Interpreter details

The interpreter has four phases: tokenization (technically called "lexical analysis"), tree building (parsing or syntactical analysis), tree printing, tree optimization (extra credit), and tree interpretation.

The tokenization is already done for you (though, as mentioned above, you may decide to modify this).

In the next step, you will need to turn each line into a tree. More accurately, each line is a statement, and a statement has two things: a target variable, and an expression tree (or the root node of an expression tree) You will need to design classes to hold this, something like

public class Statement {
    String targetVariable;
    Tree expr;
}

public class Tree {
    Node root;
}

The entire program, then, may be something like an ArrayList of these Statements. This phase is about building trees. You need to think how to do this.

Next (and this is mainly for debugging purposes), you should reconstruct the program and print it out to the screen, based on a depth first traversal of your trees. The best way to do this is to write a print() method for each of your classes (Statement, Tree, and Node, in the example above) and then iterate through the ArrayList holding all your statements (or whatever structure you choose) and call the print() methods. All this will help you make sure you have constructed the trees correctly.

For extra credit, you may next implement an the optimization phase, in which you will look for inefficiencies in the trees. This phase is about comparing and modifying trees. Specifically, you should change them according to the following arithmetic identities, where tr stands for some subexpression/subtree:

tr + tr translates to tr * 2
tr + 0 and 0 + tr translate to tr
tr * 1 and 1 * tr translate to tr
tr - 0 and tr / 1 translate to tr

If you do this portion, please print out your new trees (by calling your print() methods, or the equivalent, again) to demonstrate the transformations succeeded

The final phase is about traversing trees. Here you will need to interpret the program by doing a depth-first post-order traversal of the expression trees and storing the results in appropriate variables (use a HashMap for this?).

In all this, you may assume the input programs are correct (including that no variable is used before it is initialized). All values in the language are integers.

4. To turn in

Copy all the files you made or modified to a turn-in directory I've made for you.

cp filename /cslab/class/cs245/turnin/proj5/{ben,elizabeth,hudson,neile,stephen,tim}

DUE: Friday, Apr 4, 5:00 pm.

Thomas VanDrunen

Last modified: Fri Apr 4 07:43:06 CDT 2008