Project 2: Language models

The goal of this project is to understand the various kinds of language models we have talked about in class and to explore their relative merits. Your task will be to implement several language models.

1. Set up

You can find starter code for this project at

/homes/tvandrun/Public/cs384/proj2

In that folder you will find a Python package langmod containing the following:

langmod.py, containing the following classes:
- LanguageModelData, a class containing the data that a language model is built from---mainly frequency distributions for unigrams, bigrams, and trigrams. More on this below.
- ConstantLanguageModel, a language model that simply gives equal probability for every type.
- UnigramLanguageModel etc, classes for unsmoothed and Laplace smoothed models for unigrams, bigrams, and trigrams.
- InterpolatedLanuageModel, a stub of a class for implementing a language model interpolated among other language models.
probplex.py, a program that computes the perplexity of language models on a test set.

In addition to the starter code, you will find a variety of text files of different sizes. They are drawn from what is intended to be a balanced corpus, specifically compiled from

Novels by James Joyce
Novels in the Tom Swift series
Newswire articles from the Reuters corpus
The Westminster Confession of Faith

The LanguageModelData class is set up to be used as a singleton. The given code instantiates this class once by feeding it text from the file training.txt. Specifically, we assume a vocabulary made up of the words in the file /usr/share/dict/words, all turned to lowercase, plus three special strings: NUM for numbers, PNCT for punctuation, and OOV for out of vocabulary, that is, words not in the vocabulary. The code that reads and tokenizes the training text produces a sequence of strings from this set. Apostrophes are considered part of the alphabet, and contractions and possesives are kept intact. For example, the sentence "I can't see the Jaberwocky, can you?" would be tokenized as

['I', "can't", 'see', 'the', 'OOV', 'PNCT', 'can', 'you', 'PNCT']

The punctionation, numbers, and out-of-vocabulary words are considered types in themselves. The variable lm_data refers to the object of this class for the training data. In probplex.py it is passed to the constructors of the the language model classes.

You can run probplex.py from the commandline as

python langmod/probplex.py

2. Your task

You are to implement three language models. Two of them you need to write from scratch; specifically write classes to implement two of the following three options:

Simple Good-Turing using unigrams. (This will involve implementing regression analysis, but don't let that scare you off; detailed guidance is included below).
Katz's cut-off Good-Turing using unigrams.
A Good-Turing scheme using bigrams or trigrams.

Additionally, finish InterpolatedLanguageModel. This means implementing the algorithm to find optimal weights---either in the constructor itself or in a method called by it. Notice that a held-out set has been provided, and the code for tokenizing it already appears in langmod.py.

3. Testing

Modify probplex.py to test the perplexity of the language models you have implemented. Experiment with different combinations of language models to interpolate. You may change the held-out set and test set, especially to experiment with different sizes.

4. Details

I am including a document that summarizes the formulas and algorithms we have seen in class that are useful for this project. This also includes a descrition of regression analysis.

langmodsummary.pdf

5. To turn in

Turn in your code and any changes to any corpus. Also turn in a short write-up (a couple of paragraphs or so) explaining which options you chose, any difficulties you ran into, any changes you made to the rest of the code and/or corpus, and what you learned from the experiments (ie, the results from probplex.py). Turn it in to:

/cslab.all/ubuntu/cs384/(your login id)/proj2

DUE: Midnight between Monday, Oct 26 and Tuesday, Oct 27.

Thomas VanDrunen

Last modified: Tue Oct 6 13:04:51 CDT 2015