The goal of this project is to understand the various kinds of language models we have talked about in class and to explore their relative merits. Your task will be to implement several language models.
You can find starter code for this project at
/homes/tvandrun/Public/cs384/proj2
In that folder you will find a Python package langmod
containing the following:
langmod.py
, containing the following classes:
LanguageModelData
, a class containing the
data that a language model is built from---mainly frequency distributions
for unigrams, bigrams, and trigrams.
More on this below.
ConstantLanguageModel
, a language model
that simply gives equal probability for every type.
UnigramLanguageModel
etc, classes
for unsmoothed and Laplace smoothed models for unigrams,
bigrams, and trigrams.
InterpolatedLanuageModel
, a stub of a
class for implementing a language model interpolated among
other language models.
probplex.py
, a program that
computes the perplexity of language models on a test set.
In addition to the starter code, you will find a variety of text files of different sizes. They are drawn from what is intended to be a balanced corpus, specifically compiled from
The LanguageModelData
class is set up to be used as a
singleton.
The given code instantiates this class once by feeding it text from
the file training.txt
.
Specifically, we assume a vocabulary made up of the words in the file
/usr/share/dict/words
, all turned to lowercase,
plus three special strings: NUM
for numbers,
PNCT
for punctuation, and
OOV
for out of vocabulary, that is,
words not in the vocabulary.
The code that reads and tokenizes the training text produces
a sequence of strings from this set.
Apostrophes are considered part of the alphabet, and
contractions and possesives are kept intact.
For example, the sentence "I can't see the Jaberwocky, can you?"
would be tokenized as
['I', "can't", 'see', 'the', 'OOV', 'PNCT', 'can', 'you', 'PNCT']
The punctionation, numbers, and out-of-vocabulary words are considered
types in themselves.
The variable lm_data
refers to the object of this
class for the training data.
In probplex.py
it is passed to the constructors
of the the language model classes.
You can run probplex.py
from the commandline as
python langmod/probplex.py
You are to implement three language models. Two of them you need to write from scratch; specifically write classes to implement two of the following three options:
Additionally,
finish InterpolatedLanguageModel
.
This means implementing the algorithm to find optimal
weights---either in the constructor itself
or in a method called by it.
Notice that a held-out set has been provided, and the
code for tokenizing it already appears in
langmod.py
.
Modify probplex.py
to test the perplexity of the
language models you have implemented.
Experiment with different combinations of language models to
interpolate.
You may change the held-out set and test set,
especially to experiment with different sizes.
I am including a document that summarizes the formulas and algorithms we have seen in class that are useful for this project. This also includes a descrition of regression analysis.
Turn in your code and any changes to any corpus.
Also turn in a short write-up (a couple of paragraphs or so)
explaining which options you chose, any difficulties you
ran into, any changes you made to the rest of the code and/or
corpus, and what you learned from the experiments
(ie, the results from probplex.py
).
Turn it in to:
/cslab.all/ubuntu/cs384/(your login id)/proj2
DUE: Midnight between Monday, Oct 26 and Tuesday, Oct 27.