The goals of this project are
You will write a program that reads in a text and produces a similar text but with some new words substituted for old words, which you hope will be correctly spelled words substituted in for misspelled words. The success of the program is measured by how well it does this correction: you want to catch as many misspelled words as possible, replace those with the right word as frequently as possible, and to catch and replace correctly spelled words as little as possible (ie, your program will almost certainly have some false positives, but you want to minimize that).
You can find the following files in
~tvandrun/Public/cs394/proj2
:
spellcheck.py
: A dummy program that gives
the structure of what your final program should do:
It reads in a file, tokenizes it by words (using NLTK's tokenizer),
subjects each token to a function correct_spell()
that returns a replacement word (currently
correct_spell()
always returns
the word it gets), and reassembles the words into a new text.
experiment.py
: The program I showed you
at the end of class;
it uses the vocabulary from the Baum training corpus
and generates a list of words from that vocabulary that are within
an edit distance of 5 from lave;
it also gives the frequency of edit distances less than 30 (acutally,
it gives the frequency of edit distances that make it past the
cut-off of 25 for intermediate steps).
This also contains an implementation of minimum edit distance
for reference.
baum-train.txt
, included because
experiment.py
uses it.
baum-test.txt
, included
because it is a small file that you use to figure out how
correct_spell()
currently works.
Since it doesn't contain any spelling errors (that I noticed),
it's not a useful text case for you.
Your task, then is to modify spellcheck.py
so that the decisions it makes are based on probabilities
suggested by a language model and the edit distance between words.
bigramLM.py
from Sept 19, to accomplish this).
experiment.py
.
Turn in your code and test cases and a short (about one page) report on the choices you made for varying edit distance, the language model, etc. Discuss your program's performance, especially what, if anything, you observed to improve it. Explain the test cases you used. Copy your files to
/cslab.all/ubuntu/cs394/proj2/your_name
where your_name
is [brandon|chris|davenport|elliot|gill|josh|kate|leanne|nathan|taylor|tmac
].
DUE: 12:00 midnight between Wednesday, Nov 6, and Thursday, Nov 7, 2013.