Project 2: A spelling-checker

The goals of this project are

To make and use a practical implementation of the minimum edit distance algorithm.
To develop a language model.
To use minimum edit distance and a language model in implementing a spelling corrector.

You will write a program that reads in a text and produces a similar text but with some new words substituted for old words, which you hope will be correctly spelled words substituted in for misspelled words. The success of the program is measured by how well it does this correction: you want to catch as many misspelled words as possible, replace those with the right word as frequently as possible, and to catch and replace correctly spelled words as little as possible (ie, your program will almost certainly have some false positives, but you want to minimize that).

1. Given resources

You can find the following files in ~tvandrun/Public/cs394/proj2:

spellcheck.py : A dummy program that gives the structure of what your final program should do: It reads in a file, tokenizes it by words (using NLTK's tokenizer), subjects each token to a function correct_spell() that returns a replacement word (currently correct_spell() always returns the word it gets), and reassembles the words into a new text.
experiment.py: The program I showed you at the end of class; it uses the vocabulary from the Baum training corpus and generates a list of words from that vocabulary that are within an edit distance of 5 from lave; it also gives the frequency of edit distances less than 30 (acutally, it gives the frequency of edit distances that make it past the cut-off of 25 for intermediate steps). This also contains an implementation of minimum edit distance for reference.
baum-train.txt, included because experiment.py uses it.
baum-test.txt, included because it is a small file that you use to figure out how correct_spell() currently works. Since it doesn't contain any spelling errors (that I noticed), it's not a useful text case for you.

Your task, then is to modify spellcheck.py so that the decisions it makes are based on probabilities suggested by a language model and the edit distance between words.

2. The minimum way to complete this project

Find an appropriate corpus, and use it to make a word list (vocabulary) and to train a unigram language model (you can use code from class, for example bigramLM.py from Sept 19, to accomplish this).
Use the implementation of minimum edit distance from the given code.
Write the program so that for each word in the text being corrected, we determine it to be misspelled if does not occur in the wordlist (ie, look only for non-word errors).
Make a list of candidate words by ripping code from experiment.py.
Decide on the best candidate word using some combination of edit distance and unigram probabilities.
Test your program using a test case with made-up spelling errors.

3. How to ramp it up (which I expect most of you to do, each according to his or her ability)

Implement a higher n-gram language model, and smooth it. This will allow you to make contextual decisions.
Correct not only non-word errors but real-word errors. This will require you to take into consideration that some words are very unlikely in certain contexts.
Think about ways to tweak edit distance, such as giving finer-grained costs for individual edits.
Choose the best candidate word using not only edit distance and unigram probabilities but also contextual considerations.
Test your prgram using real-life texts that contain real spelling errors.
Track how your program's performance improves as you make these changes.

4. To turn in

Turn in your code and test cases and a short (about one page) report on the choices you made for varying edit distance, the language model, etc. Discuss your program's performance, especially what, if anything, you observed to improve it. Explain the test cases you used. Copy your files to

/cslab.all/ubuntu/cs394/proj2/your_name

DUE: 12:00 midnight between Wednesday, Nov 6, and Thursday, Nov 7, 2013.

Thomas VanDrunen

Last modified: Tue Nov 5 15:10:29 CST 2013