Project 3: Edit distance and spelling correction

This project has three goals (which roughly correspond to the statistical, computational, and linguistic aspects of this course, respectively):

1. Set up

You can find starter code for this project at

/homes/tvandrun/Public/cs384/proj3

In that folder you will find a Python package akeela containing the following:

Additionally there are several test files for training, tuning, and testing. Most are identical to those of project 2, although there is an additional file test-small-common-mistake.txt, a variation on test-small.txt but with deliberate spelling mistakes; and a file test-small-random-mistake.txt which was produced using interfere.py.

2. Edit distance

The first part of the project is narrowly defined: Finish the implementation of edit distance in the file editdist.py. Recall that the costs of the transformations is parametrized to the algorithm and stored in the file scorefile.

As mentioned above, you can test this using testeditdist.py.

3. Spelling correction

The second part of the project is adjusting the spelling correction strategy found in spellcheck.py to improve its accuracy on realistic mistakes that a person would make (including real-word errors) and mistakes introduced by random interference.

The program spellcheck.py reads in a file and, for each word produces a substitute (possibly the same word) based on some combination of edit distance and probability. In the code as it is given to you, the function correct_spell considers each word in the vocabulary and, for each word x that has a distance less than 5 from the given word w, computes a score for it using the formula

p(x | history) / (1 + 10 * edit_distance(w, x))

That is, a candidate's score is based on its probability (conditioned on the recent history) divided by some scaled version of its edit distance. Thus candidates with a smaller distance will have a higher score. The "1 +" part of the formula is so that the given word itself, which will have a distance 0, will not cause division by zero.

If you run spellcheck.py out of the box, it will use a brain-dead language model (specifically, an "interpolation" of a constant language model and a unigram language model, except the lambdas aren't tuned, they're just .5) and the stud of edit_distance, which always returns 0. You'll get this result on one of the test files:

$ cat test-small.txt 
THERE was no hope for him this time: it was the third stroke. Night
after night I had passed the house (it was vacation time) and studied
the lighted square of window:
$ python akeela/spellcheck.py test-small.txt 
THE the the the the the the the : the the the the the . The 
the the THE the the the the ( the the the the ) the the 
the the the the the : 

Basically, it thinks every word is a misspelling of the. Improve on this by

The strategy or formula doesn't have to be radically different---I've provided this one as a starting point for you to fiddle with. However, I have found that it doesn't perform fantastically well even with a good language model and correct implementation of edit distance, so you might come up with a very different strategy. The only hard requirement is that it must depend on both contextual probability and edit distance. Also, if you use anything like the provided strategy, it will be SLOW.

You may also develop your own corpus and test cases. You even may choose to modify the goals slightly to focus on realistic typos people might make vs random mutations---although the best submission would do well on both.

4. Other comments

I encourage you not to use any resource for this besides what's available in our textbook. There are other implementations of edit distance and, more generally, spelling correction out there, including one that I probably will point you to later this semester (say, to prepare for the final exam), but my intentions are that you would try this on your own before looking at any of them.

5. To turn in

Turn in any python file you made changes to.

Please also turn in a short write-up (couple of paragraphs or so) explaining what you tried (different language models, different strategies, even different training texts) and why, and also with what results. You may provide test cases etc that you feel demonstrate what your spelling corrector is good at, together with an explanation in the write-up.

Turn these things in to

/cslab.all/ubuntu/cs384/(your login id)/proj3

DUE: Open. Absolute deadline is midnight between Friday, Dec 11 and Saturday, Dec 12 (ie, the last day of classes). I recommend you turn it in earlier than that, and I will (try to) grade these on a continual basis as they come in.


Thomas VanDrunen
Last modified: Thu Oct 22 13:13:15 CDT 2015