This project has three goals (which roughly correspond to the statistical, computational, and linguistic aspects of this course, respectively):
You can find starter code for this project at
/homes/tvandrun/Public/cs384/proj3
In that folder you will find a Python package akeela
containing the following:
langmod.py
, identical to the file of the same name
given in Project 2. You will likely replace it with your langmod.py
from your completed Project 2, possibly with further modification
for this project.
editdist.py
, which contains a stub for a function
computing the edit distance between two strings; you will need to finish this.
testeditdist.py
, a simple program that tests your
implementation of edit distance by reading two strings from the command line
and computing their edit distance.
spellcheck.py
, a program that takes in a file and produces
a modified version by substituting "corrections" (we hope) for words
determined to be incorrectly spelled.
This is the "main" program, which you will have to modify.
interfere.py
, a program that takes a in a file and
introduce random mutations:
$ python akeela/interfere.py test-small.txt THERE was no hope for him this time : it wat tle third stroke. Nighe after night I hfd passed rhe wouse ( ht was vacation time ) bnd studied the lighted square of window :You can use this to generate samples to test your spell-checker on.
Additionally there are several test files for training, tuning, and
testing. Most are identical to those of project 2, although there is
an additional file test-small-common-mistake.txt
,
a variation on test-small.txt
but with deliberate
spelling mistakes; and a file test-small-random-mistake.txt
which was produced using interfere.py
.
The first part of the project is narrowly defined:
Finish the implementation of edit distance in the file editdist.py
.
Recall that the costs of the transformations is parametrized to
the algorithm and stored in the file scorefile
.
As mentioned above, you can test this using testeditdist.py
.
The second part of the project is adjusting the spelling correction strategy
found in spellcheck.py
to improve its accuracy on
realistic mistakes that a person would make (including real-word errors)
and mistakes introduced by random interference.
The program spellcheck.py
reads in a file and,
for each word produces a substitute (possibly the same word) based
on some combination of edit distance and probability.
In the code as it is given to you, the function correct_spell
considers each word in the vocabulary and, for each word x
that has a distance
less than 5 from the given word w, computes a score for it using the formula
p(x | history) / (1 + 10 * edit_distance(w, x))
That is, a candidate's score is based on its probability (conditioned on the recent history) divided by some scaled version of its edit distance. Thus candidates with a smaller distance will have a higher score. The "1 +" part of the formula is so that the given word itself, which will have a distance 0, will not cause division by zero.
If you run spellcheck.py
out of the box, it will
use a brain-dead language model (specifically, an "interpolation" of
a constant language model and a unigram language model, except
the lambdas aren't tuned, they're just .5) and the stud of
edit_distance
, which always returns 0.
You'll get this result on one of the test files:
$ cat test-small.txt THERE was no hope for him this time: it was the third stroke. Night after night I had passed the house (it was vacation time) and studied the lighted square of window: $ python akeela/spellcheck.py test-small.txt THE the the the the the the the : the the the the the . The the the THE the the the the ( the the the the ) the the the the the the the :
Basically, it thinks every word is a misspelling of the. Improve on this by
The strategy or formula doesn't have to be radically different---I've provided this one as a starting point for you to fiddle with. However, I have found that it doesn't perform fantastically well even with a good language model and correct implementation of edit distance, so you might come up with a very different strategy. The only hard requirement is that it must depend on both contextual probability and edit distance. Also, if you use anything like the provided strategy, it will be SLOW.
You may also develop your own corpus and test cases. You even may choose to modify the goals slightly to focus on realistic typos people might make vs random mutations---although the best submission would do well on both.
I encourage you not to use any resource for this besides what's available in our textbook. There are other implementations of edit distance and, more generally, spelling correction out there, including one that I probably will point you to later this semester (say, to prepare for the final exam), but my intentions are that you would try this on your own before looking at any of them.
Turn in any python file you made changes to.
Please also turn in a short write-up (couple of paragraphs or so) explaining what you tried (different language models, different strategies, even different training texts) and why, and also with what results. You may provide test cases etc that you feel demonstrate what your spelling corrector is good at, together with an explanation in the write-up.
Turn these things in to
/cslab.all/ubuntu/cs384/(your login id)/proj3
DUE: Open. Absolute deadline is midnight between Friday, Dec 11 and Saturday, Dec 12 (ie, the last day of classes). I recommend you turn it in earlier than that, and I will (try to) grade these on a continual basis as they come in.