The goal of this project is to understand the part-of-speech tagging problem, the hidden-markov-model approach to the problem, and the Viterbi algorithm for HMMs. You will write a program that will train an HMM POS-tagger on a (tagged) corpus and then tag a testing sample.
The big-picture view of this project is:
The hard parts will be 4 and 5. I don't anticipate part 4 to be that hard conceptually, but it will take a fair amount of programming, and there will be some corner cases to worry about. Part 5 will be hard conceptually, but you do have the algorithm given to you in the book (Figure 5.17 on page 147).
I am giving you a file that contains the structure of the project,
which you can finish,
though you can modify the given code to take it in a slightly
different direction from my suggestions.
Find the code at ~tvandrun/Public/cs384/proj4
.
As you read the project description, you can follow along in the given code.
Don't spend a long time deciding on a corpus. Just find something that you can work with. To make this project more manageable, I recommend you get text with a smallish vocabulary and simple sentences: children's literature would be great. Also make sure you can get testing data that is similar to the training corpus, such as something by the same author, perhaps from a different book in the same series.
You may adjust your corpus as you go along, experimentally. If you find that building the probability matrices is too slow, then make your corpus smaller. If it goes very quickly, then make your corpus larger.
Recall that you can load a corpus from a text file using
nltk's PlaintextCorpusReader
.
Make a vocabulary from the most frequent words in the corpus. I've provided code for this, with a suggested vocabulary size of 500, which you can adjust. All tokens from types not in the vocabulary will be treated as instances of the special "***" type.
You can find the whole tagset both on page 131 of the book and in the front endpapers. This tagset is too large for our purposes, so you will have to use a smaller one. Moreover, since the NLTK uses the Penn Treebank tagset, we'll need a way to map from official/full tagset tags to tags in your reduced tagset.
In my version of this project I reduced it to four main tags
(noun-like things, verb-like things, adjective-like things,
and adverb-like things) plus extra tags for interjections,
(non-grouping) symbols, grouping symbols, and
end-of-sentence markers.
I have included the code for my version:
reduced_tags
is a list containing all
tags; penntb_to_reduced
is a map
from the full tagset to the reduced.
You should consider whether you want to make some changes to what I have done---either use more (or, I suppose, fewer) tags, or modify how we map full-tagset tags to reduced-tagset tags (for example, should participles be considered verb-like things, as in mine, or rather should they be adjective-like things?). If you change this, explain your way (particularly the linguistic reasons) in the write-up.
That's actually really easy and is already done for you in the given code.
Now it gets tricky. You want to make two "matrices" (I call them matrices because that's how they are presented in typical expositions of the algorithm; however in Python you'll probably want to make them maps of maps). One matrix will tell for each tag (ie, state in the HMM), what is the probability of transitioning to each tag/state. The other matrix will tell for each tag what the probability of emitting each word in the vocabulary.
To do this using relative frequency estimates, first you
need to tally up how many times each tag transitions to each
other tag, and how many times each tag emits each vocabulary
word in the training data.
I suggest (not only here but also in the given code)
to do this with maps word_tag_tally
and tag_transition_tally
.
(You could initialize all counts to 1 instead of 0 if you
wanted to do LaPlace (add-one) smoothing.)
I give a skeleton of a loop over each (word, tag) pair in the tagged
corpus.
Watch out:
Sometimes the nltk tagger will return -NONE-
.
Experiment to find out when that happens and determine what to
do when that happens.
Also, what should happen if the word is not in the vocabulary?
Then you need to use that information to construct
the actual transition and emission probabilities (the "matrices").
I recommend making trans_probs
to be
a map from tags to maps from tags to probabilities.
For example, trans_probs['N']
would
return a map indicating the probability for each other tag
that it would follow a noun-like thing.
For example, trans_probs['N']['V']
would return
the probability that the next state/tag would be "verb-like thing"
if the current state/tag is "noun-like thing".
Similarly, I recommend making emit_probs
to be a map from tags to maps from words to probabilities.
For example, emit_probs['N']['pickle']
would
return the probability of the word being pickle if the
current state is "noun-like thing".
This is the other hard part. I won't give you much guidance here because the algorithm is/was covered (1) in class; (2) on page 147 in the text book; (3) in the paper "A Revealing Introduction to Hidden Markov Models" by Mark Stamp; and (4) in the paper "A tutorial on Hidden Markov Models" by Lawrence Rabiner.
What I will say (to put it into the present context)
is that I recommend you encapsulate the algorithm in the function
pos_tagging()
whose stub I provide.
This function would take the untagged sequence as a parameter,
and also the transition and emission probabilities, the
vocabulary, the set of tags, and the start state.
It would return a tagged sequence
(so its input is a list of strings (the tokens in the text);
its output is a list of pairs (two strings, the first being the token,
the second being the tag)).
Load the test sample and tag it. See how well it compares to how you would tag it by hand and how well it compares to how NLTK would tag it.
Turn in your code and a short write-up that describes the decisions you made in selecting a corpus, set of tags, what to do in corner cases, etc, and an analysis of how well your tagger did. Copy your files to
/cslab/class/cs384/(your login id)/proj4
DUE: midnight between Friday, Nov 17, and Saturday, Nov 18. (Note that this is a week later than originally expected for this project. Also, this might overlap with Project 5.)