Project 4: Using the Hidden Markov Models for Part-of-Speech Tagging

The goal of this project is to understand the part-of-speech tagging problem, the hidden-markov-model approach to the problem, and the Viterbi algorithm for HMMs. You will write a program that will train an HMM POS-tagger on a (tagged) corpus and then tag a testing sample.

The big-picture view of this project is:

Get a training corpus. From that corpus make a reasonably-sized vocabulary.
Determine a reduced form of the PennTreebank tag set.
Tag the training corpus using NLTK's tagger (so now you have a tagged corpus).
Make probability matrices for the HMM by computing relative frequencies for the tag transitions and tag emissions indicated in the tagged training corpus.
Implement the Viterbi algorithm to tag a token sequence using the probability matrices.

The hard parts will be 4 and 5. I don't anticipate part 4 to be that hard conceptually, but it will take a fair amount of programming, and there will be some corner cases to worry about. Part 5 will be hard conceptually, but you do have the algorithm given to you in the book (Figure 5.17 on page 147).

I am giving you a file that contains the structure of the project, which you can finish, though you can modify the given code to take it in a slightly different direction from my suggestions. Find the code at ~tvandrun/Public/cs384/proj4.

As you read the project description, you can follow along in the given code.

1. Get a corpus

Don't spend a long time deciding on a corpus. Just find something that you can work with. To make this project more manageable, I recommend you get text with a smallish vocabulary and simple sentences: children's literature would be great. Also make sure you can get testing data that is similar to the training corpus, such as something by the same author, perhaps from a different book in the same series.

You may adjust your corpus as you go along, experimentally. If you find that building the probability matrices is too slow, then make your corpus smaller. If it goes very quickly, then make your corpus larger.

Recall that you can load a corpus from a text file using nltk's PlaintextCorpusReader.

Make a vocabulary from the most frequent words in the corpus. I've provided code for this, with a suggested vocabulary size of 500, which you can adjust. All tokens from types not in the vocabulary will be treated as instances of the special "***" type.

2. Make a reduced form of the PennTreebank tagset.

You can find the whole tagset both on page 131 of the book and in the front endpapers. This tagset is too large for our purposes, so you will have to use a smaller one. Moreover, since the NLTK uses the Penn Treebank tagset, we'll need a way to map from official/full tagset tags to tags in your reduced tagset.

In my version of this project I reduced it to four main tags (noun-like things, verb-like things, adjective-like things, and adverb-like things) plus extra tags for interjections, (non-grouping) symbols, grouping symbols, and end-of-sentence markers. I have included the code for my version: reduced_tags is a list containing all tags; penntb_to_reduced is a map from the full tagset to the reduced.

You should consider whether you want to make some changes to what I have done---either use more (or, I suppose, fewer) tags, or modify how we map full-tagset tags to reduced-tagset tags (for example, should participles be considered verb-like things, as in mine, or rather should they be adjective-like things?). If you change this, explain your way (particularly the linguistic reasons) in the write-up.

3. Tag the training corpus.

That's actually really easy and is already done for you in the given code.

4. Make the probability matrices.

Now it gets tricky. You want to make two "matrices" (I call them matrices because that's how they are presented in typical expositions of the algorithm; however in Python you'll probably want to make them maps of maps). One matrix will tell for each tag (ie, state in the HMM), what is the probability of transitioning to each tag/state. The other matrix will tell for each tag what the probability of emitting each word in the vocabulary.

To do this using relative frequency estimates, first you need to tally up how many times each tag transitions to each other tag, and how many times each tag emits each vocabulary word in the training data. I suggest (not only here but also in the given code) to do this with maps word_tag_tally and tag_transition_tally. (You could initialize all counts to 1 instead of 0 if you wanted to do LaPlace (add-one) smoothing.) I give a skeleton of a loop over each (word, tag) pair in the tagged corpus.

Watch out: Sometimes the nltk tagger will return -NONE-. Experiment to find out when that happens and determine what to do when that happens. Also, what should happen if the word is not in the vocabulary?

Then you need to use that information to construct the actual transition and emission probabilities (the "matrices"). I recommend making trans_probs to be a map from tags to maps from tags to probabilities. For example, trans_probs['N'] would return a map indicating the probability for each other tag that it would follow a noun-like thing. For example, trans_probs['N']['V'] would return the probability that the next state/tag would be "verb-like thing" if the current state/tag is "noun-like thing". Similarly, I recommend making emit_probs to be a map from tags to maps from words to probabilities. For example, emit_probs['N']['pickle'] would return the probability of the word being pickle if the current state is "noun-like thing".

5. Implement the Viterbi algorithm to make a POS tagger

This is the other hard part. I won't give you much guidance here because the algorithm is/was covered (1) in class; (2) on page 147 in the text book; (3) in the paper "A Revealing Introduction to Hidden Markov Models" by Mark Stamp; and (4) in the paper "A tutorial on Hidden Markov Models" by Lawrence Rabiner.

What I will say (to put it into the present context) is that I recommend you encapsulate the algorithm in the function pos_tagging() whose stub I provide. This function would take the untagged sequence as a parameter, and also the transition and emission probabilities, the vocabulary, the set of tags, and the start state. It would return a tagged sequence (so its input is a list of strings (the tokens in the text); its output is a list of pairs (two strings, the first being the token, the second being the tag)).

6. Try it out

Load the test sample and tag it. See how well it compares to how you would tag it by hand and how well it compares to how NLTK would tag it.

7. Turn in

Turn in your code and a short write-up that describes the decisions you made in selecting a corpus, set of tags, what to do in corner cases, etc, and an analysis of how well your tagger did. Copy your files to

/cslab/class/cs384/(your login id)/proj4

DUE: midnight between Friday, Nov 17, and Saturday, Nov 18. (Note that this is a week later than originally expected for this project. Also, this might overlap with Project 5.)

Thomas VanDrunen

Last modified: Thu Nov 2 16:10:10 CDT 2017