Lab activity: Lexical semantics

This activity will guide you though some experiments using WordNet and NLTK. You may diverge form this to explore relevant tangents.

You may want to look at the NLTK book, especially the section about WordNet. Note, however, that some things have changed since that book was written. Specifically, several class members that once were data attributes (instance variables) are now method. For example wn.synset('car.n.01').lemma_names should be wn.synset('car.n.01').lemma_names().

You can get starter code from ~tvandrun/Public/cs384/lexsem/. This includes some corpora you may use.

1. Inferring the meaning of a word from context

Linguist J. R. Firth said, "You shall know a word by the company it keeps." In other words, the context in which that word is used can help one infer its meaning. In this experiment you will find some "unknown" words in a corpus (unknown in the sense that they don't appear in a dictionary, or, for our purposes, have no synonym set in WordNet) and find other words used nearby to reveal something about that word's sense.

Finish the program lexsplore.py. (The name is a portmanteau of lexical semantics and explore with a hint of let's explore.) I have included code for opening a corpus (give it as a command-line argument). You fill in the code for making a list of words that occur within some range of each interesting word (I suggest you filter out stop words and others). The last line of the loop (nearbys[w] = Counter(nearby)) turns that list into a bag of words. The end of the program prints out the results.

How good is this at indicating what these words mean? You can compare this with what you can infer on your own by using the "concordance" function of NLTK to find the contexts for these words.

2. Replacing words with synonyms

In this second part, you will transform a text by replacing some of its words with synonyms.

Finish the program synamon.py. (The name is a portmanteau of synonym and cinnamon... don't ask why.) The program opens a file (named in the commandline) and processes it line-by-line. For each token in each line, we possibly replace it with a synonym. mod_word is for "modified word", the word we will replace word with. I suggest making replacements (a) when a word is not a stopword, and (b) a word has exactly one sense from synsets, and (c) that sense has more than one appropriate lemma. In other words, the original word isn't ambiguous and it has at least one synonym. (Then again, you may get some interesting results if you resplace words that are ambiguous and have more than one synonym...)

So that your changes stand out, I suggest that, for example, instead of replacing couch with sofa, you replace it with sofa[couch]. That is, make it stand out as a word on which you did a replacement.

Find your own text to try this on.

At the end, turn in a record of what you did (and learned) to a turn-in folder /cslab/class/cs384/(your id)/ica-lexsem

Thomas VanDrunen

Last modified: Mon Nov 9 13:20:24 CST 2015