This activity will guide you though some experiments using WordNet and NLTK. You may diverge form this to explore relevant tangents.
You may want to look at the
NLTK book,
especially
the section about WordNet.
Note, however, that some things have changed since that book
was written.
Specifically, several class members that once were
data attributes (instance variables) are now method.
For example
wn.synset('car.n.01').lemma_names
should be wn.synset('car.n.01').lemma_names()
.
You can get starter code from ~tvandrun/Public/cs384/lexsem/
.
This includes some corpora you may use.
Linguist J. R. Firth said, "You shall know a word by the company it keeps." In other words, the context in which that word is used can help one infer its meaning. In this experiment you will find some "unknown" words in a corpus (unknown in the sense that they don't appear in a dictionary, or, for our purposes, have no synonym set in WordNet) and find other words used nearby to reveal something about that word's sense.
Finish the program lexsplore.py
.
(The name is a portmanteau of lexical semantics and
explore with a hint of let's explore.)
I have included code for opening a corpus (give it as a command-line
argument).
You fill in the code for making a list of
words that occur within some range of each interesting word
(I suggest you filter out stop words and others).
The last line of the loop (nearbys[w] = Counter(nearby)
)
turns that list into a bag of words.
The end of the program prints out the results.
How good is this at indicating what these words mean? You can compare this with what you can infer on your own by using the "concordance" function of NLTK to find the contexts for these words.
In this second part, you will transform a text by replacing some of its words with synonyms.
Finish the program synamon.py
.
(The name is a portmanteau of synonym and cinnamon...
don't ask why.)
The program opens a file (named in the commandline)
and processes it line-by-line.
For each token in each line, we possibly replace it with a synonym.
mod_word
is for "modified word",
the word we will replace word
with.
I suggest making replacements (a) when a word is not a stopword, and
(b) a word has exactly one sense from synsets
,
and
(c) that sense has more than one appropriate lemma.
In other words, the original word isn't ambiguous and it has at least
one synonym.
(Then again, you may get some interesting results if you resplace
words that are ambiguous and have more than one synonym...)
So that your changes stand out, I suggest that, for example, instead of replacing couch with sofa, you replace it with sofa[couch]. That is, make it stand out as a word on which you did a replacement.
Find your own text to try this on.
At the end, turn in a record of what you did (and learned) to
a turn-in folder /cslab/class/cs384/(your id)/ica-lexsem