The goal of this in-class activity is to experiment with the NLTK library.
Below you will find a series of suggested exercises. You are free to pick and choose and modify these, and, to a reasonable extent, explore other questions. Keep a record of what you attempt and what you discover (such as a Python file annotated with comments) to turn in to count toward your participation grade.
/homes/tvandrun/Public/cs384/code/tryout.py
-- what we did in class Monday
/homes/tvandrun/Public/cs384/code/simple_util.py
-- a few helpful functions
/homes/tvandrun/Public/cs384/texts/children-lit
-- (folder containing) the texts we used in class Monday
/homes/tvandrun/Public/cs384/texts/proverbs
-- (folder containing) the book of Proverbs
from nltk.book import *
1. In class we considered calculating a measure of a text's lexical diversity: the average number of times each type is used, or the ratio of tokens to types. (A lower score indicates a richer vocabulary). Can you detect any patterns in lexical diversity? Is it consistent for an author? Do different genres tend to have noticeably different lexical diversities? There are some problems with this measure used by itself (someone may have noticed this in class yesterday): the size of the text will skew the result, since smaller texts will have lower scores than longer texts. Can you think of a variation on this measure that would improve its accuracy?
2. NLTK has some features for plotting graphical representations of
data about texts.
These include a dispersion plots and plots of frequency distributions.
A dispersion plot will show the density of use of a type or set of types
over the length of the text.
To see the density of the types Alice
, Queen
,
and curious
in the wonderland books,
we can use
text.dispersion_plot(["Alice", "Queen", "curious"])
To see the relative frequencies of the top fifty most common types, use
freq_dist.plot(50, cumulative=False)
By setting cumulative
to True
, you
can see a graph of the cumulative frequencies
(ie, the 10th most common type is plotted against the
combined number of occurrences of the top ten types).
Try these on a few texts. What can we learn? For example, can the dispersion plot be used to identify sections of a text?
3. Compare common collocations (bigrams) of different authors or texts. Are there any pairs that reveal something interesting about the content?
4. Using the POS tagger, write a function that calculates the frequency of personal pronouns. One would expect personal pronouns to be more common in narrative fiction. What do you find?
5. Similarly to the previous, write a function that calculates the density of adverbs. Is adverb use consistent for a given author? Does it vary greatly between authors or between genres?
6. In the class demo, we experimented with the guess that frequency of "and" or the ratio of "and" to "but" might be characteristic of an author. Can you think of other candidate tests for an author's distinctive style? See if anything appears to be consistent within an author's works, but different between authors. (One idea based on problem 5: does an author favor a certain mix of parts of speech?)
7. The sci-fi/adventure series Tom Swift was sometimes mocked for relying heavily on the pattern of "proper noun said adverb". This spawned a kind of pun known as a Tom Swifty, which follows the same pattern but where the adverb puns what the character says. Some examples:
"Oops, I pierced your cheek by mistake," he said mysteriously.
"I'm making pancakes," he said flippantly.
"I use the Bourne Again Shell," he said bashfully.
Find how commonly this pattern (not necessarily as puns) occurs in various texts.
8. This problem comes from the NLTK book, problem 7 of Section 2.8:
According to Strunk and White's Elements of Style, the word however, used at the start of a sentence, means "in whatever way" or "to whatever extent", and not "nevertheless". They give this example of correct usage: However you advise him, he will probably do as he thinks best. (http://www.bartleby.com/141/strunk3.html) Use the concordance tool to study actual usage of this word in the various texts we have been considering. See also the LanguageLog posting "Fossilized prejudices about 'however'".
I would suggest not only using the concordance tool but also the tagger to see how it identifies however, as either an adverb RB (I, however, like ice cream) or as a (subordinating) conjunction IN (I will help however I can). Actually, in my quick experiments, I can't find an example where NLTK's pos tagger identifies any use of however as a conjunction.
Spend about an hour or hour and a half outside of class.
At the end, turn in a record of what you did (and learned) to
a turn-in folder /cslab/class/cs384/(your id)/ica1