Project 1: NLTK and regular expressions

The goals of this project are

to begin thinking about language-processing problems
to gain experience in Python and the NLTK
to practice using regular expressions

This project is open-ended, but the general structure of your task is

ask a question that can be studied by processing text
go out and find a corpus, some collection of text
"clean up" that text
compute some linguistic information about that text to address the question
write-up what you did and what you learned.

Cleaning up the text and computing linguistic information about the text will be accomplished using regular expressions and NLTK.

1. Finding a some text

Logically/ideally one would first formulate a question to pursue and then search for a corpus appropriate for investigating that question. Realistically you're more likely to search for interesting texts first and then consider what questions could be asked about those texts; the process might be a bit dynamic, and you should think through all the parts of the project before starting. At any rate, it's easier to talk about this step first.

Possible sources include

Works of literature as can be downloaded from such cites as Project Gutenberg.
Archives of listserve or chatroom discussions.
News articles
Archives of blog posts or social media posts.

Depending on the kind of source, you can acquire this text either by saving it to a file on disk through your web browser ("save page as"), copy-and-pasting into a text file, or using NLTK's urlopen tool (see section 3.1 in the NLTK book).

As for the size of your corpus, it depends on what is practical. The Lewis Carroll text I used in class a couple weeks ago (which contained both Alice's Adventures in Wonderland and Through the Looking Glass) was about 220 K, which is very small for a literature corpus. However, if you're grabbing news articles or blog posts, you'll likely end up with something much smaller.

If you are proficient in a foreign of ancient language, you may choose to work on texts in that language instead of English, which may greatly change what kind of questions you would ask.

2. Cleaning up the text

Now you need to clean out portions of the information that you do not want to include in your analysis. This would include

HTML tags, if what you downloaded in HTML and not raw text.
Metadata, such as date, subject, and author information in forum posts.
By-lines and such in articles
Chapter headings in literature
Any other irregularities in the text

Write this in Python, making using of regular expressions and/or NLTK.

Obviously the difficulty of this task will depend greatly on the data you have. Stripping out HTML tags would be a fairly hard task, but you may use the NLTK clean_html function to do that (see "Dealing with HTML" in Section 3.1, referenced above).

The point of this step is to practice using regular expressions to make text more usable for the questions you will investigate in part 3. See Section 3.4 in the NLTK book and the Python regular expressions documentation for help. I recommend that you write code that will read in the files you have, transform the text, and write the result back to disk.

3. Asking questions about the text

The heart of this project is to compute information about the text. You'll need to come up with a question---or series of related questions---to investigate. Here are some general suggestions to get you started:

What "wh"-words (like "what", "which", "where", "who", "when", "why"... and maybe "how") are used, with what relative frequency to each other, and how frequent are they in the text as a group? How frequently do they begin sentences?
What proper nouns are used, which are most common, and what percentage of the text do they make up?
What hyphenated words or word-groups are there, and how common are they?
What contractions are most common? What about other uses of apostrophes?
If your text is financial news, can you detect stock symbols or prices, or find information about their frequency?
What dates and other references to time can you find, and how common are they in the text and relative to each other?
How common are phrases like "less xxxxx than", and what xxxxx are used in such phrases?

Here are some specific questions that students investigated two years ago when I gave a similar assignment in the seminar version of this course (some were more ambitious than others, and some more successful projects than others):

Comparing the corpora of two Edgar Allen Poe and H P Lovecraft, how do the most common words they use differ, and what does that reveal about the common themes in the two authors?
In a corpus of typical Twitter messages, what is the average number of tokens in a tweet? How many unique types? What proportion of them are exclamations and questions? Many indicated a positive sentiment and how many a negative sentiment? [That last question was miles more advanced than the others...]
Can hidden messages (the use of steganography) be detected in personal letters? [I don't think this one got very far.]
Do A Conan Doyle's short stories display a significant difference in word usage from his novels?
In Virgil's Aeneid, what adjectives are most commonly used to describe the protagonist Aeneas? Which (grammatical) case is Aeneas's name most commonly in?
How frequent are words or phrases showing Semitic influence in the Greek New Testament compared with the Septuagint and other ancient Greek texts?
What contractions are most commonly used in Shakespeare?

If you want help in narrowing down your question, determining whether your question is at an appropriate level of difficulty, or figuring out the best way to compute information about it, feel free to come talk sometime.

Write a program in Python to analyze the corpus, generating data to answer the question. Again regular expressions and things we've seen in NLTK are recommended. (You may want to peruse the NLTK book for tools that haven't been covered in class yet.)

4. Writing up what you did

Write up what you did and learned, including

What question or questions you investigated.
What corpus you found: what is it, why it was appropriate to the question, where/how did you find it?
What needed to be cleaned up in the corpus, and how did you clean it? (You'll turn in your code, too.)
How did you investigate your question? What was your result data?
Interpret the data. Did it provide an answer to the question? If not, what future work might be necessary to answer it?

This can be a no-frills right-to-the-point report, but it should help me understand what you did, why you did it, and to what success. I'm imagining the reports being about a page and half, but they can be shorter or longer if appropriate.

Turn in your report (PDF is preferred), your corpus, and your code to a turn-in folder:

/cslab.all/ubuntu/cs384/(your login id)/proj1

Also, make sure that, between the code documentation your report, your code is sufficiently explained.

DUE: Midnight between Monday, Sept 28, and Tuesday, Sept 30.

Thomas VanDrunen

Last modified: Thu Sep 3 13:54:46 CDT 2015