Project 1: NLTK and regular expressions

The goals of this project are

This project is open-ended, but the general structure of your task is

Cleaning up the text and computing linguistic information about the text will be accomplished using regular expressions and NLTK.

1. Finding a some text

Logically/ideally one would first formulate a question to pursue and then search for a corpus appropriate for investigating that question. Realistically you're more likely to search for interesting texts first and then consider what questions could be asked about those texts; the process might be a bit dynamic, and you should think through all the parts of the project before starting. At any rate, it's easier to talk about this step first.

Possible sources include

Depending on the kind of source, you can acquire this text either by saving it to a file on disk through your web browser ("save page as"), copy-and-pasting into a text file, or using NLTK's urlopen tool (see section 3.1 in the NLTK book).

As for the size of your corpus, it depends on what is practical. The Lewis Carroll text I used in class a couple weeks ago (which contained both Alice's Adventures in Wonderland and Through the Looking Glass) was about 220 K, which is very small for a literature corpus. However, if you're grabbing news articles or blog posts, you'll likely end up with something much smaller.

If you are proficient in a foreign of ancient language, you may choose to work on texts in that language instead of English, which may greatly change what kind of questions you would ask.

2. Cleaning up the text

Now you need to clean out portions of the information that you do not want to include in your analysis. This would include

Write this in Python, making using of regular expressions and/or NLTK.

Obviously the difficulty of this task will depend greatly on the data you have. Stripping out HTML tags would be a fairly hard task, but you may use the NLTK clean_html function to do that (see "Dealing with HTML" in Section 3.1, referenced above).

The point of this step is to practice using regular expressions to make text more usable for the questions you will investigate in part 3. See Section 3.4 in the NLTK book and the Python regular expressions documentation for help. I recommend that you write code that will read in the files you have, transform the text, and write the result back to disk.

3. Asking questions about the text

The heart of this project is to compute information about the text. You'll need to come up with a question---or series of related questions---to investigate. Here are some general suggestions to get you started:

Here are some specific questions that students investigated two years ago when I gave a similar assignment in the seminar version of this course (some were more ambitious than others, and some more successful projects than others):

If you want help in narrowing down your question, determining whether your question is at an appropriate level of difficulty, or figuring out the best way to compute information about it, feel free to come talk sometime.

Write a program in Python to analyze the corpus, generating data to answer the question. Again regular expressions and things we've seen in NLTK are recommended. (You may want to peruse the NLTK book for tools that haven't been covered in class yet.)

4. Writing up what you did

Write up what you did and learned, including

This can be a no-frills right-to-the-point report, but it should help me understand what you did, why you did it, and to what success. I'm imagining the reports being about a page and half, but they can be shorter or longer if appropriate.

Turn in your report (PDF is preferred), your corpus, and your code to a turn-in folder:

/cslab.all/ubuntu/cs384/(your login id)/proj1

Also, make sure that, between the code documentation your report, your code is sufficiently explained.

DUE: Midnight between Monday, Sept 28, and Tuesday, Sept 30.


Thomas VanDrunen
Last modified: Thu Sep 3 13:54:46 CDT 2015