Project 1: Regular Expressions

The goals of this project are

This project is somewhat open-ended, but the general structure of your task is

The second two tasks will be accomplished using regular expressions (and other tools such as those in the NLTK). I strongly recommend using Python and the NLTK for this project (both because of the tools available and because this is a good opportunity to learn Python), but if you happen to know how to use equivalent tools in other programming languages and/or toolkits, I won't forbid you to use them in the project.

1. Going out and finding some text

The first step is to make a mini-corpus of text to experiment on. Accumulate a body of text (presumably found on the Internet) and store them in files to process. Possible sources include

But don't start on this until you have thought about parts 2 and 3. Your choice of text to work with will depend on (a) how big a challenge you want part 2 to be, and (b) what kind of questions you will ask in part 3, since those questions may be specific to the domain or genre of the text.

Depening on the kind of source, you can acquire this text either by saving it to a file on disk through your web browser ("save page as"), copy-and-pasting into a text file, or using NLTK's urlopen tool (see section 3.1 in the NLTK book).

As for the size of the text, it depends on what is practical. The Lewis Carroll text I used in class a couple weeks ago (which contained both Alice's Adventures in Wonderland and Through the Looking Glass) was about 220 K, which is very small for a literature corpus. However, if you're grabbing news articles or blog posts, you'll likely end up with something much smaller.

2. Cleaning up the text

Now you need to clean out portions of the information that you do not want to include in your analysis. This would include

Obviously the difficulty of this task will depend greatly on the data you have. Stripping out HTML tags would be a fairly hard task, but you may use the NLTK clean_html function to do that (see "Dealing with HTML" in Section 3.1, referenced above).

The point of this step is to practice using regular expressions to make text more usable for the questions you will investigate in part 3. See Section 3.4 in the NLTK book and the Python regular expressions documentation for help (there are other places online that can also give you direction in this; simply Google for "python regular expressions"). I recommend that you write code that will read in the files you have, transform the text, and write the result back to disk (and so then load the next text in part 3), but if you are confident in what you're doing, you could always encorporate this task into your code for part 3 and clean-up the text on the fly.

3. Computing information about the text

Finally, once you have a corpus in a cleaned-up form, determine some questions to ask about this text that can be determined using regular expressions (and other tools from NLTK). Here are some suggestions to get you started:

Use your imagination and curiosity.

4. Practical matters

For many of you, this project will be mainly about getting to know enough Python to get by, learning how to use NLTK, and, especially, learning to use regular expressions. So choose texts and questions (both the number of questions and the difficulty of the questions) that are proportional to your ability. On the other hand, if you're already very comfortable with Python and regular expressions, I expect you to give yourself a bigger challenge.

While I was learning Python (and that mainly happened this past spring semester, to be honest), I simply Googled for how to do whatever in Python as I needed to.

If you want to work with another person, that's fine. The two of you may make a single submission. Choose a parnter (or don't) so as to maximize learning.

If you have expertise in a foreign or ancient language, then feel free to make a corpus in a language other than English (and, obviously, adjust the questions appropriately).

5. To turn in

Turn in your code and short (about one page) report on the texts you chose, what you needed to do to clean them up, the questions you chose, and what you discovered about those questions. Copy your files to

/cslab.all/ubuntu/cs394/proj1/your_name

where your_name is [brandon|chris|davenport|elliot|gill|josh|kate|leanne|nathan|taylor|tmac].

DUE: 12 midnight between Thursday, Oct 3, 2013 and Friday, Oct 4, 2013


Thomas VanDrunen
Last modified: Thu Oct 3 15:12:29 CDT 2013