Project 1: Regular Expressions

The goals of this project are

to begin thinking about language-processing problems
to get used to Python and the NLTK
to practice using regular expressions

This project is somewhat open-ended, but the general structure of your task is

to go out and find some collection of text
to "clean up" that text
to compute some linguistic information about that text

The second two tasks will be accomplished using regular expressions (and other tools such as those in the NLTK). I strongly recommend using Python and the NLTK for this project (both because of the tools available and because this is a good opportunity to learn Python), but if you happen to know how to use equivalent tools in other programming languages and/or toolkits, I won't forbid you to use them in the project.

1. Going out and finding some text

The first step is to make a mini-corpus of text to experiment on. Accumulate a body of text (presumably found on the Internet) and store them in files to process. Possible sources include

Works of literature as can be downloaded from such cites as Project Gutenberg.
Archives of listserve or chatroom discussions.
News articles
Archives of blog posts or social media posts.
Wikipedia articles (but in that case you better have some good questions to investigate in part 3... studying Wikipedia articles isn't very creative---though the Wikipedia talk pages might be interesting).

But don't start on this until you have thought about parts 2 and 3. Your choice of text to work with will depend on (a) how big a challenge you want part 2 to be, and (b) what kind of questions you will ask in part 3, since those questions may be specific to the domain or genre of the text.

Depening on the kind of source, you can acquire this text either by saving it to a file on disk through your web browser ("save page as"), copy-and-pasting into a text file, or using NLTK's urlopen tool (see section 3.1 in the NLTK book).

As for the size of the text, it depends on what is practical. The Lewis Carroll text I used in class a couple weeks ago (which contained both Alice's Adventures in Wonderland and Through the Looking Glass) was about 220 K, which is very small for a literature corpus. However, if you're grabbing news articles or blog posts, you'll likely end up with something much smaller.

2. Cleaning up the text

Now you need to clean out portions of the information that you do not want to include in your analysis. This would include

HTML tags, if what you downloaded in HTML and not raw text.
Metadata, such as date, subject, and author information in forum posts.
By-lines and such in articles
Chapter headings in literature
Any other irregularities in the text

Obviously the difficulty of this task will depend greatly on the data you have. Stripping out HTML tags would be a fairly hard task, but you may use the NLTK clean_html function to do that (see "Dealing with HTML" in Section 3.1, referenced above).

The point of this step is to practice using regular expressions to make text more usable for the questions you will investigate in part 3. See Section 3.4 in the NLTK book and the Python regular expressions documentation for help (there are other places online that can also give you direction in this; simply Google for "python regular expressions"). I recommend that you write code that will read in the files you have, transform the text, and write the result back to disk (and so then load the next text in part 3), but if you are confident in what you're doing, you could always encorporate this task into your code for part 3 and clean-up the text on the fly.

3. Computing information about the text

Finally, once you have a corpus in a cleaned-up form, determine some questions to ask about this text that can be determined using regular expressions (and other tools from NLTK). Here are some suggestions to get you started:

What "wh"-words (like "what", "which", "where", "who", "when", "why"... and maybe "how") are used, with what relative frequency to each other, and how frequent are they in the text as a group? How frequently do they begin sentences?
What proper nouns are used, which are most common, and what percentage of the text do they make up?
What hyphenated words or word-groups are there, and how common are they?
What contractions are most common? What about other uses of apostrophes?
If your text is financial news, can you detect stock symbols or prices, or find information about their frequency?
What dates and other references to time can you find, and how common are they in the text and relative to each other?
How common are phrases like "less xxxxx than", and what xxxxx are used in such phrases?

Use your imagination and curiosity.

4. Practical matters

For many of you, this project will be mainly about getting to know enough Python to get by, learning how to use NLTK, and, especially, learning to use regular expressions. So choose texts and questions (both the number of questions and the difficulty of the questions) that are proportional to your ability. On the other hand, if you're already very comfortable with Python and regular expressions, I expect you to give yourself a bigger challenge.

While I was learning Python (and that mainly happened this past spring semester, to be honest), I simply Googled for how to do whatever in Python as I needed to.

If you want to work with another person, that's fine. The two of you may make a single submission. Choose a parnter (or don't) so as to maximize learning.

If you have expertise in a foreign or ancient language, then feel free to make a corpus in a language other than English (and, obviously, adjust the questions appropriately).

5. To turn in

Turn in your code and short (about one page) report on the texts you chose, what you needed to do to clean them up, the questions you chose, and what you discovered about those questions. Copy your files to

/cslab.all/ubuntu/cs394/proj1/your_name

DUE: 12 midnight between Thursday, Oct 3, 2013 and Friday, Oct 4, 2013

Thomas VanDrunen

Last modified: Thu Oct 3 15:12:29 CDT 2013