Lab 13: Java Collections

This goal of this project is to practice using classes from the Java Collections framework (ArrayList, HashMap, HashSet, and Iterator) to solve problems.

1. Introduction

In this lab you will write a program that takes a piece of text and counts the frequency of the words in it. For example, in this sentence, the word times occurs 5 times, occurs also occurs 5 times, and occurs 3 times, and for, example, in, this, sentence, the, word, each and also each occurs 2 times.

First think about how you would do this by hand if you were given a page of text. You would go through the text while keeping a list of all the words that occur. If the next word you look at is not yet on your list (that is, it's the first time you see the word in the text), you add it to the list and put a mark by it. If it is already on your list, find the word and add another mark by it. That way, you tally all the times each word occurs. We will be using a similar strategy for our program.

times
occurs
and
for
...

Your program will read the text on which to do the word counting from a file. You'll be provided the component that reads in from a file. Your program should be used like this:

 java WordCounter filename

This would read in a piece of text stored in the file filename and print the results to the screen.

2. Set up

As usual, clone the repository from the class directory:

hg clone /cslab/class/csci235/lab13
cd lab13

You should have two .java files to work on, plus several files that you can use as input for testing.

3. Read in and count

This part will consist of a sequence of smaller steps. Read them carefully, think about what they are doing, and program them carefully.

Step 1: First, become familiar with the file FileProcessor.java. It has a main() method, which you will complete. It also has a method processFile() which is written for you. processFile() takes a String, which it interprets as the name of a file. It read that file and ignores punctuation, spaces, etc, and produces an ArrayList of the words.

In the main() method right now (before you make any changes to it or uncomment anything), we simply print the ArrayList. We saw in class that ArrayLists have a useful toString() method, but we didn't see it in action. Compile FileProcessor as it is, and look at the result.

Step 2: Now, delete the line that prints out the ArrayList; we don't need it anymore. Now we get to do the heart of the tallying. To implement the word-tally strategy mentioned earlier, you will use a class called HashMap.

Like ArrayList and HashSet, HashMap is a container class, that is, something used to store a collection of data. HashMap acts like a table, where we can look up a value based on a key. Every HashMap needs two types of data: one type for the keys and another type for the values. (Mathematically, HashMap is somewhat like a function, where the keys are the domain and the values associated with the keys are the codomain.)

The important methods of HashMap are put(), which takes a key and a value and associates that key with the value (this may overwrite a previous association with that key); get() which takes a key and returns the value associated with that key; containsKey(), which determines whether or not a potential key is currently associated with a value; and keySet(), which returns a Set (the supertype of HashSet) which contains all the keys.

Think about the table you would use to do a word tally by hand. How can you use a HashMap to do this? As you can see in the program, we make a HashSet that associates Strings with ints. For every word in the file, we want to associate it with a number, the number of times we've seen it (so far).

Step 2 involves a loop which iterates through the words in the ArrayList. Fill-in the loop for this step so that it adjusts the HashMap to account for the current word. (Think about: what if this word has occurred already? what if this word is occurring for the first time?) Make the word lowercase; we do not want to consider capitalization in our word counting.

Don't forget that it.next() will return each word once; really, it does two things: it returns the next word and moves the iterator ahead one position in the ArrayList. If you call it.next() two times in a row, you will get different results.

HashMap happens to have a nice toString() method, so we can see what it looks like. The program is set up so that it will print the HashMap after you have populated it. Compile and test.

Step 3: Delete the line that prints the HashMap. We would like to print the results ourselves, one word per line. Uncomment the section for this step. Notice that this contains a loop which will iterate through the HashMap. Pay careful attention to how this is done. In particular, notice the initialization

          Iterator it = tally.keySet().iterator(); 

What does this mean? First, we call keySet() on the HashMap, and this will return the Set of Strings that are associated with some value. Then we make an iterator that iterates through the Strings in the set of keys. Notice that this iterator is not iterating through the HashMap itself, but rather the set of keys for the HashMap.

Fill in this loop so that it prints out a word followed by its frequency, one word per line. Then compile and test.

Step 4: This information would be much more useful if it were in an order---that is, if it were sorted. Moreover, since we're more likely to be interested in knowing what the most frequent words are, we would like it backwards sorted. Again, for Ex19, something like

the 74
to 32
and 31
of 18
lord 17
you 15
people 15
moses 13
mountain 12
them 9
...

Our strategy is that before writing all the word/frequency pairs, we will store them in a ArrayList of pairs and sort them. In order to have a "ArrayList of pairs", we need some class to represent those pairs.

Open the file SIPair.java. This is for a class of the same name which represents a pair containing a String and an int. (CSCI 243 students: notice that if we were writing in ML instead of Java, we wouldn't have to do this!) You will be completing this class in the next few steps, but you don't need to do anything with it now.

Instead, back in FileProcessor.java, uncomment the line that makes an ArrayList of these pairs, and change the loop you wrote last time so that instead of printing each String and int, you instead make a new SIPair object out of them and store that object in the ArrayList.

There's not much to test in the step, but make sure your code compiles.

Step 5: Java comes with a canned Collections.sort() method in the class Collections, which we would like to use for this. This method will operate on an ArrayList as long as the type of element stored in the ArrayList implements an interface from the Java API called Comparable--- this just means that it must implement a method compareTo() which allows us to compare two items of the same type, so that the sort() method knows what order to put them in.

So, go to SIPair.java and and fill-in the method public int compareTo(SIPair other). Refer to the Java API description of the method to see how it should work--- what it should return, based on the the things it is comparing. Remember, we want to sort so as to put the highest frequency first. If two words have the same frequency, we break the tie by putting them in alphabetical order based on the words themselves. You can do this by calling the compareTo() method of the class String.

Step 6: Now, back in FileProcessor.java, uncomment the line that calls Collections.sort() That's all---now it's sorted.

Make sure everything compiles before you move on.

Step 7: Soon we will print our results to the screen. But first, we need to have a way to print a pair (whether to the screen or to a file). Fill-in the toString() method to your class which will return a representation of your data that looks like, for example, the 74.

Step 8: Finally, write a loop that will write each element in the ArrayList to the screen. Make use of the toString() method you just wrote.

Compile and test everything.

4. Turn in

Make a typescript showing your program running on one of the sample files given; print that along with your source code.

Turn in your source files and typescript with a command like

/cslab/class/csci235/bin/handin lab13 *.java typescript

Thomas VanDrunen, Cary Gray
Last modified: Tue Nov 23 09:08:09 CST 2010