Lab 11: Java Collections

This goal of this project is to practice using classes from the Java Collections framework (Vector, HashMap, HashSet, and Iterator) to solve problems.

1. Introduction

In this lab you will write a program that takes a piece of text and counts the frequency of the words in it. For example, in this sentence, the word times occurs 5 times, occurs also occurs 5 times, and occurs 3 times, and for, example, in, this, sentence, the, word, each and also each occurs 2 times.

First think about how you would do this by hand if you were given a page of text. You would go through the text while keeping a list of all the words that occur. If the next word you look at is not yet on your list (that is, it's the first time you see the word in the text), you add it to the list and put a mark by it. If is already on your list, find the word and add another mark by it. That way, you tally all the times each word occurs. We will be using a similar strategy for our program.

times
occurs
and
for
...

Your program will read the text on which to do the word counting from a file. I will provide the component that reads in from a file. Your program should be used like this:

> java WordCounter filename

This would read in a piece of text stored in the file filename and print the results to the screen.

2. Set up

As usual, make a new directory for this project and cd into it.

Copy the following file from the course directory. The first is the starter for your lab. The other four are examples to test it on.

> cp /homeemp/tvandrun/pub/235/FileProcessor.java .
> cp /homeemp/tvandrun/pub/235/Ex19 .
> cp /homeemp/tvandrun/pub/235/Ps132 .
> cp /homeemp/tvandrun/pub/235/Rom12 .
> cp /homeemp/tvandrun/pub/235/Rev22 .

3. Read in and count

This part will consist of a sequence of smaller steps. Read the carefully, think about what they are doing, and program them carefully.

Step 1: The class FileProcessor contains a static method toVector which takes the name of a file and returns a Vector containing the words in the file (so, a very long Vector. Open a new file (something like WordCounter.java), in which you will write your program. Write the opening documentation, the import statement, and the skeleton of the main method. Then fill-in the main method so that it reads the name of the textfile from the command line, uses FileProcessor.toVector() to convert it, and iterates (using an iterator) over the Vector, printing each word to the screen. Compile and test.

Step 2: To implement the word-tally strategy mentioned earlier, you will use a HashMap. Think about the table you would use to do a word tally by hand. How can you use a HashMap to do this? What types of values would it associate? Create an appropriate HashMap and store it in a variable. Given a word, how to we update our tally? Change your loop so that instead of printing the word to the screen, adjusts the HashMap to account for this word. (Think about: what if this word has occurred already? what if this word is occurring for the first time?) Make the word lowercase; we do not want to consider capitalization in our word counting. Compile and test.

Step 3: Once our tally is done, we want to print our tally table to the screen. Write a second loop that iterates over the keys of the HashMap and rints out a word followed by its frequency, one word per line. Compile and test

Step 4: This information would be much more useful if it were in an order---that is, if it were sorted. Moreover, since we're more likely to be interested in knowing what the most frequent words are, we would like it backwards sorted. Again, for Ex19, something like

the 74
to 32
and 31
of 18
lord 17
you 15
people 15
moses 13
mountain 12
them 9
...

Our strategy is that before writing all the word/frequency pairs, we will store them in a Vector of pairs and sort them. In order to have a "Vector of pairs", we need some class to represent those pairs.

Write a class (in a separate file) to represent/contain a word/frequency pair. What instance variables would you need? Make sure you write a usable constructor. Do not worry yet about other methods.

Step 5: Make a new variable to store a Vector of the class you made in step 4, and initialize the variable by assigning to it a new instance of Vector. Then modify the for loop so that instead of printing the current word and frequency to the file, it puts a new instance of the class you wrote in step 4 into the Vector. (Do not delete the old line that write to a file; just comment it out, so you can refer to it later.)

Step 6: Java comes with a canned Collections.sort() method in the class Collections, which we would like to use for this. To do this, you need to make your class from step 5 a Comparable. Comparable is an interface that comes with Java, and it requires that a class implement a method called compareTo, which takes another instance of the same class and returns and integer to signal which instance comes first. This way, a sort method know in what order to put them.

So, update the class so that it explicitly implements Comparable<Pair> and write a method public int compareTo([name of your class] other) Refer to the Java API description of the method to see how it is supposed to work. Remember, we want to sort so as to put the highest frequency first.

Step 7: Now, after the for loop, simply call Collections.sort(), passing it your Vector. That's all---now it's sorted.

Step 8: Soon we will print our results to the screen. But first, we need to have a way to print a pair (whether to the screen or to a file). Add a toString() method to your class which will return a representation of your data that looks like, for example, the 74.

Step 9: Finally, write a loop that will write each element in the Vector to the screen. Make use of the toString() method you just write.

4. Turn in

Copy the files you modified to

/homestu/jmichalk/labTA/turnInProjects/yourFirstName,yourLastInitial

Thomas VanDrunen

Last modified: Thu Dec 21 14:56:01 CST 2006