This goal of this project is to practice using classes from the Java Collections framework (ArrayList, HashMap, HashSet, and Iterator) to solve problems. (See the class page for a link to the handout from class.)
As usual, clone the repository from the class directory:
You should see two subdirectories, examples and counter. Go into the examples directory, where you will find the two example programs from yesterday’s handout. Read these programs and make sure you understand how they work; compile and test them.
Now change into the counter directory; the command to do that is probably
You should have two .java files to work on, one .class file, and several files that you can use as input for testing.
In this lab you will write a program that takes a piece of text and counts the frequency of the words in it. For example, in this sentence, the word times occurs 5 times, occurs also occurs 5 times, and occurs 3 times, and for, example, in, this, sentence, the, word, each and also each occurs 2 times.
First think about how you would do this by hand if you were given a page of text. You would go through the text while keeping a list of all the words that occur. If the next word you look at is not yet on your list (that is, it’s the first time you see the word in the text), you add it to the list and put a mark by it. If it is already on your list, find the word and add another mark by it. That way, you tally all the times each word occurs. We will be using a similar strategy for our program.
times | |
occurs | |
and | |
for | |
... |
Your program will read the text on which to do the word counting from a file. You’ll be provided the component that reads in from a file. Your program should be used like this:
This would read in a piece of text stored in the file filename and print the results to the screen.
Hint: It will probably be a good idea to commit your work to Mercurial when you finish each of the steps below.
This part will consist of a sequence of smaller steps. Read them carefully, think about what they are doing, and program them carefully.
Step 1: First, become familiar with the file FileProcessor.java. It has a main() method, which you will complete. It calls FileAux.filesToWords() from the provided class file, which takes the name of a file, opens and reads that file, and returns an ArrayList of the words it contains (ignoring punctuation, spaces, etc.). That is provided for you.
In the main() method right now (before you make any changes to it or uncomment anything), we simply print the ArrayList. We saw in class that ArrayLists have a useful toString() method, but we didn’t see it in action. Compile FileProcessor as it is, and look at the result.
Step 2: Now, delete the line that prints out the ArrayList; we don’t need it anymore. Now we get to do the heart of the tallying. To implement the word-tally strategy mentioned earlier, you will use a class called HashMap.
Like ArrayList and HashSet, HashMap is a container class, that is, something used to store a collection of data. HashMap acts like a table, where we can look up a value based on a key. Every HashMap needs two types of data: one type for the keys and another type for the values. (Mathematically, HashMap is somewhat like a function, where the keys are the domain and the values associated with the keys are the codomain.)
The important methods of HashMap are put(), which takes a key and a value and associates that key with the value (this may overwrite a previous association with that key); get() which takes a key and returns the value associated with that key; containsKey(), which determines whether or not a potential key is currently associated with a value; and keySet(), which returns a Set (the supertype of HashSet) which contains all the keys.
Think about the table you would use to do a word tally by hand. How can you use a HashMap to do this? As you can see in the program, we make a HashSet that associates Strings with ints. For every word in the file, we want to associate it with a number, the number of times we’ve seen it (so far).
Step 2 involves a loop which iterates through the words in the ArrayList. Fill-in the loop for this step so that it adjusts the HashMap to account for the current word. (Think about: what if this word has occurred already? what if this word is occurring for the first time?) Make the word lowercase; we do not want to consider capitalization in our word counting.
Don’t forget that it.next() will return each word once; really, it does two things: it returns the next word and moves the iterator ahead one position in the ArrayList. If you call it.next() two times in a row, you will get different results.
HashMap happens to have a nice toString() method, so we can see what it looks like. The program is set up so that it will print the HashMap after you have populated it. Compile and test.
Step 3: Delete the line that prints the HashMap. We would like to print the results ourselves, one word per line. Uncomment the section for this step. Notice that this contains a loop which will iterate through the HashMap. Pay careful attention to how this is done. In particular, notice the initialization
What does this mean? First, we call keySet() on the HashMap, and this will return the Set of Strings that are associated with some value. Then we make an iterator that iterates through the Strings in the set of keys. Notice that this iterator is not iterating through the HashMap itself, but rather the set of keys for the HashMap.
Fill in this loop so that it prints out a word followed by its frequency, one word per line. Then compile and test.
Step 4: This information would be much more useful if it were in an order—that is, if it were sorted. Moreover, since we’re more likely to be interested in knowing what the most frequent words are, we would like it backwards sorted.
Again, for Ex19, we would get something like
Our strategy is that before writing all the word/frequency pairs, we will store them in an ArrayList of pairs and sort them. In order to have a ”ArrayList of pairs”, we need some class to represent those pairs.
Open the file SIPair.java. This is for a class of the same name which represents a pair containing a String and an int. (CSCI 243 students: notice that if we were writing in ML instead of Java, we wouldn’t have to do this!) The one thing that we need to do right now is fix the toString() method so that it returns the string followed by a colon and space, then the number, as in ”the: 74”.
Back in FileProcessor.java, uncomment the line that makes an ArrayList of these pairs. Also uncomment the line farther down that prints pairs. Now change the body of the loop you wrote in step 3 time so that instead of printing each String and int, you instead make a new SIPair object out of them and store that object in the ArrayList.
You should be able to test your program, taking advantage of the fact that both ArrayList and SIPair have a toString() method.
Step 5: Java comes with a canned Collections.sort() method in the class Collections, which we can use to order our list. This method will operate on an ArrayList as long as the type of element stored in the ArrayList implements an interface from the Java API called Comparable—this just means that it must implement a method compareTo() which allows us to compare two items of the same type, so that the sort() method knows what order to put them in.
So, go to SIPair.java and and fill-in the method public int compareTo(SIPair other).
This method is defined as returning zero if the two objects are equal, a negative value if the receiver (this) is “less than” the parameter other, and a positive value if the receiver is “greater than” other. Because we are sorting, “less than” means “belonging before.” We want higher frequency first; so we should get a negative value if the frequency of this is larger. If two words have the same frequency, we break the tie by putting them in alphabetical order based on the words themselves—which we can get by calling the compareTo() method on the class String.
You may find it helpful to write a simple main() method in SIPair to make a few pairs and see the results of comparing them.
Step 6: Now, back in FileProcessor.java, uncomment the line that calls Collections.sort() That’s all—now it’s sorted.
Make sure everything compiles before you move on. If you run the program now, you should see the list in the correct order.
Step 7: Finally, replace the statement that prints pairs with a loop that will write each element in the ArrayList on a line by itself. Make use of the toString() method you wrote in step 4.
Compile and test everything.
Make a typescript showing your program running on one of the sample files given.
Turn in your source files and typescript with a command like