This goal of this project is to practice
using classes from the Java Collections framework
(ArrayList
, HashMap
, HashSet
,
and Iterator
) to
solve problems. (See the class page for a link to the handout from class.)
In this lab you will write a program that takes a piece of text and counts the frequency of the words in it. For example, in this sentence, the word times occurs 5 times, occurs also occurs 5 times, and occurs 3 times, and for, example, in, this, sentence, the, word, each and also each occurs 2 times.
First think about how you would do this by hand if you were given a page of text. You would go through the text while keeping a list of all the words that occur. If the next word you look at is not yet on your list (that is, it's the first time you see the word in the text), you add it to the list and put a mark by it. If it is already on your list, find the word and add another mark by it. That way, you tally all the times each word occurs. We will be using a similar strategy for our program.
times | |
occurs | |
and | |
for | |
... |
Your program will read the text on which to do the word counting from a file. You'll be provided the component that reads in from a file. Your program should be used like this:
java FileProcessor filename
This would read in a piece of text stored in the file filename
and print the results to the screen.
As usual, clone the repository from the class directory:
hg clone /cslab/class/csci235/labs/lab12 cd lab12
You should have two .java
files to work on, one
.class
file, and several
files that you can use as input for testing.
Hint: It will probably be a good idea to commit your work to Mercurial when you finish each of the steps below.
This part will consist of a sequence of smaller steps. Read them carefully, think about what they are doing, and program them carefully.
Step 1:
First, become familiar with the file FileProcessor.java
.
It has a main()
method, which you will complete.
It calls FileAux.filesToWords()
from the
provided class file, which takes the name of a file, opens and reads
that file, and returns an ArrayList
of the words it
contains (ignoring punctuation, spaces, etc.).
That is provided for you.
In the main()
method right now (before
you make any changes to it or uncomment anything),
we simply print the ArrayList
.
We saw in class that ArrayList
s
have a useful toString()
method,
but we didn't see it in action.
Compile FileProcessor
as it is, and
look at the result.
Step 2:
Now, delete the line that prints out the ArrayList
;
we don't need it anymore.
Now we get to do the heart of the tallying.
To implement the word-tally strategy mentioned earlier, you will use
a class called HashMap
.
Like ArrayList
and HashSet
,
HashMap
is a container class, that is, something
used to store a collection of data.
HashMap
acts like a table, where we can
look up a value based on a key.
Every HashMap
needs two types of data:
one type for the keys and another type for the values.
(Mathematically, HashMap
is somewhat like
a function, where the keys are the domain
and the values associated with the keys are the codomain.)
The important methods of HashMap
are put()
, which takes a key and a value and
associates that key with the value (this may overwrite
a previous association with that key);
get()
which takes a key and returns the
value associated with that key;
containsKey()
, which determines whether or not
a potential key is currently associated with a value;
and keySet()
, which returns a Set
(the supertype of HashSet
) which
contains all the keys.
Think about the table you would use to do a word tally by hand.
How can you use a HashMap
to do this?
As you can see in the program, we make a
HashSet
that associates String
s
with int
s.
For every word in the file,
we want to associate it with a number, the number of
times we've seen it (so far).
Step 2 involves a loop which iterates through the
words in the ArrayList
.
Fill-in the loop for this step so that it
adjusts the HashMap
to account for the current word.
(Think about: what if this word has occurred already? what
if this word is occurring for the first time?)
Make the word lowercase; we do not want to consider
capitalization in our word counting.
Don't forget that it.next()
will return
each word once;
really, it does two things: it returns the next
word and moves the iterator ahead one position
in the ArrayList
.
If you call it.next()
two times
in a row, you will get different results.
HashMap
happens to have a nice
toString()
method, so we can see what it looks like.
The program is set up so that it will print
the HashMap
after you have populated it.
Compile and test.
Step 3:
Delete the line that prints the HashMap
.
We would like to print the results ourselves,
one word per line.
Uncomment the section for this step.
Notice that this contains a loop which will
iterate through the HashMap
.
Pay careful attention to how this is done.
In particular, notice the initialization
Iteratorit = tally.keySet().iterator();
What does this mean?
First, we call keySet()
on the HashMap
,
and this will return the Set
of Strings
that are associated with some value.
Then we make an iterator that iterates through
the String
s in the set of keys.
Notice that this iterator is not iterating through
the HashMap
itself, but rather
the set of keys for the HashMap
.
Fill in this loop so that it prints out a word followed by its frequency, one word per line. Then compile and test.
Step 4:
This information would be much more useful if it were in an order—that is,
if it were sorted.
Moreover, since we're more likely to be interested in knowing what the most
frequent words are, we would like it backwards sorted.
Again, for Ex19
, something like
the 74 to 32 and 31 of 18 lord 17 you 15 people 15 moses 13 mountain 12 them 9 ...
Our strategy is that before writing all the word/frequency pairs, we will store them in an ArrayList of pairs and sort them. In order to have a "ArrayList of pairs", we need some class to represent those pairs.
Open the file SIPair.java
.
This is for a class of the same name which represents
a pair containing a String
and an
int
.
(CSCI 243 students: notice that if we were writing in ML
instead of Java, we wouldn't have to do this!)
The one thing that we need to do right now is fix the
toString()
method so that it returns the string followed by
a colon and space, then the number, as in "the: 74
".
Back in FileProcessor.java
, uncomment
the line that makes an ArrayList
of these
pairs. Also uncomment the line farther down that prints pairs
.
Now change the body of the loop you wrote in step 3
time so that instead of printing
each String
and int
, you instead
make a new SIPair
object out of them and
store that object in the ArrayList
.
You should be able to test your program, taking advantage of the
fact that both ArrayList
and SIPair
have a
toString()
method.
Step 5:
Java comes with a canned Collections.sort()
method
in the class Collections,
which we can use to order our list.
This method will operate on an ArrayList
as long as the type of element stored in the
ArrayList
implements an
interface from the Java API called
Comparable
—this
just means that it must implement a method
compareTo()
which allows us to compare two
items of the same type, so that the sort()
method knows what order to put them in.
So, go to SIPair.java
and
and fill-in the method
public int compareTo(SIPair other)
.
Refer to
the Java API description of the method
to see how it should work—what it should return,
based on the the things it is comparing.
Remember, we want to sort so as to put the highest frequency first.
If two words have the same frequency, we break the tie
by putting them in alphabetical order based on the words
themselves.
You can do this by calling the compareTo()
method of the class String
.
You may find it helpful to write a simple main()
method in SIPair
to make a few pairs and see the results
of comparing them.
Step 6:
Now, back in FileProcessor.java
, uncomment
the line that calls Collections.sort()
That's all—now it's sorted.
Make sure everything compiles before you move on. If you run the program now, you should see the list in the correct order.
Step 7:
Finally, replace the statement that prints pairs
with a
loop that will write each element in the ArrayList
on a
line by itself.
Make use of the toString()
method you wrote in step 4.
Compile and test everything.
Make a typescript showing your program running on one of the sample files given.
Turn in your source files and typescript with a command like
/cslab/class/csci235/bin/handin lab12 *.java typescript