This goal of this project is to practice
using classes from the Java Collections framework
(ArrayList
, HashMap
, HashSet
,
and Iterator
) to
solve problems.
In this lab you will write a program that takes a piece of text and counts the frequency of the words in it. For example, in this sentence, the word times occurs 5 times, occurs also occurs 5 times, and occurs 3 times, and for, example, in, this, sentence, the, word, each and also each occurs 2 times.
First think about how you would do this by hand if you were given a page of text. You would go through the text while keeping a list of all the words that occur. If the next word you look at is not yet on your list (that is, it's the first time you see the word in the text), you add it to the list and put a mark by it. If is already on your list, find the word and add another mark by it. That way, you tally all the times each word occurs. We will be using a similar strategy for our program.
times | |
occurs | |
and | |
for | |
... |
Your program will read the text on which to do the word counting from a file. I will provide the component that reads in from a file. Your program should be used like this:
java WordCounter filename
This would read in a piece of text stored in the file filename
and print the results to the screen.
As usual, make a new directory for this project and cd into it.
Copy the following file from the course directory. The first is the starter for your lab. The other four are examples to test it on.
cp /homes/tvandrun/Public/cs235/lab14/* .
This part will consist of a sequence of smaller steps. Read the carefully, think about what they are doing, and program them carefully.
Step 1:
First, become familiar with the file FileProcessor.java
.
It has a main()
method, which you will complete.
It also has a method processFile()
which
is written for you.
processFile()
takes a String
,
which it interprets as the name of a file.
It read that file and ignores punctuation, spaces, etc,
and produces an ArrayList
of the words.
In the main()
method right now (before
you make any changes to it or uncomment anything),
we simply print the ArrayList
.
We mentioned in class yesterday that ArrayList
s
have a useful toString()
method,
but we didn't see it in action.
Compile FileProcessor
as it is, and
look at the result.
Step 2:
Now, delete the line that prints out the ArrayList
;
we don't need it anymore.
Now we wen to do the heart of the tallying.
To implement the word-tally strategy mentioned earlier, you will use
a class called HashMap
.
Like ArrayList
and HashSet
,
HashMap
is a container class, that is, something
used to store a collection of data.
HashMap
acts like a table, where we can
look up a value based on a key.
Every HashMap
needs two types of data:
one type for the keys and another type for the values.
(Mathematically, HashMap
is somewhat like
a function, where the keys are the domain
and the values associated with the keys are the codomain.)
The important methods of HashMap
are put()
, which takes a key and a value and
associates that key with the value (this may overwrite
a previous association with that key);
get()
which takes a key and returns the
value associated with that key;
containsKey()
, which determines whether or not
a potential key is currently associated with a value;
and keySet()
, which returns a Set
(the supertype of HashSet
) which
contains all the keys.
Think about the table you would use to do a word tally by hand.
How can you use a HashMap
to do this?
As you can see in the program, we make a
HashSet
that associates String
s
with int
s.
For every word in the file,
we want to associate it with a number, the number of
times we've seen it (so far).
Step 2 involves a loop which iterates through the
words in the ArrayList
.
Fill-in the loop for this step so that it
adjusts the HashMap
to account for the current word.
(Think about: what if this word has occurred already? what
if this word is occurring for the first time?)
Make the word lowercase; we do not want to consider
capitalization in our word counting.
Don't forget that it.next()
will return
each word once;
really, it does two things: it returns the next
word and moves the iterator ahead one position
in the ArrayList
.
If you call it.next()
two times
in a row, you will get different results.
HashMap
happens to have a nice
toString()
method, so we can see what it looks like.
The program is set up so that it will print
the HashMap
after you have populated it.
Compile and test.
Step 3:
Delete the line that prints the HashMap
.
We would like to print the results ourselves,
one word per line.
Uncomment the section for this step.
Notice that this contains a loop which will
iterate through the HashMap
.
Pay careful attention to how this is done.
In particular, notice the initialization
Iteratorit = tally.keySet().iterator();
What does this mean?
First, we call keySet()
on the HashMap
,
and this will return the Set
of Strings
that are associated with some value.
Then we make an iterator that iterates through
the String
s in the set of keys.
Notice that this iterator is not iterating through
the HashMap
itself, but rather
the set of keys for the HashMap
.
Fill in this loop so that it prints out a word followed by its frequency, one word per line. Then compile and test.
Step 4:
This information would be much more useful if it were in an order---that is,
if it were sorted.
Moreover, since we're more likely to be interested in knowing what the most
frequent words are, we would like it backwards sorted.
Again, for Ex19
, something like
the 74 to 32 and 31 of 18 lord 17 you 15 people 15 moses 13 mountain 12 them 9 ...
Our strategy is that before writing all the word/frequency pairs, we will store them in a ArrayList of pairs and sort them. In order to have a "ArrayList of pairs", we need some class to represent those pairs.
Open the file SIPair.java
.
This is for a class of the same name which represents
a pair containing a String
and an
int
.
(CSCI 243 students: notice that if we were writing in ML
instead of Java, we wouldn't have to do this!)
You will be completing this class in the next few steps,
but you don't need to do anything with it now.
Instead, back in FileProcessor.java
, uncomment
the line that makes an ArrayList
of these
pairs, and change the loop you wrote last
time so that instead of printing
each String
and int
, you instead
make a new SIPair
object out of them and
store that object in the ArrayList
.
There's not much to test in the step, but make sure your code compiles.
Step 5:
Java comes with a canned Collections.sort()
method
in the class Collections,
which we would like to use for this.
This method will operate on an ArrayList
as long as the type of element stored in the
ArrayList
implements an
interface from the Java API called
Comparable
--- this
just means that it must implement a method
compareTo()
which allows us to compare two
items of the same type, so that the sort()
method knows what order to put them in.
SSo, go to SIPair.java
and
and fill-in the method
public int compareTo(SIPair other)
.
Refer to
the Java API description of the method
to see how it should work--- what it should return,
based on the the things it is comparing.
Remember, we want to sort so as to put the highest frequency first.
If two words have the same frequency, we break the tie
by putting them in alphabetical order based on the words
themselves.
You can do this by calling the compareTo()
method of the class String
.
Step 6:
Now, back in FileProcessor.java
, uncomment
the line that calls Collections.sort()
That's all---now it's sorted.
Make sure everything compiles before you move on.
Step 7:
Soon we will print our results to the screen.
But first, we need to have a way to print a pair (whether to the screen
or to a file).
Fill-in the toString()
method to your class which will
return a representation of your data that looks like, for example,
the 74
.
Step 8:
Finally, write a loop that will write each element in the
ArrayList
to the
screen.
Make use of the toString()
method you just wrote.
Compile and test everything.
Make a typescript showing your program and its running on one of the sample files given.