This goal of this project is to practice
using classes from the Java Collections framework
(Vector
, HashMap
, HashSet
,
and Iterator
) to
solve problems.
In this lab you will write a program that takes a piece of text and counts the frequency of the words in it. For example, in this sentence, the word times occurs 5 times, occurs also occurs 5 times, and occurs 3 times, and for, example, in, this, sentence, the, word, each and also each occurs 2 times.
First think about how you would do this by hand if you were given a page of text. You would go through the text while keeping a list of all the words that occur. If the next word you look at is not yet on your list (that is, it's the first time you see the word in the text), you add it to the list and put a mark by it. If is already on your list, find the word and add another mark by it. That way, you tally all the times each word occurs. We will be using a similar strategy for our program.
times | |
occurs | |
and | |
for | |
... |
Your program will read the text on which to do the word counting from a file. I will provide the component that reads in from a file. Your program should be used like this:
> java WordCounter filename
This would read in a piece of text stored in the file filename
and print the results to the screen.
As usual, make a new directory for this project and cd into it.
Copy the following file from the course directory. The first is the starter for your lab. The other four are examples to test it on.
> cp /homeemp/tvandrun/pub/235/FileProcessor.java . > cp /homeemp/tvandrun/pub/235/Ex19 . > cp /homeemp/tvandrun/pub/235/Ps132 . > cp /homeemp/tvandrun/pub/235/Rom12 . > cp /homeemp/tvandrun/pub/235/Rev22 .
This part will consist of a sequence of smaller steps. Read the carefully, think about what they are doing, and program them carefully.
Step 1:
The class FileProcessor
contains a static
method toVector
which takes the name of a file
and returns a Vector
containing the
words in the file (so, a very long Vector
.
Open a new file (something like WordCounter.java
),
in which you will write your program.
Write the opening documentation, the import statement,
and the skeleton of the main method.
Then fill-in the main method so that it reads
the name of the textfile from the command line,
uses FileProcessor.toVector()
to convert it,
and iterates (using an iterator) over the Vector
,
printing each word to the screen.
Compile and test.
Step 2:
To implement the word-tally strategy mentioned earlier, you will use
a HashMap
.
Think about the table you would use to do a word tally by hand.
How can you use a HashMap
to do this?
What types of values would it associate?
Create an appropriate HashMap
and store it in a variable.
Given a word, how to we update our tally?
Change your loop so that instead of printing the word to the screen,
adjusts the HashMap
to account for this word.
(Think about: what if this word has occurred already? what
if this word is occurring for the first time?)
Make the word lowercase; we do not want to consider
capitalization in our word counting.
Compile and test.
Step 3:
Once our tally is done,
we want to print our tally table to the screen.
Write a second loop that
iterates over the keys of the
HashMap
and rints out a word
followed by its frequency, one word per line.
Compile and test
Step 4:
This information would be much more useful if it were in an order---that is,
if it were sorted.
Moreover, since we're more likely to be interested in knowing what the most
frequent words are, we would like it backwards sorted.
Again, for Ex19
, something like
the 74 to 32 and 31 of 18 lord 17 you 15 people 15 moses 13 mountain 12 them 9 ...
Our strategy is that before writing all the word/frequency pairs, we will store them in a Vector of pairs and sort them. In order to have a "Vector of pairs", we need some class to represent those pairs.
Write a class (in a separate file) to represent/contain a word/frequency pair. What instance variables would you need? Make sure you write a usable constructor. Do not worry yet about other methods.
Step 5: Make a new variable to store a Vector of the class you made in step 4, and initialize the variable by assigning to it a new instance of Vector. Then modify the for loop so that instead of printing the current word and frequency to the file, it puts a new instance of the class you wrote in step 4 into the Vector. (Do not delete the old line that write to a file; just comment it out, so you can refer to it later.)
Step 6:
Java comes with a canned Collections.sort()
method
in the class
Collections,
which we would like to use for this.
To do this, you need to make your class from step 5 a Comparable
.
Comparable
is an interface that comes with Java,
and it requires that a class implement a method
called compareTo
, which takes another
instance of the same class and returns and integer to signal
which instance comes first.
This way, a sort
method know in what order to put them.
So, update the class so that it explicitly implements
Comparable<Pair>
and write a method
public int compareTo(
[name of your class] other)
Refer to
the Java API description of the method
to see how it is supposed to work.
Remember, we want to sort so as to put the highest frequency first.
Step 7:
Now, after the for loop, simply call Collections.sort()
,
passing it your Vector.
That's all---now it's sorted.
Step 8:
Soon we will print our results to the screen.
But first, we need to have a way to print a pair (whether to the screen
or to a file).
Add a toString()
method to your class which will
return a representation of your data that looks like, for example, the 74
.
Step 9:
Finally, write a loop that will write each element in the Vector to the
screen.
Make use of the toString()
method you just write.
Copy the files you modified to
/homestu/jmichalk/labTA/turnInProjects/
yourFirstName,yourLastInitial