The goal of this exercise is for you to practice using regular expressions.
The Java library supports regular expressions in
the package
java.util.regex
,
with two classes
in particular,
Pattern
and
Matcher
.
The Pattern
class is for objects to represent regular
expressions themselves, and the opening documentation for
that class gives the details of how regular expressions are to be
formatted, what special characters and shorthands are supported,
etc.
It does not have a public constructor;
instead, make a new Pattern
instance by using
the static factory method Pattern.compile()
that takes a regular expression as a string.
The Matcher
class stands for the result
of searching a string for occurrences of patterns that match
a regular expression.
The matcher()
method in Pattern
returns a Matcher
object which can
be manipulated to obtains the results of the search.
Grab starter code from ~tvandrun/Public/cs345/regex
.
This will have four classes in a package called reexamples
,
which we will inspect together and then you'll have a chance to play with.
It also has two text files: baum.txt
, which contains
the text of seven of Frank L Baum's Oz books;
and shakespeare.txt
, which contains the text of a bunch
of Shakespeare's plays (don't remember how many).
reexamples.SpamPro
takes a URL as a commandline
argument, grabs the HTML source of that website, and searches for
things that look like email addresses.
This shows the use of Pattern
and
Matcher
objects,
especially the find()
and group()
methods of Matcher
,
which can be used analogously to how one would use an iterator.
reexamples.JGrep
is a homemade, simplified
version of the grep
command that uses
Java's regular expression syntax.
The webpage
http://online.wsj.com/mdc/public/page/2_3024-NYSE.html?mod=stocksdaily
contains a table of data about every stock on the New York Stock Exchange.
One filter that some investors use for screening stocks
is the price-earning (PE) ratio, that is, the ratio of the stock's
price to the company's most recent earnings statement divided per share.
A stock with a low PE ratio is considered to be "cheap", ie,
you can purchase a larger amount of that company's earnings for
a smaller stock price.
Notice that if a company has no earnings, then its PE ratio is
infinity.
However, conventional wisdom is that you don't want a stock's PE ratio
to be too low either, since it suggests the market knows something
bad about the company that's making its share price drop.
The "sweet spot" for PE ratios is debated.
(And, of course, investment decisions need to be made
from other considerations and not just one isolated statistic.)
The class reexamples.PERatioScreener
takes two numbers
as commandline arguments and interprets them as lower and
upper bounds for acceptable PE ratios.
It then lists all the NYSE companies with PE ratios within the
given range, as listed on the website above.
This program shows how to use Matcher.group()
a little more effectively to extract part of a substring
that matches a regular expressions.
Parenthesized parts of a regular expression are called groups,
and a call like group(1)
extracts the substring
that matches the first parenthesized portion of the regular expression.
This program also exemplifies how to process formatted information such as in HTML
The class reexamples.FormatChapter
takes a book of the Bible and a chapter number and
grabs the chapter from biblegateway.com
.
(As given it's hardwired to use the ESV, but that's changed easily enough---just modify line 15 in the file.)
This program strips out the formatting (mainly HTML tags and other
metadata) and prints the chapter in a somewhat readable form
to the terminal.
In this example, we don't want merely to find matches for
regular expressions.
In some cases we want to replace them with something else.
We can do that with replaceAll()
and similar methods
in class Matcher
.
See in the code how certain HTML tags are replaced with different
kinds of whitespace, how special characters are detected and fixed,
and how formatting of verse numbers is fixed.
As time permits, do some of the following:
FormatChapter
to clean up some other
webpage and print its contents nicely to the terminal.
PERatioScreener
or SpamPro
to extract information from some other website.
JGrep
(or regular old grep
)
to search for some interesting information about word usage or
other patterns in the text files given or some webpages.