Practice: Regular expressions

The goal of this exercise is for you to practice using regular expressions.

1. Introduction

The Java library supports regular expressions in the package java.util.regex, with two classes in particular, Pattern and Matcher. The Pattern class is for objects to represent regular expressions themselves, and the opening documentation for that class gives the details of how regular expressions are to be formatted, what special characters and shorthands are supported, etc. It does not have a public constructor; instead, make a new Pattern instance by using the static factory method Pattern.compile() that takes a regular expression as a string.

The Matcherclass stands for the result of searching a string for occurrences of patterns that match a regular expression. The matcher() method in Pattern returns a Matcher object which can be manipulated to obtains the results of the search.

Grab starter code from ~tvandrun/Public/cs345/regex. This will have four classes in a package called reexamples, which we will inspect together and then you'll have a chance to play with. It also has two text files: baum.txt, which contains the text of seven of Frank L Baum's Oz books; and shakespeare.txt, which contains the text of a bunch of Shakespeare's plays (don't remember how many).

2. SpamPro

reexamples.SpamPro takes a URL as a commandline argument, grabs the HTML source of that website, and searches for things that look like email addresses. This shows the use of Pattern and Matcher objects, especially the find() and group() methods of Matcher, which can be used analogously to how one would use an iterator.

3. JGrep

reexamples.JGrep is a homemade, simplified version of the grep command that uses Java's regular expression syntax.

4. PERatioScreener

The webpage http://online.wsj.com/mdc/public/page/2_3024-NYSE.html?mod=stocksdaily contains a table of data about every stock on the New York Stock Exchange. One filter that some investors use for screening stocks is the price-earning (PE) ratio, that is, the ratio of the stock's price to the company's most recent earnings statement divided per share. A stock with a low PE ratio is considered to be "cheap", ie, you can purchase a larger amount of that company's earnings for a smaller stock price. Notice that if a company has no earnings, then its PE ratio is infinity. However, conventional wisdom is that you don't want a stock's PE ratio to be too low either, since it suggests the market knows something bad about the company that's making its share price drop. The "sweet spot" for PE ratios is debated. (And, of course, investment decisions need to be made from other considerations and not just one isolated statistic.)

The class reexamples.PERatioScreener takes two numbers as commandline arguments and interprets them as lower and upper bounds for acceptable PE ratios. It then lists all the NYSE companies with PE ratios within the given range, as listed on the website above.

This program shows how to use Matcher.group() a little more effectively to extract part of a substring that matches a regular expressions. Parenthesized parts of a regular expression are called groups, and a call like group(1) extracts the substring that matches the first parenthesized portion of the regular expression.

This program also exemplifies how to process formatted information such as in HTML

5. FormatChapter

The class reexamples.FormatChapter takes a book of the Bible and a chapter number and grabs the chapter from biblegateway.com. (As given it's hardwired to use the ESV, but that's changed easily enough---just modify line 15 in the file.)

This program strips out the formatting (mainly HTML tags and other metadata) and prints the chapter in a somewhat readable form to the terminal. In this example, we don't want merely to find matches for regular expressions. In some cases we want to replace them with something else. We can do that with replaceAll() and similar methods in class Matcher. See in the code how certain HTML tags are replaced with different kinds of whitespace, how special characters are detected and fixed, and how formatting of verse numbers is fixed.

6. Your turn

As time permits, do some of the following:

Adapt FormatChapter to clean up some other webpage and print its contents nicely to the terminal.
Adapt PERatioScreener or SpamPro to extract information from some other website.
Use JGrep (or regular old grep) to search for some interesting information about word usage or other patterns in the text files given or some webpages.

Thomas VanDrunen

Last modified: Tue Apr 30 16:14:04 CDT 2019