Project: Tries

1. Introduction

The goal of this project is to understand how to use tries to implement a set (modifying this to implement a map or bag would be pretty simple).

My intention was that this project be lighter-weight, an easier one to finish up the semester with. It easily could have been a nasty one, however. But I'm going to put the hard parts into a series of "bonus" exercises, which you can try for extra credit. [Do not attempt to do the bonus exercises as a replacement for earlier exercises. That is a very bad strategy and violates the intention of "extra credit." Extra credit is for the occupation of students who have already done the required stuff, not to compensate for missing the required stuff.]

This corresponds to material in Section 7.2, but that section doesn't (yet) have a project description in it.

2. Set up

Find the project code for this and the next project at /homes/tvandrun/Public/cs345/trie. As usual, the code is organized into the packages adt, impl and test, but adt and imple each have only one file.

Familiarize yourself with the given code in impl/TrieSet. A trie is a tree whose nodes each have a letter and as many children (potentially) as there are letters. If a string is in the set that this trie represents, then there is a path in the tree from the root to a node following the letters in that string. But the converse isn't true: just because a path in the tree exists following the letters of a certain string doesn't mean that string is in the set: it might merely be a prefix of a string that is in the set. Thus each nodes needs a tag indicating whether it is terminal---whether it is the end of a path indicating a string in the set. Note that not all terminal nodes are leaves, though all leaves ought to be terminal nodes---otherwise they're wasting space being in the tree.

3. Implementing basic functionality

The five operations you need to implement are add, contains, isEmpty, size, and remove. The first two will be done iteratively in the TrieSet class itself; the others will be done recursively in the TrieNode class.

Even though contains() can't be used until the add() method is implemented, you may want to do this one first simply because it is the easiest. This method will have a loop that simultaneously navigates a branch of the trie and steps through the given string (item). For each character in the string, we move to the appropriate child node. If we ever hit a null child, or if we come to the end of the string and are at a non-terminal node, then the given item is not in the set. If we end on a terminal node, on the other hand, then the item is in the set.
The add() method, like contains(), will also navigate the tree as it steps through the characters in the given item. However, if will make new nodes as it goes whenever it would enter a branch that doesn't (yet) exist. The last node it makes (or visits, if the node is already there), needs to be set as terminal.
The isEmpty() method can be done using size(), but there is a simpler way that involves looking only at the root node.
The size() method in TrieNode is a simple depth first traversal through the tree. We simply count the number of terminal nodes. But make sure you understand what the size() method means for each node: n.size() means, count the number of terminal nodes in the subtrie rooted at n, but each path in the subtrie represents not a full string in the whole trie (unless n is the root), but a suffix of a string in the whole trie.
The remove() method is the hardest. It's important to understand what the recursive remove() message to an individual node means. If we want to remove "oatmeal" from the (sub)trie rooted at n, then that means we want to remove "atmeal" from the subtrie rooted at n's o child. So as we navigate the tree, we remove a character from the beginning of the item string for each recursive call.

The base case is when we're left with an empty string for the item to be removed (or hit a null node, in which case the item we're trying to remove isn't in the trie to begin with). In that base case, it won't do simply to remove the node: If both "oat" and "oatmeal" are in the set and we remove "oat" from the set, then removing the node that "oat" ends at from the trie would also remove "oatmeal", which is not what we want. Thus instead we should mark that node as not being terminal anymore.

But if that node is a leaf, then we indeed do want to remove it, to minimize the memory used. By the same token, we should remove every node in the path that we took there if those nodes no longer have any terminal descendant. Thus on the way back up, remove() should remove nodes that represent "dead" branches.

The remove() method returns the node that should take the place of the one that it is called on. Basically that means it should return null if that branch is now dead, or the node itself (this) if it is still live.

Test using test.TrieTest.

4. Bonus problems

There are a bunch of harder problems that one could solve. Too bad the semester has to end. If you have some extra time after the required part of this project (and all other projects), here are a few nifty problems to trie, I mean, try. I'll give some extra credit for each of these you do (but don't do these in place of completing an earlier project, do it for the fun/challenge).

A. The iterator

Complete the iterator operation for TrieSet. This is set up so that the iterator operation itself is delegated to the TrieNode class. That class's iterator() method returns an iterator over the strings that terminate in the subtrie rooted at that node. This means that that node's iterator will make use of the iterators returned by its children---it will be an iterator over iterators. Also, since a node's subtrie contains only the suffixes of the strings that terminate in its subtrie, then in order for the iterator to return the entire string, we need to pass the prefix to the recursive calls to iterator.

Test using TrieTestIterator.

B. Finding keys with a prefix

Don't do this one until you have the iterator working, because this will make use of the iterator. You want to return an iterable that will give an iterator over just the keys with a specific prefix. This requires getting to the root node of the subtrie containing all strings that have the given prefix and returning that node's iterator... that is, if that node exists. If there is no such subtrie (which should be the same thing as there being no strings with that prefix), then this method should make a vacuous iterator whose hasNext() returns false the first time.

Oh, and since this method returns an Iterable, not an Iterator, make sure you wrap the iterator appropriately. If that throws you off, here's how to wrap an iterator in an iterable:

return new Iterable {
   public Iterator iterator() {
        // put the stuff you would put for an iterator method here
   }
};

Test using TrieTestKeysWithPrefix.

C. Finding the longest prefix

Given a string (which might not be a key in set), find the longest string in the set, if any, that is a prefix of the given string.

Test using TrieTestLongestPrefix.

D. Finding keys that match (very hard)

This problem is similar to the iterator problem (you will, indeed, make a new iterator), except that instead of traversing the entire trie, the iterator will descend only those branches whose strings match the given pattern of characters and wild cards. For example, ELLEN and ELLIE both match the pattern ELL...

Test using TrieTestKeysThatMatch.

5. Turn in

Copy the file you modified (TrieSet.java) to your turn-in folder /cslab/class/cs345/(your id)/trie .

To keep up with the course, this should be finished by May 3.

Thomas VanDrunen

Last modified: Thu Dec 12 14:55:58 CST 2019