lexicalunit ([info]lexicalunit) wrote,
@ 2008-02-21 22:23:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Tries For Fun
So for a while now I've been working on implementing a really nice trie in C++ in the style of a STL container. The reason for this was that I wanted to use a trie to implement a search in my photo website. The problem with that is I had to write it in PHP, which sucks. Since I hardly know PHP I decided to model a trie in C++ first. Along the way I found out there's lots to C++ that I can still learn! (or at least get much more practice with...)

I've finally got the code to where I feel like I won't be too ashamed of myself if eyes other than my own view it, so I'm putting it out for any and all to see.


First let's talk about what you need if you want to build the source. You need boost*. I only use boost for lexical_cast because it allows me to provide default policies that just work for most basic types. For example, with lexical_cast the code for implementing a trie based on ints is this simple:
typedef trie<int> int_trie;
And from my point of view the code to create a prefix of ints of a given length, something very important to the operation of prefix-tries, looks something like this:
return lexical_cast<int> (lexical_cast<std::string> (value).substr (0, length));
Compare that to a std::stringstream implementation:
std::stringstream ss;
ss << value;
std::string prefix = ss.str ().substr (0, length);
ss.str (string ());
ss << prefix;
int r;
ss >> r;
return r;
*:Well you don't need it if you just replace my code with std::stringstream equivalent code, which isn't hard. But you do need it if you want your life to be complete ;)

Next you'll need a C++ compiler, but then, if you're reading this post, I assume you have that already. Onward and to the code.

The trie class has a few policies that let users tell it how to function. So we have the advantage of making the class extensible but without having to rely on the overhead of dealing with inheritance. This approach to C++ class design is called Modern C++ and Andrei Alexandrescu wrote the book on it, literally. Here's what it looks like:
template<typename T, class Compare, class CharT, class Length, class Prefix>
class trie { /* ... */ };
Defaults are provided for Compare, CharT, Length, and Prefix so in most cases the user simply has to provide the T type that the trie stores, ie: strings or integers or what have you.

My trie uses the policies to determine how it should handle various situations. Let's say you want to store strings in a trie, but in descending order instead of the default ascending order. All you have to do is override the default comparator policy like so:
typedef trie<std::string, std::greater<std::string> > descending_string_trie;
Neat eh? Length tells the trie how to calculate the lexicographical length of a given T value and Prefix tells the trie how to come up with a prefix of a given T value (as seen above in my lexical_cast code sample). If you download my code and take a look at my main.cpp you'll see near the top an implementation of a trie that stores string/data pairs so that you can insert some value into a trie and associate it with some sort of data.

In my first iteration of trie I had assumed that the user would always want some data associated with node values. This was because my intended use was creating a dictionary of search terms where each term had a list of associated webpages as its data. That way, someone could search my dictionary of words and get a list of hits for my photo webpage very quickly.

With my Modern-C++ style re-write, I can still support this paradigm and maintain a higher level of flexibility. Not to mention the code is much simpler!

Next I wanted my trie to work with all the STL algorithms so I needed to provide trie iterators. I had to ask myself, what kind of iterators does a trie require? Quickly I decided that reverse iterators don't make much sense for the things users would want to do with a trie. In fact, I decided that forward iterators were pretty much sufficient for a quality trie implementation.

The hardest part about writing an iterator is getting operator++() correct, all the other methods are trivial. Let me show the code first, then go step by step through it.
    template<typename T, class Compare, class Length, class Prefix>
    typename trie<T, Compare, CharT, Length, Prefix>::iterator&
    trie<T, Compare, CharT, Length, Prefix>::iterator::operator++()
    {
        // After we've fallen off the trie, we're always at the end.
        if (this->node == NULL)
        {
            return *this;
        }
    
        // Step through children in pre-order.
        if (!this->node->leaf ())
        {
            this->parents.push_back (this->node);
            // Get pointer to child trie object.
            this->node = *this->node->children.begin ();
            return *this;
        }
    
        // Peel off parents and step through their children in pre-order.
        typedef typename subtrie::iterator child;
        while (!this->parents.empty ())
        {
            child end = this->parents.back ()->children.end ();
            // Recursively step through parents' children.
            child next = ++(this->parents.back ()->children.find (this->node));
            if (next != end)
            {
                this->node = *next;
                return *this;
            }
            else
            {
                this->node = &(*this->parents.back ());
                this->parents.pop_back ();
            }
        }
    
        // End of trie has been reached.
        this->node = NULL;
        return *this;
    }
The first three lines are wordy, but simple. All they say is that this is a method template for iterator::operator++() that returns a trie::iterator&.

The first thing we check for is if the iterator points to NULL. This is my sentry for the end of iteration. Once an iterator is incremented past the end of the trie, it always points to NULL. Checking this first and returning immediately if the check fails ensures that once an iterator hits the end of the trie, it's always stuck there.

When I first implemented incrementing I did it by pre-order. It may be that case that in the future other kinds of ordering may be desired, for example in-order iteration. The next if statement determines if we are at the end of a trie branch, or in other words if we're at a leaf node. If not, we store a parent pointer in our iterator and continue on down the trie by going to our current node's first child. We continue the progress until we finally reach a leaf node. At this point things get interesting.

Using those stored parent pointers is how we continue our incrementation. We use the iterator's parents container like a stack by first accessing it's most recent addition. So we try and find the current node in the children of the most recent parent. Basically, we're asking our direct parent for an iterator over its children which points to ourself. Then we simply increment this iterator to get our parent's next child, one of our siblings.

If this child exists, then it becomes the next node. Otherwise, the next node becomes our parent and we begin the loop again. Immediately the next node then becomes one of our siblings and we return.

Eventually though this process we will exhaust all our parent pointers after having traveled along all the tire's branches. Finally we return an iterator to NULL, which is our sentry for end of trie. So there's iterator::operator++(). The const version of iterator operates exactly the same way.

The next hurdle in implementation was making trie exception safe. For a while I had this bug that caused my test program to crash after any trie created via the copy-assignment operator was destructed. Let's see why.

Here's my initial implementation, see if you can spot the bug that causes an error to occur in reduce_childs(). Irrelevant code has been removed to make it easier to see the problem.
class trie
{
        typedef trie* trie_node;

        struct trie_node_less { /* ... */ };
        typedef std::set<trie_node, trie_node_less> subtrie;

        struct trie_node_delete
        {
            void operator()(trie_node const &node) throw()
            {
                delete node;
            }
        };

    subtrie children;
};

trie::~trie() throw()
{
    for_each (this->children.begin (), this->children.end (), trie_node_delete ());
    reduce_childs (this);
}

trie::reduce_childs(trie_node node) throw()
{
	if (node == NULL)
	{
		return;
	}

	if (node->childs > 0)
	{
		--node->childs; // ERROR HERE
	}

	reduce_childs (node->parent);
}

trie& trie::operator=(const trie &copy)
{
	if (this == &copy)
	{
		return *this;
	}

	trie temp = copy;
	swap (temp);
	return *this;
}

void trie::swap(trie &from) throw()
{
	using std::swap;

	swap (from.children, this->children);
	swap (from.childs, this->childs);
	swap (from.value, this->value);
	swap (from.parent, this->parent);
}
PS: Don't cheat by downloading my source and reading the comments I put in there! I'll put some space between here and the answer.




















Ok here's the answer. The problem is in operator=() and ~trie(). First notice that in the copy-assignment operator we create a temp and then insert values into it. The reason we do this is to provide exception safety. All the real work is done to the side on a temporary, and then that work is committed to our node with no-throw operations like swap().

Unfortunately this means that the direct children of temp have parent pointers that point to a temporary object. So now when reduce_childs() is called from the destructor, we traverse up the trie to the top where we attempt to dereference memory that's been destructed, which causes the program to crash.

So how can we solve this? Well, clearly we have to re-parent these wayward nodes! But can we do this and maintain no throw exception safety? The answer is yes. Since the STL provides basic exception safety for iterators over containers, and they don't throw any exceptions themselves, and children is a STL container, all we have to make sure of is that we don't do anything that could throw an exception while we iterate over the children and re-parent them. Fortunately assignment to a pointer won't throw an exception, so we're fine. Download my source to see the final working code.

There's other interesting points in trie.h, but rather than blab on about random stuff now, I'll just end this post here and respond to any questions instead.

Download: trie.tar.gz (12kb)



Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…