This is a nice write-up of the available sensible trie implementations - thanks Bhavin!
One thing I'd add is that if you aren't exploiting the ordered nature of the keys within a trie - i.e. if your keys have no underlying substructure or you aren't doing linear scans through the key space - then you don't want a trie at all and you'd be better off with a hash table.
A hash table will be faster and use less memory if your keys are unstructured (imagine for a minute choosing random integers for your keys - then mod N is a good hash function for those keys, so it is obvious a hash table will work well for them, whereas a trie will waste space trying to find common prefixes between random keys.)
If you are accessing completely randomly, then the cache utilisation will be similar between a good trie and a hash table. If you are accessing some things more than others, then you should use a move-to-front hash table to be competitive.
> so it is obvious a hash table will work well for them, whereas a trie will waste space trying to find common prefixes between random keys.)
It is obvious, but it is wrong. I observed Judy beating out many hash tables in specifically this situation, and comparing favorably to google's sparsehash. Given its much smaller code size and fewer dependencies this has demonstrated itself a huge win at my organization.
Benchmarking beats random ideas, no doubt about that! Thanks for pointing out my mistake with integers. However, I would say Judy doesn't look so hot for string keys - slower at most things and similar memory requirements.
I'd be interested to see 10 million 64-bit integers in Judy and DenseHash though. Since ten million is 0.2% of 2^32, in some sense lots of the potential key space is actually in use, so I'd expect a lot of sharing to be possible and a sparse trie to perform well.
Is there a 64-bit integer key Judy array you could test on?
Further, the article is only an announcement that some benchmarking would be happening later...
But (linked directly from the wikipedia article), this article gives benchmarking and a good argument that a "Judy array" is a rather over-engineered solution: http://www.nothings.org/computer/judy/ (and that article is linked to the wikipedia article).
thanks robin. i have however noticed in general that a hash table invariably has performance issues in comparison to a trie when storing a large dataset of strings. i guess this is because a trie does guarantee constant time lookup given that in normal circumstances the max length of the string is fixed. a hash table on the other hand does not guarantee constant time lookup due to collisions. another advantage with an optimized trie is that the key does not have to be duplicated. so if there are two keys - "bhavin" and "bhatia" - the first 3 characters do not need to be stored twice making the structure relatively more compact (offcourse this only applies to one of the space-optimized trie structures. in a regular trie since each node would store 26 buckets the total space consumed may actually be considerably higher). lastly i would assume since hash tables are sparse by nature they would take up extra space that tries and trees otherwise do not
True, but some types of tries sidestep this problem completely -- I'm a big fan of the radix/Patricia tree (mentioned in the article). It uses less space than a hash table, and has a similar asymptotic runtime when you consider the cost of hashing the string itself.
For me the power of a trie over a hash table has always been the ease of implementation. Hash tables have a lot of corner cases that tries don't have. I can write an entire trie in the time it would take me to just write the part of a hash-table that deals with collisions. Nevermind resizing or rehashing!
This is why Judy arrays seemed a bit odd to me. I feel that if I put as much effort into making a super-fast hash table as the author did into making a super-fast trie, I feel like it would perform as well or better. That's just a gut-instinct though and I don't really have any data to back it up.
I suppose an exception could when the median key length is very short, tries tend to be better since a secure hash function could take longer than a very small number of memory lookups(especially since some of them will be cached).
Two useful implementations that this article misses are Array-compacted tries (aka dual-array tries) and Array-mapped tries. ACTs are nice due to the small memory footprint and good cache-locality, but have the downside of being difficult and time-consuming to construct. AMTs have a bit-array for determining existence of children and a sorted array of child-pointers in each node and are surprisingly fast at both insertion and search. Both have been described elsewhere, but I like the explanations by Phil Bagwell in his paper, "Fast And Space Efficient Trie Searches" (1998).
Perhaps a dumb question, but has anyone considered using a trie to manage search query auto-suggest, such as Google AutoSuggest?
It seems like it could be more efficient as you could send to the browser the node ID of the current buffer and then when the next key was pressed, use the current ID to efficiently look forward just one level.
Im working on a project to do this right now in fact, since it seeems all the open-source autocomplete implementations I can find just come down to doing a "SELECT bla where bla like bla", which is pretty slow. The one major issue is that tries are prefix-based, so it has to match starting from the beginning of the query. My sort of solution around that is to break up the terms your matching against into groups of phrases, and allow prefix matching on any of those phrases.
As a question to HN-I was planning on using Radix Trees, but now Ive read this article it seems like there might be some other good options. Are any of those structures noticeable better for storing lots of strings?
I thought that most of the time since users are typing one character at a time that would be efficient. Even pasting at the end of the string is easy to manage as it might only be O(q) where q is the length of the query string (depending on the structure of the underlying trie).
The worst cases are editing midstring, and worst, pasting to the start of the query string. Still, those are O(q) rather than O(N) where N are the total number of strings in the database. In my case, N will have millions of records where q will rarely exceed 250 characters.
the problem is not efficiency, its relevancy. Say you have a result "hacker news". If its a pure prefix trie, this wont match the query "news", even though you probably want to return this as a suggestion. My hack around this would be to add both "hacker news" and "news" in the trie-Its still not perfect, since it wouldn't match "acker" or "ws" etc.
Do you have a plan for handling non-ascii characters in your trie? The descriptions I've seen usually involve having a character index into an array, and it seems like a waste if you used 2-byte characters. I guess you would use an encoding (like UTF-8), with a particular canonicalization.
That's a nice review - I'm toying with the idea of trying to write a simple hierarchical OLAP database (like Essbase or Hyperion HFM) and trie's seem like a potentially interesting starting point.
One thing I'd add is that if you aren't exploiting the ordered nature of the keys within a trie - i.e. if your keys have no underlying substructure or you aren't doing linear scans through the key space - then you don't want a trie at all and you'd be better off with a hash table.
A hash table will be faster and use less memory if your keys are unstructured (imagine for a minute choosing random integers for your keys - then mod N is a good hash function for those keys, so it is obvious a hash table will work well for them, whereas a trie will waste space trying to find common prefixes between random keys.)
If you are accessing completely randomly, then the cache utilisation will be similar between a good trie and a hash table. If you are accessing some things more than others, then you should use a move-to-front hash table to be competitive.