Radix-tree-1.0: trees that radix

The package made it to Hackage unaltered, happy bashing.

3 Likes

The overview of the modules in your initial post on the linked PR is great! It tells me what I’m looking at, without having to know what patricia trees or zebra trees are (or looking at the module reference documentation and inferring things from the API). Please add that overview to the package description or readme on hackage!

EDIT: it’s especially useful because even after scanning the API of the patricia trees, I was still confused how this was different from IntMap. That’s an important question that potential users will have. :slight_smile:

1 Like

To be frank, looking at the list of modules at radix-tree: Radix trees., I feel quite overwhelmed. Which one am I supposed to use by default? What’s the difference? How does it compare to Map / HashMap? What is Zebra about? It would be great for README to answer such questions.

3 Likes

Looks like a great library, great work.

I’m happy to rename radixtree which is a library for using compact prefix trees to do parsing, to something more appropriate - maybe parsers-radixtree? It’s probably going to be confusing for people to have both radix-tree and radixtree. But I have no idea if this is possible at all, as packages cannot be deleted, only deprecated.

OTOH if radix-tree got some compact radix trees in that would make radixtree almost redundant. That is trees where branchless edges are merged together, so ["apple","application"] is the tree [("appl", [("e",accept "apple"), ("ication", accept "application")])] if that makes sense.

Indeed, but unfortunately there’s an art to this, so I’ll have to spend a few hours on it.


My goal with this library is parsing as well, but my goals are far wider:

  1. Since the default strategy in parsing is evaluating the radix tree at runtime, spine-lazy trees can be used to skip evaluation of the spine on unused optional keys;

  2. If all the keys are known at compilation time, you can precompile a spine-strict tree (hence why I added the .TH modules to each data structure). If only some are known, you can precompile a spine-lazy tree and only add the unknowns at runtime.

This library is proof that both of these goals are achievable in Haskell. However to utilize this power properly we’ll also need better parsing libraries (see my quirky prototype of json) and lazier underlying parsers (see the incomplete PR in binary).

The issue here is that Haskell data structures are persistent, so any “compacting” you do will directly result in more copying on operations that are now more complex, and therefore higher space usage if you’re sharing a lot of nodes. If memory is a massive concern for your use-case then you should look into an [ephemeral] adaptive radix tree implementation in C/C++/Rust, and then deal with all the restrictions that come from using that.

A possible middle ground here would be making a specialized compressed radix tree that only supports lookups and using that for the precompilated scenario. (though I imagine the actual performance gains from that are going to be meager)

1 Like

@BurningWitness thanks for updating README, much clearer now.

I whipped up a quick benchmark against the previous version of the package, but the results are somewhat off. Am I missing some important pragmas / patterns? The source code is at GitHub - Bodigrim/radix-tree at benchmarks. We take a listing of GHC sources (~5000 files), create a radix tree and then lookup every key in it. On my machine numbers are

Old:
  603  μs ±  28 μs
New:
  2.31 ms ± 221 μs

It feels expensive to lookup for Feed, which presumably emits byte by byte, but I cannot find any other way to do it.

See the “Inlining” section at the top of the module.

Hence you’re looking for

foldl'
  (\acc k -> acc + New.find 0 (New.feedShortByteString k) newRadixTree)
  0
  keys

Thanks, that’s better, but still slower than the previous implementation:

  Old: OK
    602  μs ±  54 μs
  New: OK
    1.27 ms ± 108 μs

Okay, I forgot to put strictness annotations (three bang patterns in each function), so the changes on my machine look like

  Old/fromList: OK
    5.54 ms ± 226 μs   ->   5.58 ms ± 228 μs
  New/fromList: OK
    4.93 ms ± 224 μs   ->   3.27 ms ± 290 μs
  Old/lookups:  OK
    869  μs ±  61 μs   ->   868  μs ±  49 μs
  New/lookups:  OK
    2.31 ms ± 208 μs   ->   693  μs ±  43 μs

I’ll apply this to all relevant functions and ship later today.

See the tighter-loops branch for the updated version.

On my machine the tighter-loops branch runs just a bit slower than the old implementation:

  Old: OK
    600  μs ± 7.2 μs
  New: OK
    651  μs ± 9.5 μs

Now if I make common prefixes 10x longer, the difference gets more pronounced:

  Old: OK
    1.99 ms ±  29 μs
  New: OK
    3.12 ms ±  23 μs

I think it is very challenging to beat memcmp with a bytewise loop. Would it be possible to provide
lookupBA :: ByteArray -> RadixTree a -> Maybe a? Or, if it fits better with the rest of the API, lookupBuild :: Build -> RadixTree a -> Maybe a?..

If you feel the need to squeeze out every last drop of performance, consider importing the relevant .Unsafe module and building a function yourself. The PVP applies to .Unsafe modules, since I never claimed otherwise.

I’m against adding any functions that duplicate existing functionality, so “30% faster in 30% of the cases” is not a compelling reason in my view.

Also consider the fact that you can gain performance in other places: you can precompile trees, perform chunked lookups and even use lazy trees. Unless you have an actual industrial use-case, I doubt this will be a bottleneck.

1 Like

Thanks, indeed the exposed bits are enough to implement

findBA :: a -> ByteArray -> New.RadixTree a -> a
findBA d key = \(New.RadixTree maybeX tree) -> case keyLen of
  0 -> fromMaybe d maybeX
  _ -> go 0 tree
  where
    keyLen = sizeofByteArray key
    go !n = \case
      New.Bin p l r -> go n (if indexByteArray key n < p then l else r)
      New.Tip arr@(sizeofByteArray -> arrLen) mx dx
        | arrLen + n <= keyLen
        , EQ <- compareByteArrays key n arr 0 arrLen
        -> if arrLen + n == keyLen then fromMaybe d mx else go (n + arrLen) dx
        | otherwise
        -> d
      New.Nil -> d

The benchmarks for long (tenfold) common prefixes are quite impressive:

  Old:    OK
    1.99 ms ± 109 μs
  New:    OK
    3.03 ms ± 216 μs
  findBA: OK
    980  μs ±  54 μs

However, with shorter common prefixes findBA is measurably slower than other implementations:

  Old:    OK
    600  μs ±  14 μs
  New:    OK
    656  μs ±  15 μs
  findBA: OK
    693  μs ±  26 μs

Upd.: It is possible though to use compareByteArrays only if prefix length is, say, 8 or more, in which case we would reap the best of both worlds.

4 Likes

There definitely exists some optimization route where Feed works over chunks and not individual bytes, perhaps comparing Words instead of Word8s if both buffers allow it, but it would be quite a bit more complex than the current solution.

My general point is that this is already asymptotically optimal (and will get slightly better in the future as it gets tweaked), so I wouldn’t bother with a custom solution unless I know every other part of the system around it is about as well-optimized. The two common examples are unnecessary key type conversions and serializing the tree as a list of key-value pairs: either of these most probably introduces way more overhead than any insertion/lookup routine you can handroll, and solving them is way easier.

1 Like

If I may, I’d suggest to mention this in documentation for lookup directly. Otherwise plenty of users would expect that converting their keys to Feed upfront is a faster way of doing things. You probably want to emphasize that clients should never hold anything of type Feed but rather stick to ByteString / ByteArray / Text / String or whatever they have and convert it to Feed just in time.

This is not a lookup quirk, it applies equally to all functions that consume Feed. Furthermore it’s not critical that the users get this right, their programs will merely run slower. As such this is currently addressed with the “Inlining” documentation section on top of every relevant module. The section could probably be worded better, I admit that.

It’s similar to spine-laziness in that the mechanics are rather simple and obvious to anyone in the loop, while being quite confusing to anyone outside. I would argue addressing this isn’t even the library’s responsibility: I should just be able to say “this is spine-lazy” and “this is short cut fused”, and point to educational pages that explain these concepts in as much depth as may be needed.

1 Like

True, but lookup is the most basic one. Users can extrapolate further.

What I can tell you very verily is that I read that section and yet made a mistake. I figured out to benchmark saturated applications to get it inlining, but did not connect dots that feedByteString is the one supposed to fuse.

The thing is that radix-tree is already quite a departure from containers / unordered-containers. If I give it a try, do not get inlining right and results are slow, I would not look twice, I would just stick to unordered-containers. After all, Hackage is full of libraries which are asymptotically optimal, but slow as hell.

That said, I feel that I exhausted the bandwidth of unsolicited advices for today :slight_smile: Thanks for your enormous amount of work here; I’m eagerly awaiting for a new release with bangs.

1 Like

I’m afraid that this allows for ambiguous reading. To be clear: I did not mean that radix-tree is one of such libraries.

3 Likes

The use-case of radixtree was for parsing thousands-ish dictionaries of English-language terms where the dictionary wasn’t known at compile-time, being 100% user supplied, but would also only rarely change. So it was fine to have very bad (in fact, much worse than it needs to be) insert – it just had to traverse quickly. It could definitely still somewhat easily be better, like by binary search on “large” nodes. Combined with a packrat parser it worked well for parsing English sentences quick enough for autosuggestions to feel snappy as you typed.

But yeah you could have both fast traversal/minimal memory, and fast insertion, with a mutable API, and different node types for different cases. https://db.in.tum.de/~leis/papers/ART.pdf this seems to be that. I don’t see an intrinsic reason a Haskell impl couldn’t be fast and elegant. Might be a nice project to throw linear types at too :slight_smile:

Anyway back to the fancy new rewrite. What are the use-cases you have in mind for taking advantage of laziness? The *.Lazy modules would probably be a great backbone for memoisation a la memoize or MemoTrie for some algorithms.

For parsing an extreme example would be runtime-evaluated command-line options, where there may be 200 accessible options and users never touch more than a dozen at a time. The cost of accessing said dozen and pushing the rest around should be meager relative to constructing the entire tree. This same reasoning applies generally to parsing every format with dictionaries in it. It’s obviously less performant than directly precompiling a strict tree, but still it’s a nice default option that a language should have.

Regarding memoization… I don’t know, probably? I’ve never properly used it, so my understanding of it is blurry at best.