The package made it to Hackage unaltered, happy bashing.
The overview of the modules in your initial post on the linked PR is great! It tells me what I’m looking at, without having to know what patricia trees or zebra trees are (or looking at the module reference documentation and inferring things from the API). Please add that overview to the package description or readme on hackage!
EDIT: it’s especially useful because even after scanning the API of the patricia trees, I was still confused how this was different from IntMap
. That’s an important question that potential users will have.
To be frank, looking at the list of modules at radix-tree: Radix trees., I feel quite overwhelmed. Which one am I supposed to use by default? What’s the difference? How does it compare to Map
/ HashMap
? What is Zebra
about? It would be great for README
to answer such questions.
Looks like a great library, great work.
I’m happy to rename radixtree
which is a library for using compact prefix trees to do parsing, to something more appropriate - maybe parsers-radixtree
? It’s probably going to be confusing for people to have both radix-tree
and radixtree
. But I have no idea if this is possible at all, as packages cannot be deleted, only deprecated.
OTOH if radix-tree
got some compact radix trees in that would make radixtree
almost redundant. That is trees where branchless edges are merged together, so ["apple","application"]
is the tree [("appl", [("e",accept "apple"), ("ication", accept "application")])]
if that makes sense.
Indeed, but unfortunately there’s an art to this, so I’ll have to spend a few hours on it.
My goal with this library is parsing as well, but my goals are far wider:
-
Since the default strategy in parsing is evaluating the radix tree at runtime, spine-lazy trees can be used to skip evaluation of the spine on unused optional keys;
-
If all the keys are known at compilation time, you can precompile a spine-strict tree (hence why I added the
.TH
modules to each data structure). If only some are known, you can precompile a spine-lazy tree and only add the unknowns at runtime.
This library is proof that both of these goals are achievable in Haskell. However to utilize this power properly we’ll also need better parsing libraries (see my quirky prototype of json) and lazier underlying parsers (see the incomplete PR in binary).
The issue here is that Haskell data structures are persistent, so any “compacting” you do will directly result in more copying on operations that are now more complex, and therefore higher space usage if you’re sharing a lot of nodes. If memory is a massive concern for your use-case then you should look into an [ephemeral] adaptive radix tree implementation in C/C++/Rust, and then deal with all the restrictions that come from using that.
A possible middle ground here would be making a specialized compressed radix tree that only supports lookups and using that for the precompilated scenario. (though I imagine the actual performance gains from that are going to be meager)
@BurningWitness thanks for updating README
, much clearer now.
I whipped up a quick benchmark against the previous version of the package, but the results are somewhat off. Am I missing some important pragmas / patterns? The source code is at GitHub - Bodigrim/radix-tree at benchmarks. We take a listing of GHC sources (~5000 files), create a radix tree and then lookup every key in it. On my machine numbers are
Old:
603 μs ± 28 μs
New:
2.31 ms ± 221 μs
It feels expensive to lookup for Feed
, which presumably emits byte by byte, but I cannot find any other way to do it.
See the “Inlining” section at the top of the module.
Hence you’re looking for
foldl'
(\acc k -> acc + New.find 0 (New.feedShortByteString k) newRadixTree)
0
keys
Thanks, that’s better, but still slower than the previous implementation:
Old: OK
602 μs ± 54 μs
New: OK
1.27 ms ± 108 μs
Okay, I forgot to put strictness annotations (three bang patterns in each function), so the changes on my machine look like
Old/fromList: OK
5.54 ms ± 226 μs -> 5.58 ms ± 228 μs
New/fromList: OK
4.93 ms ± 224 μs -> 3.27 ms ± 290 μs
Old/lookups: OK
869 μs ± 61 μs -> 868 μs ± 49 μs
New/lookups: OK
2.31 ms ± 208 μs -> 693 μs ± 43 μs
I’ll apply this to all relevant functions and ship later today.
On my machine the tighter-loops
branch runs just a bit slower than the old implementation:
Old: OK
600 μs ± 7.2 μs
New: OK
651 μs ± 9.5 μs
Now if I make common prefixes 10x longer, the difference gets more pronounced:
Old: OK
1.99 ms ± 29 μs
New: OK
3.12 ms ± 23 μs
I think it is very challenging to beat memcmp
with a bytewise loop. Would it be possible to provide
lookupBA :: ByteArray -> RadixTree a -> Maybe a
? Or, if it fits better with the rest of the API, lookupBuild :: Build -> RadixTree a -> Maybe a
?..
If you feel the need to squeeze out every last drop of performance, consider importing the relevant .Unsafe
module and building a function yourself. The PVP applies to .Unsafe
modules, since I never claimed otherwise.
I’m against adding any functions that duplicate existing functionality, so “30% faster in 30% of the cases” is not a compelling reason in my view.
Also consider the fact that you can gain performance in other places: you can precompile trees, perform chunked lookups and even use lazy trees. Unless you have an actual industrial use-case, I doubt this will be a bottleneck.
Thanks, indeed the exposed bits are enough to implement
findBA :: a -> ByteArray -> New.RadixTree a -> a
findBA d key = \(New.RadixTree maybeX tree) -> case keyLen of
0 -> fromMaybe d maybeX
_ -> go 0 tree
where
keyLen = sizeofByteArray key
go !n = \case
New.Bin p l r -> go n (if indexByteArray key n < p then l else r)
New.Tip arr@(sizeofByteArray -> arrLen) mx dx
| arrLen + n <= keyLen
, EQ <- compareByteArrays key n arr 0 arrLen
-> if arrLen + n == keyLen then fromMaybe d mx else go (n + arrLen) dx
| otherwise
-> d
New.Nil -> d
The benchmarks for long (tenfold) common prefixes are quite impressive:
Old: OK
1.99 ms ± 109 μs
New: OK
3.03 ms ± 216 μs
findBA: OK
980 μs ± 54 μs
However, with shorter common prefixes findBA
is measurably slower than other implementations:
Old: OK
600 μs ± 14 μs
New: OK
656 μs ± 15 μs
findBA: OK
693 μs ± 26 μs
Upd.: It is possible though to use compareByteArrays
only if prefix length is, say, 8 or more, in which case we would reap the best of both worlds.
There definitely exists some optimization route where Feed
works over chunks and not individual bytes, perhaps comparing Word
s instead of Word8s
if both buffers allow it, but it would be quite a bit more complex than the current solution.
My general point is that this is already asymptotically optimal (and will get slightly better in the future as it gets tweaked), so I wouldn’t bother with a custom solution unless I know every other part of the system around it is about as well-optimized. The two common examples are unnecessary key type conversions and serializing the tree as a list of key-value pairs: either of these most probably introduces way more overhead than any insertion/lookup routine you can handroll, and solving them is way easier.
If I may, I’d suggest to mention this in documentation for lookup
directly. Otherwise plenty of users would expect that converting their keys to Feed
upfront is a faster way of doing things. You probably want to emphasize that clients should never hold anything of type Feed
but rather stick to ByteString
/ ByteArray
/ Text
/ String
or whatever they have and convert it to Feed
just in time.
This is not a lookup
quirk, it applies equally to all functions that consume Feed
. Furthermore it’s not critical that the users get this right, their programs will merely run slower. As such this is currently addressed with the “Inlining” documentation section on top of every relevant module. The section could probably be worded better, I admit that.
It’s similar to spine-laziness in that the mechanics are rather simple and obvious to anyone in the loop, while being quite confusing to anyone outside. I would argue addressing this isn’t even the library’s responsibility: I should just be able to say “this is spine-lazy” and “this is short cut fused”, and point to educational pages that explain these concepts in as much depth as may be needed.
True, but lookup
is the most basic one. Users can extrapolate further.
What I can tell you very verily is that I read that section and yet made a mistake. I figured out to benchmark saturated applications to get it inlining, but did not connect dots that feedByteString
is the one supposed to fuse.
The thing is that radix-tree
is already quite a departure from containers
/ unordered-containers
. If I give it a try, do not get inlining right and results are slow, I would not look twice, I would just stick to unordered-containers
. After all, Hackage is full of libraries which are asymptotically optimal, but slow as hell.
That said, I feel that I exhausted the bandwidth of unsolicited advices for today Thanks for your enormous amount of work here; I’m eagerly awaiting for a new release with bangs.
I’m afraid that this allows for ambiguous reading. To be clear: I did not mean that radix-tree
is one of such libraries.
The use-case of radixtree
was for parsing thousands-ish dictionaries of English-language terms where the dictionary wasn’t known at compile-time, being 100% user supplied, but would also only rarely change. So it was fine to have very bad (in fact, much worse than it needs to be) insert
– it just had to traverse quickly. It could definitely still somewhat easily be better, like by binary search on “large” nodes. Combined with a packrat parser it worked well for parsing English sentences quick enough for autosuggestions to feel snappy as you typed.
But yeah you could have both fast traversal/minimal memory, and fast insertion, with a mutable API, and different node types for different cases. https://db.in.tum.de/~leis/papers/ART.pdf this seems to be that. I don’t see an intrinsic reason a Haskell impl couldn’t be fast and elegant. Might be a nice project to throw linear types at too
Anyway back to the fancy new rewrite. What are the use-cases you have in mind for taking advantage of laziness? The *.Lazy
modules would probably be a great backbone for memoisation a la memoize
or MemoTrie
for some algorithms.
For parsing an extreme example would be runtime-evaluated command-line options, where there may be 200 accessible options and users never touch more than a dozen at a time. The cost of accessing said dozen and pushing the rest around should be meager relative to constructing the entire tree. This same reasoning applies generally to parsing every format with dictionaries in it. It’s obviously less performant than directly precompiling a strict tree, but still it’s a nice default option that a language should have.
Regarding memoization… I don’t know, probably? I’ve never properly used it, so my understanding of it is blurry at best.