Seqn: A sequence library

meooow · June 21, 2024, 1:52pm

Why a new library? Primarily because I’ve wished a few times that we had a data structure with similar properties to Data.Sequence, but spine-strict. seqn's Seq fits that role. It is also value-strict, because I consider that a sane default.

Aside from regular sequences, if you use fingertree you may find MSeq to be a good (and faster) alternative. If you are not familiar with measured sequences, you can check out the small tutorial in the MSeq docs to see if it can be of use to you.

That’s all, and feedback is welcome!

jaror · June 21, 2024, 2:37pm

Looking at the benchmarks, it seems there is quite a bit of room for improvement in the fingertree and containers packages. They both claim to use the same implementation but containers is almost always faster, except for the append benchmark.

Edit: it seems like a matter of tradeoffs: fingertree is flexible in what measures it supports, but containers is faster because it fixes the measure to Int. I guess we could get the best of both worlds with backpack a la unpacked-containers: Unpacked containers via backpack. (The append thing seems like a fluke due to containers being stricter than all the other libraries in the comparison.)

I also think it would be good to have benchmarks that are centered on use cases rather than the API. A very simple example would be using the sequence as a queue. That sounds like a use case where containers would perform better than seqn, because it extensively accesses both ends. Furthermore, perhaps some nofib benchmarks could be converted to use sequences instead of lists. That would also answer the question: how much performance would I lose if I just used sequences everywhere?

BurningWitness · June 21, 2024, 3:05pm

Wouldn’t you run out of addressable space much earlier than that?

jaror · June 21, 2024, 3:17pm

$ cabal repl -b seqn
...
ghci> import qualified Data.Seqn.Seq as S
ghci> f x = print (length x) *> f (x <> x)
ghci> f (S.fromList [1,2,3])
3
6
12
24
48
96
192
384
768
...
54043195528445952
108086391056891904
216172782113783808
432345564227567616
864691128455135232
1729382256910270464
3458764513820540928
*** Exception: Tree.linkL: impossible

BurningWitness · June 21, 2024, 3:21pm

True, forgot you can share memory in these things.

meooow · June 21, 2024, 3:44pm

Looking at the benchmarks, it seems there is quite a bit of room for improvement in the fingertree and containers packages.

I suspect that fingertree has room for improvement, but containers is already well optimized. A straightforward thing I see in fingertree is that all the functions requiring a Measured constraint could benefit from being INLINABLE. I have not looked deeper into it though.

They both claim to use the same implementation but containers is almost always faster, except for the append benchmark.

As mentioned above the benchmarks, it is good to be cautious of the measurements for lazy structures (perhaps I should make it clearer). Looking at the source code, I can see that append is strict for containers but lazy for fingertree. So the time for fingertree cannot directly be compared, it has created thunks which will be forced later.

I also think it would be good to have benchmarks that are centered on use cases rather than the API

This is certainly a good idea. If someone would point me to some good use cases for sequences, I will add them.
As an example, a simple use case for PQueue is sorting a list, and I added it to the benchmarks when it occurred to me.

A very simple example would be using the sequence as a queue. That sounds like a use case where containers would perform better than seqn, because it extensively accesses both ends.

containers will certainly be better than seqn for this, because the complexity is better for the relevant operations. The cons/snoc/uncons/unsnoc many benchmarks make this clear. Though arguably, if one wants a queue, they should use a queue and not a sequence.

noinia · June 21, 2024, 5:13pm

Very nice! I’ve wanted measureable sequences several times (and when I needed them I usually ended up using some combination of Data.Set, its internals, and unsafeCoerce). So having something packaged up in a nice/safer (and hopefully also efficient/competitive) is awesome!

meooow · June 22, 2024, 5:38am

Out of curiosity, did fingertree not work for these use cases?

noinia · June 22, 2024, 7:20pm

I was/am interested in ordered sequences for which the ordering can change over time [1]. I probably could have used the fingertree package, but my two reasons for wanting to use Data.Set were (1) better constant factors, and (2) Data.Set actually already had all the right functions (e.g. split, lookupGE, etc) already (i.e. I didn’t really want to reimplement all of that.

[1] As an example application: we have some set of lines in the plane, and you want to answer point location queries. One way of doing that is using a sweep line: we sweep (=move) some vertical line from left to right over the plane, while maintaining the order in which the lines intersect the vertical sweep line. When the sweep line passes over an intersection point of two lines, the ordering only changes locally (so we can just swap the two lines and that’s it). To answer a point location query at (x,y): you find the version of the data structure at “time” x, and look up the line directly above x in the vertical order using lookupGE. This gives you O(log n) queries using O(n^2 log n) space (which is very simple, and near optimal).

As an aside: As Jaror mentioned; it would probably be nice if there was some backpacked version of fingertree to reduce constant factors. It’s a shame the developer UI for backpack is kind of horrible though (i.e. having to define a zillion of different sublibraries + some backpack related GHC bugs (for example: last time I checked it was not possible to define that some type in a signature implements a type family).