Datastructures in the vein of `compressed` - what online implementations of compressed data structures are out there? {solved}

MangoIV · August 23, 2024, 5:09pm

Hi!

I am currently looking a bit into online compressed data structures. A library that I’ve found is Kmett’s compressed (@hackage/compressed) which e.g. offers an LZ78 implementation which can be compressed with differingly efficient methods and decompressed into some Monoid. However, it doesn’t allow for typical operations on strings while maintaining its compression, i.e. “online”.

This is important when we start from a generator (of the string) which can build an exponential size generator but we know can be practically compressed to something that is polynomial in the input size.

A scheme that makes this possible is e.g. described in Maintaining dynamic sequences under equality tests in polylogarithmic time | Algorithmica but it’s not very approachable and a very old paper so I was wondering whether there were some more modern approaches to the topic and/or implementations, e.g. stemming from BioInf research which also seemed to be a big topic in the Haskell ecosystem, historically.

The operations that I’m particular interested in being possible in polynomial time in the compressed input (and hence logarithmic on the uncompressed input) are

equality
concatenation
splitting at a particular index

Mind that already equality in sublinear time can already only work if the input is compressed.

Thanks in advance for your input!

ocramz · August 24, 2024, 8:43am

do “succint” data structures count (heh) as compressed?

https://haskell-works.github.io/posts/2018-08-22-pdep-and-pext-bit-manipulation-functions.html

MangoIV · August 24, 2024, 9:22am

yes, I was looking at succinct data structures, but I was unsure about the prior art in Haskell. unfortunately, discoverability is low, especially in the age of decreasing usefulness of search engines.

Thank you for sharing this post. After having a first glance at it, I’m not yet sure whether it’ll help me but I will definitely have a thorough look at it.

I am also not quite sure which of them can be thought of as single sequences like described in the paper in my original post.

There also seems to be a lack of succinct datastructures that support sublinear equality (linear only in the size of the compressed structure), which is one of the most important things to me.

ocramz · August 24, 2024, 9:54am

It’s still a very general framing, could you give some examples of the kind of data you are generating ?

Also, I’m not sure I follow your meaning of “online”. Do you mean that the data does not fit in memory and you can only stream it through?

MangoIV · August 24, 2024, 10:05am

Yes, so basically:

I build a string
this string is known to have entropy low enough that it can be compressed up to exponentially
I later want to do operations on such strings
- equality
- splitting
- concatenation

These should all be at most polynomial in the size of the compressed version of the strings.

I know this is possible (See maintaining dynamic sequences under equality tests) but it’s not trivial because we need to use the low entropy of the strings to get sublinear operations wrt the expanded size of them.

Most importantly, leaving the strings expanded does not work. The problem is, that e.g. equality is at least linear.

MangoIV · August 24, 2024, 10:31am

btw, I also wanted to note, that there’s more vaguely related Kmett things, cf. this blogpost:
https://www.schoolofhaskell.com/user/edwardk/moore/for-less#compressive-parsing
I sadly don’t understand this enough to make it less of a vague connection.

olf · August 25, 2024, 7:32pm

Former bioinformatician here. The features you are asking for are unusual operations from a biological viewpoint, so I doubt that any of the high-performance succinct data structures have you covered. The bioinformatician usually knows what was the source for a compressed index, so equality tests never come to mind. That said,

in my days we used bgzip from the samtools library, which relies on the fact that two concatenated gzip files are valid gzip.
At least with compression structures like Lempel-Ziv, I don’t see how you could test for equality without walking the entire LZ parse. But maybe I have poor imagination. LZ is deterministic, whence two equal strings will result in equal patterns of LZ blocks, so it would suffice to compare those. This breaks down when concatenating two sequences of LZ blocks that otherwise don’t share their backreferences.
The BLAST tool maintains a data structure that can highlight similarities in many sequences in parallel, and I’m pretty certain that they don’t rebuild the entire structure when adding a new reference sequence. Hence it might be worthwhile looking into that.

An approaches I know less about is grammar-based compression. Perhaps one can efficiently compare and concatenate grammars.

jaror · August 25, 2024, 8:14pm

I’m wondering if you could use reduced binary decision diagrams to compress arbitrary data. Checking equivalence is linear in the compressed size, because the compression is a bijection. Splitting at a particular index should also be quite cheap. The problematic part would probably be concatenation, although it seems doable. Also, all the pointers mean that it will only really be effective if there is a lot of duplication in the input.

MangoIV · August 25, 2024, 10:22pm

Yes, the original paper I linked is part of the commonly quoted grammar based compression literature.

MangoIV · August 25, 2024, 10:23pm

Thank you for your ideas, I will definitely have a look!

ocramz · August 26, 2024, 4:52am

well, all learning is a form of (lossy) compression so why not

jmcarthur · September 5, 2024, 12:48pm

Although it would not be as cool as a bespoke data structure probably would be, a finger tree storing compressed blocks and measured by the sum of their uncompressed sizes might work well.

MangoIV · September 5, 2024, 1:52pm

I don’t think it’s that easy - can you perhaps explain why you think it might work?

sclv · September 9, 2024, 1:07am

Well this is just an efficient sequence of compressed blocks, so that seems like it will give good splitting properties and concatenation properties. For equality it might take a bit more work.

That said, in fact, splitting and concatenation can already be done in sublinear time with a fingertree style structure – and you can just annotate by hash to get typically sublinear equality as well.

MangoIV · September 9, 2024, 9:42am

not really - it is not clear that at block boundaries the compression is consistent, you have to do some recompression there (cf Artur Jez’ papers). annotating by hash itself is only possible if you can make sure that the structure is canonical, which is indeed a way to go but it’s not completely clear. I’m currently implementing a new paper which looks very promising, I will keep you posted

sclv · September 14, 2024, 4:41pm

i don’t understand why we would need consistent compression at block boundaries if all we want is splitting and concatenation? (and note that my claim was in fact stronger – that splitting, concatenation and equality can all be performed sublinearly even without compression).

additionally i don’t think one needs to make sure the structure is canonical to annotate by hash. rather what one needs is that the hash is of the necessarily canonical underlying data, and that it is a hash which can be monoidally appended while preserving this property. i may be missing something, but i think such techniques are well known.

MangoIV · September 14, 2024, 6:09pm

I also need equality, splitting and concatenation is not enough…

That being said, yes, there’s solutions for this on uncompressed data, in fact, I’m implementing one right now

sclv · September 14, 2024, 6:31pm

Right, but as i suggested, you can accomplish sublinear equality without consistent compression at block boundaries (or need for recompression) as long as you have a monoidally normal hash of the underlying (uncompressed) data.

a quick google finds someone who did some literature search for this for basically the reason we’re discussing here @wabbit.one - Monoidal Hashing

MangoIV · September 14, 2024, 6:41pm

Ah - yeah, thank you, that’s really nice, I’ll see if that’s useful for me - currently I’m using a splay tree based datastructure which maintains its hash on the nodes - but maybe this is even better since it works well on immutable data structures. The nice thing about the splay trees is that they also give me another operation that seems quite useful… thanks

MangoIV · September 14, 2024, 7:31pm

hwsl2: Hashing with SL2

Topic		Replies	Views
Radix-tree-1.0: trees that radix Show and Tell	28	2401	May 1, 2024
New library: shamochu “Shuffle and merge overlapping chunks” lossless compression Show and Tell	16	1594	June 19, 2024
Counting Words, but can we go faster? Show and Tell	30	3577	August 10, 2022
Mutable Data structures? Why so hard to find	53	4230	August 1, 2022
Seqn: A sequence library Show and Tell	8	1049	June 22, 2024

Datastructures in the vein of `compressed` - what online implementations of compressed data structures are out there? {solved}

Related topics