New library: shamochu “Shuffle and merge overlapping chunks” lossless compression

I have developed a new library shamochu, short for “Shuffle and merge overlapping chunks” lossless compression.

The idea for the compression is:

  1. Split the input array in chunks of a given size.
  2. Rearrange the order of the chunks in order to optimize consecutive chunks overlaps.
  3. Create a data table with the reordered chunks and an index table that maps the original chunk index to its offset in the data table.

Then the data can be accessed in O(1) via a few bitwise operations and indexing two arrays. The same operation can then be applied to the index table and may lead to further compression.

Further details on the corresponding Hackage page. This work took inspiration from
Fast character case conversion… or how to really compress sparse arrays” by Alexander Pankratov.

This is working quite good on e.g. blobs in unicode-data: more than 90% space saving! See the corresponding PR.

But I wonder: does this algorithm has a name? Is there already an implementation in Haskell? I could not find much information apart from the page I linked. If someone knows, please share! The current implementation is okay for a POC, but slow; knowing the name would enable finding literature and thus allow to improve it.

16 Likes

shamochu-0.1.0 released!

I cannot edit the original message, if a moderator can fix the link.

You should be able to edit it by clicking the three dots and then the pencil symbol at the bottom of your post, but I’ve done it for you now.

1 Like

Thanks! That what I have reached for… but there is none. Permission issue? Is there a threshold on posts count before being able to edit them?

[EDIT] I can edit my comment though.

I see, it seems you were at trust level 1, which means you can’t edit posts after 24 hours. The only thing you lacked to get to level 2 is replying to at least 3 different topics (that you didn’t create yourself). But I’ve manually upgraded you, so you should now be able to modify your post further if you wish.

1 Like

what algorithm do you use to optimize the chunk overlap?

  1. Chunk input data with the given size.
  2. Create an empty sequence of overlapped chunks. We are going to grow a sequence of overlapped chunks with the best move at each iteration, keeping track of the original chunks indexes.
  3. Compute all possible moves:
    a. The pair of remaining chunks with the best overlap;
    b. The pair of chunk and overlapped chunks with the best overlap;
    c. The pair of overlapped chunks with the best overlap.
  4. Apply the best move (highest overlap).
  5. If no move is possible or best overlap is 0, go to 6. Else go to 3.
  6. Append the remaining chunks, if any. Then merge the entire sequence.
  7. Create the index array and apply the same algorithm on it. Then compute the memory size of both solutions (one and two stages) and keep the best.

It is not possible to know the best chunk size in advance, so the real algo iterates over various chunk sizes the user provides (for the input data and the index array), and keep the best solution at each iteration.

The current implementation is slow but okay for my current use case: compress bitmaps of Unicode character properties (length: ≈ 2e5 for relevant data, repetitive patterns, result used as static data in Haskell source file) and be able to access directly the original data in O(1) (few bit-wise operation and array indexing).

This algorithm does not guarantee the best result, but is good enough. I guess we could encode this into an input for a real optimization software, but I had fun coding this in short time. I have not found literature to optimize the current algorithm, so I would gladly welcome any insight.

1 Like

I don’t know, but I love the name Shamochu! :smiley:

1 Like

thank you! it does sound like a lot of work with the optimal pairwise overlap. By only compressing neighboring chunks it could sound like a variation of byte pair encoding.

does this algorithm has a name?

Step 2 seems to be the shortest common superstring problem.

Since your algorithm finds common substrings (of the same orientation) in a string, and a suffix tree finds all of them, I imagine that you could derive your encoding from a suffix tree, or at least derive the optimal chunk size from the overall suffix tree structure.
You should also look up the theory behind de-novo assembly of genomes, since that is solving a very similar problem: piecing together a contiguous sequence from many overlapping fragments.

2 Likes

Thanks all for you feedaback. This is why I love the Haskell community!

@ocramz Yes the algorithm requires lots of work. I have not yet optimized it, just ensured it provides correct results. I saw not point in optimizing before checking the literature, but got stuck at that step, thus my post here. It’s not exactly a variation of byte pair encoding, because we do not need dictionary lookup but we access data in O(1).

@meooow thanks for the pointer, it is indeed the shortest common superstring problem. “Chunk” was not a great keyword after all :sweat_smile:. It will be much easier to look for a better algorithm now.

@olf thanks for the pointers. I will need to check if using a suffix tree I can maintain easily an import property: access the original data in O(1) thanks to a few bitwise operations. The reference to bioinformatics is really interesting, I will add it to my list. Although looking for “shortest common superstring” seems it would provide me more actionable results.

1 Like

I think @olf’s idea was that you would first build a suffix tree and then use that to figure out the most efficient chunking strategy to do your shamochu. So you wouldn’t use the suffix tree directly as your compressed representation.

Indeed. In the example [[1,2,3],[2,3,4],[3,1,2]] given in the haddocks, the longest path in the suffix tree with at least two children has length two. So the best we can hope for is overlapping chunks of length three, which indeed seems to lead to the optimal result.
Why N+1? Because if two chunks x and y have an overlap o of length N which is maximal, then the best we can hope for is that (using Data.Sequence notation)

x = a <| o
y = o |> b

so that the common superstring is a <| o |> b. This kind of superstring relation is known as a contig in genome assembly. This makes me think:

  1. What if an overlap o exists in multiple chunks, but is surrounded by incompatible context? How does shamochu perform in this scenario?
  2. A Lempel-Ziv 78 parse is a special instance of shamochu, because LZ78 relies on special overlaps where one chunk is a prefix of another, and they appear in a certain order, namely, the offsets are ascending. This proves that shamochu can compress at least as good as LZ78.

Proof of claim 2:

data LZBlock a = LZBlock {prefix :: Maybe Int, suffix :: a}
type LZParse a = Vector (LZBlock a)
-- invariant: If block with index `j` points to another block 
-- with index `i`, then `i<j`.
uncompress :: LZParse a -> Vector a
uncompress blocks = blocks >>= u where
    u :: LZBlock a -> Vector a
    u (LZBlock Nothing  a) = pure a
    u (LZBlock (Just i) a) = u (blocks!i) <> pure a

Now for each index i in the LZParse there is a unique largest index j such that block number j refers (transitively) to i. Let the shamochu be the concatenation of (the decompressions of) these maximal blocks. Then clearly every block can be mapped into this shamochu, and every position in the shamochu corresponds to a unique block of the LZParse.
If k is the number of blocks in the LZParse, then the vector of blocks occupies 2k words, assuming that a and Maybe Int both occupy one word each. The described shamochu occupies 3k words, which shows we have not exploited its special structure.

While constructing the LZParse one usually maintains a trie of the LZBlocks. The maximal blocks correspond to the maximal paths in this trie.

1 Like

By the way, if you do the same with a shamochu, then it becomes a compressed full-text index, which is a very interesting data structure for bioinformatics. A commonly used structure of this kind is the FM-index but the LZ78 parse can also be used for full-text search.

I happened to have this tab open next to this one, and thought you might enjoy some of the similar ideas: Size Optimization Tricks

Aha, so it might be worthwhile to carve out an abstraction here. What shamochu’s CompressedArray and assembly can have in common is:

  • a linear sequence of tokens/instructions (the array),
  • a Foldable instance (at least morally) that defines a toList traversal, the actual control flow,
  • jump instructions that can cause toList visit a token multiple times.

So to make CompressedArray more like assembly or Lempel-Ziv, we could embed the tokens in array and pairs of tokens from offsets and sizes in one common instruction stream. @Wismill: there ought to be an invariant that these arrays have the same length.

jumps :: CompressedArray a -> Vector (Int,Int)
jumps ca = Vector.zip (offsets ca) (sizes ca)

These have very similar semantics to the length-distance pairs in LZ77 compression. As I understand it, the difference is that the offsets in LZ77 refer to positions in the uncompressed stream, not the compressed array. LZ77 and LZSS interleave literals and jump instructions in one instruction stream. The streaming property of LZ77 says that a jump instruction can only refer to an earlier position in the output stream.
Interleaving jumps and array data might destroy the O(1) access property of the current shamochu, though.