I have developed a new library shamochu, short for “Shuffle and merge overlapping chunks” lossless compression.
The idea for the compression is:
- Split the input array in chunks of a given size.
- Rearrange the order of the chunks in order to optimize consecutive chunks overlaps.
- Create a data table with the reordered chunks and an index table that maps the original chunk index to its offset in the data table.
Then the data can be accessed in O(1) via a few bitwise operations and indexing two arrays. The same operation can then be applied to the index table and may lead to further compression.
Further details on the corresponding Hackage page. This work took inspiration from
“Fast character case conversion… or how to really compress sparse arrays” by Alexander Pankratov.
This is working quite good on e.g. blobs in unicode-data: more than 90% space saving! See the corresponding PR.
But I wonder: does this algorithm has a name? Is there already an implementation in Haskell? I could not find much information apart from the page I linked. If someone knows, please share! The current implementation is okay for a POC, but slow; knowing the name would enable finding literature and thus allow to improve it.