Datastructures in the vein of `compressed` - what online implementations of compressed data structures are out there? {solved}

this thing and Data.FingerTree (which fits well in the picture) are indeed unbelievably ingenious works of art <3
I’m amazed. I hope something good comes out of it

module HashRope
  ( HashRope (..)
  , HashRopeNode (..)
  , HashRopeMeasure (..)
  , HashRope.length
  , HashRope.fromList
  , HashRope.splitAt
  , HashRope.take
  )
where

import Data.ByteString (ByteString)
import Data.FingerTree (FingerTree, Measured (measure))
import Data.FingerTree qualified as FingerTree
import Data.Hash.SL2 (Hash, hash)
import Data.List qualified as List
import GHC.Generics (Generic)
import GHC.IsList (IsList)
import GHC.IsList qualified as IsList

data HashRopeMeasure = MkHashRopeMeasure
  { measureLength :: {-# UNPACK #-} !Int
  , measureHash :: {-# UNPACK #-} !Hash
  }
  deriving stock (Eq, Ord, Show, Generic)

newtype HashRopeNode = MkHashRopeNode {payload :: ByteString}
  deriving stock (Eq, Ord, Show)

instance Semigroup HashRopeMeasure where
  a <> b = MkHashRopeMeasure {measureLength = a.measureLength + b.measureLength, measureHash = a.measureHash <> b.measureHash}

instance Monoid HashRopeMeasure where
  mempty = MkHashRopeMeasure {measureLength = 0, measureHash = mempty}

instance Measured HashRopeMeasure HashRopeNode where
  measure node = MkHashRopeMeasure {measureLength = 1, measureHash = hash node.payload}

newtype HashRope = MkHashRope {unHashRope :: FingerTree HashRopeMeasure HashRopeNode}
  deriving stock (Show, Generic)
  deriving newtype (Measured HashRopeMeasure, Semigroup, Monoid)

-- | we also need to check the equality of lengths since @[""] == ([] :: 'HashRope')@
instance Eq HashRope where
  {-# INLINE (==) #-}
  r1 == r2 =
    let mr1 = measure r1; mr2 = measure r2
     in mr1.measureLength == mr2.measureLength
          && mr1.measureHash == mr2.measureHash

-- | this instance exists just for convenience of use with ordered containers
instance Ord HashRope where
  {-# INLINE compare #-}
  r1 `compare` r2 =
    let mr1 = measure r1; mr2 = measure r2
     in case mr1.measureLength `compare` mr2.measureLength of
          EQ -> mr1.measureHash `compare` mr2.measureHash
          uneq -> uneq


length :: HashRope -> Int
length r = (measure r).measureLength

fromList :: [ByteString] -> HashRope
fromList l = MkHashRope (foldMap (FingerTree.singleton . MkHashRopeNode) l)

toList :: HashRope -> [ByteString]
toList rope = foldMap (List.singleton . payload) rope.unHashRope

instance IsList HashRope where
  type Item HashRope = ByteString
  fromList = HashRope.fromList
  toList = HashRope.toList

-- >>> HashRope.splitAt 1 ["a", "b", "c"]
-- (MkHashRope {unHashRope = fromList [MkHashRopeNode {payload = "a"}]},MkHashRope {unHashRope = fromList [MkHashRopeNode {payload = "b"},MkHashRopeNode {payload = "c"}]})
splitAt :: Int -> HashRope -> (HashRope, HashRope)
splitAt i rope =
  let (!l, !r) = FingerTree.split (\m -> i < m.measureLength) $ unHashRope rope
   in (MkHashRope l, MkHashRope r)
{-# INLINE splitAt #-}

take :: Int -> HashRope -> HashRope
take i = fst . HashRope.splitAt i
{-# INLINE take #-}

This works extremely well - thank you @sclv monoidal hashes are I think really the way to go here - and FingerTree from the fingertree package is exactly designed for this - unfortunately it’s not compressed or concise, but I wonder how much sharing there actually is - maybe it’s better than expected :eyes:

1 Like

The interesting thing to observe here is that if we go from compressed to uncompressed, we lose the guarantee to be collision free - i.e. we do know that the compressed version would be completely collision free but since we “forget” that we actually came from a compressed representation, the hash based approach may now collide - if we could somehow use the reduced entropy while hashing, we could make sure that the domain would actually be collision free since we know that the preimage is small enough, if that makes sense - it’s pretty cool in general though.

If you have settled on this approach, consider using MSeq instead (I’m the author). You will see much better performance for concat and split compared to fingertree's FingerTree.

Also, it is a good idea to use functions provided by the library where possible (toList, fromList), rather than defining your own potentially less efficient version.

2 Likes

I know that I should use library functions where possible but if they did it right (which I hope), then foldMap should be faster than first using fromList and then mapping (assuming the latter doens’t fuse) :slight_smile:

Btw ~ the other structure I implemented were the so called “forests of enhanced splay trees” described in [2403.13162] A Textbook Solution for Dynamic Strings ~ both work really well, I have yet to benchmark which is faster but at the moment I think it’s the latter, although the fingertree + monoidal hashing version is a bit more elegant and pairs better with immutable datastructures.

Thank you for linking MSeq, I think I will try it out, it should be kinda trivial :slight_smile:

1 Like

@meooow is there a specific reason why your MSeq has the length built in and not as Measure? The cool thing about FingerTree (afaiu) is that all of the Monoids you can attach are kinda modular via the Measure and your operations just maintain a cached measure :slight_smile:

~~ that of course also means that you have to be careful because e.g. the Foldable instance is not as fast as it could be if you’d cache the Sum measure

the FeST version is actually by far faster… :stuck_out_tongue:

I prefer it being built-in, because

  1. Counting elements is such a fundamental property of a sequence.
  2. Indexing allows for more efficient operations in a simple way. Split being a fundamental operation may be elegant but it’s not very efficient when operations based on it discard some of the result.

And the choice of structure (weight balanced tree) gives us these for free, in a way.

Neat! Would be interesting to see your benchmarks, if you have time to put them up somewhere.

1 Like

don’t have benchmarks yet but it’s 5 - 10x on randomly generated inputs. and it can probably be even improved more. absolutely insane how it went from DNF for Vectors to 0.7 seconds on 1000 random samples from quickcheck.