Transparently implement data types differently and more compact

I’m sure someone has had this idea before, but who knows:
Ever since I knew how data types are represented in Haskell (shameless plug; if you don’t, maybe watch my latest video playing around with ghc-vis), I wonder if I woudn’t rather want something more memory-compact than that in some cases.

And now I wonder how well it would work to have, say, a library that you feed a data type definition (via TH), and it produces a type definition, a set of pattern synonyms and a complete pragma so that:

  • In terms of programming, it’s indistinguishable from the plain data type.
  • In terms of semantics, it’s indistinguishable (maybe require all constructor fields to be strict for that)
  • But it’s backed by a very different implementation. For example all values are ByteArray with a very compact encoding of the value, so that even a nested, complex value fits into a few bytes. Or, if it isn’t recursive and thus the encoding has limited size, the implementation could just be one or a few Word64#.

Depending on the precise implementation, I could imagine that pattern matching becomes similarly fast as now (read a byte from a specific offset). Or maybe projections can be done by slicing the compact binary represention, and thus don’t allocate a lot.

WDYT? Has this been tried before? What encoding might be the best?

The idea comes partly from seeing GHC developers use short-hand descriptions like 1!P(L,L) for values that would be rather large in memory, and I wonder if maybe GHC itself should use a similar encoding for more memory efficiency.

Oh, and actually looking at code related to that I find a very related idea: The Card type looks like an enumeration, and can be used like it thanks to pattern synonyms, but it’s actually just an Int.
Although there the underlying representation is meaningful as a bitvector and some operations are implemented on that, so it’s a hand-crafted application of the idea, and not a generic tool. Smells like @sgraf to me … ah yes, git blame confirms that :slight_smile:

7 Likes

That said, it’s actually not so clear what would be a good compact binary format for generic data. Yes, one can use fewer bytes (or even bits) for the tag of a sum. But for the actual constructor fields, if projection to the second field should not have to decode the first field, one still needs a bit of metadata overhead.

1 Like

I explored this for an AST - you load it from disk from a compact representation, use pattern synonyms to deconstruct it. That way you keep it as one big flat ByteString. Your “interpreter” is both a byte-code and an AST interpreter. The lines are blurred. But it means no pointer chasing and L1 cache. I considered writing some TH to do it for me. But I never got far enough to need such an efficient representation.

2 Likes

Did you already decide what kind of compact representation you’d use? In the end, it probably has pointers too, just relative ones, wouldn’t it?

I think Gibbon and Sixten are related projects.

2 Likes

I think the serialised representation has a number of great advantages. But it also has a couple of disadvantages:

  • Only applicable to “compact regions”, e.g., no outward pointers from the graph. Unless you want to confuse the Garbage Collector…
  • Update implies a copy of the whole ByteArray, unless you have a smarter representation in terms of ByteArray-Mapped Tries which might bring its own costs
  • No sharing of sub-graphs. E.g., if you need a sub-graph as its own box/ByteArray, you’ll have to copy. Unless there’s slicing involved…

But I actually thought about representing Demand (which is the 1!P(L,L) thing from the OP) and SubDemand as a ByteArray. The update cost shouldn’t be too high if made in bulk. I just lack the time of doing so.

2 Likes

As I mentioned, the Demand is actually my motivating example.

But still, how exactly would you serialize it? If you have a a constructor like (,), how would you encode that so that you can implement snd without having to decode the full first element? (Or would you just do that?)

Interesting projects. It looks like they then heavily optimize the operations on the compact representation. Would be interesting to see if such stuff be made an opt-in library in Haskell :slight_smile:

You could just store the size of any data type the size of which isn’t constant at compile time. Just like when serializing a String or any list. It’s a tradeoff between space and time, though.

How does that compare to adding Unboxed to all fields (recursively)?

Good question, and I am not sure if I know all the answers. Unboxing sums requires a full word for the tag; one could imagine a compact format that uses just one byte, or even single bits. And I am not sure if UNBOX is currently even supported to unbox sum types; the documentation says it isn’t. I expect this kind of unboxing can only work for fields that are bounded in size, not recursive ones.

Don Stewart also played with something a bit related a long time ago: https://hackage.haskell.org/package/adaptive-containers