Implementing type-safe heterogeneous collections

Ah, I’m misleading you, sorry my bad.

(.=.) :: Label l -> v -> Tagged l v 

newtype Tagged s b = Tagged { unTagged :: b } deriving ...

(Gleaned from various places on Hackage. Hoogle seems not to have it. OTOH I’m on the road, without proper access to usual checks.)

So the label is a phantom. The l.h. operand to .=. is being used for its type only.

I’ve considered making a data frame library in the past but I’ve never got round to it. If I were doing it I’d do it as an arbitrary collection of Vectors wrapped in abstract data types that enforce the invariant that all the Vectors are of the same length. So, for example

newtype Field a = HiddenMkField (Vector a)

newtype DataFrame a = HiddenMkDataFrame a
    deriving Functor

fromVector :: Vector a -> DataFrame (Field a)
fromVector = HiddenMkDataFrame . HiddenMkField

zip ::
  DataFrame (Field a) ->
  DataFrame (Field a) ->
  DataFrame (Field a, Field a)
zip = ... check same length and put in pair...

All functions that operate on Fields must preserve the invariant.

This approach has the nice property that you can be very specific about the fields that are in your data frame, for example

data MyRecord = MyRecord 
  { count :: Field Int,
    name :: Field String,
    weight :: Field Double
  }

and you can also be completely dynamic about the collection of Fields and the types of the Fields, for example, DataFrame (Field Dynamic, Field Dynamic), or DataFrame (Map String Dynamic).

In order to write polymorphic functions over DataFrames you can use ProductProfunctors in roughly the same way that Opaleye does.

Speaking for myself, I find the Frames library exceedingly intimidating to use.

There’s a tutorial, sure. But, for example, the tutorial doesn’t show how to melt the frame, so I would refer to this documentation entry: melt. I can’t imagine someone picking up Frames as a beginner.

Basically, I would be lost when not on the happy path with Frames.

Sure, that looks much like a regular list definition, but it’s isomorphic to – and has exactly the same memory layout as – this definition:

type family HList2 (xs :: [Type]) where
  HList2 '[] = ()
  HList2 (x ': xs) = (x, HList2 xs)

foo :: HList2 [Int, Bool]
foo = (1, (True, ()))

You might object that the regular list datatype can also be represented as nested pairs and you’d be right, but the regular list has a single parameter type and that makes a world of difference in what you can do with it. It makes no sense to speak of an infinite HList for example, because its type would also be infinite and GHC wouldn’t allow it.

1 Like

I believe a dataframe library which made judicious use of dynamic types for the columns would be an interesting part of the design space to explore. It would avoid the usual pitfalls when trying to describe the columns fully statically: inscrutable types, eternal compilation times… while enabling new use cases, like interactive definitions of frames and pipelines.

The trick would be to avoid being too dynamic. For complex chains of operations that might take a while to complete, some kind of value-level analysis could be performed in order to be able to “fail fast” before running the operations. It would still fail at runtime, but it would fail early, and that might be an acceptable compromise.

And, likely, it would still be possible to integrate islands of “static types” in dynamic pipelines, if one so wished.

1 Like