Combining generics with higher-kinded types

Hello,

I am working on a proof-of-concept of a type-safe implementation of dataframes. In this proof-of-concept, I want to essentially take a record type, and allow it to either:

  • contain scalars, e.g. Int and String;
  • contain columns of scalars.

My starting point is type family, Column:

-- | Type family which allows for higher-kinded record types
-- in two forms:
--
-- * Single record type using `Identity`;
-- * Record type whose elements are arrays, using `Vector`.
--
-- Types are created like regular record types, but each element
-- must have the type @`Column` f a@ instead of @a@.
type family Column (f :: Type -> Type) a where
    Column Identity x = x

    Column Vector x = Vector x

-- Example record type that could be turned into a dataframe
data User f
    = MkUser { userName :: Column f String
             , userAge  :: Column f Int
             }
    deriving Generic

In this framework, User Identity is a regular record type, while User Vector is a dataframe ā€“ every record of the type is a column of values.

Based on all of this work, letā€™s consider a simple class FromRows with the following method:

class FromRows t where
    fromRows :: Vector (t Identity) -> t Vector

which is equivalent to turning rows of structure into a structure of columns.
To make dataframes ergonomic, it would be great to create a default implementation of fromRows using generics

class FromRows t where
    fromRows :: Vector (t Identity) -> t Vector

    default fromRows :: ( ??? )
                     => Vector (t Identity) 
                     -> t Vector
    fromRows = ???

-- Ultimate goal -- users just derive the instance
instance FromRows User

I followed the tutorial on Generics by Mark Karpov, but I donā€™t understand how to do this in practice. I have

class GFromRows t where
    gfromRows :: Vector (t Identity) -> t Vector

instance                               GFromRows U1 where (...)
instance (GFromRows a, GFromRows b) => GFromRows (a :*: b) where (...)
instance GFromRows t =>                GFromRows (M1 i c t) where (...)

class FromRows t where
    fromRows :: Vector (Row t) -> Frame t
    
    default fromRows :: ( ??? )
                        => Vector (Row t) 
                        -> Frame t
    fromRows = (to?) . gfromRows . (from?)

In particular, Iā€™m confused on the constraints on gfromRows. Looks like there is extra ceremony required because t in FromRows t is a higher-kinded type

Edit: I previously used the type synonyms Row and Frame, but did not define them:

-- | Type synonym for a record type with scalar elements
type Row (dt :: (Type -> Type) -> Type) 
    = (dt Identity)

-- | Type synonym for a record type whose elements are arrays (columns)
type Frame (dt :: (Type -> Type) -> Type)   
    = (dt Vector)
1 Like

Youā€™ve given two different type signatures of fromRows, the second of which mentions the undefined Row and Frame types, so I assume the first is the correct one.

I believe for a higher-kinded type you want the Generic1 class. The implementation would look something like this:

class GFromRows t where
  gfromRows :: Vector (t Identity) -> t Vector

instance GFromRows U1 where
  gfromRows _ = U1

instance (GFromRows a, GFromRows b) => GFromRows (a :*: b) where
  gfromRows = fromTuple . bimap gfromRows gfromRows . V.unzip . fmap toTuple
    where
      toTuple (a :*: b) = (a, b)
      fromTuple (a, b) = a :*: b

instance (GFromRows t) => GFromRows (M1 i c t) where
  gfromRows = M1 . gfromRows . fmap unM1

class FromRows t where
  fromRows :: Vector (t Identity) -> t Vector
  default fromRows ::
    (Generic1 t, GFromRows (Rep1 t)) =>
    Vector (t Identity) ->
    t Vector
  fromRows = to1 . gfromRows . fmap from1

However, the compiler wonā€™t derive an instance of Generic1 for your User type (even if you replace Column f by f in the field types:

        Constructor ā€˜MkUserā€™ applies a type to an argument involving the last parameter
                              but the applied type is not of kind * -> *

Iā€™m not sure if there is a formulation of User that the compiler could derive a Generic1 instance for.
I suggest you look into the barbies package.

You can do this with Generic (not Generic1), but you have to forget that youā€™re working with a parameterized type. The Generic class expects a simple type. You have two types t Identity and t Vector. So you treat them as separate instances of Generic, which will both be indices of the GFromRows class.

class GFromRows rI rV where
  gfromRows :: Vector (rI x) -> rV x

The x parameter is unused and unusable, just ignore it. (Itā€™s there for working with Generic1, but youā€™re not using it here (itā€™s unusable for your situation too.))

The types rI and rV, since they come from t Identity and t Vector, are supposed to have pretty much the same structure.

They are both products at the same time.

instance (GFromRows tI1 tV1, GFromRows tI2 tV2) => GFromRows (tI1 :*: tI2) (tV1 :*: tV2) where
   ...

And in the field case, you know that when a field of t Identity has type a, the same field in t Vector will have type Vector a.

instance (v ~ Vector a) => GFromRows (K1 i a) (K1 i v) where
  gfromRows = K1 . fmap unK1

The default implementation will look like this:

default fromRows :: (Generic (t Identity), Generic (t Vector), GFromRow (Rep (t Identity)) (Rep (t Vector))) => Vector (t Identity) -> t Vector
fromRows = to . gfromRows . fmap from
4 Likes

Ahh I apologize, I didnā€™t define Row and Frame. Post edited

Amazing, this works! Thank you. Iā€™ll try to write a blog post soon about this

2 Likes

Thanks to @Lysxia I am putting together a little prototype here

1 Like

I wrote a little blog post about this discussion: Modeling dataframes in Haskell using higher-kinded types - Laurent P. RenƩ de Cotret

As noted at the end, some more work is required is order to embed dataframes in others, such that all fields are represented by a column. Iā€™ll have to think about that

3 Likes

Nice post. I like the idea of HKDs (though I do think itā€™s a shame deriving instances tends to become more complicated/less ergonomic). Anyway, I think your last example also still has a type error no; I would expect you would need something like ``storeAddress :: Column f (Address g)ā€™ , where g is also some functor. (Which I guess you may now also have to pass into Store itself). E.g. I would expect something like:

data Address f = ...
data Store g f = MkStore { storeName    :: Column f String
                         , storeAddress :: Column f (Address g) 
                         }
    deriving (Generic)
2 Likes

Indeed that is also a type error, thanks for pointing it out. Iā€™ll fix it right now.

This is my first post based on Literate Haskell, and I ā€œturn offā€ typechecking for some code blocks (when I want to show something that should not work)

Edit: I think it should be:

data Store f
    = MkStore { storeName    :: Column f String
              , storeAddress :: Address f
              }
    deriving (Generic)

but I suspect that nesting dataframes will require something more, e.g. some newtype Nested such that

data Store f
    = MkStore { storeName    :: Column f String
              , storeAddress :: Nested f Address
              }
1 Like

Great article! This has the makings of a really elegant solution. Could this approach be extended to support different containers for different types. For example both the dataframe implementation in working on (which Iā€™ve decided to call hawk) and Frames store primitives in unboxed vectors and everything else in a boxed vector. Not super important but saves some pointers indirection for numeric functions.

Also Iā€™m working on a small paper for his but I think an unexplored dataframe (one where youā€™re figuring out the types or pruning things) is a fundamentally different object from a complete dataframe. (One where things are well behaved and you would mostly like to do analytics on top of it). Iā€™ve been thinking about how to best implement that bridge. When you have a fully fleshed out prototype Iā€™d love to play with it and translate one kind of data frame to the other.

1 Like

That is a good feature to have ā€“ the performance of unboxed vectors is significantly better --, but Iā€™m not sure how to make that work in this particular design; the container type f must be uniform across fields at this time.

I think an unexplored dataframe (ā€¦) is a fundamentally different object from a complete dataframe

Absolutely. There is a lot of required ceremony around building and handling complete dataframes as in the post.

There are two more features I want to explore:

  • Making one or more of the columns of the dataframe into an index, like a database primary key. This would be required only for certain operations (e.g. lookup);
  • Nesting dataframes while preserving column storage.

Iā€™ll try to release an experimental package for this in the next few weeks.

2 Likes

The barbies library was mentioned already, but I would like to add that your fromRows is essentially Data.Functor.Barbie.

1 Like

I saw the name and forgot ā€“ thanks, Iā€™ll check it out!

After checking out barbies, youā€™re exactly right!

fromRows is bdistribute', and its dual toRows, must therefore be bsequence' (not shown in the blog post, but required for dataframe functionality).

I wish there was a way to do something like this:

type Frameable t = (DistributiveB t, TraversableB t)

data MyRecord t
    MkRecord { field1 :: Column f Int
             , field2 :: Column f String}
             }
    deriving (Generic)

instance Frameable MyRecord
-- equivalent to
-- instance DistributiveB MyRecord
-- instance TraversableB MyRecord

but alas, building upon barbies would require users to explicitly derive instances of DistributiveB and TraversableB.

Not a huge deal, but ideally users would not have to learn what these things are

2 Likes