Combining generics with higher-kinded types

Hello,

I am working on a proof-of-concept of a type-safe implementation of dataframes. In this proof-of-concept, I want to essentially take a record type, and allow it to either:

  • contain scalars, e.g. Int and String;
  • contain columns of scalars.

My starting point is type family, Column:

-- | Type family which allows for higher-kinded record types
-- in two forms:
--
-- * Single record type using `Identity`;
-- * Record type whose elements are arrays, using `Vector`.
--
-- Types are created like regular record types, but each element
-- must have the type @`Column` f a@ instead of @a@.
type family Column (f :: Type -> Type) a where
    Column Identity x = x

    Column Vector x = Vector x

-- Example record type that could be turned into a dataframe
data User f
    = MkUser { userName :: Column f String
             , userAge  :: Column f Int
             }
    deriving Generic

In this framework, User Identity is a regular record type, while User Vector is a dataframe – every record of the type is a column of values.

Based on all of this work, let’s consider a simple class FromRows with the following method:

class FromRows t where
    fromRows :: Vector (t Identity) -> t Vector

which is equivalent to turning rows of structure into a structure of columns.
To make dataframes ergonomic, it would be great to create a default implementation of fromRows using generics

class FromRows t where
    fromRows :: Vector (t Identity) -> t Vector

    default fromRows :: ( ??? )
                     => Vector (t Identity) 
                     -> t Vector
    fromRows = ???

-- Ultimate goal -- users just derive the instance
instance FromRows User

I followed the tutorial on Generics by Mark Karpov, but I don’t understand how to do this in practice. I have

class GFromRows t where
    gfromRows :: Vector (t Identity) -> t Vector

instance                               GFromRows U1 where (...)
instance (GFromRows a, GFromRows b) => GFromRows (a :*: b) where (...)
instance GFromRows t =>                GFromRows (M1 i c t) where (...)

class FromRows t where
    fromRows :: Vector (Row t) -> Frame t
    
    default fromRows :: ( ??? )
                        => Vector (Row t) 
                        -> Frame t
    fromRows = (to?) . gfromRows . (from?)

In particular, I’m confused on the constraints on gfromRows. Looks like there is extra ceremony required because t in FromRows t is a higher-kinded type

Edit: I previously used the type synonyms Row and Frame, but did not define them:

-- | Type synonym for a record type with scalar elements
type Row (dt :: (Type -> Type) -> Type) 
    = (dt Identity)

-- | Type synonym for a record type whose elements are arrays (columns)
type Frame (dt :: (Type -> Type) -> Type)   
    = (dt Vector)
1 Like

You’ve given two different type signatures of fromRows, the second of which mentions the undefined Row and Frame types, so I assume the first is the correct one.

I believe for a higher-kinded type you want the Generic1 class. The implementation would look something like this:

class GFromRows t where
  gfromRows :: Vector (t Identity) -> t Vector

instance GFromRows U1 where
  gfromRows _ = U1

instance (GFromRows a, GFromRows b) => GFromRows (a :*: b) where
  gfromRows = fromTuple . bimap gfromRows gfromRows . V.unzip . fmap toTuple
    where
      toTuple (a :*: b) = (a, b)
      fromTuple (a, b) = a :*: b

instance (GFromRows t) => GFromRows (M1 i c t) where
  gfromRows = M1 . gfromRows . fmap unM1

class FromRows t where
  fromRows :: Vector (t Identity) -> t Vector
  default fromRows ::
    (Generic1 t, GFromRows (Rep1 t)) =>
    Vector (t Identity) ->
    t Vector
  fromRows = to1 . gfromRows . fmap from1

However, the compiler won’t derive an instance of Generic1 for your User type (even if you replace Column f by f in the field types:

        Constructor ā€˜MkUser’ applies a type to an argument involving the last parameter
                              but the applied type is not of kind * -> *

I’m not sure if there is a formulation of User that the compiler could derive a Generic1 instance for.
I suggest you look into the barbies package.

You can do this with Generic (not Generic1), but you have to forget that you’re working with a parameterized type. The Generic class expects a simple type. You have two types t Identity and t Vector. So you treat them as separate instances of Generic, which will both be indices of the GFromRows class.

class GFromRows rI rV where
  gfromRows :: Vector (rI x) -> rV x

The x parameter is unused and unusable, just ignore it. (It’s there for working with Generic1, but you’re not using it here (it’s unusable for your situation too.))

The types rI and rV, since they come from t Identity and t Vector, are supposed to have pretty much the same structure.

They are both products at the same time.

instance (GFromRows tI1 tV1, GFromRows tI2 tV2) => GFromRows (tI1 :*: tI2) (tV1 :*: tV2) where
   ...

And in the field case, you know that when a field of t Identity has type a, the same field in t Vector will have type Vector a.

instance (v ~ Vector a) => GFromRows (K1 i a) (K1 i v) where
  gfromRows = K1 . fmap unK1

The default implementation will look like this:

default fromRows :: (Generic (t Identity), Generic (t Vector), GFromRow (Rep (t Identity)) (Rep (t Vector))) => Vector (t Identity) -> t Vector
fromRows = to . gfromRows . fmap from
4 Likes

Ahh I apologize, I didn’t define Row and Frame. Post edited

Amazing, this works! Thank you. I’ll try to write a blog post soon about this

2 Likes

Thanks to @Lysxia I am putting together a little prototype here

1 Like

I wrote a little blog post about this discussion: Modeling dataframes in Haskell using higher-kinded types - Laurent P. RenƩ de Cotret

As noted at the end, some more work is required is order to embed dataframes in others, such that all fields are represented by a column. I’ll have to think about that

3 Likes

Nice post. I like the idea of HKDs (though I do think it’s a shame deriving instances tends to become more complicated/less ergonomic). Anyway, I think your last example also still has a type error no; I would expect you would need something like ``storeAddress :: Column f (Address g)’ , where g is also some functor. (Which I guess you may now also have to pass into Store itself). E.g. I would expect something like:

data Address f = ...
data Store g f = MkStore { storeName    :: Column f String
                         , storeAddress :: Column f (Address g) 
                         }
    deriving (Generic)
2 Likes

Indeed that is also a type error, thanks for pointing it out. I’ll fix it right now.

This is my first post based on Literate Haskell, and I ā€œturn offā€ typechecking for some code blocks (when I want to show something that should not work)

Edit: I think it should be:

data Store f
    = MkStore { storeName    :: Column f String
              , storeAddress :: Address f
              }
    deriving (Generic)

but I suspect that nesting dataframes will require something more, e.g. some newtype Nested such that

data Store f
    = MkStore { storeName    :: Column f String
              , storeAddress :: Nested f Address
              }
1 Like

Great article! This has the makings of a really elegant solution. Could this approach be extended to support different containers for different types. For example both the dataframe implementation in working on (which I’ve decided to call hawk) and Frames store primitives in unboxed vectors and everything else in a boxed vector. Not super important but saves some pointers indirection for numeric functions.

Also I’m working on a small paper for his but I think an unexplored dataframe (one where you’re figuring out the types or pruning things) is a fundamentally different object from a complete dataframe. (One where things are well behaved and you would mostly like to do analytics on top of it). I’ve been thinking about how to best implement that bridge. When you have a fully fleshed out prototype I’d love to play with it and translate one kind of data frame to the other.

1 Like

That is a good feature to have – the performance of unboxed vectors is significantly better --, but I’m not sure how to make that work in this particular design; the container type f must be uniform across fields at this time.

I think an unexplored dataframe (…) is a fundamentally different object from a complete dataframe

Absolutely. There is a lot of required ceremony around building and handling complete dataframes as in the post.

There are two more features I want to explore:

  • Making one or more of the columns of the dataframe into an index, like a database primary key. This would be required only for certain operations (e.g. lookup);
  • Nesting dataframes while preserving column storage.

I’ll try to release an experimental package for this in the next few weeks.

2 Likes

The barbies library was mentioned already, but I would like to add that your fromRows is essentially Data.Functor.Barbie.

1 Like

I saw the name and forgot – thanks, I’ll check it out!

After checking out barbies, you’re exactly right!

fromRows is bdistribute', and its dual toRows, must therefore be bsequence' (not shown in the blog post, but required for dataframe functionality).

I wish there was a way to do something like this:

type Frameable t = (DistributiveB t, TraversableB t)

data MyRecord t
    MkRecord { field1 :: Column f Int
             , field2 :: Column f String}
             }
    deriving (Generic)

instance Frameable MyRecord
-- equivalent to
-- instance DistributiveB MyRecord
-- instance TraversableB MyRecord

but alas, building upon barbies would require users to explicitly derive instances of DistributiveB and TraversableB.

Not a huge deal, but ideally users would not have to learn what these things are

2 Likes