Implementing type-safe heterogeneous collections

I’m not sure how HKD solves this problem. Yes, you still need an existential because you dont know the types and orders at compile time, but once you unwrap the existential, HList provides exactly the types to be able to tell the compiler “I want to get the first column, which has type X”.

HMap doesnt help because DataFrames have ordered columns. HList is exactly whats needed here

Sorry, I should have been clearer. I was proposing not using any type called DataFrame at all and using tuples, if you want:

  • a product of lists
  • with type-safety (no run-time checks for types of things)
  • in vanilla Haskell (without advanced type features or HList)

You won’t get all three of those things any other way.

As proof, consider the type of your proposed apply function:

apply :: Transformable a => DataFrame -> String -> (a -> a) -> DataFrame

To have the type safety you want, if df1 is a DataFrame that contains an Int column, and df2 is a DataFrame that contains a String column, you want apply df1 "the-column" someIntFn to type-check but apply df2 "the-column" someIntFn to raise an error at compile time. If df1 and df2 have exactly the same type (DataFrame), there is no information the compiler can use to differentiate between those two expressions, so you can’t expect different compile-time behavior from them.

You can resign yourself to losing type safety, in which case the path you’re on, or the Dynamic approach sketched by danidiaz, will do the job.

If you are unhappy with that, then you must find a way to make the types of df1 and df2 different. My proposal, which works for simple cases but not for cases where you want to, say, map a function uniformly over all (heterogeneous) columns, is to abandon the DataFrame type and use tuples; the lens infrastructure will help you work ergonomically with tuples of size up to 19, and that’s sufficient for things like apply.

Your only remaining alternative, if you want type safety and you want a type named DataFrame, is to parameterize DataFrame somehow with a constant number of parameters that express the types of a variable number of columns. This isn’t something that’s possible in ‘simple’ Haskell. So for that, your options are:

  • Bite the bullet and learn HList
  • Find a less-popular package that does, at its core, the same thing as HList, and learn that
  • Learn some combination of data kinds, multi-parameter type classes, functional dependencies, type families, and/or some other GHC extensions that give you equivalent power, and implement the core of HList yourself (I include in this list item clever gotchas like cons-style nested tuples, because you need aforementioned extensions to do things like concatenations/uniform maps over nested tuples, and if you don’t need those things you should be happy with simple tuples instead of DataFrame)
1 Like

DataFrames’ columns and rows are both ordered and named, i.e. they’re accessible through either index or name. The column operations tend to use names, the row operations are more index-oriented.

As for “how”: if the set of all possible columns were known in advance, you could start with

data CountryRow f = CountryRow {
  name :: f Text,
  landArea :: f Double,
  waterArea :: f Double,
  population :: f Word,
  gdp :: f Centi,
  ...
}

Then a DataFrame Year CountryRow type can be defined in several different ways like

type DataFrame rowLabel rowColumns = [(rowLabel, rowColumns Maybe)] -- sparse
type DataFrame rowLabel rowColumns = [(rowLabel, rowColumns Identity)] -- total
type DataFrame rowLabel rowColumns = rowColumns (Compose [] ((,) rowLabel)) -- columnar

The major problem, again, is that the set of possible columns is not fixed.

1 Like

Couldn’t you simply define a product like this:

data p :*: q f = p f :*: q f
newtype Name f = Name (f Text)
newtype LandArea f = LandArea (f Double)
type CountryRow = Name :*: LandArea :*: etc

Ah, I guess this way pattern matching and constructing these rows is a bit of a pain.

No idea how helpful this will be, but maybe you can get some inspiration or we will arrive at a design that I, too, can use.

I’m currently writing an SQL library that allows writing arbitrary SQL - that is then type checked and serialized into this:

type (:=) :: Symbol -> Type -> Type
data label := ty = (KnownSymbol label) => Proxy label := ty

-- TODO small array
newtype Row (types :: [Type]) = Row [Any]

instance HasField l (Row ((l := t) : e)) t where
    getField (Row xs) = unsafeCoerce $ xs !! 0

instance HasField l (Row (_0 : (l := t) : e)) t where
    getField (Row xs) = unsafeCoerce $ xs !! 1

instance HasField l (Row (_0 : _1 : (l := t) : e)) t where
    getField (Row xs) = unsafeCoerce $ xs !! 2

It gives quite nice interface (as long as the library will be constructing it):

let row = Row [unsafeCoerce ("hello" :: Text), unsafeCoerce ("haskell" :: String)]
                :: Row '["abc" := Text, "xyz" := String]

    print row.abc -- "hello"
    print row.doesntExist -- compile error

There are two problems. It depends on experimental OverloadedRecordDot extension. It is read-only currently (enough for my use case). I didn’t check OverloadedRecordUpdate yet and I believe it is even more experimental.

Right, which I’d argue is the crux of the original problem. OP is making a DataFrame library, so in this context, it doesnt make any sense to talk about a technique that requires the columns to be known.

As an aside, I’m not sure how you can say “HList is mostly inferior to HKD”; they solve two different problems. To me, this is like saying “lists are mostly inferior to tuples”. Sure, if you know the shape of the data before hand and the shape doesnt change, then use tuples. But lists solve a different set of problems, so it doesnt make sense to compare the two.

I want to challenge this assumption. I work with dataframes a lot, and I always know the subset of columns I care about.

Consider JSON. Some applications need to accept arbitrary JSON, sure. But when I work with JSON, I know at least a subset of the schema, and create the appropriate datatypes to be decoded/encoded.

I would expect the same from a Haskell DataFrame library. My workflow would look like this:

  1. Peek at the data (e.g. first few rows from a CSV file) to determine what I’m working with;
  2. Define a datatype which represents the data I care about;
  3. Implement some kind of decode operation to get the data into a DataFrame;
  4. Proceed with data analysis.

I’m a big fan of the SQL interface beam, which uses higher-kinded types (see tutorial).

2 Likes

EDIT: Sorry, wrote the first response after reading the previous message too quickly :joy:

Sure, that’s a valid use case, but I think a lot of people also use DataFrames to explore the data. I’d imagine pandas would be a lot less popular if you have to read the source input beforehand and define the schema first.

But on the other hand, maybe you wouldnt be doing that in Haskell, a compiled language, in the first place. Maybe that only makes sense in a REPL context

1 Like

Maybe this is made clear someplace in the thread that I missed: in what ways does the [Frames]Frames: Data frames for working with tabular data files) library fall short of your needs?

The analogy of lists and tuples is misplaced, as they indeed serve very different purposes. But at the value level, while an HKD record is still a record (which in Haskell is a tuple), an HList is not actually a list but a bunch of nested pairs – i.e. a tuple, only with worse performance and ergonomics. That’s why I’m saying it’s mostly inferior – mostly as in for most purposes. The only exception is if you’re building the type of your HList from parts, but it’s difficult to maintain the proof that the result contains the types you put inside.

Can you elaborate? That doesn’t make sense to me. The canonical HList I’m familiar with is

data HList xs where
  Nil :: HList []
  (:&) :: x -> HList xs -> HList (x : xs)

foo :: HList [Int, Bool]
foo = 1 :& True :& Nil

bar :: HList xs -> HList (String : xs)
bar xs = "hello" :& xs

baz :: HList [String, Int, Bool]
baz = bar foo

I don’t see how this is a bunch of nested pairs. It’s almost verbatim the definition of a normal list, just a GADT version

[HList old hand speaking] The canonical HList I think of is as per the 2004 paper, with constructors HCons, HNil. Each node is then a pair of (label, value). There was no type-level : back then.

I appreciate it’s no longer 2004. So perhaps “canonical” is best avoided.

BTW I’m following this thread with interest, and wondering how this use case would go if Hugs/TRex was still a thing.

I see, thanks. I wasn’t aware of the older implementation of HList. I meant “canonical” as in, the implementation in the HList package that OP linked (and the implementation I’m more familiar with).

But even so, I’m reading the 2004 paper, and don’t see how an HList is (label, value) nodes. I see the definition

data HNil = HNil
data HCons e l = HCons e l

type (:*:) e l = HCons e l
e .*. l = HCons e l

foo :: Int :*: Bool :*: HNil
foo = 1 .*. 2 .*. HNil

But I don’t see the (label, value) pairs. Furthermore, I don’t see how HLists are “a bunch of nested pairs”; it’s only true insofar as a normal list is “a bunch of nested pairs”.

Section 5 ‘Extensible records’, top of page 5, operator .=.. (But now that you point it out, that operator isn’t defined – even in the Appendixes. Naughty!)

ah sure, using an HList for extensible records. But that’s not inherent to using an HList. That’s like saying “using an HList to simulate HKD is inferior to HKD”, which, sure.

Ah, I’m misleading you, sorry my bad.

(.=.) :: Label l -> v -> Tagged l v 

newtype Tagged s b = Tagged { unTagged :: b } deriving ...

(Gleaned from various places on Hackage. Hoogle seems not to have it. OTOH I’m on the road, without proper access to usual checks.)

So the label is a phantom. The l.h. operand to .=. is being used for its type only.

I’ve considered making a data frame library in the past but I’ve never got round to it. If I were doing it I’d do it as an arbitrary collection of Vectors wrapped in abstract data types that enforce the invariant that all the Vectors are of the same length. So, for example

newtype Field a = HiddenMkField (Vector a)

newtype DataFrame a = HiddenMkDataFrame a
    deriving Functor

fromVector :: Vector a -> DataFrame (Field a)
fromVector = HiddenMkDataFrame . HiddenMkField

zip ::
  DataFrame (Field a) ->
  DataFrame (Field a) ->
  DataFrame (Field a, Field a)
zip = ... check same length and put in pair...

All functions that operate on Fields must preserve the invariant.

This approach has the nice property that you can be very specific about the fields that are in your data frame, for example

data MyRecord = MyRecord 
  { count :: Field Int,
    name :: Field String,
    weight :: Field Double
  }

and you can also be completely dynamic about the collection of Fields and the types of the Fields, for example, DataFrame (Field Dynamic, Field Dynamic), or DataFrame (Map String Dynamic).

In order to write polymorphic functions over DataFrames you can use ProductProfunctors in roughly the same way that Opaleye does.

Speaking for myself, I find the Frames library exceedingly intimidating to use.

There’s a tutorial, sure. But, for example, the tutorial doesn’t show how to melt the frame, so I would refer to this documentation entry: melt. I can’t imagine someone picking up Frames as a beginner.

Basically, I would be lost when not on the happy path with Frames.

Sure, that looks much like a regular list definition, but it’s isomorphic to – and has exactly the same memory layout as – this definition:

type family HList2 (xs :: [Type]) where
  HList2 '[] = ()
  HList2 (x ': xs) = (x, HList2 xs)

foo :: HList2 [Int, Bool]
foo = (1, (True, ()))

You might object that the regular list datatype can also be represented as nested pairs and you’d be right, but the regular list has a single parameter type and that makes a world of difference in what you can do with it. It makes no sense to speak of an infinite HList for example, because its type would also be infinite and GHC wouldn’t allow it.

1 Like

I believe a dataframe library which made judicious use of dynamic types for the columns would be an interesting part of the design space to explore. It would avoid the usual pitfalls when trying to describe the columns fully statically: inscrutable types, eternal compilation times… while enabling new use cases, like interactive definitions of frames and pipelines.

The trick would be to avoid being too dynamic. For complex chains of operations that might take a while to complete, some kind of value-level analysis could be performed in order to be able to “fail fast” before running the operations. It would still fail at runtime, but it would fail early, and that might be an acceptable compromise.

And, likely, it would still be possible to integrate islands of “static types” in dynamic pipelines, if one so wished.

1 Like