Implementing type-safe heterogeneous collections

mchav · January 21, 2024, 8:08am

I’m brainstoming the implementation of a simple (easy-to-use, almost vanilla Haskell) dataframe type going roughly off of this paper where dataframes are a 2D data structure heterogeneous values with some type metadata. Of course, this definition isn’t really useful in a statically typed language. But my initial intuition was that since the data is not completely hereogenous (we want columns to have the same data type) we could model dataframes in Haskell as follows:

data FrameList = ListInt [Int]
               | ListString [String]
               | ListBool [Bool]

data DataFrame = DataFrame {
    columns :: M.Map String FrameList
}

And so to apply ad hoc transformations onto the framelist, I create a typeclass that allows me to use specialize function application:

class Transformable a where
    transform :: (a -> a) -> FrameList -> FrameList

instance Transformable Int where
    transform f (ListInt xs) = ListInt (map f xs)
    transform _ other        = error "Invalid Int transformation"

instance Transformable String where
    transform f (ListString xs) = ListString (map f xs)
    transform _ other           = error "Invalid String transformation"

So then an “apply” function could look roughly like this:

apply :: Transformable a => DataFrame -> String -> (a -> a) -> DataFrame
apply d column f = case column `M.lookup` (columns d) of
    Nothing     -> d
    Just series -> DataFrame { values = M.insert column (transform f series) (columns d) }

But this entire solution has a number of drawbacks:

I lose type safety and have to do runtime checks to see if the function is valid. Excuse that I used error. I know I could use Either or Maybe but that’s besides the issue.
Adding new types to the data frame means writing all this boilerplate.
The approach seems rigid. For example, supporting a function that goes from (a -> b) means writing more boilerplate to map between types.

Of course data frames only contains a few data types so it wouldn’t be much work but I was wondering if there’s a better way to do this.

What I still haven’t tried:

HList because it’s intimidating enough that it’s a second last resort.
Anything involving template Haskell (is that even the right tool for this?)

Any help would be appreciated.

rhendric · January 21, 2024, 10:12am

For that, it’s hard to beat ([Int], [Int], [Bool], [String]). Use lens and your apply d column f is over (column . mapped) f d (where column is _0, _1, _2, …). Want horizontal composition? Nest tuples and compose column accessors.

(I’m using lists because you are; I’d expect that a serious application would use Array or Vector or almost anything else instead.)

mchav · January 21, 2024, 4:23pm

Thank you. I’m not quite sure I understand. Defining the collection as a tuple would mean knowing the number of elements up front as in:

data DataFrame a b c .... n = DataFrame {
  columns :: ([a], [b], ..., [n])
}

Supporting a large number of columns seems unwieldy. Am I misunderstanding something?

blamario · January 21, 2024, 6:05pm

HList is mostly inferior to the higher-kinded data approach. Neither of them is a suitable foundation for Data Frames, because the names, order, and types of the columns are not fixed in advance.

Maybe HMap would work?

Regarding your interface, some creative instances of the HasField class could in principle support the basic features of the API while preserving the type safety.

danidiaz · January 21, 2024, 7:00pm

I think Transformable can be replaced by the standard Haskell machinery for dynamic types! That machinery lives in modules like Data.Typeable, Type.Reflection (which provides a more “powerful”, type-indexed version of the TypeReps from Data.Typeable), Data.Dynamic and Data.Type.Equality.

Consider these new definitions:

type FrameList a = [a]

data SomeFrameList where
  MakeSomeFrameList :: 
         (Typeable a) => {frameList :: FrameList a} -> SomeFrameList

type ColumnName = Text

data DataFrame = DataFrame
  { columns :: Map ColumnName SomeFrameList
  }

SomeFrameList uses existential quantification to strategically forget about the type of the elements in a FrameList. Two things to notice:

It still remembers at compile time that they live in a FrameList.
It requires the elements to be Typeable. Unlike your Transformable, most (all?) concrete types in Haskell have an automatically defined Typeable instance. You don’t even need to derive it. That instance allows performing type comparisons at runtime.

Now, because the application might fail at runtime, here’s a nice error datatype:

data ApplyError
  = ColumnDoesNotExist ColumnName
  | TypeMismatch SomeTypeRep SomeTypeRep

And the apply function could be something like:

apply :: 
    forall b. (Typeable b) => 
    DataFrame -> 
    ColumnName -> 
    (b -> b) -> 
    Either ApplyError DataFrame
apply DataFrame {columns} columnName f =
  DataFrame <$> Map.alterF alteration columnName columns
  where
    alteration :: Maybe SomeFrameList 
               -> Either ApplyError (Maybe SomeFrameList)
    alteration = \case
      Nothing -> Left $ ColumnDoesNotExist columnName
      -- The type annotation in the match binding 
      -- brings the hidden type of tye elements into scope
      Just (MakeSomeFrameList {frameList = oldFrameList :: [a]}) ->
        -- We get the *indexed* type 'TypeRep's of the types 
        -- we want to compare.
        -- It's not enough that we use the unindexed
        -- Type.Reflection.SomeTypeRep / Data.Typeable.TypeRep (same thing)
        -- because the 'testEquality' function needs the indexed versions.
        let repb :: Type.Reflection.TypeRep b = Type.Reflection.typeRep @b
            repa :: Type.Reflection.TypeRep a = Type.Reflection.typeRep @a
         in case repa `testEquality` repb of
              Nothing ->
                Left $
                  TypeMismatch (SomeTypeRep repb) (SomeTypeRep repa)
              Just Refl ->
                Right $
                  Just $
                    MakeSomeFrameList
                      { frameList = map f oldFrameList
                      }

It should work with concrete types, without any need to define new instances or code for each type.

The main magic lies in the use of testEquality function to compare indexed type representations.

I’ve put the code in this repo. Despite (or because ) all the dynamic typing, I haven’t run it, only compiled it!

brandonchinn178 · January 21, 2024, 7:14pm

I’m not sure how HKD solves this problem. Yes, you still need an existential because you dont know the types and orders at compile time, but once you unwrap the existential, HList provides exactly the types to be able to tell the compiler “I want to get the first column, which has type X”.

HMap doesnt help because DataFrames have ordered columns. HList is exactly whats needed here

rhendric · January 21, 2024, 7:29pm

mchav:

Thank you. I’m not quite sure I understand. Defining the collection as a tuple would mean knowing the number of elements up front as in:
data DataFrame a b c .... n = DataFrame {
  columns :: ([a], [b], ..., [n])
}
Supporting a large number of columns seems unwieldy. Am I misunderstanding something?

Sorry, I should have been clearer. I was proposing not using any type called DataFrame at all and using tuples, if you want:

a product of lists
with type-safety (no run-time checks for types of things)
in vanilla Haskell (without advanced type features or HList)

You won’t get all three of those things any other way.

As proof, consider the type of your proposed apply function:

apply :: Transformable a => DataFrame -> String -> (a -> a) -> DataFrame

To have the type safety you want, if df1 is a DataFrame that contains an Int column, and df2 is a DataFrame that contains a String column, you want apply df1 "the-column" someIntFn to type-check but apply df2 "the-column" someIntFn to raise an error at compile time. If df1 and df2 have exactly the same type (DataFrame), there is no information the compiler can use to differentiate between those two expressions, so you can’t expect different compile-time behavior from them.

You can resign yourself to losing type safety, in which case the path you’re on, or the Dynamic approach sketched by danidiaz, will do the job.

If you are unhappy with that, then you must find a way to make the types of df1 and df2 different. My proposal, which works for simple cases but not for cases where you want to, say, map a function uniformly over all (heterogeneous) columns, is to abandon the DataFrame type and use tuples; the lens infrastructure will help you work ergonomically with tuples of size up to 19, and that’s sufficient for things like apply.

Your only remaining alternative, if you want type safety and you want a type named DataFrame, is to parameterize DataFrame somehow with a constant number of parameters that express the types of a variable number of columns. This isn’t something that’s possible in ‘simple’ Haskell. So for that, your options are:

Bite the bullet and learn HList
Find a less-popular package that does, at its core, the same thing as HList, and learn that
Learn some combination of data kinds, multi-parameter type classes, functional dependencies, type families, and/or some other GHC extensions that give you equivalent power, and implement the core of HList yourself (I include in this list item clever gotchas like cons-style nested tuples, because you need aforementioned extensions to do things like concatenations/uniform maps over nested tuples, and if you don’t need those things you should be happy with simple tuples instead of DataFrame)

blamario · January 21, 2024, 7:46pm

DataFrames’ columns and rows are both ordered and named, i.e. they’re accessible through either index or name. The column operations tend to use names, the row operations are more index-oriented.

As for “how”: if the set of all possible columns were known in advance, you could start with

data CountryRow f = CountryRow {
  name :: f Text,
  landArea :: f Double,
  waterArea :: f Double,
  population :: f Word,
  gdp :: f Centi,
  ...
}

Then a DataFrame Year CountryRow type can be defined in several different ways like

type DataFrame rowLabel rowColumns = [(rowLabel, rowColumns Maybe)] -- sparse
type DataFrame rowLabel rowColumns = [(rowLabel, rowColumns Identity)] -- total
type DataFrame rowLabel rowColumns = rowColumns (Compose [] ((,) rowLabel)) -- columnar

The major problem, again, is that the set of possible columns is not fixed.

jaror · January 21, 2024, 8:00pm

Couldn’t you simply define a product like this:

data p :*: q f = p f :*: q f
newtype Name f = Name (f Text)
newtype LandArea f = LandArea (f Double)
type CountryRow = Name :*: LandArea :*: etc

Ah, I guess this way pattern matching and constructing these rows is a bit of a pain.

jeukshi · January 21, 2024, 8:11pm

No idea how helpful this will be, but maybe you can get some inspiration or we will arrive at a design that I, too, can use.

I’m currently writing an SQL library that allows writing arbitrary SQL - that is then type checked and serialized into this:

type (:=) :: Symbol -> Type -> Type
data label := ty = (KnownSymbol label) => Proxy label := ty

-- TODO small array
newtype Row (types :: [Type]) = Row [Any]

instance HasField l (Row ((l := t) : e)) t where
    getField (Row xs) = unsafeCoerce $ xs !! 0

instance HasField l (Row (_0 : (l := t) : e)) t where
    getField (Row xs) = unsafeCoerce $ xs !! 1

instance HasField l (Row (_0 : _1 : (l := t) : e)) t where
    getField (Row xs) = unsafeCoerce $ xs !! 2

It gives quite nice interface (as long as the library will be constructing it):

let row = Row [unsafeCoerce ("hello" :: Text), unsafeCoerce ("haskell" :: String)]
                :: Row '["abc" := Text, "xyz" := String]

    print row.abc -- "hello"
    print row.doesntExist -- compile error

There are two problems. It depends on experimental OverloadedRecordDot extension. It is read-only currently (enough for my use case). I didn’t check OverloadedRecordUpdate yet and I believe it is even more experimental.

brandonchinn178 · January 21, 2024, 9:48pm

Right, which I’d argue is the crux of the original problem. OP is making a DataFrame library, so in this context, it doesnt make any sense to talk about a technique that requires the columns to be known.

As an aside, I’m not sure how you can say “HList is mostly inferior to HKD”; they solve two different problems. To me, this is like saying “lists are mostly inferior to tuples”. Sure, if you know the shape of the data before hand and the shape doesnt change, then use tuples. But lists solve a different set of problems, so it doesnt make sense to compare the two.

LaurentRDC · January 21, 2024, 11:42pm

I want to challenge this assumption. I work with dataframes a lot, and I always know the subset of columns I care about.

Consider JSON. Some applications need to accept arbitrary JSON, sure. But when I work with JSON, I know at least a subset of the schema, and create the appropriate datatypes to be decoded/encoded.

I would expect the same from a Haskell DataFrame library. My workflow would look like this:

Peek at the data (e.g. first few rows from a CSV file) to determine what I’m working with;
Define a datatype which represents the data I care about;
Implement some kind of decode operation to get the data into a DataFrame;
Proceed with data analysis.

I’m a big fan of the SQL interface beam, which uses higher-kinded types (see tutorial).

brandonchinn178 · January 22, 2024, 12:35am

EDIT: Sorry, wrote the first response after reading the previous message too quickly

Sure, that’s a valid use case, but I think a lot of people also use DataFrames to explore the data. I’d imagine pandas would be a lot less popular if you have to read the source input beforehand and define the schema first.

But on the other hand, maybe you wouldnt be doing that in Haskell, a compiled language, in the first place. Maybe that only makes sense in a REPL context

adamConnerSax · January 22, 2024, 3:16am

Maybe this is made clear someplace in the thread that I missed: in what ways does the [Frames]Frames: Data frames for working with tabular data files) library fall short of your needs?

blamario · January 22, 2024, 3:22am

The analogy of lists and tuples is misplaced, as they indeed serve very different purposes. But at the value level, while an HKD record is still a record (which in Haskell is a tuple), an HList is not actually a list but a bunch of nested pairs – i.e. a tuple, only with worse performance and ergonomics. That’s why I’m saying it’s mostly inferior – mostly as in for most purposes. The only exception is if you’re building the type of your HList from parts, but it’s difficult to maintain the proof that the result contains the types you put inside.

brandonchinn178 · January 22, 2024, 3:50am

Can you elaborate? That doesn’t make sense to me. The canonical HList I’m familiar with is

data HList xs where
  Nil :: HList []
  (:&) :: x -> HList xs -> HList (x : xs)

foo :: HList [Int, Bool]
foo = 1 :& True :& Nil

bar :: HList xs -> HList (String : xs)
bar xs = "hello" :& xs

baz :: HList [String, Int, Bool]
baz = bar foo

I don’t see how this is a bunch of nested pairs. It’s almost verbatim the definition of a normal list, just a GADT version

AntC2 · January 22, 2024, 5:03am

[HList old hand speaking] The canonical HList I think of is as per the 2004 paper, with constructors HCons, HNil. Each node is then a pair of (label, value). There was no type-level : back then.

I appreciate it’s no longer 2004. So perhaps “canonical” is best avoided.

BTW I’m following this thread with interest, and wondering how this use case would go if Hugs/TRex was still a thing.

brandonchinn178 · January 22, 2024, 5:24am

I see, thanks. I wasn’t aware of the older implementation of HList. I meant “canonical” as in, the implementation in the HList package that OP linked (and the implementation I’m more familiar with).

But even so, I’m reading the 2004 paper, and don’t see how an HList is (label, value) nodes. I see the definition

data HNil = HNil
data HCons e l = HCons e l

type (:*:) e l = HCons e l
e .*. l = HCons e l

foo :: Int :*: Bool :*: HNil
foo = 1 .*. 2 .*. HNil

But I don’t see the (label, value) pairs. Furthermore, I don’t see how HLists are “a bunch of nested pairs”; it’s only true insofar as a normal list is “a bunch of nested pairs”.

AntC2 · January 22, 2024, 6:19am

Section 5 ‘Extensible records’, top of page 5, operator .=.. (But now that you point it out, that operator isn’t defined – even in the Appendixes. Naughty!)

brandonchinn178 · January 22, 2024, 6:57am

ah sure, using an HList for extensible records. But that’s not inherent to using an HList. That’s like saying “using an HList to simulate HKD is inferior to HKD”, which, sure.