I’m brainstoming the implementation of a simple (easy-to-use, almost vanilla Haskell) dataframe type going roughly off of this paper where dataframes are a 2D data structure heterogeneous values with some type metadata. Of course, this definition isn’t really useful in a statically typed language. But my initial intuition was that since the data is not completely hereogenous (we want columns to have the same data type) we could model dataframes in Haskell as follows:
data FrameList = ListInt [Int]
| ListString [String]
| ListBool [Bool]
data DataFrame = DataFrame {
columns :: M.Map String FrameList
}
And so to apply ad hoc transformations onto the framelist, I create a typeclass that allows me to use specialize function application:
class Transformable a where
transform :: (a -> a) -> FrameList -> FrameList
instance Transformable Int where
transform f (ListInt xs) = ListInt (map f xs)
transform _ other = error "Invalid Int transformation"
instance Transformable String where
transform f (ListString xs) = ListString (map f xs)
transform _ other = error "Invalid String transformation"
So then an “apply” function could look roughly like this:
apply :: Transformable a => DataFrame -> String -> (a -> a) -> DataFrame
apply d column f = case column `M.lookup` (columns d) of
Nothing -> d
Just series -> DataFrame { values = M.insert column (transform f series) (columns d) }
But this entire solution has a number of drawbacks:
- I lose type safety and have to do runtime checks to see if the function is valid. Excuse that I used
error
. I know I could use Either or Maybe but that’s besides the issue. - Adding new types to the data frame means writing all this boilerplate.
- The approach seems rigid. For example, supporting a function that goes from
(a -> b)
means writing more boilerplate to map between types.
Of course data frames only contains a few data types so it wouldn’t be much work but I was wondering if there’s a better way to do this.
What I still haven’t tried:
- HList because it’s intimidating enough that it’s a second last resort.
- Anything involving template Haskell (is that even the right tool for this?)
Any help would be appreciated.