[Design] Dataframes in Haskell

It’s surely possible. Question is whether it works well, convenient, doesn’t cause problems. But it’s only possible to figure out by trying it out

As for vinyl-like approaches I think we’re running into limits of language. Thing which we want to write are not clearly expressible. One way to look at statically-typed data frame is explicit Array of Structures →Structure of Array transformation. Implementing it requires speaking about fields of product type and it’s something we just don’t in a language

“One way to look at statically-typed data frame is explicit Array of Structures →Structure of Array transformation. Implementing it requires speaking about fields of product type and it’s something we just don’t in a language”

Could you explain this? I read it a couple of times and didn’t understand. Do you mean a Dataframe can be through of as a series of row (array of structures) transformation that return a collection of columns?

And could you explain concretely what you mean when you say we can’t/don’t “speak about record fields” in Haskell?

FWIW, I have in this repo a toy example of how promotions could work using dynamic types. Looks like:

$ cabal repl
ghci> frame <- createExampleFrame
ghci> frame & describe
"doubly" -> Double
"floaty" -> Float
"inty" -> Int
ghci> frame & "foo" .= "inty" +: "floaty"
ghci> frame & describe
"doubly" -> Double
"floaty" -> Float
"foo" -> Float
"inty" -> Int
ghci> frame & printCol "foo"
2.0
4.0
6.0
ghci>
1 Like

Oh it was poorly worded. Of course data frames represented as columns so it SoA on other hand in haskell it’s easier to work with AoS thus AoS→SoA.

But this problems goes deeper. Say we want to make typed data frames. One may want to specify that type of row is product data type A. Then we have problem of figuring out what are columns, how they should be represented and how to address columns. On other hand one will want to create ad-hoc data frame say with column x::X and y::Y. Now question is what is type of row? vinyl and other libraries for extensible record try to answer that question.

1 Like

I agree it could work in principle. But real question is whether it works for real world tasks. It has to support more types and probably needs to be extensible. Operations must allow to inject user-provided functions. I should provide good error messages. Construction may became to unwieldy. And I think only way to get answers is to try it out

I think for row manipulations they end up looking either like Vinyl records or you create a DSL for row expressions - sort of like Polars. I’m not partial to either approach but it seems the DSL would make for a simpler UI depending on the target audience.

One of the ideas I’ve been toying with is, in the long term, doing program synthesis to generate data transformations sort of like Cloud Data prep/Trifacta. A typed row manipulation API does make that synthesis problem a little easier (I think). That’s pie in the sky though.