[Design] Dataframes in Haskell

I’ll submit this proposal as a Github issue for the Haskell foundation as well but posting here for visibility too.

Document is open for comments.

Happy New Year, everyone.

14 Likes

Can you elaborate on the motivation? From my ignorant perspective, I don’t see why anyone would use a compiled language for exploratory data analysis. The big advantage of Haskell is strong type guarantees, but I don’t see how useful types can be if you don’t know the schema beforehand.

One counterexample I can think of is Spark in Java/Scala, but IIUC people use Spark mostly for big data analysis, not data exploration.

Any modern language that works with data (no matter how big or small the data is) must have a dataframe library.

Is there a distinction between a data ecosystem and the exploratory data analysis usecase? I would think that having a robust data ecosystem is separate from how good/easy it is at exploring data?

2 Likes

C++ has a popular solution and Rust has Polars. Although in the case of Rust python bindings are more popular than the Native library. I’ve also been in contact with the author of Java’s DFLib about design and benchmarking.

I’ve been toying with the idea of having a typed implementation as well as a dynamically typed one. But I have a hard time convincing myself that the typed usecase wouldn’t be better suited by a relational model like beam.

2 Likes

Thanks for explaining! I think a DataFrame implementation in Haskell would be great, mirroring the C++ and Rust implementations. Some miscellaneous comments:

  • Big +1 to using Typeable instead of HLists
  • You mentioned “DataFrames are ordered”, which wasn’t immediately obvious to me; I thought the idea was to be able to order columns separately, which didn’t make sense to me. Maybe rephrase as something like “DataFrames can be sorted, so all values in a DataFrame must be orderable”?
    • Also, the reference for this points to page 2 in the PDF, but I couldn’t find anything about this…
  • What’s the benefit of making a Haskell Foundation proposal? Why not just make the library yourself?
2 Likes

Thanks.

The doc is interesting, how will getting the Foundation involved benefit your idea and the community?

RE foundation as opposed to library: 1) it’d be great if there was consensus and buy in for the design since the big claim is that this is “foundational” data ecosystem work 2) I wouldn’t want a BDFL model for this. Would be great if it was a community project (not sure if that’s what foundation projects are for but that’s the impression I got from reading the docs. Could be misguided).

2 Likes

@brandonchinn178 you’re right. Wrong way to paraphrase it but was referring to this section:

“In addition, data-
frames are ordered—and dataframe systems often enforce a strict
coupling between logical and physical layout…”

About integration with other tools, Python DuckDB lib can query various DataFrames. This type of integration would instantly give an alternative API (SQL), which is only a benefit.

There are also efforts like ADBC and Arrow Flight SQL that give some hope for future Haskell-native database bindings (or, at the very least, easier support for various databases). It would be great if DataFrames could work over the result of a query.

I know that native support for Apache Arrow is mentioned, but I guess my question is, could support for Apache Arrow in Haskell be more foundational than “just” a DataFrame library? More foundational, as in, unlocking more use cases, one of which would be writing a DataFrame library.

2 Likes

EDA requires good interactivity and good plotting libraries. We have neither

  1. We have ihaskell for interactive work. But from my experience it just not great, tend to lag behind GHC releases (or did situation improve as of late?), and lack polish in general. Most importantly AFAIR it run haskell in interpretation mode. It other words it’s slow. python escapes that problem by doing most work in C kernels.
  2. As for plotting we just don’t have commonly accepted solution. Situation is plain bad here.

So I think data frames have to be useful outside of EDA.


Dynamically typed columns also bring problem of type promotion. How say, addition Double and Int is handled? Python has no problem with this, it will promote int to double. And if library adds implicit promotions how they interact with user defined data types?

The Haskell community urgently needs better support for Parquet and Apache Iceberg. Their absence creates significant friction in building Haskell-based business apps that expose their data. A first-class dataframe library would not only address this issue but also make integrating Parquet and Iceberg much more straightforward.

3 Likes

There are these seemingly unused Arrow bindings based on glib/gobjects. I don’t think it ever found a use case and probably the author ran out of steam.

You’re right though. Something so important shouldn’t be a side-thought of a separate project but it could form the basis of an implementation that can be spun out.

In the mean time I’ve attended the bi-weekly Arrow community meeting and I’m working on a prototype with some guidance from the Arrow team.

4 Likes

I share that same sentiment about IHaskell. My current target environment is GHCI as shown in this example. Not the greatest but it allows me to make progress while I think of a good UX solution.

GHCI is still interpreted but it’s been working well so far.

RE plotting: again agreed. Hence resorting to terminal plotting for now but I do really like the gnuplot bindings since they offload the problem to a different system altogether.

RE promotion and dynamic typing: current implementation isn’t that dynamic. Types are checked for equality at runtime. Double and int would throw an exception rather than silently promote. I haven’t thought about what this means vs my design philosophy so I’ll have to sit on this thought for some time to see what would make the most sense.

2 Likes

I think vinyl: Extensible Records should be read in companion with Frames.

The most obvious drawback of these two libraries is that they are row-oriented. … Frames is a promising attempt but has a syntax that looks more like an advanced Haskell tool than a data science tool.

Frames is abstract in the type of a row and I can imagine that your core type:

HeterogeneousCollection [Vector T1, Vector T2…]

… can be accomodated.

Frames and vinyl are old, for sure, but there’s a lot of design ideas in them that could be useful.

1 Like

Author of chart-svg: Charting library targetting SVGs. here.

This package is designed towards a simple, lensy API, with hopefully pleasant defaults, tries to be completely configurable but others would be better situated to list any short-comings. It hasn’t been stable over the years but is getting better.

I was wondering what people might be looking for in a plotting library?

2 Likes

Thanks @tonyday567 I’ve never worked with chart-svg so I’ll give it a shot. For me the problem with charting in Haskell has been:

  • Setting up the back ends is a pain (GTK vs Cairo) especially between OS’s. Terminal plotting seemed “safer.” Also migrating to newer version of the backend is pain - recently did some PRs for Dunai and porting an simple example from SDL1 to SDL2 wasn’t as straightforward as I’d imagined.
  • I think part of it is also an optics problem. Whenever you search “Haskell plotting library” you’re met with a lot of posts about how there aren’t any good plotting libraries. I do think the sort of federated approach that tidyverse in R does or what data Haskell aimed to do would have solved this part of the problem. But from speaking to @ocramz even that is a hard problem to solve.
2 Likes

I’ll try a prototype with Vinyl. I think a good north star for the implementation would be :

  • The UX of partitioning the data (adding columns, splitting the Dataframe etc)
  • The UX of schema evolution

I was recently in conversation with Joe Hellerstein from the modin project and he had a pretty good summary of the EDA problem.

“exploratory data analysis/cleaning requires type flexibility – for two reasons. First, types are used as exploratory “lenses” on even uniform data – e.g. we might try interpreting a numeric column as a date (e.g 20150202) vs. a time interval vs. currency etc. to figure out what was intended. Second, data often starts jumbled from multiple sources. So ground truth is effectively a union type, and as we discover/decide upon it, we can partition by type to reorganize the data (e.g. split into separate columns or separate datasets). This partitioning often happens incrementally: during cleaning we might treat a column as having two types T and other, and iteratively refine other. Hence the use of union types (or null-hypothesis “bottom” types) needs to be available on inputs and outputs of transformations.”

My take away was that types are hypotheses of what ground truth might be rather than a specification of what it is. And an EDA system should make these sort of manipulations easy to do and validate statistically or otherwise.

From looking at the vinyl tutorial doing these sort of loose manipulations isn’t very ergonomic.

1 Like

That’s interesting I didn’t know about that library. I skimmed through API and it seems there’s no support for log scale, or did I miss it? Is there way to specify axis range manually, on one or both ends? What about another common problem: axis using dates?

So first problem of plotting libraries: required feature set is HUGE. Second problem is API design. All gajillion knobs that library supports should be exposed to user. They should be discoveralbe. And it has to sprinkle enough magic to make simple plots simple. Looking at lineExample, it seems awfully lot of typing for such simple plot. One shouldn’t mess with palette to get some default colors

There’s no log scale support, but rolling your own should be pretty easy. There’s a lot of flexibility with axis range, tick spacing and the like. There’s a lot of detail in Axes, tick marks, grid lines and so on. Ticks can be placed automatically or you can manually select them. There’s a pretty sharp delineation between what is data and what is a rendering or visual aid (a Hud). The best way to safely specify another data range is to create a blank chart with the data rectangle you want to extend. If you want to contract the data range, you’d have to truncate the actual data, and I’m not sure whether that’s a feature or a wart. Dates are supported.

There’s definitely a tradeoff between simplicity and feature set, , there is a gazillion nobs (types?), and the library exposes every one of them. But a chart is just a big non-recursive, showable AST, so the trickiness is discovering which nob is the one you want. It’s initially complex but I think there’s a payoff down the line in practical tweaking.

palette is just a palette of colors that are used in the Chart.Example. You can say set #color (Colour 1 0 0 0.5) to get red at half opacity. But it doesn’t include names so you can’t say set #color red.

Thanks for the feedback!

1 Like

Problem getting log scale right is not simple at all. Implementation must handle 0 gracefully. That’s quite common, take for example histogram with huge variation of bin counts and some of them are zero. One could want to plot it in log scale and plot shouldn’t break because some bins are zero. And tick placement is maze of special cases. Tick placement for axis spanning several orders of magnitude is different from one that span less than one

Filtering data is not acceptable solution. First it’s too difficult, one could have many object on plot. It would be difficult to filter them all. On top of it it’s just wrong consider plot of x² in x range [-1,1] and y range [0.5,1]. In that case line should beyond plot. But filtering all values <0.5 will create false line between two branches. Again it’s very basic feature I would expect any plotting library to provide it.

wrt to color. I expect that dropping few line plots without setting any properties would result in plot with lines getting some default colors (different for each) so it’s possible to understand which is which at a galnce. That’s especially important for interactive use. One would want result now, it need not to be perfect but it should be decent.

P.S. But I think we’re getting off topic

2 Likes

I think that is doable using the machinery in Data.Typeable and Type.Reflection. We could keep some kind of data structure of “this type is promotable to this other one, and here is the promotion function” entries.

As for using extensible record libraries like “vinyl” for extra type safety, I’m a bit wary about their usability. Perhaps, if we are going to work within ghci, an alternative would be to use Template Haskell. Like, if we have a frame with dynamically typed columns, we could introspect them and generate a strongly typed record and perhaps a conversion function.

AFAIK it’s possible to run [Dec] splices in ghci, although it’s a bit akward, you need multi-line commands and prepend them with some dummy declaration or assignment, otherwise they get misinterpreted as Exp splices (is there a way to avoid this?)

ghci> data Foo = Foo Int Int
ghci> foo = Foo 1 2
ghci> :{
ghci| _ = () -- how to avoid having to put this dummy decl?
ghci| $(deriveJSON defaultOptions ''Foo)
ghci| :}
ghci> toJSON foo
Array [Number 1.0,Number 2.0]
1 Like