[Initial feedback request] DataFrame library

Exploring the design space and wanted to try out creating a dataframe library that’s meant for more exploratory data analysis. That is where you don’t know the shape of the data before hand and want to load it up quickly and answer pretty basic question.

Please let me know what you think of this direction and maybe clue me in on some existing tools in case I’m duplicating work.

8 Likes

The example is very good to introduce what this library is all about, Haddocks less so.

I don’t think exploratory data analysis can be done without dynamic typing, so I welcome this!

2 Likes

Thanks. I’m going to ask non-Haskell people to try it out to see if fulfills the goal of being easy to use.

1 Like

If targets non-Haskell people , would you consider building wrapper for like Python ?

Thank you for sharing!

I do have some pointers for you on similar work:

  • frames : row-oriented, the first real Haskell dataframes library, with compile-time affordances for inferring column types.
  • colonnade : column-oriented, with adapters to some standard data formats and a nice way to talk about table headers.
  • heidi : row-oriented, using Generics (specifically, generics-sop) to extract the frame structure from regular Haskell algebraic types.
4 Likes

Like polars I’d like to eventually get parallelism gains from using Haskell. It might be worth wrapping python for chatting thoufh but I noticed Haskell-chart already does this.

I’ve seen and worked with frames. I’m trying really hard to steer away from the template Haskell approach and see how ergonomic I can make the API.

Colonnade seems great I’ll look more into it.

RE Heidi, aren’t row-based representations more awkward to query?

Yeah, I started on Heidi knowing very little about DB internals :smiley: It’s mostly a proof of concept for the user-facing API.

The Dataframe project is great, I could have used it many times.

It’s also great that Dataframe puts a spotlight on the gulf between routine quick exploratory data analysis (EDA) and typical rigorous haskell engineering, where strong static typing is vitally important. This raises interesting questions about the product concept and roadmap.

Given that Dataframe types are dynamic and column names are strings, it’s philosophically much closer to python-pandas-polars and R-tidyverse than traditional haskell. With that structural foundation, the dataframe package has two obvious paths going forward:

Path 1) Continue in the dynamic direction, expanding functionality to compete directly with python and R.

Path 2) Provide a bridge between a lightweight dynamic style and traditional haskell strong static typing.

Path 1 is a big project, simply because the depth and breadth of other EDA platforms define the baseline, and therefore define community expectations. For example, matplotlib and seaborn are flexible and define the cultural norm, and building analogous haskell tools will need substantial engineering resources.

Next, suppose that a full-featured Dataframe is wildly successful. That’s essentially a big EDSL which doesn’t share haskell’s strengths, value propositions, trajectory, and culture. That’s not a bad thing but it’s not haskell.

Path 2 is another thing entirely. It would be straightforward to write adaptors between the dataframe library and traditional haskell packages like aeson, but there are bigger questions around the UX and understanding just what problems the bridge is solving.

Not easy questions but fun to consider!

1 Like

I thought about this. There could be two abstractions.

  1. what we have now which is, as you note, far from Haskell which has the goal of loading data in memory, doing some analysis on top of it and answering questions quickly. An API like this benefits from being somewhat dynamic since a user wants to fail fast and try things.
  2. a lazy API that build an execution graph on top of a typed representation of a dataframe then runs at the end. This one woild be closer to an ETL pipeline for files larger than disk. The fail fast model doesn’t work here because you don’t want a pipeline that will fail 20 minutes into running with a runtime error.

This is similar to the distinction that Polars has. I’ve noted this issue in “Future work”.

What do you think?

a lazy API that build an execution graph on top of a typed representation of a dataframe

Cool.

(I’m assuming that typed representation means dynamic type inference inside DataFrame, e.g., during read_csv, rather than traditional haskell compile-time typing. If that’s wrong, ignore everything below.)

Then we’d have two independent type systems and API flavors which users see as unrelated. In the extreme case, a data science user could be completely unaware of traditional haskell typing, because the EDSL is so all-encompassing that there’s no type system leakage from haskell. At the same time, traditional haskell users might cringe because there are now two independent type systems in play, and DataFrame can’t easily interoperate with existing apps and packages.

Imagine a typical python or R use case: Subset rows of a dataframe using a regex on a string column, and for that subset, add a new column based on a lambda function that calls scikit. For that to be successful, there needs to be a user-friendly bridge between these two worlds.

A new, independent type system is a big architectural stake in the ground, so opinions on DataFrame may differ based on this point. Presumably data scientists, with a job to be done quickly, will love DataFrame for all the reasons they love python and R. At the same time, haskell traditionalists might be less enthusiastic because the world they knew – one big type system at the center of everything – has become two big type systems that don’t interoperate.

2 Likes

Fair. That seems like a bad idea. I guess my thinking was on the spectrum of things one could do to represent heterogeneous collections the weakly typed options are good for the EDA case and the more heavy handed approaches like HLists and other approaches discussed here would make more sense as a time investment for a world in which you’re not writing one off scripts.

But after sitting on what you said that would be a different library altogether. Makes little sense to unify the two.

Wonder if there’s an interesting intersection between your two abstractions (quick dynamic EDA tools, execution call graph with type safety). If building and type-checking the graph were very fast and integrated into a platform like jupyter, you’d essentially have a general-purpose type-safe data science platform.

Imagine the wonderful type-safe user experience:

customers = read_csv(...)
customers.bonus_points = customers.city * 1000
                         ^^^^^^^^^^^^^^^^^^^^^
Type error: DataFrame.String * DataFrame.Number

Yes, it would be great, but the reality is that all data sources should be assumed to be garbage, therefore assuming a particular (and fixed!) type for a column that’s been read straight from a CSV is generally a bad idea.

What ‘Frames’ does is sample a few data rows at compile time and construct a type for those, but what if someone put a phone number instead of an email at row 9872342? and so on

There are three distinct phases in play:

  1. Heuristically infer the column datatype, like pandas and other CSV readers do. Typically they sample n rows and make a good guess. (Obviously this issue disappears with typed data like SQL tables.)
  2. Read the full CSV, detecting errors when text can’t be parsed into the column’s datatype. Those errors could produce NaN or Nothing values, throw exceptions, ignore entire rows, call a lambda that produces default values, or do any other useful API behavior.
  3. Typecheck the execution graph using the inferred CSV datatypes, as well as anything else relevant, e.g., 1000: DataFrame.Number in the example above.

These phases are essentially independent, so the general-purpose type-safe data science platform really is doable. I get excited just thinking about it!

1 Like