[Initial feedback request] DataFrame library

mchav · November 25, 2024, 6:08am

Exploring the design space and wanted to try out creating a dataframe library that’s meant for more exploratory data analysis. That is where you don’t know the shape of the data before hand and want to load it up quickly and answer pretty basic question.

Please let me know what you think of this direction and maybe clue me in on some existing tools in case I’m duplicating work.

f-a · November 25, 2024, 7:14am

The example is very good to introduce what this library is all about, Haddocks less so.

I don’t think exploratory data analysis can be done without dynamic typing, so I welcome this!

mchav · November 25, 2024, 2:53pm

Thanks. I’m going to ask non-Haskell people to try it out to see if fulfills the goal of being easy to use.

yellowbean · November 25, 2024, 4:39pm

If targets non-Haskell people , would you consider building wrapper for like Python ?

ocramz · November 27, 2024, 3:22am

Thank you for sharing!

I do have some pointers for you on similar work:

frames : row-oriented, the first real Haskell dataframes library, with compile-time affordances for inferring column types.
colonnade : column-oriented, with adapters to some standard data formats and a nice way to talk about table headers.
heidi : row-oriented, using Generics (specifically, generics-sop) to extract the frame structure from regular Haskell algebraic types.

mchav · November 27, 2024, 3:20pm

Like polars I’d like to eventually get parallelism gains from using Haskell. It might be worth wrapping python for chatting thoufh but I noticed Haskell-chart already does this.

mchav · November 27, 2024, 3:25pm

I’ve seen and worked with frames. I’m trying really hard to steer away from the template Haskell approach and see how ergonomic I can make the API.

Colonnade seems great I’ll look more into it.

RE Heidi, aren’t row-based representations more awkward to query?

ocramz · November 27, 2024, 3:30pm

Yeah, I started on Heidi knowing very little about DB internals It’s mostly a proof of concept for the user-facing API.

ack821 · December 8, 2024, 5:21pm

The Dataframe project is great, I could have used it many times.

It’s also great that Dataframe puts a spotlight on the gulf between routine quick exploratory data analysis (EDA) and typical rigorous haskell engineering, where strong static typing is vitally important. This raises interesting questions about the product concept and roadmap.

Given that Dataframe types are dynamic and column names are strings, it’s philosophically much closer to python-pandas-polars and R-tidyverse than traditional haskell. With that structural foundation, the dataframe package has two obvious paths going forward:

Path 1) Continue in the dynamic direction, expanding functionality to compete directly with python and R.

Path 2) Provide a bridge between a lightweight dynamic style and traditional haskell strong static typing.

Path 1 is a big project, simply because the depth and breadth of other EDA platforms define the baseline, and therefore define community expectations. For example, matplotlib and seaborn are flexible and define the cultural norm, and building analogous haskell tools will need substantial engineering resources.

Next, suppose that a full-featured Dataframe is wildly successful. That’s essentially a big EDSL which doesn’t share haskell’s strengths, value propositions, trajectory, and culture. That’s not a bad thing but it’s not haskell.

Path 2 is another thing entirely. It would be straightforward to write adaptors between the dataframe library and traditional haskell packages like aeson, but there are bigger questions around the UX and understanding just what problems the bridge is solving.

Not easy questions but fun to consider!

mchav · December 8, 2024, 6:02pm

I thought about this. There could be two abstractions.

what we have now which is, as you note, far from Haskell which has the goal of loading data in memory, doing some analysis on top of it and answering questions quickly. An API like this benefits from being somewhat dynamic since a user wants to fail fast and try things.
a lazy API that build an execution graph on top of a typed representation of a dataframe then runs at the end. This one woild be closer to an ETL pipeline for files larger than disk. The fail fast model doesn’t work here because you don’t want a pipeline that will fail 20 minutes into running with a runtime error.

This is similar to the distinction that Polars has. I’ve noted this issue in “Future work”.

What do you think?

ack821 · December 8, 2024, 7:14pm

a lazy API that build an execution graph on top of a typed representation of a dataframe

Cool.

(I’m assuming that typed representation means dynamic type inference inside DataFrame, e.g., during read_csv, rather than traditional haskell compile-time typing. If that’s wrong, ignore everything below.)

Then we’d have two independent type systems and API flavors which users see as unrelated. In the extreme case, a data science user could be completely unaware of traditional haskell typing, because the EDSL is so all-encompassing that there’s no type system leakage from haskell. At the same time, traditional haskell users might cringe because there are now two independent type systems in play, and DataFrame can’t easily interoperate with existing apps and packages.

Imagine a typical python or R use case: Subset rows of a dataframe using a regex on a string column, and for that subset, add a new column based on a lambda function that calls scikit. For that to be successful, there needs to be a user-friendly bridge between these two worlds.

A new, independent type system is a big architectural stake in the ground, so opinions on DataFrame may differ based on this point. Presumably data scientists, with a job to be done quickly, will love DataFrame for all the reasons they love python and R. At the same time, haskell traditionalists might be less enthusiastic because the world they knew – one big type system at the center of everything – has become two big type systems that don’t interoperate.

mchav · December 9, 2024, 2:53am

Fair. That seems like a bad idea. I guess my thinking was on the spectrum of things one could do to represent heterogeneous collections the weakly typed options are good for the EDA case and the more heavy handed approaches like HLists and other approaches discussed here would make more sense as a time investment for a world in which you’re not writing one off scripts.

But after sitting on what you said that would be a different library altogether. Makes little sense to unify the two.

ack821 · December 9, 2024, 4:17am

Wonder if there’s an interesting intersection between your two abstractions (quick dynamic EDA tools, execution call graph with type safety). If building and type-checking the graph were very fast and integrated into a platform like jupyter, you’d essentially have a general-purpose type-safe data science platform.

Imagine the wonderful type-safe user experience:

customers = read_csv(...)
customers.bonus_points = customers.city * 1000
                         ^^^^^^^^^^^^^^^^^^^^^
Type error: DataFrame.String * DataFrame.Number

ocramz · December 9, 2024, 4:44am

Yes, it would be great, but the reality is that all data sources should be assumed to be garbage, therefore assuming a particular (and fixed!) type for a column that’s been read straight from a CSV is generally a bad idea.

What ‘Frames’ does is sample a few data rows at compile time and construct a type for those, but what if someone put a phone number instead of an email at row 9872342? and so on

ack821 · December 9, 2024, 2:58pm

There are three distinct phases in play:

Heuristically infer the column datatype, like pandas and other CSV readers do. Typically they sample n rows and make a good guess. (Obviously this issue disappears with typed data like SQL tables.)
Read the full CSV, detecting errors when text can’t be parsed into the column’s datatype. Those errors could produce NaN or Nothing values, throw exceptions, ignore entire rows, call a lambda that produces default values, or do any other useful API behavior.
Typecheck the execution graph using the inferred CSV datatypes, as well as anything else relevant, e.g., 1000: DataFrame.Number in the example above.

These phases are essentially independent, so the general-purpose type-safe data science platform really is doable. I get excited just thinking about it!

artem · January 31, 2025, 3:23am

Hey! I thought you may be interested in evaluating your design choices using B2T2: GitHub - brownplt/B2T2: The Brown Benchmark for Table Types (B2T2)

AntC2 · January 31, 2025, 9:29am

Inquiry There is no ‘‘standard library’’ of table operations.

[from the paper linked from the ReadMe]

is just not true. The yardstick is ‘Relational Completeness’, Codd 1970. I’m very surprised section 2.2 Benchmark origins mentions SQL – which is, frankly, an embarrassment to the whole industry – but not Relational Algebra, which is as succinct and elegant as Lambda calculus (since you’re posting on a Haskell forum).

As the paper goes on to say, it may be some particular language doesn’t out-of-the-box provide that exact set; then it must be able to compose what it does provide to arrive at those operations. (In the literature this is known as ‘expressively complete’ – see Date & Darwen.)

And yes, the choice of user-facing operations is down to human factors – which I imagine are so difficult to compare as to make rating pointless.

AntC2 · January 31, 2025, 10:05am

[3.1 Basic definitions] A schema is an ordered sequence of column names and associated sorts.

No to the “ordered”: if the tool exposes the implementation detail of left-to-right ordering, I would ipso facto count it as not industrial strength. Worse still if it supports positional access as well as via column name. (This is the ugly situation Haskell has got itself into, making for very brittle code. Contrast for example purescript doesn’t allow positional access to fields of datatypes declared with record syntax – quite right too.)

Since origins does mention SQL, note positional access in 1960s-era is widely regarded as a mis-feature. This century’s iterations of the standard are aiming to extirpate it.

Reading on I’m going to disagree with rather a lot of 3.1, to the extent I’m not seeing the value in the exercise. Then reflect first on this point from 3.2

… we ignore input-output … does not stipulate how tables are entered into programs …

Exactly. The l-to-r sequence of how (say) CSV physical/textual columns match to logical named columns is an implementation detail that should be rated (if atall) under usability, not expressivity.

artem · January 31, 2025, 11:45am

Thanks for your comments. By the way, I’m not an author of the paper, which is what you assumed, it looks like.

AntC2 · January 31, 2025, 12:29pm

Ah, apologies. I’ve re-worded. I’d assumed your mention was some sort of endorsement(?)

Topic		Replies	Views
[Design] Dataframes in Haskell Haskell Foundation	25	2890	January 6, 2025
Pre-HFTP: Proposal DataFrame Library for Haskell Haskell Foundation	11	855	December 9, 2024
Haskell for Data Processing	14	2443	January 17, 2024
[ANN] dataframe 0.1.0.0 Announcements	7	335	April 8, 2025
Implementing type-safe heterogeneous collections Learn	24	1240	January 22, 2024

[Initial feedback request] DataFrame library

Related topics