Pre-HFTP: Proposal DataFrame Library for Haskell

mchav · December 7, 2024, 6:50pm

I have a working prototype of a dataframe library. @LaurentRDC suggested I work on a proposal for this to send to the Haskell foundation some time ago but it seems the only time I get large blocks of time for personal projects is over the holidays.

Screencast of usage in GHCI

I’m seeking initial feedback on the approach and some possible future directions.

Where does this library fit into the design space?
I think it’s good to have a library that allows you to go from “I have a dataset” to “oh, this is what this data is about” very quickly. As such, this library prioritizes simplicity where possible. A few design decisions in particular:

An API that is reminiscent of Pandas, Polars, and SQL
Dynamic typing (which also incidentally gives more control over the error messaging - GHC’s errors can be a little intimidating)
Use in GHCI/notebooks/literate programming rather than standalone scripts
Terminal-based plotting so users don’t have to have all the right lib-gtk/sdl libraries installed.

I’ve included some future work in the README that highlights things I’d like to work on in the near to medium term.

Once the large questions are settled I’d also like to do more UX studies e.g survey data scientists and ask them what they think about the usability and ergonomics of the API, and what feature completeness looks like.

But before all that welcoming initial feedback - and maybe a look at the code because I think there is a lot of unidiomatic Haskell in the codebase (lots of repetition and many partial functions).

After getting feedback from this thread I’ll work on a formal proposal doc to send over. Thanks. Will also cross post for more feedback.

romes · December 8, 2024, 12:14am

Very good work with the demo @mchav, it looks great.

mchav · December 8, 2024, 1:09am

Thank you. I was inspired by jpterm to keep everything in the terminal.

For the example I was using the introductory example in a Tensor flow intro book. I’m still trying to find a good set of benchmarks to profile against but that’s a design afterthought for now.

ocramz · December 8, 2024, 3:32am

Thank you @mchav , this looks very promising.

Regarding features, as a user I would need:

pivoting (wider/longer) : Pivot data from long to wide — pivot_wider • tidyr
indexing/subsetting by rows and colums (pandas has a very unintuitive syntax for this)
relational joins, INNER JOIN at least.

mchav · December 8, 2024, 5:43am

Thanks for the input @ocramz. I’ll open issues for those operations. Btw - I meant to ask what were the learnings from the data Haskell effort? I remember you were active on there. The website is still up and it still speaks of roadmaps and the like. Are the plans to resuscitate the effort?

ocramz · December 8, 2024, 6:27am

Thank you, glad to hear you’re interested.

Learnings from the DataHaskell experiment, good question!

OSS communities take a lot of energy and require deliberate and thoughtful communication. People lose interest immediately if these are not in place.
Goals, momentum and planning. Unifying numerical computing and data analysis is a very ambitious program with tons of tricky technical details and UX things to iron out by interacting with users. Too ambitious roadmaps kill (unpaid) projects. This is kind of where we left off before losing steam: GitHub - DataHaskell/dh-core: Functional data science
Documentation over “updates”. At some point we had a buzzing chat on Gitter, we even had an in-person workshop at ICFP’17 and a Zoom call that spanned many timezones, but nothing beats a single online place where people can read stuff on their own time. I suppose this is the closest we got: Issues · DataHaskell/dh-core · GitHub

Honestly, I don’t really have the bandwidth to restart DH, but I would happily give the keys to an interested party. Truth be told it was a very inclusive setup and people were invited as soon as they signaled interest, so in the same vein I have invited you in as well.

mchav · December 8, 2024, 6:10pm

Greatly appreciated it. Might be worth setting up a discord instead of gitter.

what were some things that worked communication wise? What counted as deliberate and thoughtful communication.
if you could redefine the goals with the benefit of hindsight what would they be and how would you pace them?
by updates you mean posts or messages about new work as opposed to putting it in a doc?

l-Shane-l · December 8, 2024, 8:09pm

I can definitely relate to this experience. You could request help for any of the tasks that dont require much design work, like documentation, anything devops related or task of this nature that take a lot of time.

I really like this library, its something I really need on a project im currently working on. For me I already have to jump between Haskell, C, JS, Bash, Perl and Dart on a daily basis(I am trying to phase out Dart and Perl). If I could remove the need to also jump into a Python context with all the tooling, syntax and conceptual shift that comes with a language shift that would be a big win for me.

Its probably unlikely to completely remove the need for python but for me at least I use it mainly for data exploration atm.

danidiaz · December 8, 2024, 8:31pm

In the example video, the use of & in combination with multi-line commands seems pretty convenient!

ocramz · December 9, 2024, 1:34am

You know DataHaskell actually started with a Slack space, and then by migrating to gitter (linked to repos, open archives) we instantly lost ~50% of the people?

ocramz · December 9, 2024, 1:37am

Very much agree. In addition, as a syntactic convenience it’s already familiar to a good chunk of the data science community thanks to the tidyverse: Introducing magrittr • magrittr

mchav · December 9, 2024, 3:00am

@l-Shane-l that would be great. I’d appreciate the help and would love to help in return. I remember asking this on Reddit some time ago.

@danidiaz thank you. I thought about exporting (|>) = (&) as an alias that way it looks like Elm/FSharp but figured using & was better since it’s only one key.

@ocramz have all messaging forums for this been abandoned?

Topic		Replies	Views
[Initial feedback request] DataFrame library Show and Tell	24	1335	February 2, 2025
[Design] Dataframes in Haskell Haskell Foundation	25	2897	January 6, 2025
Haskell for Data Processing	14	2453	January 17, 2024
[HFTP] Maximally decoupling GHC and Haddock Haskell Foundation	4	1054	November 14, 2022
[ANN] dataframe 0.1.0.0 Announcements	7	335	April 8, 2025

Pre-HFTP: Proposal DataFrame Library for Haskell

Related topics