Pre-HFTP: Proposal DataFrame Library for Haskell

I have a working prototype of a dataframe library. @LaurentRDC suggested I work on a proposal for this to send to the Haskell foundation some time ago but it seems the only time I get large blocks of time for personal projects is over the holidays.

Screencast of usage in GHCI

I’m seeking initial feedback on the approach and some possible future directions.

Where does this library fit into the design space?
I think it’s good to have a library that allows you to go from “I have a dataset” to “oh, this is what this data is about” very quickly. As such, this library prioritizes simplicity where possible. A few design decisions in particular:

  • An API that is reminiscent of Pandas, Polars, and SQL
  • Dynamic typing (which also incidentally gives more control over the error messaging - GHC’s errors can be a little intimidating)
  • Use in GHCI/notebooks/literate programming rather than standalone scripts
  • Terminal-based plotting so users don’t have to have all the right lib-gtk/sdl libraries installed.

I’ve included some future work in the README that highlights things I’d like to work on in the near to medium term.

Once the large questions are settled I’d also like to do more UX studies e.g survey data scientists and ask them what they think about the usability and ergonomics of the API, and what feature completeness looks like.

But before all that welcoming initial feedback - and maybe a look at the code because I think there is a lot of unidiomatic Haskell in the codebase (lots of repetition and many partial functions).

After getting feedback from this thread I’ll work on a formal proposal doc to send over. Thanks. Will also cross post for more feedback.

14 Likes

Very good work with the demo @mchav, it looks great.

1 Like

Thank you. I was inspired by jpterm to keep everything in the terminal.

For the example I was using the introductory example in a Tensor flow intro book. I’m still trying to find a good set of benchmarks to profile against but that’s a design afterthought for now.

1 Like

Thank you @mchav , this looks very promising.

Regarding features, as a user I would need:

2 Likes

Thanks for the input @ocramz. I’ll open issues for those operations. Btw - I meant to ask what were the learnings from the data Haskell effort? I remember you were active on there. The website is still up and it still speaks of roadmaps and the like. Are the plans to resuscitate the effort?

Thank you, glad to hear you’re interested.

Learnings from the DataHaskell experiment, good question!

  • OSS communities take a lot of energy and require deliberate and thoughtful communication. People lose interest immediately if these are not in place.
  • Goals, momentum and planning. Unifying numerical computing and data analysis is a very ambitious program with tons of tricky technical details and UX things to iron out by interacting with users. Too ambitious roadmaps kill (unpaid) projects. This is kind of where we left off before losing steam: GitHub - DataHaskell/dh-core: Functional data science
  • Documentation over “updates”. At some point we had a buzzing chat on Gitter, we even had an in-person workshop at ICFP’17 and a Zoom call that spanned many timezones, but nothing beats a single online place where people can read stuff on their own time. I suppose this is the closest we got: Issues · DataHaskell/dh-core · GitHub

Honestly, I don’t really have the bandwidth to restart DH, but I would happily give the keys to an interested party. Truth be told it was a very inclusive setup and people were invited as soon as they signaled interest, so in the same vein I have invited you in as well.

4 Likes

Greatly appreciated it. Might be worth setting up a discord instead of gitter.

  1. what were some things that worked communication wise? What counted as deliberate and thoughtful communication.
  2. if you could redefine the goals with the benefit of hindsight what would they be and how would you pace them?
  3. by updates you mean posts or messages about new work as opposed to putting it in a doc?

I can definitely relate to this experience. You could request help for any of the tasks that dont require much design work, like documentation, anything devops related or task of this nature that take a lot of time.

I really like this library, its something I really need on a project im currently working on. For me I already have to jump between Haskell, C, JS, Bash, Perl and Dart on a daily basis(I am trying to phase out Dart and Perl). If I could remove the need to also jump into a Python context with all the tooling, syntax and conceptual shift that comes with a language shift that would be a big win for me.

Its probably unlikely to completely remove the need for python but for me at least I use it mainly for data exploration atm.

1 Like

In the example video, the use of & in combination with multi-line commands seems pretty convenient!

image

6 Likes

You know DataHaskell actually started with a Slack space, and then by migrating to gitter (linked to repos, open archives) we instantly lost ~50% of the people?

1 Like

Very much agree. In addition, as a syntactic convenience it’s already familiar to a good chunk of the data science community thanks to the tidyverse: Introducing magrittr • magrittr

2 Likes

@l-Shane-l that would be great. I’d appreciate the help and would love to help in return. I remember asking this on Reddit some time ago.

@danidiaz thank you. I thought about exporting (|>) = (&) as an alias that way it looks like Elm/FSharp but figured using & was better since it’s only one key.

@ocramz have all messaging forums for this been abandoned?