Haskell for Data Processing

I’ve been looking into Haskell’s data ecosystem. There seems to be a lot of foundational work that is missing that I’d like to help implement (if such efforts already exist) or start to implement with a group of Haskellers who have time. Namely:

  • A flat buffer library - the current one is abandoned and isn’t featured in the official flat buffer documentation despite some seemingly niche language called Lobster being supported.
  • an Apache Arrow compatible data frame library (along with the rest of the apache arrow suite)
  • A well supported plotting library

I think this was somewhat initially the vision of dataHaskell but that effort seems to have fizzled out. Were there learnings published somewhere? What were the pitfalls? Is there still activity in the community?

9 Likes

Just recently, someone was working on Series and DataFrames á la Python’s pandas library, so I’m pretty sure there’s still data science people out there interested in elevating Haskell’s repertoire in that field.

Haven’t heard much about dataHaskell, though I wasn’t paying much attention at the time it was still going because of unrelated reasons.

I’d encourage you to go for it, since having libraries to use things like ProtoBuf and FlatBuffers are almost always a plus. And take some time to get feedback on your repo before pushing to hackage, if someone recognizes there’s already a library that does what you want to do, it might be better to join in their effort. Though if your approach is different, go ahead with your own package.

Good luck, have fun :+1:

3 Likes

I’ve been working on some data science primitives that I need for work. I created a feedback thread here. Ideally, I would like to migrate all of my Python-based workloads to Haskell. This includes a dataframe data structure as well.

I would definitely be on board to collaborate on creating an Apache Arrow compatible data structure in the javelin repository. I’ve discussed my intention in this post last week. The reason I haven’t started work on it yet… is because I don’t need it yet!

If you are looking for contribution ideas, you can check out the issues; I’ve listed some basic features that would be great to get off the ground.

There appears to be a nice plotting library under the name of Chart that you might want to check out.

4 Likes

I’ve never heard of Apache Arrow before… but looking at it, apparently the C library uses GObject. That means it should be possible to create bindings for it using haskell-gi.

@garetxe, any thoughts on that?

1 Like

Just played around with Javelin. Great library. Will start diving into it. That said it seems like all the features in the issues tab are will require some back and forth to ensure the design makes sense. Especially data frames. What sort of workflows have you migrated so far?

1 Like

See also

vis a vis the haskell flat buffer library, it looks like the current one might suffice, modulo a few changes needed to keep up with api churn in some underlying dependencies? probably just needs a maintainer to adopt it and step in.

also note that while the datahaskell team seems to have stalled out, they created a nice inventory of existing libraries (at that time) for a variety of purposes: dataHaskell : Current Environment

2 Likes

Thanks. Rechead out to the maintainer of the flat buffer library to see what the status of the library is. This is a large compendium of libraries. Will try and go through them to see which are maintained and which aren’t. Sometimes I wish there were obvious defaults in the haskell ecosystems for these types of things. Choosing libraries ends up feeling like ordering from a diner sometimes.

That’s pretty much always the case for Haskell! There are always multiple viable points in the design space for a domain, so it’s cool that Haskellers end up exploring all of them over time.

It’s hard to know if a library is maintained or not! I haven’t made a release of matplotlib in 3 years because it just works. Sadly, upstream matplotlib has deprecated some APIs so there will be another release this year. This is true of a lot of Haskell libraries (unless the CLC decides to screw us with more breaking changes).

3 Likes

I would be interested in colaborate. May we create a repo with the goals, and hopefully a plan?

4 Likes

I would also be happy to collaborate, maybe in a shared GitHub organization or a HF task force. I don’t have the time these days to lead the effort, though

1 Like

I can lead the effort. Will create a scratch doc with a plan (also for continuity) and a shared github repo/org.

2 Likes

Flatbuffer author says they haven’t had time but would be willing to roadmap what work was left.

2 Likes

There is a lot of buzz recently around Polars an Apache Arrow compatible data frame library developed in Rust. The underlying design philosophy (laziness, streaming interface, etc) might resonates with Haskellers. I was wondering if creating a wrapper around it and then building on top of that is something worth to consider.

Admittedly I have no idea how difficult or realistic it would be to interface with a such Rust library from Haskell. The python wrapper works great.

4 Likes

Looks like others have also been thinking about Rust <> Haskell bindings: One step forward, an easier interoperability between Rust and Haskell | IOG Engineering

1 Like