Resuscitating dataHaskell

TLDR: anyone interested in resuscitating data Haskell?

It’s been about two years since I started thinking about how the Haskell data science ecosystem can grow. I remember reading @ad-si ‘s blog post on how to create a bar chart from some CSV data. It seemed like pieces of the ecosystem existed but you still had to do a lot of plumbing.

Well, enough plumbing to scare away most data scientists. After reading this post I looked back into dataHaskell to see if there were other tutorials that show how data scientists can be productive with Haskell. The tutorials were a Haddock-style walk through of some low level libraries (vector, hmatrix etc).

Following the scheme, presented here - I wanted to try and solve the data acquisition and exploratory data analysis parts of the problem. Crucially, I wanted to make an approachable API similar to the work that had started with analyze

I recently revisited the blog and tried to see if I could re-implement the same csv workflow in dataframe.

There are still some rough edges (no LAG operator so we have manipulate vectors, plotting could be better as well) but it’s a lot easier to work with tabular data.

It would be great to piggy back off of the dataHaskell framework and provide tutorials/ports of DS workflows in various Haskell tools. This will help library maintainers spot pain points and will provide an entry for data work in Haskell.

From my conversations with @ocramz it seems most of the utility of dataHaskell was roadmapping, community and documentation. I’ve been in touch with the maintainers of hasktorch, srtree, and chart-svg. There seems to be a desire to unify all our efforts but no “organization” to help do so.

25 Likes

I encourage you to think about how the Haskell Foundation can assist!

Coordination between groups of volunteers is hard. This is where a taskforce or working group affiliated with the Foundation can help.

7 Likes

Speaking of dataHaskell I always found situation about linear algebra in haskell wanting. We only have hmatrix which feels quite dated.

So some time ago I wrote library which defines type classes for linear spaces and incomplete wrappers for BLAS/LAPACK: vecvec. I never got it into shape where I would be happy to make hackage release. But it works and I think of interest for dataHaskell.

3 Likes

I’m also interested in having data science tools in Haskell. For my work, I research sampling methods like Hamiltonian Monte Carlo, and I implement stuff in Jax, which is a functional library in Python that supports autodiff and parallelism, and compiles to gpu. It was actually inspired by Haskell a little (see e.g. the type signature here jax.lax.scan — JAX documentation). It’s pretty great, but I’d be much happier if I could use Haskell. The obstacles are:

  • no jax equivalent in the sense of a gpu compatible, autodiffable tensor library (hasktorch is cool, but since pytorch is extremely not functional, I have a bit of an aversion. Maybe I shouldn’t?)
  • notebooks are tricky. With some effort I got notebooks working, and made some tutorials for monad-bayes, e.g. Introduction, but it was a bit tricky and I couldn’t run the notebooks in vscode, which I really care about

Btw, there’s also linearmap-category: Native, complete-ish, matrix-free linear algebra. which is a heroic effort to do coordinate free linear algebra. It’s lacking e.g. gpu backends, but I think it’s very cool and would maybe be a nice focal point for linear algebra efforts.

4 Likes

I would certainly be interested in helping in the department of hasktorch and LLMs. My ultimate goal is to make Haskell a first class citizen for running inference and training with Haskell, both of which will require tooling support from the broader dataHaskell ecosystem (of which I’m happy to contribute to of course). Right now I’m working on defining the interface to load and perform inference on LLMs in a similar manner to HuggingFace’s transformers library.

4 Likes

I think some of my dear colleagues are. I’ll post this on the company slack

4 Likes

Perhaps the packages Henning Thielemann authored are less well-known? There is a lot of useful stuff, also for BLAS.

1 Like

Thanks all for chiming in. @LaurentRDC i thought structure and motivation would precede an appeal to the Foundation. That is, proving the need and usefulness of the effort is a prerequisite to approaching the foundation. Does the foundation have office hours or something similar?

@Shimuuar @olf will look into both packages. Did you picture this being a general purpose array format sort of like numpy is to Pandas?

@reuben do you have a repo of some sample work in python? I’d like to see what migration looks/feels like. I’ve been trying to get iHaskell to work with vscode. No luck still but with some time and effort it should be possible. That’s the sort of foundational work I think should be done. Things should be easy to install and similar to the tidy verse we have to have a canonical set of packages that we can write introductory materials from.

@collinarnett you’ve been really helpful with the Hasktorch stuff. I think collaborating further on the deep learning side would be fruitful. Although my interest and work is tabular data.

3 Likes

While linear algebra, tabular data and deep learning are certainly important aspects of data science, there is much more to data science than that. (Is data science even a well-defined term?) We should ask ourselves, where within this vast field Haskell has the potential to stand out, and where other eco-systems are so far advanced that re-inventing the wheel in Haskell would bind development resources that could elsewhere be more fruitful. Since writing parsers in Haskell is a blaze and we have FFI, we can easily leverage battle-tested implementations in other languages.
Consider: Which kinds of data are best handled by mutable data structures, which kinds benefit from deep structuring and type safety? Where can functional programming make original algorithmic contributions? Let’s make a list.

An example for the diversity of data science: There is no de-facto standard library for intervals (on general linearly ordered types) on Hackage. Yet intervals arise as a central piece of bioinformatics (regions on a genome) and time series analysis.

Henning Thielemann wrote abstractions for, and bindings to, several linear algebra libraries including glpk, blas and coin-or. In that respect it is general-purpose. He also wrote bindings to gnuplot.

2 Likes

We certainly need some standard API for working with N-dimensional vectors. We need data types which could be shared between libraries. There’s vector for 1D array it’s not ideal but it’s accepted and good enough.

Also there’s massiv library.

I know about it but it’s undocumented and has completely impenetrable API. So I pretend it doesn’t exist.

Of course not. It’s just catch-all word for everything having to do with data analysis. Bioinformatics, particle physics, astronomy etc. Each field has its own set of problems and preferred solutions and algorithms.

Also before one can shine one has to learn to walk. There’s no way around this. Here we’re talking about basic building blocks which are needed for almost everything.

I used to think that the Ix classes (base or massiv) handle this.

Yes the author has strong opinions on how module imports are to be used, which reflects in the API. Yet pretending these bindings don’t exist is perhaps also a too strong opinion when it comes to mapping the landscape of existing packages.

Well said.
Maybe I’m overthinking this. I just fear that in Haskell it might be harder to reach a consensus what even these basic building blocks are, since you can not foresee how the downstream programs want to consume them. Languages like Python, Java or C offer (afaik) either arrays in memory (strict) or iterators (lazy, streamable).
Should everything in Haskell’s data science be stream-based? If yes, which of the competing streaming libraries should be made the base of the building blocks? Or can we get around that decision by defining an abstract building block that can appear in both streams and mutable vectors? Should the concept of named columns be deemed essential?
You can certainly pack all this into a single standard API, but you might end up with eight type parameters, which will be off-putting.

There’s two things the Foundation can provide, in general:

  • A space for coordination;
  • Resources, e.g. funds or developer resources.

You can engage the Foundation at any point if you need a space for coordination.

If you wanted resources, realistically, there would need to be some demonstration of sustained momentum and a plan for future resource utilization. But even then, it would be fine to reach out to understand what can and can’t be done!

There are no office hours, although that’s an interesting suggestion!. The best thing to do is to reach out to the executive director, José, by e-mail (jmct@haskell.foundation) and see what can be done right now! (Note that the director might be slow to response as he’s traveling to Japan)

1 Like

I think it’s good to have table stakes and then let the ecosystem pick a direction. But we get to your other question: what are table stakes? For me, three things:

  • I work as a lead for some data scientists. My initial goal was to have their workflows be reproducible in Haskell in a way that doesn’t sacrifice clarity.
  • I’ve been taking “data science” classes on coursera and using that to expand the API of dataframes. My goal is to have a good didactic tool that can act as a gateway for people who studied data science (assuming these online courses are representative of what most schools think data science is)
  • I’ve been attending the Parquet and Arrow community meetings to get a sense of what other developers in the data science ecosystem are working on. These meetings also have a decent representation from people in industry (Databricks and Snowflake) so I indirectly get a view of why their customers (who I assume are data scientists) are asking for.

So with that context the questions I usually get are around I/O (can you read or write this format?), tabular operations (how do I do this SQL like thing in Haskell) and general ecosystem interop ( how do I pass the data to this R/Python tool?).

These questions are table stakes. Once sorted the ecosystem can grow in different direction or in lockstep with Julia, R, Python etc. In terms of interesting directions that DS ecosystem can grow with Haskell, I think program synthesis related approaches would be a good start since synthesis research typically happens with functional languages.

2 Likes

Can you please expand on that? Are your team’s workflows currently in Haskell or do you plan to port them from other languages and the concern is the ported workflows would be less clear due to lack of proper tooling?

For Bayesian inference stuff, the blackjax repo is the sort of thing I mean, e.g. blackjax/blackjax/mcmc/mclmc.py at main · reubenharry/blackjax · GitHub. I don’t know if this falls under data science, so feel free to disregard. But basically this is a library of Markov kernels used for MCMC. It’s all pretty simple code - the point is that because jax code (which has the same API as numpy and therefore is straightforward to write) backends to fast GPU code, it’s actually practical. For example, I can run samplers from blackjax on million dimensional distributions. I have implemented similar sampling algorithms in Haskell (even with fancy FRP approaches, which is very fun), but it isn’t fast enough to be useful in practice so I can’t actually use it for work, apart from making cute demos.

Of course, there are Haskell attempts to have gpu-compatible array programming, but to my understanding, nothing is easy or smooth yet, and for me this is a really big barrier. It might be pessimistic, but I think unless a few engineers were paid full time for a few years to solve this problem, there’s no getting past it.

Re IHaskell and vscode, 100% agree.

Fair. I was talking from perspective whether to use library or not.

Only way to understand which approaches work and which don’t is to try to use them. And library design iterative anyway. because it’s not possible foresee all uses. We don’t know what right way™ is but that shouldn’t stop us.

The orthotopepackage also has an API for multidimensional arrays.

2 Likes

This is misguided:

  1. Haskell doesn’t have to be strictly best in class in a domain to be worth pursuing. I don’t wanna write stupid Python personally, so even a worse-than-Python Haskell data ecosystem would benefit me. I’d do data projects I wouldn’t otherwise with it! Cuz I’m not spending my conscious time on Python.
  2. You can’t figure this out up front! You have to just pioneer and trust that Haskell is great. The unique opportunities will present themselves, but only if you dig deep enough. Data Haskell needs activation energy.
3 Likes

Haskell doesn’t even have a consensus for records lol. It’s not in the culture - and it’s a feature not a bug. Expecting quick consensus on something as amorphous as “data” is self sabotaging the effort imo. I think ppl should chill and go build even if it’s somewhat redundant :slight_smile:

We can refactor later (Haskell is good at that)

2 Likes

Created a discord sever for anyone that’s interested. No channels except for general yet.

7 Likes