I wondered if anyone has blog posts or personal experiences they’d like to share about using Haskell for data analysis or data science?
I’m a Haskell noob but data science is what I do professionally, and I feel a draw to applying one to the other. I don’t really mind coding fundamental algorithms from scratch, but one big stumbling block I see is data access libraries. E.g. working with GIS data, Parquet, images, and so on.
I also don’t see much support for the operational research side of things: mostly algorithms one cannot code alone. constraint solvers (GeCode), MIP/Linear programming, and so on.
Do you use Haskell for this sort of work? If so, how?
I use Haskell as a quant, which is the silly name for data science in finance.
The ecosystem is definitely lacking in most of the exploratory niceties you would get using R or Python. I had to write a library to handle one-dimensional series because none were ready-to-use for my use case.
I also do linear programming in Haskell, and it’s really nice, especially in combination with type-safe dimensions. If you want to use GLPK, there’s hmatrix-glpk. We use an in-house library, which binds to a proprietary linear programming solver, so we haven’t bothered to release it (I’m also ashamed to say that it only works on Linux ).
It would be great to expand the ecosystem – maybe you want to take a crack at it? Best to identify a need you have, and work towards satisfying that need
hmatrix-glpk – last updated 3 years ago I’m seeing that quite a bit. I wanted an interval arithmetic library (Haskell has one) and a range valued sets library (Haskell has two), but its the same story: dormant. This doesn’t bother me that much. I’ve spent about a year with Prolog, and found it very useful for all sorts of use cases despite a lack of packages.
From the command line, one could create Haskell executables to do specific bits of analysis and orchestrate them with more specialised components:
Use GNUPlot for plots – interface by file io.
ETL in SQL + SQLite – interface by file io.
Discrete optimisation using MiniZinc – interface by file io.
SCIP for linear/MIP programming – interface by file io.
Makefile – for orchestration between bits.
Babel (orgmode), for literate programming.
I guess the theme with Haskell applied to data analysis/science will be “custom algorithms”
Haskell has noteworthy auto-diff and linear algebra libraries. That makes optimisers for regression and classifical algorithms fairly easy to hand roll.
I have been using Haskell at work for data science-y purposes for ten years. Data science is a vast field, and as you observed, some niches are barely covered at all in the Haskell ecosystem. That said, occasionally someone uses the FFI to provide bindings to well-established tools. Recently Henning Thielemann added some new bindings to linear programming solvers because my company needed them.
That is likely either because such a library is mature or the user base is too small for changes becoming necessary. A better metric would be open issues/merged pull requests. I have released several libraries which haven’t been updated in a while since they are mature enough for me and nobody else has requested changes.
I haven’t got oodles of time at the moment, but soon enough I’ll try and redo this in Haskell. Its a custom – simple – property pricing model but it would require parsing, optimising, visualising, and so on.
I suspect the Haskell aesthetic can offer a unique perspective on the solution. I did a similar exercise with a data analysis in Prolog here, and I was very happy with the results.
I’ve used Haskell for toy versions (but useful ones) of various Bayesian inference algorithms (MCMC and particle filters in particular). Stream based programming, and functional reactive stuff worked out nicely here