Data analysis/science in Haskell

I wondered if anyone has blog posts or personal experiences they’d like to share about using Haskell for data analysis or data science?

I’m a Haskell noob but data science is what I do professionally, and I feel a draw to applying one to the other. I don’t really mind coding fundamental algorithms from scratch, but one big stumbling block I see is data access libraries. E.g. working with GIS data, Parquet, images, and so on.

I also don’t see much support for the operational research side of things: mostly algorithms one cannot code alone. constraint solvers (GeCode), MIP/Linear programming, and so on.

Do you use Haskell for this sort of work? If so, how?

3 Likes

I use Haskell as a quant, which is the silly name for data science in finance.

The ecosystem is definitely lacking in most of the exploratory niceties you would get using R or Python. I had to write a library to handle one-dimensional series because none were ready-to-use for my use case.

I also do linear programming in Haskell, and it’s really nice, especially in combination with type-safe dimensions. If you want to use GLPK, there’s hmatrix-glpk. We use an in-house library, which binds to a proprietary linear programming solver, so we haven’t bothered to release it (I’m also ashamed to say that it only works on Linux :frowning: ).

It would be great to expand the ecosystem – maybe you want to take a crack at it? Best to identify a need you have, and work towards satisfying that need

5 Likes

hmatrix-glpk – last updated 3 years ago :frowning: I’m seeing that quite a bit. I wanted an interval arithmetic library (Haskell has one) and a range valued sets library (Haskell has two), but its the same story: dormant. This doesn’t bother me that much. I’ve spent about a year with Prolog, and found it very useful for all sorts of use cases despite a lack of packages.

From the command line, one could create Haskell executables to do specific bits of analysis and orchestrate them with more specialised components:

  1. Use GNUPlot for plots – interface by file io.
  2. ETL in SQL + SQLite – interface by file io.
  3. Discrete optimisation using MiniZinc – interface by file io.
  4. SCIP for linear/MIP programming – interface by file io.
  5. Makefile – for orchestration between bits.
  6. Babel (orgmode), for literate programming.

I guess the theme with Haskell applied to data analysis/science will be “custom algorithms” :slight_smile:

Haskell has noteworthy auto-diff and linear algebra libraries. That makes optimisers for regression and classifical algorithms fairly easy to hand roll.

I feel like most of the building blocks are already available.
It just takes a sub-optimal amount of effort to put it all together. :sweat_smile:

Here is a list of resources:
https://github.com/krispo/awesome-haskell?tab=readme-ov-file#data-science

And here is an example of it being more difficult than expected:
https://adriansieber.com/how-to-create-a-bar-chart-from-a-csv-file-with-haskell/
(OpenAI would have generated the Python + Matplotlib solution in 5 seconds :sweat_smile:)

I have been using Haskell at work for data science-y purposes for ten years. Data science is a vast field, and as you observed, some niches are barely covered at all in the Haskell ecosystem. That said, occasionally someone uses the FFI to provide bindings to well-established tools. Recently Henning Thielemann added some new bindings to linear programming solvers because my company needed them.

That is likely either because such a library is mature or the user base is too small for changes becoming necessary. A better metric would be open issues/merged pull requests. I have released several libraries which haven’t been updated in a while since they are mature enough for me and nobody else has requested changes.

1 Like

I remember finding sky blue trades interesting, a (Haskell) blog series on Principal Component Analysis and such for weather patterns.

1 Like

Around 6-7 years ago, there was a DataHaskell effort. Right now, the gitter room is almost abandoned :frowning:

I haven’t got oodles of time at the moment, but soon enough I’ll try and redo this in Haskell. Its a custom – simple – property pricing model but it would require parsing, optimising, visualising, and so on.

I suspect the Haskell aesthetic can offer a unique perspective on the solution. I did a similar exercise with a data analysis in Prolog here, and I was very happy with the results.

1 Like

I found HaskellR. Its ostensibly a seamless and highly performant interface to R from Haskell. E.g. here is a clustering example from the linked site:

{-# LANGUAGE QuasiQuotes #-}
{-# LANGUAGE ScopedTypeVariables #-}
import H.Prelude as H
import Language.R.QQ

import System.Random

main = H.withEmbeddedR defaultConfig $ do
  H.runRegion $ do
    -- Put any complex model here
    std <- io $ newStdGen
    let (xs::[Double]) = take 100 $ randoms std
    d  <- [r| matrix(xs_hs,ncol = 2) |]
    rv <- [r| clusters <- kmeans(d_hs, 2) |]
    [r| par(mar = c(5.1, 4.1, 0, 1));
        plot(d_hs, col = rv_hs$cluster, pch = 20
            , cex = 3, xlab = "x", ylab = "y");
        points(rv_hs$centers, pch = 4, cex = 4, lwd = 4);
      |]
    return ()

Wow…

2 Likes

Can anyone give any insight into the “state” of HaskellR - I almost started using it a couple of times, but this answer has put me off: https://www.reddit.com/r/haskell/comments/li9muq/comment/gn5u4za/?utm_source=share&utm_medium=web2x&context=3

It seems to be get minor updates pretty regularly though… Code frequency · tweag/HaskellR · GitHub

I see Jax there, you may be interested in Dex (by the same people)

1 Like