Resuscitating dataHaskell

Statistician here. Wrote and maintained some of the original AB testing code at Facebook c. 2013-2014. They wanted to launch some SQL-like DSL that allowed you to generate metrics on the fly to compare across A-B groups, but they used an Antlr frontend that was nightmarish to install, write and maintain. Here are some of my thoughts.

  1. DSLs. Haskell is so expressive and there are so many cute ways to parse, program and metaprogram. Just having a saner frontend that looks a little more declarative than imperative would be nice*.
  2. Exploiting the monoid-like nature of many computations. For example sampling is monoidal, AB tests are monoidal because of the independence assumption. So monte carlo and other independent sampling, meta analysis are all monoidal.
  3. Monadic computations that are random but look pure. I think Haskell would be the perfect language to replace something inscrutable like PyMC3, which last I checked was not pleasurable to write**.
-- * imagine code that looks something closer to tidyr than pandas using (>>>)

dataFrame 
  >>> selectCols  ['a','b'] 
  >>> filter      [('a'), (>=5)]
  >>> computeMean 

-- ** something nicer than PyMC3 would be

someDist :: Distribution Float
   x <- normalDistribution (0, 500)
   return x

anotherDist :: Distribution Float
   x <- someDist
   if   x >= 5
   then return x
   else anotherDist

But in general whatever we build we’ve got to (a) make sure it solves a problem that people need solved and (b) aggressively court those people. Haskell is far closer to R than it is to Python in terms of userbase and adoption, in the sense that both Haskell and R users are long-tail. You’ve got your rabid core developer base, and then everyone else is a casual user that drops in from time to time. The only difference is that R’s casual users are essentially a captive audience, since most of them are statisticians who don’t know how to code. But even if you use Haskell casually, then you must already be somewhat competent at programming, otherwise GHC just screams nonsense at you that you don’t understand.

So the competition is tough, I would say, and from my limited understanding most projects in Haskell die early :frowning: and like, data scientists are mostly interesting in things like drawing graphs, querying databases, computing statistics, running inferences, not with becoming compiler geniuses.

For the examples you give some of the APIs already look like that.

-- using dataframe
a = F.col @Int "a" -- can be generated with template Haskell

dataFrame
  |> select      ["a","b"] 
  |> filterWhere (a F.>= F.lit 5)
  |> mean "a" 

-- using monad bayes

dist :: Distribution Double
dist = normal 0 500

I think things are starting to look ergonomic but there’s still a lot of work to be done. There are a lot of areas in my areas of concern where a well design Haskell solution wouof stand out:

  • Reactive Dataframes (similar to marimo or Pluto.jl)
  • Type guided program synthesis for interpretable ML/ML for science
  • Reproducable pipelines (general data provenance??)

I genuinely think it’s a matter of putting in concerted, consistent effort.

I just saw — hadn’t kept up with the developments until I saw your post. I like what you’ve done with dataframe, it really looks like Tidy but for Haskell. I signed up to do Iris in your document and I can look at a couple of issues in dataframe later, like cleaning up joins.

What exactly do you mean by reproducable and data provenance? In my head, pipeline is synonymous with function (composed of functions) which thanks to referential transparency is as reproducable as you could wish. So what in your pipelines is not reproducable?

This might just be bad coding in the part of my team but it’s a mixture of:

  • most work is done in notebooks and execution order isn’t always clear
  • State is mutable which exacerbates the previous problem
  • Randomness isn’t always explicit which makes reproducibility harder

Oh wow, I feel honored that one of my blog posts inspired you to work on DataFrame and to revive the dataHaskell website.

Note to myself: Write more blog posts – you never know what they might set in motion! :face_with_tongue:

Thanks for you efforts and keep up the great work!

I’d also love to see dataHaskell become more of a thing, I was excited for the idea the first time and again with the revival.

That said, my contributions which improved the flexibility of computing statistics from the statistics package were rejected for not great reasons. Currently the statistics package requires data be in an unboxed Vector Double - this change makes the same algorithms work for any traversable of Double (and indeed any traversable you can project a Double out of - such as a data frame column), and has no impact on performance.

Having low level primitives which can be efficiently applied to contiguous values in memory, and then combined monoidally are pretty important building blocks for high performance statistics applications. So I’d love to see that work merged to make building on top without having to jump through the hoops of producing vectors (which was fine when statistics basically existed for criterion).

Sounds like there just had to be more whiteboarding. Let’s see if we can have a small focus group that addresses this and pushes the PR through.

I’ll see if I can clean up the PR and add a few more folds which return Mondial results.