[RFC] Sibyl: Time Series Analysis in Haskell

Hello! I’ve recently begun work on a new time series analysis and forecasting library called Sibyl that will implement functions and models comparable to statsmodels in Python or forecast in R: things like ARIMA/SARIMA, exponential smoothing, classical decomposition, and automatic model selection via the Hyndman-Khandakar algorithm. I’m working with the folks over at DataHaskell (@mchav, @daikonradish) to flesh out the initial specifications and begin drafting design plans.

During initial design discussions, we ran into a roadblock around a major UX choice. My initial approach splits the library into two layers:

  • An unsafe facade layer for notebook users and statisticians, imported with import Sibyl. Functions throw runtime errors on failure rather than returning Either. The goal is R-style convenience.
  • A safe layer for production pipelines, where users import modules individually (import Sibyl.Safe.TimeSeries, import Sibyl.Models.SARIMAX, etc.) and handle error types explicitly.

Here’s what each looks like in practice. A notebook user or scripter might write:

main :: IO ()
main = do
  raw <- D.readCsv "./data/sales.csv"
  result <- fromDataFrame "date" "sales" raw
    |> fit (ARIMA (1, 1, 1))
    |> forecast 12
    |> toDataFrame
  D.writeCsv "./artifacts/forecast.csv" result
-- Nice and convenient!

And the production pipeline version, where error handling is explicit:

module Pipeline where

import qualified Sibyl.Safe.TimeSeries as TS
import qualified Sibyl.Models.SARIMAX  as SARIMAX
import qualified Sibyl.Model           as M
import qualified DataFrame             as D

main :: IO ()
main = do
  raw <- D.readCsv "./data/sales.csv"
  case TS.fromDataFrame "date" "sales" raw of
    Left err     -> putStrLn $ "Bad series: " ++ show err
    Right series ->
      case SARIMAX.fitSARIMAWith settings series of
        Left err    -> putStrLn $ "Fit failed: " ++ show err
        Right model -> do
          M.summarize model
          let fc = SARIMAX.forecastSARIMA 12 model
          D.writeCsv "./artifacts/forecast.csv" (TS.toDataFrame (M.point fc))
          putStrLn "Done."
  where
    settings = SARIMAX.defaultSARIMASettings
      { SARIMAX.sarimaP      = 1, SARIMAX.sarimaD      = 1, SARIMAX.sarimaQ      = 1
      , SARIMAX.sarimaBigP   = 0, SARIMAX.sarimaBigD   = 1, SARIMAX.sarimaBigQ   = 1
      , SARIMAX.sarimaPeriod = 12, SARIMAX.sarimaMethod = SARIMAX.CSSML
      }

A third option has also come up in our discussions: using DataKinds and type families to encode model orders at the type level, so that Fitted ('ARIMA 1 1 1) and Fitted ('ARIMA 2 1 0) are different types. This gives much stronger guarantees, like for example, SARIMAX’s requirement for future regressors at forecast time becomes a type-level constraint rather than a runtime check. The tradeoff is that it makes interactive use in GHCi or a notebook harder, since model orders would need to be known at compile time.

The tension we keep running into is this: the new-to-Haskell R or Python user wants minimal friction, sensible defaults, and something that “just works”, per se. They’d likely use the unsafe facade and never touch Either. The Haskell engineer wants (?) strong guarantees, composability, and wants the type system to do a lot of work rather than just wrapping error calls. By trying to serve both audiences…we risk ending up with something that fully satisfies neither. Too un-ergonomic for the statistician, not strong enough for the seasoned Haskeller.

So, as we work through the initial stages of this library, I was wondering if I could get some initial comments and questions answered by anyone who has thoughts on all this; especially:

  • Is the two-layer approach a reasonable compromise, or is this two half-baked APIs?
  • For those of you who do statistical work in Haskell: what would you want out of a library like this?
  • Are there prior examples in the Haskell ecosystem that handle this well? Libraries that serve both “quick and dirty” and “production-grade” users?
  • Does the DataKinds approach seem worth the added complexity?

The GitHub repo is here if you want to see what exists so far. Happy to answer any questions! I’m also an intermediate Haskell programmer, and it’s possible I’m missing something really obvious in the details for this implementation. I welcome any and all feedback, as every opinion will help me better identify how best to write this library to serve the most amount of people :slight_smile:

14 Likes

My initial intuition has two recommendations.

Firstly, you could predominantly use the “safe” functions in the package, and then expose a convenience module which use TH or similar to just generate either (error …) id . f variants. This would let people use the safe stuff if they wanted, but those who want a grab bag of useful functions can use them and accept the risks.

Secondly, I see you use a |> piping operator. I wonder whether you could do some kind of effects-system like thing where you can plumb together all your combinators in the scripter fashion, but the way you interpret it is how you get the key difference. For scripters, working in IO is likely ok, so all errors can be chucked into fail or similar immediately, but for those who want stronger guarantees you can let the monad go to Either err or some other custom stack.

This second approach may have the annoying effect of making pure functions you wish to compose have to exist within a Monad, but with sufficiently general constraints (Applicative or Monad) you could simplify to Identity in some cases; maybe these pure functions have their monadic counterparts suffixed with M?

1 Like

It’s very exciting to see DataHaskell stuff moving again. Scattershot feedback, in no particular order, though I hope it’s useful:

I recommend dropping the parameters, as GHC can only expand a type alias when all parameters are provided:

type TS = TimeSeries
type FC = Forecast

I can’t find where you define (|>), but I’d recommend using (&) from Data.Function:

  • It’s in package base, so you don’t pull in another dependency
  • It’s the standard Haskell operator for this concept
  • It reads well (verbally): "fromDataFrame … and fit … and forecast … and toDataFrame`
  • The < and > on operators have an established meaning in Haskell, they usually imply something is “lifted”: (<$>), (<$), ($>), (<*>), (<*), (*>), (<&>), (<$?>), (<&?>) etc.

You may want to think about documenting your type parameters, because m in a constraint often means “some monad” and t often means “some traversable”. Choosing short abbreviations and using them consistently (e.g. mdl or model for “model”?) might help here.

I personally don’t think -XDataKinds is that scary, particularly if you can put in the work to do helpful custom type errors, and -XRequiredTypeArguments means you can ask for type variables that you need, and don’t have to rely on users’ intuition that an inference failure requires an application. But I’m not on the data wrangler and might not be your target audience. I do note that if your type checking is fast enough, changing something and getting a type error is a better experience than changing something, trying to run your code, and having it fall over with a runtime type error. Whether you can get Python- or R-using data people across the initial hump of “the types are there to help you” is a question best answered by testing on real users.

1 Like

Hi! Jireh from dataHaskell here. |> as you have identified is simply just &! We simply re-export for data scientists coming from other data science-y frameworks that have pipe functionality, e.g. tidyverse. Well spotted :slight_smile:

Also, DataHaskell is popping! Please join our Discord.

2 Likes

Thank you so much for your thoughtful feedback; the snippets I included above are simply models what I think it’ll (eventually) look like - Sibyl is nowhere close to being that convenient yet - but it has become standard dataHaskell notation (in dataframe, at least) to use the |> pipe in things. I think it’s pretty, haha, perhaps because I’m a data wrangler who comes from the pandas and R notebook way of doing things.

The point about the m in the type constraint is a good one. I’ll probably end up switching to mdl and idx (for t) when the final types get figured out.

@mchav said something poignant this morning: "I saw the angel in the marble and carved until I set him free.” I think the goal for my API design work should be to start with the biggest, most “fancy” abstractions possible and eventually move down the ladder until it feels “just right.” I personally (as a beginner) didn’t realize the power of the DataKinds extension, and it seems like people generally like using it … it does feel awfully right, and Haskell-y.

The inbetween option I’m leaning most heavily towards right now is something like:

myModel :: Fitted 'ARIMA -- instead of
myModel :: Fitted ('ARIMA 1 1 1)

But this is still up for debate. I agree with your point especially that:

I do note that if your type checking is fast enough, changing something and getting a type error is a better experience than changing something, trying to run your code, and having it fall over with a runtime type error.

This is why I’m still an advocate for a facade layer that splits the library between a layer of runtime errors on failure and a “safe” layer of type-driven erroring. In a day or two, I’ll have formalized a draft design document and I’ll definitely post it here for comment; but I’m leaning more into Haskell’s abstractions than away from them at the moment.

Also, thank you for the point about the type aliases; I didn’t know that, oops!

I like the split! Can one convert code from convenience to .Safe in a piecewise fashion?

For DataKinds, I guess it depends on whether the type errors can be made to look understandable (like jackdk mentioned).

I prefer |> over &. I find & very odd-looking in pipes, since I’m used to thinking it has something to do with bitwise operations or references. The |> OTOH is also used in the IHP web framework, as well as in OCaml and Julia, and feels a bit like shell pipes, only with a direction added.

1 Like

The problem I see with (|>) that we already have (<|>), the alternative operator, and so it clashes with established practice. It’s also inconsistent with the established visual language for operators in Haskell: given ($) :: (a -> b) -> a -> b, (<$>) :: Functor f => (a-> b) -> f a -> f b is like a “lifted” version of ($). So when you learn (&) :: a -> (a -> b) -> b, you can then immediately tell what (<&>) is going to do: (<&>) :: Functor f -> f a -> (a -> b) -> f b. (|>) is a road to nowhere in that regard: there is no (<|>>), and if there was, it would be confused for an operator to do with Alternative. Training new Haskellers to use non-standard operators like (|>) artificially limits their ability to interact with the wider Haskell library space. I don’t endorse that, because the reason I value Haskell is exactly because the language, libraries, and community look for bigger more powerful ideas that can be expressed elegantly.

I would recommend using a WARNING pragma to smooth the on-ramp for new users, teach them the common Haskell name for the operator, and provide a way to turn the nag off if the user really doesn’t care:

{-# WARNING in "x-pipe-operator" (|>) "|> is conventionally written & in Haskell, and can be found in Data.Function." #-}
(|>) :: a -> (a -> b) -> b
(|>) = (&)
2 Likes

I think & is exported by data frame, not sibyll.

Yeah. The (|>) operator is from DataFrame, I just happen to have picked it up as a programmer around the dataHaskell folks. I think @jackdk makes good points about the operator, but I think it originally existed to keep a similar syntax to Elixir’s pipe operator.

This decision is less of a design decision on my end, but all good points.

Additionally, I’ve just begun drafting Sibyl’s model API, which is uploaded as a doc in the latest commit sibyl/planning/design_draft.md at main · jackiedorland/sibyl · GitHub after some revision & discussion with other folks at dataHaskell. I’d appreciate comments on this as well!

1 Like

It essentially works, so far, as an import facade layer. So if you want access to the underlying (safe) functions you just import the module itself: i.e. import Sibyl.Forecast instead of just import Sibyl. The top-level module exposes all the ‘unsafe’ wrapper functions, and each individual module is safe by default. This might make things a little annoying for users that want to mix both layers in their applications, but you can just import the individual module (probably qualified) and hide its wrapped functions using the hiding keyword. This is still up for debate, though, and I’ve often thought of moving every module to its own Safe domain and renaming the top-level wrapper layer to Unsafe, as in Sibyl.Unsafe and Sibyl.Safe.Module

1 Like

Hey all, I’ve started a GitHub discussions on the repo if anyone has anything to put over there too: jackiedorland/sibyl · Discussions · GitHub

Hey @jackdk I introduced it to dataframe. Initially I had co-workers use the library and they all irrationally hated & (and a lot of the symbols in general e.g. $) - when I started pointing out these symmetries they were a little turned off by the pedantry. Might be that I’m a bad spokesperson for this but in the name of putting down one less hurdle I made the synonym. Strangely people don’t take issue with other parts of the syntax - symbols are just a really big thing for the users I’ve worked with. Few have eventually reverted to the & but a bunch still feel like it looks too much like short circuiting conjunction.

After we moved to that pipe loooking operator the pipeline looked “more approachable” too e.g from the data science discord

So it was a matter of - what hill do I wanna die on in the name of user acquisition.

That said, the warning sounds like a good idea.

4 Likes

Furthermore, |> is the “snoc” operator used in e.g. Data.Sequence (or for more general types in Lens).

So I agree with jackdk that it is a bad idea of using it instead of &

1 Like

I’m pretty sure it’s already in use in the flow package

Don’t let the haters get you down. |> is great!

3 Likes

Yes, its a tough spot to be in, because the pitch for Haskell itself is that you have to learn new ways and unlearn some old ones, but we promise that the payoff will be worth it. This is why I suggested preserving the on-ramp with a custom warning group. Thanks for considering it.

I’ve tried to read this but I don’t have enough domain knowledge to offer strong opinions. It does sound like there’s not a lot of data you’d want to keep at the type level, and if some of your models don’t have statically-knowable features then it does seem like it’d be best to avoid the type-level features. You might find that the grenade library provides some idioms you could use, if you do take the typelevel route.

If you don’t have laws to describe the typeclass, or if there aren’t very many generic functions you can write over the instances, you might also get value from looking at specifying the models themselves as values instead of typelevel tags. Not entirely in the record-of-functions “capability” sense but as some inert data that could be read by model-agnostic functions.

Have you considered writing your examples with the BlockArguments extension enabled? That way you can replace all $ with do, which I’m sure will be more palatable to them.

2 Likes

When looking at Sibyl.TimeSeries it seems that the concept of time is just any type t that has Unbox and Ord instances. Does that mean a time series is a collection of singular events along some totally ordered domain?

I work with time series a lot, and the two problems that arise most frequently are

  1. Values associated with time intervals rather than singular time points, and finding intersections between those intervals. Hours, days, weeks, months…
  2. Converting between time representations, when matching time series from different sources.

Only in the most regular scenarios can a time interval be represented by a single LocalTime, e.g. when the sampling is equi-distant and the time zone never changes. The implementation of Sibyl.Safe.TimeSeries.diff indicates that you are thinking exclusively about this scenario.

2 Likes

Thanks for looking at the library - originally I figured we might do something like a “Frequency” type (Monthly, Daily, Hourly, Minutely, Secondly, etc) but decided against it because I felt it was too rigid. I’m a student who works with time series in the very pedantic sense, and so insight into common use cases for statisticians would be really really useful.

How would you suggest we represent the type internally? An issue we ran into also was rather to have allow nulls in data, or if we should make the user clean up their data before asking Sibyl for help, which I’m starting to think isn’t the correct approach. It may end up that Sibyl operates on DataFrames themselves, and allows for irregular spaced time data and other unordered data, or intervals of data. Working out how to represent that in Haskell has become a challenge for me during planning… also @mchav what do you think?

1 Like