Request for feedback: Haskell implementation of series, or labeled arrays

LaurentRDC · November 24, 2023, 1:43pm

Hello everyone,

At work, we have needed a data structure for series – think pandas Series but for Haskell.

I have developed the package javelin for this purpose. Before I upload it to Hackage (and start following the package versioning policy), I would like to know what the community thinks of it.

Temporarily, the documentation is hosted in Github Pages, including a tutorial.

I have a few questions to get the conversation started:

Would you use this in its current state? What’s missing to make this useful to you?
How do you feel about the restriction to have unique keys (like Data.Map)?
The syntax for groupbys was chosen to optimize readability for non-Haskell users:
xs `groupBy` by `aggregateWith` agg
which my coworkers preferred, instead of the more usual
groupBy by agg xs
Do you have strong feelings about this?

I’m also happy to consider feature requests, which you can suggest here or can be filed on Github

Vlix · November 25, 2023, 7:04pm

On the one hand I appreciate the different from-/to- functions, (fromList, fromVector, fromSet, etc.) but at the same time, a

class IsSeries a k v where
  toSeries :: a -> Series k v
  fromSeries :: Series k v -> a

might be a nice addition so that you don’t have to change the function whenever you change the type of container your doing the transformations from/to.

instance IsSeries [(k, v)] k v where
  toSeries = fromList
  fromSeries = toList

instance IsSeries (Strict.Map k v) k v where
  toSeries = fromStrictMap
  fromSeries = toStrictMap

And maybe also an IsIndex class to do the transformation from/to Indexes?

EDIT: Could even add

instance IsSeries (IntMap a) Int a where ...

LaurentRDC · November 25, 2023, 8:33pm

That’s an excellent idea, thanks!

Vlix · November 25, 2023, 8:46pm

Thinking about it a bit more… it might even be nice UX to have the type classes’ methods (to-/fromSeries and to-/fromIndex, or w/e you decide) to be the default API, and provide a separate module called Data.Series.Explicit that exports all the to-/fromList, to-/fromMaps, etc.

I do know some people prefer the be explicit in which type they want to transform from/to, for type inference reasons or to prevent certain types of bugs after refactors.

LaurentRDC · November 25, 2023, 11:25pm

Type annotations are required to make toSeries and fromSeries work in examples and doctests. In order to help beginners get going quickly (e.g. my colleagues who are just starting to learn Haskell), I think I will leave the more explicit functions available and add the IsSeries typeclass as a nice-to-have

Edit: I realize I should use FunctionalDependencies to help with type inference here

silky · November 26, 2023, 8:26am

I definitely think I’ll give this a go; lovely work!

olf · November 28, 2023, 8:20am

The original Javelin was primarily a time-series tool, so I would expect its Haskell pendant to support that out of the box, too. What is the prevailing type of keys at your work place?
That said, the select function would cover most of my use-cases for timeseries. Regarding keys, it has proven to be a useful abstraction to regard time keys not as points in time but as time intervals, e.g. the time interval an associate value was aggregated over. That makes selection queries somewhat more complicated, since one must take care of properties like subset inclusion and partial intersection. Furthermore, should the time intervals be considered closed or half-open?
For the former, I have released closed-intervals that has a less complicated API as e.g. data-interval.
A timeseries fold would then not only fold over the values, but also aggregate the time interval keys into a new time key reflecting the union (convex hull) of the aggregated keys.

foldHull :: ConvexHull k => Fold a b -> Series k a -> Series k b
-- always produces a singleton series

As far as I recall, that was how Javelin was meant to work.

Once you have done computations with Series, what to do with it? Perhaps you might want to add a default rendering function to Pandoc (in a separate package to keep dependencies low), so that users can build documentation of a Javelin computation. That might integrate nicely with my provenience package that was also inspired by Javelin but captures a different aspect of it.

LaurentRDC · November 28, 2023, 5:41pm

The original Javelin was primarily a time-series tool

This is an accidental name clash! I had never heard of the original Javelin before. It appears that the company working on it disappeared before I was born…

We use UTCTime from the time package as the keys most often, although they represent open intervals of fixed width. I’ll take a look at the packages you’ve linked.

I wanted to create another related package which contains time-series-specific functionality. Most importantly, some forms of rolling operations can be optimized. That would be a good place to also expose time-interval functionality, like the foldHull you mention

olf · November 28, 2023, 7:01pm

Neither did I, before someone on haskell-cafe mentioned it to me. It appears that Microsoft Excel is going to become more like the old Javelin, with named ranges, Lambda and array functions in beta testing.

My first impression is that the recurrence relation trick can not be simulated by lazyness, so genuine cleverness!
As long as all the time intervals don’t overlap, and as long as all intervals are of equal length and non-overlapping, you should be good with UTCTime. The fun starts when intervals can be nested. Then you really need other index structures.

Another thing that constantly bugs me as a timeseries-digesting data scientist is timeseries with stamps in LocalTime without any mention of time zone. Creating a good heuristic for inferring, say, daylight savings time versus standard time in the presence of variable-width intervals is hard. The tz package only goes half the way, for example. A good and universal heuristic may be worth its own package.

LaurentRDC · November 29, 2023, 1:51am

After considering the feedback here, I have released the first version of javelin on Hackage. Happy hacking

turion · December 1, 2023, 2:57pm

Let me try to paraphrase my stream of consciousness while reading through your announcement and tutorial.

A time series library in Haskell! That’s awesome! Haskell needs more and better tools for data analysis! I hope it’s compatible with everything else that I use for data analysis and simulation currently.
Hmm, comparison with pandas. That’s maybe a good thing because pandas is established. But I also wonder how it will perform memory-wise. My typical problem with data analysis in Python is that it will OOM because it tries to cram the whole dataframe in memory. But Haskell is lazy, so it will maybe not suffer that issue.
Hackage package looks pleasant, slim. There is a tutorial. Great. Github looks good, a bit new, but great. Hopefully the package will be maintained for a long time.
Let’s read the tutorial. Ah, what’s >>> supposed to mean? Is this really a real GHCi session or documentation?
I really hope the show instance is not the table display. Sigh, it is.
I guess this looks like a well-designed library. It’s useful for what it is.
Wait, how do I do numerical integration? Any other time-aware filter like exponential moving average? Linear/Bezier/… interpolation? Where do I ever use that the key of the series represents time?
In my typical application, I never just have one time series. I have a lot of them, and they are indexed by e.g. the data source. How do I deal with that here? I guess at the end I do need multi-dimensional keys, but I don’t understand whether I can do this here.
Is this time series actually lazy? Can I add new values efficiently while dropping or aggregating on the other (old) end?
I guess I need a bigger full fleshed time series analysis example, maybe a Kalman filter or frequency analysis with multiple data sources, before I can judge whether I can use this.
What’s the data structure behind Series? Is it just a vector and a map with keys → indices?

LaurentRDC · December 1, 2023, 3:45pm

Thanks for your feedback. Let me try to address the questions:

Hmm, comparison with pandas. That’s maybe a good thing because pandas is established. But I also wonder how it will perform memory-wise. My typical problem with data analysis in Python is that it will OOM because it tries to cram the whole dataframe in memory. But Haskell is lazy, so it will maybe not suffer that issue.

Is this time series actually lazy? Can I add new values efficiently while dropping or aggregating on the other (old) end?

What’s the data structure behind Series? Is it just a vector and a map with keys → indices?

This first implementation is based on having keys stored in a Set, and values are stored in a Vector. Having keys in a Set allows for fast membership testing and searching, while values in a vector allows for fast numerical operations.

This structure isn’t efficient for certain things; for one, changing the shape of a Series is very slow, because you need to copy the entire array. If you operate solely on the values of the series, then maybe Vector’s fusing mechanism may make things a bit faster, but in general, this is very much a ‘dense’ in-memory representation of a series.

One thing that I have seen in polars (rust dataframes) is separate lazy and strict APIs. It would be great to do something like this in javelin, but what I needed now for work is the strict API, so that’s what I built first.

Let’s read the tutorial. Ah, what’s >>> supposed to mean? Is this really a real GHCi session or documentation?

If you use HLS, the >>> will open a real GHCi session indeed. The javelin project makes heavy use of doctest to check that the documentation is perfectly aligned to what you would get in a ‘real’ GHCi session, and doctest is triggered by >>>

I really hope the show instance is not the table display. Sigh, it is.

I’ve done this to keep the documentation examples as simple as possible. I’m converting coworkers who have never used Haskell before, so it was important for me that the examples were as easy as possible to reproduce. Note that you can customize the printing of series using the displayWith function.

What else would you have wanted from the Show instance? You could have it be like Map, but I personally find this much harder to read

Wait, how do I do numerical integration? Any other time-aware filter like exponential moving average? Linear/Bezier/… interpolation? Where do I ever use that the key of the series

In my typical application, I never just have one time series. I have a lot of them, and they are indexed by e.g. the data source. How do I deal with that here? I guess at the end I do need multi-dimensional keys, but I don’t understand whether I can do this here.

I guess I need a bigger full fleshed time series analysis example, maybe a Kalman filter or frequency analysis with multiple data sources, before I can judge whether I can use this.

The javelin package doesn’t contain anything specific to time-series just yet, although some functionality applies to time series (e.g. you can use the windowing function for rolling aggregations which I use a lot.

My plan was always to have a separate javelin-timeseries or similar which contains time-series specific functionality. I don’t have a need for it right now, so it doesn’t exist, but I would be very interested in your opinion on what should go in there.

Would you be willing to open a GitHub issue containing feature requests for time-series functionality? That would be very helpful. Or you can post them here

ocramz · December 1, 2023, 6:41pm

Personally I really like the table output for the Show instance, and it corresponds with the ‘pandas’ behavior.

turion · December 3, 2023, 9:36am

My recommendation would have been to put display $ in front of every line of the examples.

The Haskeller’s expectation is (afaik) that the result of Show is something that can be parsed as valid Haskell, and evaluates to the original value. Basically, it should be an inverse to Read. This is of course not enforced, and many other instances in the wild break this expectation as well.

I personally find the right approach what HyperHaskell (hyper: Display class for the HyperHaskell graphical Haskell interpreter) does: Have a separate class for pretty printing. GHCi is maybe not first choice for explorative data analysis, a Jupyter notebook might be, though. I don’t know whether the ihaskell kernel depends on Show as well, or whether it could be hooked to a richer type class when displaying the result of a cell.

To go off a tangent: I know it is tempting to mimick everything pandas does. After all, pandas is wildly successful. But only copying existing pandas features is not enough for success. javelin would need USPs. Haskell interoperability and the ability to store arbitrary Haskell terms in the time series is already a great feature that pandas doesn’t have! But ideally, this is not the only advantage.

windowing is quite useful and one can probably implement all FIR filters with it, but I’m missing something to implement IIR filters space-efficiently. This can probably only work with a lazy (possibly batched) time series.

javelin-timeseries sounds like a great idea, then my only minor nitpick would be that you already advertised javelin as a time-series library. Also, I’m not sure why it has to be a separate hackage package - are you planning on adding extra dependencies, Template Haskell, or similar, that makes it necessary to move these functionalities outside the main package?

For relevant features, I like to think about a prime example. One example I’m thinking about here is an IoT backend server. It receives multiple time series of events or telemetry from different devices, has to ingest them, process them in realtime, and react. So I need an API to lazily build a series (possibly in chunks like ByteString.Lazy, if that helps with performance), ingest it, concatenate it chronologically with other time series, do some aggregation on it, and when I’m done with those values, the used up older parts of the time series should be garbage collected. All this should be happen in nearly constant memory. At least, I should not have to hold the whole time series in memory at once.

You may say that this application is out of scope, and I should use Functional Reactive Programming for this. You may have a point. But I have a different real life example which should certainly be in scope, and is computationally nearly as hard: In a huge storage I have time series of a thousand devices over a year, approximately in second resolution. I want to compute the temporal mean of the standard deviation over devices of the time series. There is no way to hold all these time series in memory at the same time, so I need some lazy streaming API.

ocramz · December 3, 2023, 10:32am

I would very much discourage from using read for anything serious, and the read . show = id invariant is wishful thinking at best. More to the point, this library targets data science, where interactive use is the default. GHCi uses the Show instance to print terms. Why ask a library author to contort and provide additional pretty-printing facilities ? Additionally, here we are dealing with potentially large data which should not be printed out (or read back in) in full.

LaurentRDC · December 3, 2023, 12:58pm

But only copying existing pandas features is not enough for success. javelin would need USPs.

What’s a USP?

javelin-timeseries sounds like a great idea, then my only minor nitpick would be that you already advertised javelin as a time-series library.

That is a good point, I’ve edited the post to reflect that it is a general series library. In my mind it remains about timeseries because I use it exclusively for this purpose, but I agree that there isn’t much time-series only functionality

Also, I’m not sure why it has to be a separate hackage package - are you planning on adding extra dependencies, Template Haskell, or similar, that makes it necessary to move these functionalities outside the main package?

Time-series functionality would have to rely on at least the time package (and possibly tz), so there that. I also want to ensure a good developer experience by factoring functionality in separate packages, so that compilation is only as long as it needs to be. I could be convinced otherwise.

For relevant features, I like to think about a prime example. (…)

That is a tough example to address. If data is going to be streaming and you want to accumulate it, you need a data structure with fast consing (in front), and fast unsnocing (in the back). This is at odds with the purpose of a series (as I and my industry uses them) which is as a dense representation which can be manipulated fast. You have the following two facts in conflict:

For fast numerical operations, elements should be contiguous in memory => use an array!;
For fast consing/unsnocing, you shouldn’t have to copy the structure when it changes => don’t use an array!
So in this case, I think the javelin series isn’t the right data structure.

But I have a different real life example which should certainly be in scope (…)

That is a great example of something I want to make possible! As you mention, this requires a lazy API. I created a feature request for a lazy API and I’ll think about how to implement this.

turion · December 3, 2023, 5:41pm

Sorry for being so terse, it’s a Unique Selling Point. So ideally something that might maybe one day make people consider to switch from pandas to javelin that is of course a very long way to go, and it is unusual (but not unheard of) to change languages just to use a different library. Dreams of the future aside, my point was that javelin should have an outstanding feature that pandas doesn’t have.

That’s a good point. Maybe it makes sense to be abstract in the choice of timestamp representation, since the community seems to be undecided on which time library to use. E.g. there is thyme: A faster time library and some other packages out there. I started time-domain: A library for time domains and durations some time ago to go in that direction, but I didn’t incorporate thyme & co yet (PRs welcome) and time-domain isn’t very popular yet.

I agree, a standard vector isn’t cut for this. But possibly one could cook up a sequence of vectors. Basically, the old batching technique for long series, and the length of the vectors is the batch size. With good tuning of that size, one might be able get away with an acceptable constant memory footprint and nearly the same speed as a vector.

Fantastic this is an actual industry use case and I’d love to tell my colleagues that there is a Haskell library to do this job! Just let me point out here that such a series naturally has multiple indices. One is still over time, and the series should optionally be lazy over it, but the other axis (the different devices) can be strict.

In some cases, one could get away with a series over vectors (this is the case where every device produces exactly one value per timestamp, or we resample the device time series first), but often this is not possible as each device will emit values at different times. In that case, one probably has something like Series (Time, DeviceId) a, but one would like to filter for a particular DeviceId.

LaurentRDC · February 2, 2025, 5:56pm

Following this work, I have published an associated library, javelin-frames, which provides a prototype implementation of type-safe data frames.

There is more work to be done (e.g. how can we express dataframe computations lazily?), but I would like to gather some feedback first