Serializing Haskell functions to disk

Let me poke on the “no one has to write code” bit: what are the users submitting?

We’re helping users determine the “best” bid/offers for electricity, where “best” could be understood in a variety of ways. For example:

  • Minimize energy costs / maximize energy revenue;
  • Extend effective life of energy storage, which should be cycled carefully;
  • Minimize greenhouse gas emissions;

Most users don’t have the interest, expertise, or data to achieve these results. Therefore, we want to create strategies ourselves on behalf of our users. That’s why “no one has to write code”.

So we will crunch the numbers (analyzing energy consumption/generation patterns and external factors) to create a strategy. The user chooses what to optimize (out of choices we provide), which is easy enough.

So my question of serializing Haskell functions is about creating the strategies asynchronously in the backend, storing the results somewhere, and executing the strategies some time later.

1 Like

As an example, at my previous job, we would train machine-learning models in Python, and dump the model as a pickle on Amazon s3. Then, some other service would load the machine-learning model from s3 and execute it.

Was that just weights (i.e. numeric data) or code as well?

Weights and code. pickle is pretty nice in that regard.

But pickle doesn’t actually serialise the code, it just serialises a reference to it, so if you’re not running exactly the same Python program when you deserialise code then something bad is going to happen (example below).

If you want to achieve something similar in Haskell you could just have a Map mapping String identifiers to Haskell functions, and then pass around the Strings outside the process, as desired.

% python3 
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> def f(): print("Hello")
... 
>>> pickle.dumps(f)
b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x01f\x94\x93\x94.'
% python3
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> pickle.loads(b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x01f\x94\x93\x94.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: Can't get attribute 'f' on <module '__main__' (built-in)>
3 Likes

Uh, I guess I never realized that we need to always have the classes in scope when deserializing. Thanks for educating me

2 Likes

Also distributed closures have limitations: you (= the libraries your models use) need to provide a Typeable instance to be serializable: Typeable a => Binary (Closure a) Control.Distributed.Closure

Everything’s automatically Typeable, isn’t it?

Every data type you define is typeable and I believe all built-in types too, but existentially or universally quantified types are not.

3 Likes

You could fork hell, which is 1.2k lines of code in a single file and gets you a mini-Haskell for scripting. It’s pretty easy to configure what types are supported, and what primitives, both monomorphic and polymorphic. It has some basic type clases like Eq, Ord, Show and Monad, but you can’t write your own within the object language.

I kept it intentionally one file to artificially limit the size of the implementation, and to make it easy for someone to fork and re-use for another purpose.

You could generate an untyped AST in haskell-src-exts format, and that’s easy to serialize to/from disk. [The subset of] Haskell syntax is stable over time, unlike some binary format which is likely to lead you to trouble. GHC’s API and general infrastructure is also fast-moving and wobbly, so I would never use it in a deployed app. If you’re always planning on generating code that has type annotations, then you probably don’t need type-inference and could cut out that whole block of code and bring it down to 500~ lines.

Performance-wise, a basic fib implementation outperforms GHCi slightly, which isn’t a brag or rigorous, but does indicate that its performance isn’t bad. Which makes sense, the evaluator would fit on a napkin and doesn’t really do anything.

One small: if performance or distribution becomes a problem, you can generate Haskell code from a Hell AST and compile it with GHC (haskell-src-meta: Parse source to template-haskell abstract syntax.), as it doesn’t “add” any features that Haskell doesn’t already have, and output WASM down the line if needed. That adds the burden of depending on the GHC toolchain, but it’s a path forward nonetheless.

3 Likes

more to the point, do things like Vector a → a have a Typeable interface? I suspect we’re talking of trainable models in the machine learning sense here.

That’s a great practical example and it shows what I mean, namely the function and vector are typeable, but the polymorphic a is slightly problematic. Instead you’d have to use Typeable a => Vector a -> a.

1 Like

Based on the description of the problem, I would personally look into defunctionalizing your Strategy. I.e. instead of serializing the function, serialize whatever data you used to construct that function. After all, you’re not reading Haskell code from an input field in the UI or from a database, serialize the thing you’re reading or maybe serialize an intermediate stage between the raw input and the eventual Strategy.

Independent of the language used, serializing functions is very messy conceptually (except maybe for languages like C where functions can’t have closures).

I think of defunctionalization sort of as an incremental way to implement a DSL. Initially, there’s only one Strategy, so your DSL consists of the grammar (), i.e. that one Strategy is the only thing that’s expressible:

theOnlyStrategy :: UTCTime -> Bid
theOnlyStrategy = the . winning . formula

data DSLv1 = TheStrategy

toStrategy :: DSLv1 -> Strategy
toStrategy TheStrategy = theOnlyStrategy

You add 2 more Strategys in the next release, one parameterized over an integer and another over a boolean, then your DSL is represented as S1 | S2 Integer | S3 Bool and you have a simple migration from the first version of the DSL and you have a new toStrategy from DSLv2:

migrate :: DSLv1 -> DSLv2
migrate TheStrategy = S1

toStrategy :: DSLv2 -> Strategy
toStrategy S1 = theUsedToBeOnlyStrategy
toStrategy (S2 i) = ...
toStrategy (S3 b) = ...

This would give you the following benefits:

  • You have complete introspection into serialized strategies, you can display them, design domain-specific UIs for modifying them, statically analyze them and even translate them to an SMT solver and prove properties about them.
  • New releases can improve existing strategies or fix bugs in them
  • It’s just simple Haskell, no need to worry about unspeakable horrors involved in serializing code + runtime closures.
9 Likes

That seems to be the best approach for now. Thank you!

1 Like

also from an ML angle, if the strategy belongs to a parametric family of functions (e.g. it’s a polynomial of a fixed degree), you only need to serialize the coefficients so the closure only effectively needs to exist at runtime:

eval :: DSLv3 -> (UTCTime -> Double)
eval (S3 i1 i2 i3) t = i3 + i2 * t + i1 * t**2
3 Likes

Potentially of interest:

1 Like

By what point do you figure you’re implementing a Lisp with Haskell?
Might as well use Common Lisp with Coalton.

Bartos has this art school teacher vibe : "lambdas are closures mmmkay?

1 Like

I think that’s the C++ programmer in Bartosz speaking. When he says “lambdas are not named functions”, he means named functions in the C++ sense, where they can only be top-level. And when he says lambdas are closures, he means it in the sense that C++ lambdas get turned into a closure object if they have a non-empty capture clause. So I think that bit is him talking to his inner C++ that he can’t just pass around function pointers (which are effectively StaticPtr (a -> b) in Haskell land)