Serializing Haskell functions to disk

LaurentRDC · May 21, 2024, 1:55pm

Hi all,

I am looking to serialize Haskell functions to disk, but there doesn’t appear to be a simple solution.

I am working on implementing financial exchanges for local electricity. One of the key features is that users do not trade manually, but rather define strategies that we execute on a schedule on their behalf.

If we had infinite engineering time, we would create our own scripting language or re-use an embeddable scripting language like Lua. However, for simplicity, I would rather start by using standard Haskell functions if possible.

Consider a simple, small example where we have strategies like so:

data Bid = (...)

-- A strategy takes in the current time, and create some data structure
-- of type 'Bid'
type Strategy = UTCTime -> Bid

runStrategy :: Strategy -> IO Bid
runStrategy strat = do
    now <- getCurrentTime
    pure (strat now)

Each user has a different strategy, which we want to be able to store on disk.

This Stack Overflow question refers to Cloud Haskell, but both the Closure type from distributed-process and the newer GHC.StaticPtr have a different purpose than serialization on disk.

Anyone has faced this problem before?

BurningWitness · May 21, 2024, 2:13pm

Unless you actually intend on users writing their own Haskell functions, consider instead building a tree. Then have separate handwritten parsing and serialization for it, it will help you maintain backwards compatibility.

I don’t know if using Haskell as a language on the outside is a good idea, dumb C-like will probably be way easier to both write and read.

LaurentRDC · May 21, 2024, 2:20pm

Yeah that’s probably the way to go in the future, but I’m trying to avoid writing an interpreter (although it would be fun – I have been waiting for an opportunity to put lessons from Crafting Interpreters to use).

Right now, we are creating these strategies on behalf of users, based on the results of machine learning. So no one has to write Haskell functions.

agentm · May 21, 2024, 2:23pm

GHC includes an interpreter. You can serialize Haskell source code, validate that the source adheres to some domain-specific monad, and execute it. That’s what we do.

LaurentRDC · May 21, 2024, 2:33pm

That’s very interesting! I will take a deeper look tonight.

sclv · May 21, 2024, 6:06pm

I think there’s no reason that StaticPtr can’t work equally well for serialization to disk as for sending functions over the wire. My advice would be to investigate that approach and see if there’s any particular obstacles beore shooting for something else…

LaurentRDC · May 21, 2024, 6:20pm

My reading of the GHC.StaticPtr documentation is that the data lives always lives in memory, and the StaticPtr only helps in locating the memory location.

Do you know of any uses of StaticPtr in the wild that I could look at? Surprisingly, Cloud Haskell doesn’t use it

BebeSparkelSparkel · May 21, 2024, 6:33pm

Very interesting. Seems similar to my question Modifying an Running Application

You already posted a good answer on that though.

BebeSparkelSparkel · May 21, 2024, 6:43pm

Couldn’t you use llvm, wassm, or similar? This would allow you the flexibility to use Haskell or any other language that can compile to the target.

TerrorJack · May 21, 2024, 6:44pm

I want to advertise fancy solutions so much, but if I were to start a company with my own money, my honest answer would be using the battle tested hslua library for scripting layer in a Haskell app. Now this is boring and off-topic, so there are a couple of fancy solutions about serializing Haskell functions to disk:

Static pointers

Static pointers is a mechanism to assign unique fingerprints to Haskell functions. These fingerprints are 128-bit hashes that can be serialized, hence the name “static pointer” since no code is really serialized and dynamically loaded.

Caveats:

It does not address the issue of dynamic code loading at all. Say you have deployed your backend and you want to load a user-written Haskell script. You need to insert user script into your project and recompile the monolith and redeploy and restart.
You can only assign static pointers to functions without free variables, not arbitrary function closures at runtime. So you need to use special combinators to compose existing static pointers to emulate “function closures” that can be serialized. It feels awkward tbh.
You can only assign static pointers to monomorphic functions without constraints, so working with type classes is another pain spot.
The fingerprint logic is very ad-hoc and you should assume that once you recompile your stuff, existing fingerprints you’ve serialized have likely been invalidated.

Using ghci linker

There’s a well known blog post on Simon Marlow’s blog that explains the idea. It works for your use case as long as you’re willing to dig into GHC internals, but also: you’re completely on your own to guarantee ABI stability. Even if the user pluggable function has very primitive type like ByteString -> IO ByteString, as long as that function refers to something provided by your backend, then it’s prone to ABI changes when you recompile your backend.

Using ghci interpreter

This is much more heavyweight than the above; your backend now depends on the entirety of GHC API as well as the global package database. The backend sets up a GHC API session, consumes user-written Haskell script as string, then load that string using the same evaluation mechanism of ghci.

Slow but steady memory leak awaits because of CAF retention, FastString table, etc. If your backend can tolerate periodic restart then it’s the quickest way to get going though.

One common caveat of using ghci interpreter or stuffing a whole ghci in your backend is to what extent do you trust user input script. The script, either in compiled shared object form or ghci bytecode form, will live in the same address space as your entire backend and there’s a ton of dirty business a malicious user Haskell script can do. Even with safe haskell and very carefully restricted prelude, it’s trivial to write a script to allocate a ton as a naive dos attack.

Using ghc wasm backend

Use the ghc wasm backend to compile user Haskell scripts to wasm modules, then run those modules in your Haskell backend. The RPC calls between wasm side and host side can be achieved specifying a set of operations the wasm side may perform on the host side. Since this is a shameless self-plug, allow me to at least highlight the pros before going for the cons:

Proper sandboxing, the wasm module may only perform side effects based on the capabilities granted by the host. It’s also trivial to enforce resource constraints, both memory consumption and execution time can be limited, so you have more confidence consuming arbitrary user input.
Once the wasm/host interfaces are properly defined, the same wasm modules can be deployed to a fleet of different host runners with different hardware and operating systems. And the ABI stability issue in other solutions doesn’t exist here.
The wasm module execution state can be snapshotted and restored much more easily compared to native binaries.

The main caveat that comes to my mind:

It’s still in its early days and haven’t seen enough seed users yet. Bus factor is still essentially 1 at the moment, and Template Haskell support is still missing.

sclv · May 21, 2024, 6:51pm

Right, the data does live in memory, and staticptr lets you serialize the “pointer” to it – but that’s the basis on which you can serialize any closure to disk or wire, by letting you compose applications.

Two packages you can make use of that provide this (I think are) distributed-static

https://hackage.haskell.org/package/distributed-static-0.3.10/docs/Control-Distributed-Static.html

and distributed-closure

ocramz · May 21, 2024, 7:02pm

Let me poke on the “no one has to write code” bit: what are the users submitting?

LaurentRDC · May 21, 2024, 7:23pm

We’re helping users determine the “best” bid/offers for electricity, where “best” could be understood in a variety of ways. For example:

Minimize energy costs / maximize energy revenue;
Extend effective life of energy storage, which should be cycled carefully;
Minimize greenhouse gas emissions;

Most users don’t have the interest, expertise, or data to achieve these results. Therefore, we want to create strategies ourselves on behalf of our users. That’s why “no one has to write code”.

So we will crunch the numbers (analyzing energy consumption/generation patterns and external factors) to create a strategy. The user chooses what to optimize (out of choices we provide), which is easy enough.

So my question of serializing Haskell functions is about creating the strategies asynchronously in the backend, storing the results somewhere, and executing the strategies some time later.

LaurentRDC · May 21, 2024, 7:36pm

As an example, at my previous job, we would train machine-learning models in Python, and dump the model as a pickle on Amazon s3. Then, some other service would load the machine-learning model from s3 and execute it.

tomjaguarpaw · May 21, 2024, 7:40pm

Was that just weights (i.e. numeric data) or code as well?

LaurentRDC · May 21, 2024, 7:43pm

Weights and code. pickle is pretty nice in that regard.

tomjaguarpaw · May 21, 2024, 8:00pm

But pickle doesn’t actually serialise the code, it just serialises a reference to it, so if you’re not running exactly the same Python program when you deserialise code then something bad is going to happen (example below).

If you want to achieve something similar in Haskell you could just have a Map mapping String identifiers to Haskell functions, and then pass around the Strings outside the process, as desired.

% python3 
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> def f(): print("Hello")
... 
>>> pickle.dumps(f)
b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x01f\x94\x93\x94.'

% python3
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> pickle.loads(b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x01f\x94\x93\x94.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: Can't get attribute 'f' on <module '__main__' (built-in)>

LaurentRDC · May 21, 2024, 8:11pm

Uh, I guess I never realized that we need to always have the classes in scope when deserializing. Thanks for educating me

ocramz · May 22, 2024, 8:45am

Also distributed closures have limitations: you (= the libraries your models use) need to provide a Typeable instance to be serializable: Typeable a => Binary (Closure a) Control.Distributed.Closure

tomjaguarpaw · May 22, 2024, 9:06am

Everything’s automatically Typeable, isn’t it?