[ANN] A series of articles on Heftia: The Next Generation of Haskell Effects Management

Thanks

OK, sounds good.

Well, we already know it is possible with the State fragment of Bluefin, because that’s just ST, and that theory has already been worked out. I think it would be fair for you to add a comment saying that the theory of ST can likely be extended to all of Bluefin.

1 Like

Thank you for reading!

Yeah, I don’t really see much value in emphasizing this, so I’m thinking of changing it to something a bit more neutral. Do you have any good ideas?

I agree! It’s just that I happen to be more capable of explaining the former in detail, which is why I wrote extensively about it in the article. As for the latter, I’m not confident I can explain it accurately, so it naturally ended up being only a few sentences.

1 Like

That makes sense! I’ll add a description mentioning that the theory is supported by ST.

1 Like

I have revised the article. I would like to express my gratitude to everyone who offered their advice. Thank you very much!

7 Likes

my request for a compelling example of LogicT on Twitter

I’ve been working on a rule-based Sudoku solver for some time and I’ve found one of the standard reduction rules is best expressed via LogicT. while simple puzzles can be solved via direct inferences that involve no search (i.e. “if a value is known in a region, it can’t possibly occur anywhere else in that region”), the final rule, one necessary for solving difficult puzzles is significantly more complicated.

if you watch an experienced solver tackle a difficult puzzle, you will notice that at some point, while stumped, they will say something to the effect of “oh, this cell can’t take value X because if it does, it causes a problem elsewhere in the grid”. this is called a “forcing chain” and it’s the only thing that will allow you to eliminate key, impossible values from critical cells that are preventing all the other rules from making any progress.

of course, when implementing a computer solver, it can’t spot this as a flash of insight. instead, it must 1. locate the most constrained cell on the board and 2. try each value for that cell and search for a contradiction. moreover, the very hardest puzzles require this kind of search to be employed recursively – that is, the solver will get stuck again during the search for contradiction and must again find the most constrained cell on the board and search its possibilities for a value it cannot take. otherwise, no contradictions will be found anywhere on the grid.

this alone can be implemented via List/Logic (well, ReaderT Logic, anyway) rather than ListT/LogicT. however, a search for contradiction cannot return a value – we searched for an option that yielded something isomorphic to [] so there’s no value to get out of the List/Logic monad – yet there’s more information uncovered during this search that’s fruitful. if we happen to check every possible value a cell can take, we can look at other cells to see which values they were forced to take – any possibility for that cell that isn’t on that list is not possible on the final grid and we can remove it from consideration.

in order to extract this information, we need a log of the final grids at the end of each search for contradiction. this log must be backtrackable – the final grids yielded by searching a different cell are irrelevant. I found this most straightforward to implement via the logict-state package.

not sure if that’s the kind of answer you were looking for. “LogicT is useful when you’re solving a logic problem” is perhaps a bit too tautological, but logic problems frequently hide at the heart of real-world problems.

I also haven’t yet published the code online – I’m messing around with implementations to improve performance as much as I can, mostly as an exercise, but I can clean it up and post it on github if such an example is helpful for whatever you’re working on. at present I’m just using mtl as the effect system, but that’s mainly because I don’t think any of the fast effect systems handle NonDet and the ones that do are Free monad-based.

4 Likes

Yes, I understand that LogicT is useful for search. What I don’t understand is why such searches can’t be equally-well implemented without it.

1 Like

A full Sudoku solver in Prolog is about 10 lines of code (see A.9.7 SWI-Prolog -- library(clpfd): CLP(FD): Constraint Logic Programming over Finite Domains). I doubt any ad-hoc implementation can beat that. This is just to say that logic programming is very good at solving a certain class of problems.

That being said, if you allow me a bad analogy, currently LogicT in Haskell is like a monad library in an OOP language. You can use it, it may work to a certain extent but you’ll feel the shortcomings of the language and the ecosystem very quickly. So in this sense I agree with you: non-determinism in Haskell is an “exotic” effect. If an effect library doesn’t support it I wouldn’t consider it a huge loss.

In the future it would be nice to have a rich ecosystem of interoperable libraries for logic programming in Haskell. I think Haskell is flexible enough to be “bended” to fit other paradigms. But we are not there today, and without that ecosystem the usefulness of non-determinism is greatly reduced.

1 Like

that’s a fundamentally different kind of solver. the same type of solver in Haskell is about 50-100 lines of code and there are many examples on the Haskell wiki. a rule based solver doesn’t merely provide the answer - it supplies the proof that the answer is correct and unique. it employs the same process a human solver would along the way.

it’s the difference between a theorem prover and a calculator - I have no doubt that the latter is easier to write in Prolog but the former tends to be written in general purpose languages because the important part isn’t the expression of the constraints but rather explaining each step to a human being. the reference implementations I’m aware of are written in Python, Java, and Rust.

the distinction matters because modern Sudoku puzzles extend the set of constraints with new kinds of rules (e.g. Killer Sudoku which adds cages where the sum of the cells within the cage must total a specific value). people use these solvers while setting a puzzle to ensure that they haven’t added a constraint that makes it impossible to solve the puzzle - and also that the final solution is unique. they work with such a solver interactively, adding new constraints one at a time, watching how a new constraint removes some possibilities from each cell or forces the values of others. it’s very important when doing this to know /why/ each step is true - some of the steps are incredibly complex for a human and you may want to avoid such steps OR you may want to ensure such steps to guarantee the difficulty of your puzzle.

as such, see my explanation in my previous reply about why LogicT is unavoidable here - you literally can’t solve many puzzles without a proof by contradiction under these constraints. the puzzles are intentionally set up to require it.

if all I wanted were the final answer, this is a problem domain so solved that there are twenty different examples on the Haskell wiki. what makes this a fun exercise employing many of Haskell’s strengths is everything else. if you’d like to see such a solver in action, check out f-puzzles.

@tomjaguarpaw

Yes, I understand that LogicT is useful for search. What I don’t understand is why such searches can’t be equally-well implemented without it.

you can but it’s going to be a manual implementation of backtracking to unwind the captured state of the search when the search fails. you can probably do it by walking a tree explicitly.

4 Likes

Yes, I referenced that particular solver as a simple example of how logic programming (and LogicT in Haskell) can be handy in certain cases. My point was that if we had a CLPZ library that worked with LogicT we would probably be able to write the same solver in 15 lines of Haskell.

I certainly didn’t mean it as an alternative to your solver! Maybe I should have chosen a different example.

Currently I don’t see why that can’t be done equally well with for_ loops and exceptions. For example, what in the list monad is

do
  x' <- x
  y' <- y
  whatever

becomes (in Bluefin)

do
  for_ x $ \x' -> do
    withJump $ \jx -> do
      for_ y $ \y' -> do
        withJump $ \jy -> do
          whatever

(Actually, this allows you to jump back to any previous branch, which I think is more powerful than LogicT)

Is the for_ approach somehow doomed in a way I don’t yet see?

Another thing from the article:

Performance always has room for improvement, whereas correctness improvements are often impossible due to foundational interface compatibility. This asymmetry is crucial and will be discussed later.

The first sentence is not true. The easiest example is the polysemy memory leak mentioned above: suppose that you later discover than heftia exhibits the same memory issue and it is inherent to free monad based implementation. Then you either rewrite the internals with IO + delimited continuations (which will most likely need some sort of API breakage) or “it’s over”.

Speaking of memory: benchmarks in effectful also show bytes allocated/copied and peak memory for each benchmark (thanks to tasty-bench) and I just remembered that I saw weird things there in the past. So I went back, applied this patch:

diff --git a/effectful/bench/Main.hs b/effectful/bench/Main.hs
index 2d998bf..43b0994 100644
--- a/effectful/bench/Main.hs
+++ b/effectful/bench/Main.hs
@@ -19,9 +19,9 @@ main :: IO ()
 main = defaultMain
   [ concurrencyBenchmark
   , unliftBenchmark
-  , bgroup "countdown" $ map countdown [1000, 2000, 3000]
-  , bgroup "countdown (extra)" $ map countdownExtra [1000, 2000, 3000]
-  , bgroup "filesize" $ map filesize  [1000, 2000, 3000]
+  , bgroup "countdown" $ map countdown [1000000]
+--  , bgroup "countdown (extra)" $ map countdownExtra [1000000]
+--  , bgroup "filesize" $ map filesize  [1000000]
   ]
 
 countdownExtra :: Integer -> Benchmark
@@ -56,15 +56,15 @@ countdown :: Integer -> Benchmark
 countdown n = bgroup (show n)
   [ bench "reference (pure)" $ nf countdownRef n
   , bench "reference (ST)"   $ nf countdownST n
-  , bgroup "effectful (local/static)"
+{-  , bgroup "effectful (local/static)"
     [ bench "shallow" $ nf countdownEffectfulLocal n
     , bench "deep"    $ nf countdownEffectfulLocalDeep n
-    ]
+    ]-}
   , bgroup "effectful (local/dynamic)"
     [ bench "shallow" $ nf countdownEffectfulDynLocal n
     , bench "deep"    $ nf countdownEffectfulDynLocalDeep n
     ]
-  , bgroup "effectful (local/dynamic/labeled/send)"
+{-  , bgroup "effectful (local/dynamic/labeled/send)"
     [ bench "shallow" $ nf countdownEffectfulLabeledDynSendLocal n
     , bench "deep"    $ nf countdownEffectfulLabeledDynSendLocalDeep n
     ]
@@ -79,7 +79,7 @@ countdown n = bgroup (show n)
   , bgroup "effectful (shared/dynamic/labeled/send)"
     [ bench "shallow" $ nf countdownEffectfulLabeledDynSendShared n
     , bench "deep"    $ nf countdownEffectfulLabeledDynSendSharedDeep n
-    ]
+    ]-}
 #ifdef VERSION_cleff
   , bgroup "cleff (local)"
     [ bench "shallow" $ nf countdownCleffLocal n

to disable unnecessary benchmarks, then run benchmark for effectful, mtl, freer-simple, polysemy and fused-effects on GHC 9.8.4 with cabal run bench -- --stdev Infinity -p LIBRARY (--stdev Infinity forces a single run of each benchmark for fairness). Here is a sample data from a run:

$ cabal run bench -- --stdev Infinity -p "mtl"
Configuration is affected by the following files:
- cabal.project
- cabal.project.local
Created semaphore called cabal_semaphore_0 with 16 slots.
All
  countdown
    1000000
      mtl
        shallow: OK
          59.1 ms,          467 MB allocated, 124 KB copied, 6.0 MB peak memory
        deep:    OK
          386  ms,          3.0 GB allocated, 722 KB copied, 6.0 MB peak memory

All 2 tests passed (0.45s)

And here’s the aggregated data for all libraries as a table (courtesy of chatgpt):

Effect System Run Type Time Allocation Bytes Copied Peak Memory
effectful (static) shallow 4.84 ms 15 MB 23 KB 6.0 MB
deep 4.86 ms 17 MB 69 KB 6.0 MB
effectful (dynamic) shallow 20.6 ms 130 MB 73 KB 6.0 MB
deep 20.9 ms 131 MB 84 KB 6.0 MB
mtl shallow 60.2 ms 467 MB 93 KB 6.0 MB
deep 389 ms 3.0 GB 697 KB 6.0 MB
freer-simple shallow 160 ms 413 MB 319 MB 187 MB
deep 227 ms 1.2 GB 476 MB 653 MB
polysemy shallow 191 ms 1.5 GB 216 MB 122 MB
deep 283 ms 3.7 GB 318 MB 437 MB
fused-effects shallow 329 ms 1.0 GB 601 MB 414 MB
deep 898 ms 5.5 GB 713 MB 873 MB

So, the runtime is not the only metric to consider. Memory usage and GC pressure are also important and we see that apart from effectful and mtl (credit where credit’s due, allocation sucks, but at least it doesn’t leak) it’s abysmal.

In particular it looks like memory usage for these libraries increases linearly with the amount of code that’s executed, so I suspect that GHC, instead of treating the body of countdown as a loop, fully unrolls it (Alexis seems to have encountered this before). A disaster.

I don’t have benchmarks for heftia, but I’d be very surprised if it didn’t exhibit the behaviour similar to freer-simple in this regard.

4 Likes

I doubt it’s doomed – it’s what the desugared and simplified core looks like. and that makes sense. machine code has no notion of LogicT. let me see what it looks like if I rewrite my LogicT code in Bluefin using this approach. I’m a little worried about readability/clarity. I’d want to carefully encapsulate what amounts to using goto to encode backtracking.

That would be very interesting to see, thanks!

It’s a jump to a well-defined location with well-understood semantics in all contexts. It is already carefully encapsulated by Bluefin, and the possibility of such jumps is tracked in the type system. It’s actually more explicit than LogicT itself. Still, it could stand to be abstracted further.

In the same vein as an LLVM pass, I think that it is in principle possible to implement the internal code transformation via a GHC plugin (or by intervening directly in GHC). Without changing any interfaces or observable behaviour, we could, for example, detect when the interpreter in use is tail-resumptive, that the target monad of runEff is IO, and that a number of other conditions hold, and then rewrite the Freer implementation into a static dispatch over the raw IO monad instead. Of course, in that scenario we would need to prove that the semantics before and after the optimization are completely identical.

I agree.

I’d expect it to show similar behaviour. Thank you for letting me know about this benchmark.


P.S. Regarding heftia’s performance, I’m going to stop promoting it and simply report the results as they are.

1 Like

That’s only really an option when doing whole program analysis, which does not scale.
This kind of specialisation is incompatible with compositional program analysis and hence separate compilation (which is what GHC usually does).

3 Likes

For each interpreter, I believe we can determine in parallel whether it is tail-resumptive, whether it supports static dispatch, which monad it dispatches to, and so on. Therefore, by attaching metadata about the interpreters to the build artifacts of each module, supporting separate compilation is by no means impossible. We can check whether the conditions are met by inspecting the metadata of the precompiled dependent modules.

Please note that this is a heuristic. Naturally, the optimization won’t be effective in every case, but it’s expected to contribute to speed-ups in typical scenarios.

Instead of extracting optimization details via program analysis, we might also consider having it specified manually via a custom pragma, for example:

runFoo :: (State Int :> es, Emb IO :> es) => Eff (Foo : es) a -> Eff es a
runFoo = ...
{-# ANN runFoo (TailResumptive, StaticDispatch ["Emb IO", "State Int"]) #-}

In that case, if the requested optimization cannot be applied, we could emit a warning or error, thereby improving the predictability of performance.

In any case, these ideas only demonstrate a theoretical possibility; we won’t know how effective they are in practice until we actually try them. What I really want to say is that it’s too soon to flat-out assert “The first sentence is not true.”

Specialisation is fundamentally incompatible with separate compilation.
In order to specialise an effectful program for a particular effect interpretation (defined in a different module or even library), you need to have access to the intermediate representation of your whole program and then make a copy to instantiate it for the particular effect interpretation. This means that compiling the original, polymorphic IR of your program was completely in vain. Very uncompositional and prone to regress if GHC decides not to expose the unfolding of a big function.

And we haven’t yet begun talking about reliable specialisation of higher-rank polymorphism…

IMO a better solution is to restrict the program by saying “this program can only ever be interpreted using a tail-resumptive effect handler”. Then you can produce efficient code without needing to specialise afterwards, but it also means you cannot interpret your program using non-tail-resumptive handlers.
Koka does this: It differentiates between general control effects (ctl) and tail-resumptive effects (fun), right in the effect declaration, restricting the semantics of possible handlers:

effect raise
  ctl raise( msg : string ) : a

vs.

effect ask<a>
  fun ask() : a
5 Likes

I see, thank you for the detailed explanation.

I have a question stemming from my own misunderstanding: if specialization is fundamentally incompatible with separate compilation and therefore untenable, then by the same reasoning effectful’s static dispatch shouldn’t be considered high-performance either. Yet in practice effectful’s static dispatch is indeed fast. The optimization pass I’m imagining would simply, to use a metaphor, once it’s confirmed that a rewrite is possible, swap out that part of the code from heftia to effectful. All the information needed should be present in metadata, so I don’t understand why one would need to access the IR of dependent modules. Is the overhead in the process of connecting the interface of Freer with the interface of ReaderT IO the issue?

The purpose of this optimization pass is to avoid the inefficiency of Freer, which stems from the need to convert effects into data and preserve the continuation from that point. To achieve this, the idea is to transform the code into one that uses (ReaderT) IO. This is not a low-level optimization that accesses the IR; the specialisation part relies on existing mechanisms provided by GHC.

It may be that my explanation was lacking. Sorry about that.


Edit: Ah, no, I see. This might actually be doing exactly the same thing as Koka, as you pointed out. And even though I said it uses metadata, I suspect that the dependency direction required for tail‐resumptive optimization is the opposite, meaning it can only be enabled by imposing restrictions via a pragma. Is that what’s going on?

Specialisation copies the original definition of the program and then rewrites it using knowledge from the call context the specialisation applies in. You need to copy the code, hence you need the IR. If you do not copy the code, it’s not specialisation (of some code for the context in which it is used).

An optimisation is always two parts: A static analysis and a program transformation, where the former determines where it is beneficial to apply the latter.
You only describe the program analysis, but not how to exploit your metadata to make the program faster. It won’t magically become faster just by collecting metadata, you have to act on it as well.

For example your runFoo: What do you expect to happen for runFoo prog where prog is defined in a different module and where you cannot access the definition of prog? The only way to make prog somehow faster is by exploiting dynamic dispatch, but it’s exactly that dynamic dispatch that’s the overhead.

It’s the same reason why mtl does not scale, only a bit less pronounced: Everything is fine until you fail to specialise your program for the particular transformer stack. And suffice it so say: There will always come a point in the life of a code base where this happens.

I’m not too familiar with effectful’s notion of static dispatch, but it sounds to me like it’s essentially what Koka does: The program states how the State effect is to be interpreted; essentially fixing the effect handler when declaring the program. The specialisation problem is straightforward because the handler is known when the program is compiled, in contrast to when the effect handler is supplied only at the runEff prog site (which may be in a different module/library).

2 Likes

I see, I understand now. You’re right, in this case, dynamic dispatch (though I’m not sure if this usage aligns with how it’s used in effectful) is indeed unavoidable.

I hadn’t envisioned performing that kind of processing as part of the optimization.

The only purpose of gathering metadata was to check whether the relevant code could be transformed into code based on the ReaderT IO mechanism. But this actually inverts the dependency direction, so collecting the metadata may be pointless. So, in other words, we perhaps need Koka’s approach; at least Koka has made it work.

I’m beginning to see, and yes, the clear condition under which we can replace the relevant part of the program with ReaderT IO instead of Freer is when it doesn’t need any continuation information.


Let me explain the optimization I have in mind. For example, suppose we have a program like this:

prog :: (State Int :> es, Emb IO :> es) => Eff es ()
prog = liftIO . print =<< get

Here Eff is defined as the freer. In the case where a non-tail-resumptive algebraic-effect handler like runState might be used, we needs the continuation’s context, so no optimization is possible. On the other hand, if we know statically that only tail-resumptive handlers will ever be used, we can rewrite the code as follows:

prog :: (State Int :> es, Emb IO :> es) => Reader.Eff es ()
prog = liftIO . print =<< TR.get

Here Reader.Eff is defined as ReaderT IO, and TR.get fetches and invokes the get handler that has been embedded in the ReaderT environment.

Moreover, if we know that State is handled solely by an IORef-based interpreter, we can optimize even further:

prog :: (State Int :> es, Emb IO :> es) => Reader.Eff es ()
prog = liftIO . print =<< Static.get

In this version, Static.get directly reads from the IORef stored in the ReaderT environment.

And now for the tricky part: whether a non-tail-resumptive handler will be used cannot be determined without looking at the module on the runProg side (the handling site), so there is indeed a separate-compilation issue here. However, in the end, despite the impact on binary size, I think we can simply generate all three definitions and let the handling site’s code generator choose which one to use. In this case, it seems that something like metadata isn’t necessary. (Although for the third static dispatch version, if the number of effects increases, the number of combinations will explode, so some kind of countermeasure will be necessary.)

Roughly speaking, this is the code transformation I’ve been considering.


In any case, this approach may have various issues as well, but I don’t think that there is absolutely no room for improvement in principle.