StateT performance benchmarks?

Liamzy · August 27, 2023, 5:57am

Informally, I find that accum parameter with TCO is the most performant implementation of “state”, as tested with informal naive factorial.

Hell, my benchmark even has Haskell in the region of Rust, ffs, or maybe I misconfigured my TCO benchmark (it probably has more to do with bigInt implementations, Julia actually wins here).

What are actual performance tips for accum parameter vs foldl’ vs StateT vs iterate and iterate’ vs ST / IORef?

tomjaguarpaw · August 27, 2023, 7:01am

You might like to share the code. I for one don’t know what you mean by “accum parameter with TCO”.

ReleaseCandidate · August 27, 2023, 7:11am

Tail Call Optimisation, I guess.

tomjaguarpaw · August 27, 2023, 7:16am

I guess so too, but I’m not sure what that has to do with ‘the most performant implementation of “state”’.

Liamzy · August 27, 2023, 7:39am

This is an old benchmarker:

https://paste.tomsmeding.com/vyxd6Tdd

TBH, one curiosity I have is whether mapAccumL is miswritten, i.e, should it be ditching its custom State type entirely and just using straight accumulating parameters (that’s what I mean by accum param)?

jaror · August 27, 2023, 12:31pm

GHC should be able to optimise direct tail recursion with an accumulating parameter, foldl', and StateT all to the same efficient code. ST / IORef always introduces a layer of indirection and thus memory access in the tight loop (like if you would mark variables as volatile in C) so those will be slower. I don’t know exactly how iterate and iterate' fit into this picture.

To give a concrete example, these three all produce very similar Core (although they’re all slightly different in inconsequential ways):

facAcc :: Int -> Int
facAcc = go 1 where
  go !s  1 = s
  go !s !n = go (s * n) (n - 1)

facFold :: Int -> Int
facFold n = foldl' (*) 1 [1..n]

facState :: Int -> Int
facState n = execState (go n) 1 where
  go 1 = pure ()
  go n = do modify (* n); go (n - 1)

Bodigrim · August 27, 2023, 12:39pm

Seems related: StateT vs. IORef: a benchmark

Liamzy · August 27, 2023, 1:36pm

I distinctly recall that StateT was slower in my experience, but perhaps it was simply my experience or issues with Clang backend / OS / earlier version with GHC.

https://paste.tomsmeding.com/ghkcT0Zx

On recent benchmarks on GHC 9.6.2 on ArchLinux, I’m seeing roughly what’s been described, i.e, State etc are low or zero cost compared to accumulating parameter recursion.

The only weirdness is for_ and foldl’; i.e, there seems to be some sort of cost incurred in generating the list (10-20%), but compared to my Windows experience, it’s significantly lower.

Shouldn’t stream fusion just be turning this into a loop anyways?

jaror · August 27, 2023, 2:15pm

I think 20 is still an extremely small input. The function only takes a few hundred nanoseconds. At that scale there are a lot of confounding factors. If I run your program on an input of 10000 then I get quite consistent timings of about 5ms, although there are a few of around 10ms which I’d ascribe to code layout.

I wish we had a benchmarking tool as described here: https://www.youtube.com/watch?v=r-TLSBdHe1A

Edit: Actually, I think the 5ms/10ms difference might be due to strictness. With your program as written all the facState* benchmarks take 10ms and some become 5ms if I instead use Control.Monad.Trans.State.Strict.

Liamzy · August 27, 2023, 2:35pm

Thoughts?

Pastebin of source:

https://paste.tomsmeding.com/id5tlMaA

jaror · August 27, 2023, 2:43pm

Yeah, that seems consistent with my results. There is indeed a difference. In the Core dump (-ddump-simpl) I do notice this:

Rec {
$wloop
  = \ ds eta ->
      case ds of wild {
        IS x1 ->
          case x1 of {
            __DEFAULT -> $wloop (integerSub wild fac1) (integerMul eta wild);
            1# -> (# eta #)
          };
        IP x1 -> $wloop (integerSub wild fac1) (integerMul eta wild);
        IN x1 -> $wloop (integerSub wild fac1) (integerMul eta wild)
      }
end Rec }

facStateT' = \ n -> case $wloop n fac1 of { (# ww #) -> ww }

facStateT = facStateT'

facState' = facStateT'

facState = facStateT'

So all those functions are actually compiled to exactly the same thing. In the STG dump (-ddump-stg-cg) you can see how $wloop is compiled more clearly:

Rec {
$wloop =
    \r [ds eta]
        case ds<TagProper> of wild {
          IS x1 ->
              case x1<TagProper> of {
                __DEFAULT ->
                    let { sat = \u [] integerMul eta wild;
                    } in 
                      case integerSub wild fac1 of sat { __DEFAULT -> $wloop sat sat; };
                1# -> Solo# [eta];
              };
          IP _ ->
              let { sat = \u [] integerMul eta wild;
              } in 
                case integerSub wild fac1 of sat { __DEFAULT -> $wloop sat sat; };
          IN _ ->
              let { sat = \u [] integerMul eta wild;
              } in 
                case integerSub wild fac1 of sat { __DEFAULT -> $wloop sat sat; };
        };
end Rec }

All those lets are thunk allocations. So the integer multiplications all become thunks which slows things down.

If we use Control.Monad.Trans.State.Strict instead we see another version of $wloop:

Rec {
$wloop1 =
    \r [ds eta]
        let-no-escape {
          $j =
              \r []
                  case integerMul eta ds of vx {
                  __DEFAULT ->
                  case integerSub ds fac1 of sat { __DEFAULT -> $wloop1 sat vx; };
                  };
        } in 
          case ds<TagProper> of {
            IS x1 ->
                case x1<TagProper> of {
                  __DEFAULT -> $j<TagProper>;
                  1# -> Solo# [eta];
                };
            IP _ -> $j<TagProper>;
            IN _ -> $j<TagProper>;
          };
end Rec }

This one does not allocate thunks. But this one is only used by facState' and facStateT'.

In conclusion, you need to use both modify' and the strict State/StateT monad for the best performance in this case.

@sgraf can you tell why the demand analyser is not always good enough to figure out the strictness in these cases? Oh, it must be because that Solo# is lazy. If only it could use information from the use-site where the tuple is immediately unpacked:

facStateT' = \ n -> case $wloop n fac1 of { (# ww #) -> ww }

sgraf · August 27, 2023, 3:16pm

Just from skimming the code listings, you might be thinking of return-pattern specialisation as proposed in #16335: CPR Analysis is too conservative for certain inductive cases · Issues · Glasgow Haskell Compiler / GHC · GitLab.

Demand/CPR analysis would not be able to unbox the result because there can be call sites that would not need to evaluate the field of the Solo# constructor; and WW/DmdAnal does not duplicate code.

Edit:Although as you say, when all use sites are known to unbox and use the state, a smarter DmdAnal should be able to spot this. That’s a long term thing that I’m trying to solve.

sgraf · August 27, 2023, 3:17pm

Try making the field of Solo# strict or use $! on the state for Nested CPR to fire

Liamzy · August 27, 2023, 3:32pm

Just retested; it looks like I forgot to set facStateS’ to use S.modify’ and was accidentally running it on S.modify.

From benchmarks, it should be comparable to accumulator fac, fold fac, and for fac (which incidentally pulls ahead at very large numbers, despite being relatively unperformant at small numbers, and still works decently despite running on default State monad).

There was also an error in the benchmarking code regarding for, wherein the factorial wasn’t computing the correct result.

https://paste.tomsmeding.com/7wtcxovu

Source code above for benchmarks.

https://paste.tomsmeding.com/tOLf1KNH

Final set of benchmarks.

So it does look that Control.Monad.Trans.State.Strict is approximately zero-cost, and can be used recklessly as a replacement for accumulating parameter idiom in concert with modify’. The for_ idiom actually outperforms both folds and straight accumulating parameter for large n.

I read about Strict State Monad and modify’ on Kowainik a long time ago, iirc, but it looks like they moved to a new host and the old post is gone. But strangely enough, I didn’t get these results the last time I benchmarked naive factorial across idioms. Probably user error on my part (forgot to get Control.Monad.Trans.Strict)

Space-leaking monad transformers have been a huge gripe of mine traditionally; i.e, Haskell really emphasizes its monads and unperformant monads with huge performance penalties are somewhat embarrassing.

It’s nice to know that at least StateT.Strict is safe, but that calls to mind, why isn’t StateT.Strict the default on Control.Monad.Trans.State? You can’t really hide modify to get people to use the “correct” modify’ first, but at least pointing Control.Monad.Trans.State to StateT.Strict would make life simpler and Haskell less footgunny; same applies to the mtl library version.

jaror · August 27, 2023, 8:29pm

The problem is that the Solo# (a.k.a. (# _ #)) is not in our code, It comes from the a worker wrapper transformation of the (a, s) tuple in Control.Monad.Trans.State.Strict.StateT. Maybe that package should use a specialised tuple data type that is strict in the state.

I guess using modify' works too, but I think most people would want the state to always be forced, especially if they’ve already chosen to use the .Strict version.

tomjaguarpaw · September 13, 2023, 12:45pm

Agreed, but that would potentially be a silent breaking change so is rather dangerous. We should, however, try to discourage people from using the Lazy versions and from using modify.