Scrap your iteration combinators

I think the iteration combinators are part of the Haskell family vocab (PureScript and Elm etc re-use them), I prefer thinking in terms of names than explicit loops. I don’t mind efforts to abstract a few names with a broader concept, but I don’t see it inherently as an improvement always all the time.

The lens package also replaces a bunch of standard functions with various types of traversals and prisms and whatnot. So you’re not the first or the last to propose abstracting all the combinators.

4 Likes

Yes that’s right. I just mean no typeclass constraints rather than no type variables

1 Like

Thank you, I was using Foldable.toList rather than the Stream one.

Here's test.cabal
cabal-version: 3.12
name: test
version: 0.1.0

executable program
    hs-source-dirs: src
    main-is: Main.hs
    default-language: GHC2024
    ghc-options: -O2 -threaded -rtsopts "-with-rtsopts=-N"
    build-depends:
        base
      , criterion
      , mtl
      , streaming
      , transformers
Here's Main.hs
import Data.Foldable (for_)
import Data.Functor.Identity (Identity, runIdentity)
import qualified Data.Maybe as M

import qualified Streaming.Prelude as SP
import qualified Criterion.Main as C

toList :: SP.Stream (SP.Of a) Identity r -> [a]
toList s = case runIdentity (SP.uncons s) of
  Nothing -> []
  Just (h, rest) -> h : toList rest

mapMaybe :: (a -> Maybe b) -> [a] -> [b]
mapMaybe f as =
  toList $
    for_ as $ \a -> do
      for_ (f a) $ \b ->
        SP.yield b

main :: IO ()
main = t2

t1 :: IO ()
t1 = C.defaultMain [f1]
  where
    f1 = C.bgroup "f1"
       [ C.bench "1" $ C.whnf (sum . map (* 2) . mapMaybe Just) [1..10]
       , C.bench "2" $ C.whnf (sum . map (* 2) . mapMaybe Just) [1..100]
       , C.bench "3" $ C.whnf (sum . map (* 2) . mapMaybe Just) [1..1000]
       , C.bench "4" $ C.whnf (sum . map (* 2) . mapMaybe Just) [1..1000000]
       ]

t2 :: IO ()
t2 = C.defaultMain [g1]
  where
    g1 = C.bgroup "g1"
       [ C.bench "1" $ C.whnf (sum . map (* 2) . M.mapMaybe Just) [1..10]
       , C.bench "2" $ C.whnf (sum . map (* 2) . M.mapMaybe Just) [1..100]
       , C.bench "3" $ C.whnf (sum . map (* 2) . M.mapMaybe Just) [1..1000]
       , C.bench "4" $ C.whnf (sum . map (* 2) . M.mapMaybe Just) [1..1000000]
       ]

I separated t1 from t2 so you could swap and recompile main to avoid sharing any partial results whatsoever just in case.

Here's the results
"Scrap your iteration combinators" mapMaybe

benchmarking f1/1
time                 1.032 μs   (1.031 μs .. 1.032 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.031 μs   (1.029 μs .. 1.034 μs)
std dev              8.137 ns   (5.953 ns .. 10.16 ns)

benchmarking f1/2
time                 10.40 μs   (10.30 μs .. 10.53 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 10.29 μs   (10.26 μs .. 10.38 μs)
std dev              147.4 ns   (61.71 ns .. 303.4 ns)
variance introduced by outliers: 11% (moderately inflated)

benchmarking f1/3
time                 103.2 μs   (103.0 μs .. 103.5 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 104.2 μs   (103.6 μs .. 105.2 μs)
std dev              2.379 μs   (1.583 μs .. 3.272 μs)
variance introduced by outliers: 18% (moderately inflated)

benchmarking f1/4
time                 104.8 ms   (104.1 ms .. 105.8 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 102.8 ms   (102.0 ms .. 103.6 ms)
std dev              1.233 ms   (1.060 ms .. 1.462 ms)

  79,231,541,536 bytes allocated in the heap
     167,793,032 bytes copied during GC
      40,189,200 bytes maximum residency (14 sample(s))
       6,939,376 bytes maximum slop
             109 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     19715 colls, 19715 par    0.938s   0.753s     0.0000s    0.0010s
  Gen  1        14 colls,    13 par    0.062s   0.121s     0.0086s    0.0332s

  Parallel GC work balance: 3.62% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

  SPARKS: 8 (8 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time   21.188s  ( 20.604s elapsed)
  GC      time    1.000s  (  0.874s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   22.188s  ( 21.479s elapsed)

  Alloc rate    3,739,541,783 bytes per MUT second

  Productivity  95.5% of total user, 95.9% of total elapsed
Naive mapMaybe with fusion rules

benchmarking g1/1
time                 162.4 ns   (162.0 ns .. 162.8 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 163.4 ns   (162.8 ns .. 164.1 ns)
std dev              2.005 ns   (1.691 ns .. 2.435 ns)
variance introduced by outliers: 12% (moderately inflated)

benchmarking g1/2
time                 1.542 μs   (1.537 μs .. 1.548 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.548 μs   (1.542 μs .. 1.555 μs)
std dev              20.73 ns   (16.72 ns .. 25.85 ns)
variance introduced by outliers: 12% (moderately inflated)

benchmarking g1/3
time                 15.37 μs   (15.33 μs .. 15.43 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 15.35 μs   (15.32 μs .. 15.40 μs)
std dev              118.1 ns   (93.06 ns .. 149.3 ns)

benchmarking g1/4
time                 16.61 ms   (16.52 ms .. 16.69 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 16.58 ms   (16.54 ms .. 16.65 ms)
std dev              115.7 μs   (72.15 μs .. 198.3 μs)

  40,618,353,464 bytes allocated in the heap
     140,681,712 bytes copied during GC
      40,188,896 bytes maximum residency (12 sample(s))
       6,960,160 bytes maximum slop
             107 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10681 colls, 10681 par    0.469s   0.315s     0.0000s    0.0018s
  Gen  1        12 colls,    11 par    0.078s   0.106s     0.0089s    0.0352s

  Parallel GC work balance: 6.06% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

  SPARKS: 8 (8 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time   21.188s  ( 20.537s elapsed)
  GC      time    0.547s  (  0.422s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   21.734s  ( 20.959s elapsed)

  Alloc rate    1,917,090,428 bytes per MUT second

  Productivity  97.5% of total user, 98.0% of total elapsed

6x run time.
2x heap.
2x GC time.

You should do the naive thing and reinvest the gains on running Bluefin logic.

Bonus
mapMaybeSimp :: (a -> Maybe b) -> [a] -> [b]
mapMaybeSimp f = g . fmap f
  where
    g xs = case xs of
      []           -> []
      Nothing : ys -> g ys
      Just y  : ys -> y : g ys

t3 :: IO ()
t3 = C.defaultMain [h1]
  where
    h1 = C.bgroup "h1"
       [ C.bench "1" $ C.whnf (sum . map (* 2) . mapMaybeSimp Just) [1..10]
       , C.bench "2" $ C.whnf (sum . map (* 2) . mapMaybeSimp Just) [1..100]
       , C.bench "3" $ C.whnf (sum . map (* 2) . mapMaybeSimp Just) [1..1000]
       , C.bench "4" $ C.whnf (sum . map (* 2) . mapMaybeSimp Just) [1..1000000]
       ]

benchmarking h1/1
time                 400.4 ns   (398.2 ns .. 402.9 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 401.2 ns   (399.2 ns .. 403.5 ns)
std dev              7.039 ns   (5.563 ns .. 9.663 ns)
variance introduced by outliers: 20% (moderately inflated)

benchmarking h1/2
time                 3.890 μs   (3.841 μs .. 3.941 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 3.871 μs   (3.850 μs .. 3.893 μs)
std dev              75.65 ns   (63.86 ns .. 95.12 ns)
variance introduced by outliers: 20% (moderately inflated)

benchmarking h1/3
time                 42.17 μs   (41.02 μs .. 44.16 μs)
                     0.989 R²   (0.978 R² .. 0.998 R²)
mean                 43.35 μs   (42.09 μs .. 45.72 μs)
std dev              5.577 μs   (3.236 μs .. 8.143 μs)
variance introduced by outliers: 90% (severely inflated)

benchmarking h1/4
time                 40.56 ms   (40.37 ms .. 40.88 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 40.62 ms   (40.52 ms .. 40.84 ms)
std dev              315.4 μs   (207.3 μs .. 441.7 μs)

  95,201,110,136 bytes allocated in the heap
     174,479,240 bytes copied during GC
      40,190,568 bytes maximum residency (13 sample(s))
       6,938,008 bytes maximum slop
             109 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     23801 colls, 23801 par    1.078s   0.812s     0.0000s    0.0094s
  Gen  1        13 colls,    12 par    0.062s   0.129s     0.0099s    0.0386s

  Parallel GC work balance: 3.58% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

  SPARKS: 8 (8 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time   20.375s  ( 20.364s elapsed)
  GC      time    1.141s  (  0.941s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   21.516s  ( 21.306s elapsed)

  Alloc rate    4,672,447,123 bytes per MUT second

  Productivity  94.7% of total user, 95.6% of total elapsed
1 Like

Thanks for the benchmark! I’m very surprised that the streaming version of mapMaybe would be slower than the list version. They ought to optimize to the same code. Perhaps streaming is missing some fold/build rewrite optimization rules.

Looking at the generated Core, this is unfortunately very much not the case. I’m honestly a little surprised that (*>) is seemingly not inlined. Might be the recursion in its definition?.

Core for mapMaybe using Streaming (slightly reordered for clarity)
mapMaybe
  = \ @a @b f as ->
      letrec {
        go1
          = \ ds ->
              case ds of {
                [] -> mapMaybe1;
                : y ys ->
                  case f y of {
                    Nothing ->
                      $fApplicativeStream_$c*>
                        $fFunctorOf $fMonadIdentity mapMaybe1 (go1 ys);
                    Just x ->
                      case x of conrep { __DEFAULT ->
                      $fApplicativeStream_$c*>
                        $fFunctorOf
                        $fMonadIdentity
                        ($fApplicativeStream_$c*>
                           $fFunctorOf $fMonadIdentity (Step (:> conrep mapMaybe1)) mapMaybe1)
                        (go1 ys)
                      }
                  }
              }; } in
      toList (go1 as)

mapMaybe1 = \ @b -> Return ()

Rec {
toList
  = \ @a @r s ->
      case (poly_loop s) `cast` <Co:9> :: ... of {
        Nothing -> [];
        Just ds -> case ds of { (h, rest) -> : h (toList rest) }
      }
end Rec }

Rec {
poly_loop
  = \ @r @a stream ->
      case stream of {
        Step ds ->
          case ds of { :> a1 rest ->
          (Just (a1, rest)) `cast` <Co:10> :: ...
          };
        Effect m1 -> poly_loop (m1 `cast` <Co:6> :: ...);
        Return ds -> Nothing `cast` <Co:10> :: ...
      }
end Rec }

By contrast, the list version pretty much optimizes to itself, since this is about as efficient as any mapMaybe operation can be (without anything extra fancy like reuse analysis). E.g. mapMaybe shouldn’t even allocate more than the equivalent C code.

Core for Data.Maybe's mapMaybe
Rec {
mapMaybe
  = \ @a @b ds ds1 ->
      case ds1 of {
        [] -> [];
        : x xs ->
          case ds x of {
            Nothing -> mapMaybe ds xs;
            Just r -> : r (mapMaybe ds xs)
          }
      }
end Rec }

Fusion rules do make a difference, but even without them, the list implementation is (unsurprisingly) still ~6x faster than the streaming one.

4 Likes

This is interesting! Thanks for looking. I’m not surprised that *> doesn’t optimize too well, but I am surprised that *> specialized to Identity doesn’t, since it is basically list cons. Once you have that, mapMaybe Just really ought to optimize to the identity function.

Some have already brought up the rule of least power and how polymorphism making things harder to read, but I want to add that having names is useful! If I see a name that I do not recognize, I can look it up, see what it does, and understand its role as a building block in the larger program. With nested for-loops, there is no visible separation of such blocks in the program. Of course it is still possible to understand the code and maybe identify the logical pieces, but it requires more effort.

When we use for_, we need an Applicative f but we throw away the a in f a. This means that we never needed an Applicative, only a Monoid. In other words, it’s a fold. Consider foldlM, which takes s -> a -> m s, but when we involve StateT it becomes a -> StateT s m b or equivalently a -> s -> m (b, s) where the b is entirely unnecessary. So StateT is overpowered here.
Note: This doesn’t mean that one should never use for_, because sometimes the Applicative is not avoidable. A simple example is traverse_ print xs. In other cases, I will choose to use a fold instead of for_.

Another example of using overpowered constructs is the loopM example. I’ve never used this combinator, but let’s say I need this behavior so I decide to write it inline. It would look something like

fooLoop :: Bar -> M Quux
fooLoop x = do
  r <- ... do something with x ...
  case r of
    Left x -> fooLoop x
    Right y -> pure y

Now compare it with

fooLoop' :: Bar -> M Quux
fooLoop' s0 =
  runEarlyReturnT $
    flip evalStateT s0 $
      forever $ do
        s <- get
        fs <- (lift . lift) (... do something with s ...)
        s' <- lift (except fs)
        put s'

To understand this, one needs to first understand 1. how ExceptT works, 2. how StateT works, 3. how forever works. These are powerful constructs but none of them are necessary here. I find it far easier to both write and read the version that does not use them.

2 Likes

Correct! I explain this in detail in foldl traverses with State, foldr traverses with anything.

I suppose so, but barely. It’s “overpowered” in a way that >> pure () restore to be exactly-correctly-powered. For Applicative f, f () is a Monoid (witnessed by Ap) and for Monoid m, Writer m () is isomorphic to it, so there’s really not much daylight between these concepts.

I prefer to use the Applicative form because writing loop bodies using Monoid is actually quite a pain.

To understand this, one needs to first understand 1. how ExceptT works, 2. how StateT works, 3. how forever works

Yes, three simple concepts that can be understood in isolation and composed freely. Your proposed alternative is to understand not only recursion, which already seems too challenging for many programmers, but tail recursion, to guarantee we don’t use more space than intended!

A point I’ve seen made is that saying “for_ generalises other iteration combinators” misses their point.

Goto generalises for, while, and subroutines, but nobody uses goto. Because the point is not to use the thing with the most power, but to use the thing with the least power that’s still up to the task. That way we can more easily grasp the structure of what’s going on.

It’s similar to the way we use static types to restrict the programs that the compiler will accept, in order to gain greater understanding of what the program actually does.

I agree that it’s annoying to have to memorise a menagerie of different combinators, but it’s not a tradeoff we make for no reason.

8 Likes

Are you suggesting that recursion is tougher to understand and use than monad transformers?
I think we’ll just have to disagree on that :grin:

2 Likes

To use, definitely! I have several non-Haskellers (perhaps not even “programmers” from the point of view of software developers – they “program” hardware instead, in System Verilog) happily writing for and for_ loops (admittedly over Bluefin Eff rather than monad transformer per se, but syntactically they’re very similar). I don’t believe I could get them to write recursive definitions. Furthermore, the for and for_ versions look a lot like what Pythonistas are used to writing every day. So yes, I think “monad transformers” are easier to use than recursion.

Ah, well I agree with this! So I’ve clearly explained myself badly. This is a paragraph from the article:

The final version of extend is the same as the original version, not just in the sense that it calculates the same result, nor even just in the sense that it calculates the same result in the same way, but that it is a transformation of exactly the same code. This implies all the same benefits we expect from pure functional code when it comes to maintenance and refactoring. For example, if extend had been written in Python or Java then the type system wouldn’t catch it if I slipped in a call to delete files from disk, make network connections or launch the missiles; in the “imperative” extend written in Haskell it would: I can only do “State effects on a PPreAssignment”, and “Either effects on a Conflict”.

Hopefully it resolves your complaint, even if it doesn’t do so very clearly. The point is that foldl' restricts power along two dimensions:

  • It processes the list sequentially one element at at time
  • It can do at most state effects

for_ of a State action restricts power to exactly the same extent (but compositionally):

  • for_ processes the list sequentially one element at at time
  • State can do at most state effects

@TeofilC objected that the latter restriction requires “type checking in one’s head”. My counter is that for_ @_ (State _) achieves the same end without requiring a mental type checker.

(Naturally one can further object that this is messy, but perhaps let’s reserve judgements on syntax until a body of knowledge has been built up around the new style.)

Exactly! And in the style I’m proposing it is actually an explicit use of static types: the choice of the type of the inner applicative/monad. Moreover, those types are composable. When we use foldl', on the other hand, the choice is invisible, implicit and non-composable.

Well then, what’s the reason? When it comes to foldl', mapAccumL, loop etc. my suggestions really do have the “same power”. (I concede that filter and mapMaybe have a non-trivial property not captured in the choice of applicative/monad.)

And I also don’t actually believe Haskellers take “least power” as seriously as you’re suggesting. I can imagine a Haskeller writing the following, and believing they’re using “least power”:

f a b c = g . foldl' p z
  where
    g = ...

    p = ...

    z = ...

But they’re not actually holding to “least power” (even though they’re writing foldl' instead of for_) because they haven’t restricted the scopes of a, b, c, g, p and z. Does p really need a, b, c, g and z in scope (and itself, recursively?). Probably not! So it’s written in a “more powerful” way than it could be. That’s no different from using a “too powerful” inner monad/applicative in the body of a for_ (this became obvious to me after using Bluefin for a while, because available effects really are the same thing as “in scope values”).

1 Like

Ah, I see what you’re saying, that makes sense. That’s my fault for skimming. From here I’d retreat to saying that I find editing long type signatures, as I frequently find myself needing to do to comprehend/write effectful code, to be a bit of a pain in the ass. Particularly when I’m just trying to get some sub-function in a where clause to compile.

Compared to your examples, I would have to be extracting more stuff into local bindings with type signatures to confirm to myself what the types of the subexpressions are. I’m just not smart enough to keep them in my head at once.

Another complaint is that we’re sort of swapping one proliferation of combinators for another, you have to memorise evalState and runState and execState and the corresponding transformer versions. And all the other runWhatevers and for the other effects you want to use. And flipping them to make the syntax work out. And all of that stuff feels like boilerplate to me.

I think it’s possible but difficult for an effect tracking language to be ergonomic enough for me to find this sort of thing pleasant. Eg. Unison’s implementation of algebraic effects (“abilities”) seems pretty nice. I think it’s sort of inevitable that all Haskell effect system libraries are going to feel a bit bolted-on compared to languages that have one built in. I’m developing a bit of a distaste for monads honestly.

Yeah, this is what witherable is for, right?

Sure, as with many techniques, there are diminishing returns when you starting getting too zealous with it.

1 Like

I think that’s fair enough. My experience with this style is such that I actually ended up preferring it to “standard iteration combinator style”, so I’m confident at least some others will too.

Yes, sort of, but a decent effect system makes this much easier. effectful decided to flip the args of its run/eval/execState so the user doesn’t have to flip. I think that’s the right decision. Bluefin followed suit.

Regarding the choice between run/eval/execState, I just always use evalState (mnemonic: “evalState is invaluable”) because it’s simplest one that generalizes the others.

Regarding runWhatevers, well, that’s just the natural way of using effect systems (and Bluefin makes it even easier I would say, because the runWhatevers are the only way of getting the effect handle in scope in the first place!).

Yes, that’s right.

Bluefin is giving more a better taste for monads. I invite you (and anyone) to try it and see if you feel the same. If you do, drop me a line with any questions or comments!

I agree.