A [solved] benchmarking mystery

I’m playing around with stream fusion and one problematic function I encountered was drop. A simple reproducer does not involve streams at all:

import Test.Tasty.Bench
import Control.DeepSeq

drop' :: Int -> [a] -> [a]
drop' n s =
      case n of
        0 ->
          case s of
            [] -> []
            x : xs -> x : drop' 0 xs -- [1]
        _ ->
          case s of
            [] -> []
            x : xs -> drop' (n - 1) xs

main = do
  let input = replicate 1000000 'a'
  defaultMain
    [ bench "1" $ whnf (rnf . drop' 100000) input
    , bench "2" $ nf (drop' 100000) input
    ]

The main difference between this drop and the drop from the prelude is that this drop still continues traversing the input list after it has reached 0 (where the comment with [1] is), where the prelude drop just returns the remaining list whole. Clearly that could cause some performance difference, but that is not really my focus right now.

What I am much more interested in is the difference between benchmark 1 and benchmark 2. I thought these would be equivalent, but actually the latter is 4-5x slower than the former:

All
  1: OK (2.87s)
    5.18 ms ± 433 μs
  2: OK (1.67s)
    25.7 ms ± 2.5 ms

I’ve been looking at the Core, but I can’t figure out the problem.

Click to see the full Core dump
Rec {
$wdrop' :: forall {a}. Int# -> [a] -> [a]
$wdrop'
  = \ (@a) (ww :: Int#) (s :: [a]) ->
      case ww of ds {
        __DEFAULT ->
          case s of {
            [] -> [];
            : x1 xs -> $wdrop' (-# ds 1#) xs
          };
        0# ->
          case s of {
            [] -> [];
            : x1 xs -> : x1 ($wdrop' 0# xs)
          }
      }
end Rec }

drop' :: forall a. Int -> [a] -> [a]
drop'
  = \ (@a) (n :: Int) (s :: [a]) ->
      case n of { I# ww -> $wdrop' ww s }

x :: Char
x = C# 'a'#

lvl :: [Char]
lvl = : x []

Rec {
$wxs :: Int# -> [Char]
$wxs
  = \ (ww :: Int#) ->
      case ww of ds1 {
        __DEFAULT -> : x ($wxs (-# ds1 1#));
        1# -> lvl
      }
end Rec }

input :: [Char]
input = $wxs 1000000#

g :: Int
g = I# 100000#

Rec {
$wgo :: [Char] -> (# #)
$wgo
  = \ (ds :: [Char]) ->
      case ds of {
        [] -> (##);
        : x1 xs -> case x1 of { C# ipv -> $wgo xs }
      }
end Rec }

go :: [Char] -> ()
go = \ (ds :: [Char]) -> case $wgo ds of { (# #) -> () }

eta :: [Char] -> ()
eta
  = \ (x1 :: [Char]) ->
      case $wgo ($wdrop' 100000# x1) of { (# #) -> () }

Rec {
$s$wbenchLoop
  :: [Char] -> Word64# -> State# RealWorld -> State# RealWorld
$s$wbenchLoop
  = \ (x1 :: [Char]) (ww :: Word64#) (eta2 :: State# RealWorld) ->
      case ww of wild {
        __DEFAULT ->
          case seq# (case $wgo ($wdrop' 100000# x1) of { (# #) -> () }) eta2
          of
          { (# ipv, ipv1 #) ->
          $s$wbenchLoop x1 (subWord64# wild 1##64) ipv
          };
        0##64 -> eta2
      }
end Rec }

eta1 :: [Char] -> [Char]
eta1 = \ (s :: [Char]) -> $wdrop' 100000# s

Rec {
$s$wbenchLoop1
  :: [Char] -> Word64# -> State# RealWorld -> State# RealWorld
$s$wbenchLoop1
  = \ (x1 :: [Char]) (ww :: Word64#) (eta2 :: State# RealWorld) ->
      case ww of wild {
        __DEFAULT ->
          case seq#
                 (let {
                    x2 :: [Char]
                    x2 = eta1 x1 } in
                  case $wgo x2 of { (# #) -> x2 })
                 eta2
          of
          { (# ipv, ipv1 #) ->
          $s$wbenchLoop1 x1 (subWord64# wild 1##64) ipv
          };
        0##64 -> eta2
      }
end Rec }

main3 :: Addr#
main3 = "1"#

main4 :: [Char]
main4 = unpackCString# main3

main5 :: Word64 -> State# RealWorld -> (# State# RealWorld, () #)
main5
  = \ (eta2 :: Word64) (eta3 :: State# RealWorld) ->
      case eta2 of { W64# ww ->
      case $s$wbenchLoop input ww eta3 of ww1 { __DEFAULT ->
      (# ww1, () #)
      }
      }

main6 :: TestTree
main6
  = SingleTest
      $fIsTestBenchmarkable main4 (main5 `cast` <Co:9> :: ...)

main7 :: Addr#
main7 = "2"#

main8 :: [Char]
main8 = unpackCString# main7

main9 :: Word64 -> State# RealWorld -> (# State# RealWorld, () #)
main9
  = \ (eta2 :: Word64) (eta3 :: State# RealWorld) ->
      case eta2 of { W64# ww ->
      case $s$wbenchLoop1 input ww eta3 of ww1 { __DEFAULT ->
      (# ww1, () #)
      }
      }

main10 :: TestTree
main10
  = SingleTest
      $fIsTestBenchmarkable main8 (main9 `cast` <Co:9> :: ...)

main11 :: [Benchmark]
main11 = : main10 []

main12 :: [Benchmark]
main12 = : main6 main11

main1 :: State# RealWorld -> (# State# RealWorld, () #)
main1 = defaultMain1 main12

main :: IO ()
main = main1 `cast` <Co:3> :: ...

main2 :: State# RealWorld -> (# State# RealWorld, () #)
main2 = runMainIO1 (main1 `cast` <Co:3> :: ...)

main :: IO ()
main = main2 `cast` <Co:3> :: ...

The important parts seem to be these:

-- The function that evaluates the strings to normal form
Rec {
$wgo :: [Char] -> (# #)
$wgo
  = \ (ds :: [Char]) ->
      case ds of {
        [] -> (##);
        : x1 xs -> case x1 of { C# ipv -> $wgo xs }
      }
end Rec }

-- Expression that evaluates benchmark 1
seq# (case $wgo ($wdrop' 100000# x1) of { (# #) -> () }) eta2

-- Helper function for benchmark 2
eta1 :: [Char] -> [Char]
eta1 = \ (s :: [Char]) -> $wdrop' 100000# s

-- Expression that evaluates benchmark 2
seq#
  (let {
    x2 :: [Char]
    x2 = eta1 x1 } 
   in
    case $wgo x2 of { (# #) -> x2 })
  eta2

One important difference seems to be that benchmark 2 needs to keep the result x2 fully in memory during the evaluation to normal form. However, I have not been able to reproduce the 4-5x performance difference with a standalone program. I’ve tried a program like this (with -fno-full-laziness):

main = main' ()
main' () = 
  let
    xs = replicate 1000000 'a'
    x = drop' 100000 $!! xs
    !_ = rnf x
  in print (head x)

But this is only about 1.5x slower than the straightforward:

main = deepseq (drop' 100000 $!! replicate 1000000 'a') (return ())

Edit: it does seem to have to do with the memory retention and garbage collection. Running with the -T RTS options shows the memory statistics which are telling:

All
  1: OK (2.67s)
    4.88 ms ± 250 μs,  41 MB allocated, 1.3 KB copied,  53 MB peak memory
  2: OK (1.72s)
    26.0 ms ± 2.6 ms,  41 MB allocated,  40 MB copied, 120 MB peak memory

Edit 2: The reason I can’t reproduce it in a standalone program is that the standalone program also includes the allocation of the gigantic string. That means the relative speedup will be much lower.

I think this solves the mystery. Now the question is if nf in tasty-bench should be changed to discard the output of the function instead of retaining it. And if so, how to actually do that.

2 Likes

The benchmark as written will include the allocation of the string in the first run of drop’.

It doesn’t matter much as the benchmark get’s run many times but input is only allocated once but it’s usually better to either force it manually ahead of time or us the env function.


I think it’s a good thing the benchmarks retain the result. After all usually when you call a function you actually want to use the result.

As you realized you are benchmarking different things if you benchmark whnf (rnf . foo) and nf . foo.

While making nf (or whnf for that matter) not retain the input would improve benchmark numbers it would move it further away from realistic numbers for the use of whatever you are benchmarking.

1 Like

I don’t think it has much to do with whether the result is used or not.

There are plenty situations in which you use the result linearly, for example if you sum the resulting list. In those cases the items do not have to be retained in memory and copied by the garbage collector.

Instead, the difference is mainly just about whether you include the GC copying cost in your benchmark. Since you can’t influence that in any way, I don’t think it makes sense to measure it.

My point was mostly that if you change nf to discard the result as it’s evaluated you remove the users choice about what to measure.

It seems reasonable to add something like nf_ which encapsulates the pattern of wnf (rnf . f) instead of changing nf for when you want to benchmark a process that consumes data as it’s produced.

Instead, the difference is mainly just about whether you include the GC copying cost in your benchmark. Since you can’t influence that in any way, I don’t think it makes sense to measure it.

It’s a question of what you are trying to measure. If your goal is to measure purely Mutator performance then your measurements will be more accurate if you ignore GC costs as best as you can at the risk of measuring based on a less realistic model.

You can often influence GC cost both in benchmarks and real programs by tuning GC parameters, optimizing (or artificially increasing) heap residency and changing allocation rates. So I don’t think it’s that simple.

Often we can make tradeoffs between mutator time and GC costs through allocation/residency. If we do our best to avoid measuring GC cost then all our measurements will be biased towards approaches that do well for mutator time, but they could perform worse once you factor in GC costs.

1 Like

You’re indeed right. I was just trying to avoid the real problem. We do have very real control over GC behavior. It can be seen if we add the Prelude.drop to the benchmark:

  defaultMain
    [ bench "1" $ whnf (rnf . drop' 100000) input
    , bench "2" $ nf (drop' 100000) input
    , bench "3" $ nf (drop  100000) input
    ]

The results are

All
  1: OK (0.33s)
    4.40 ms ± 170 μs,  41 MB allocated, 1.6 KB copied,  53 MB peak memory
  2: OK (0.17s)
    20.3 ms ± 1.6 ms,  41 MB allocated,  37 MB copied, 102 MB peak memory
  3: OK (0.34s)
    1.04 ms ±  58 μs, 1.5 MB allocated,  69 B  copied, 102 MB peak memory

The reason that the version from the prelude is much faster and uses less memory is as I mentioned at the beginning of the thread: my custom drop' walks over the whole list and even reallocates the part that is not dropped. The prelude drop function just returns the original list and does not allocate new cons-cells. Those two functions should give wildly different results in a benchmark.

The real problem I am facing is that this situations seems unavoidable in stream fusion of the drop function. Perhaps the only way out is to design a GHC optimization that detects identity functions of the form:

drop'0 xs =
  case xs of
    [] -> []
    x:xs' -> x : drop'0 xs'

Combined with SpecConstr it should be possible to optimize drop' to become the same as the efficient Prelude.drop.