I’m playing around with stream fusion and one problematic function I encountered was drop
. A simple reproducer does not involve streams at all:
import Test.Tasty.Bench
import Control.DeepSeq
drop' :: Int -> [a] -> [a]
drop' n s =
case n of
0 ->
case s of
[] -> []
x : xs -> x : drop' 0 xs -- [1]
_ ->
case s of
[] -> []
x : xs -> drop' (n - 1) xs
main = do
let input = replicate 1000000 'a'
defaultMain
[ bench "1" $ whnf (rnf . drop' 100000) input
, bench "2" $ nf (drop' 100000) input
]
The main difference between this drop and the drop from the prelude is that this drop still continues traversing the input list after it has reached 0 (where the comment with [1] is), where the prelude drop just returns the remaining list whole. Clearly that could cause some performance difference, but that is not really my focus right now.
What I am much more interested in is the difference between benchmark 1 and benchmark 2. I thought these would be equivalent, but actually the latter is 4-5x slower than the former:
All
1: OK (2.87s)
5.18 ms ± 433 μs
2: OK (1.67s)
25.7 ms ± 2.5 ms
I’ve been looking at the Core, but I can’t figure out the problem.
Click to see the full Core dump
Rec {
$wdrop' :: forall {a}. Int# -> [a] -> [a]
$wdrop'
= \ (@a) (ww :: Int#) (s :: [a]) ->
case ww of ds {
__DEFAULT ->
case s of {
[] -> [];
: x1 xs -> $wdrop' (-# ds 1#) xs
};
0# ->
case s of {
[] -> [];
: x1 xs -> : x1 ($wdrop' 0# xs)
}
}
end Rec }
drop' :: forall a. Int -> [a] -> [a]
drop'
= \ (@a) (n :: Int) (s :: [a]) ->
case n of { I# ww -> $wdrop' ww s }
x :: Char
x = C# 'a'#
lvl :: [Char]
lvl = : x []
Rec {
$wxs :: Int# -> [Char]
$wxs
= \ (ww :: Int#) ->
case ww of ds1 {
__DEFAULT -> : x ($wxs (-# ds1 1#));
1# -> lvl
}
end Rec }
input :: [Char]
input = $wxs 1000000#
g :: Int
g = I# 100000#
Rec {
$wgo :: [Char] -> (# #)
$wgo
= \ (ds :: [Char]) ->
case ds of {
[] -> (##);
: x1 xs -> case x1 of { C# ipv -> $wgo xs }
}
end Rec }
go :: [Char] -> ()
go = \ (ds :: [Char]) -> case $wgo ds of { (# #) -> () }
eta :: [Char] -> ()
eta
= \ (x1 :: [Char]) ->
case $wgo ($wdrop' 100000# x1) of { (# #) -> () }
Rec {
$s$wbenchLoop
:: [Char] -> Word64# -> State# RealWorld -> State# RealWorld
$s$wbenchLoop
= \ (x1 :: [Char]) (ww :: Word64#) (eta2 :: State# RealWorld) ->
case ww of wild {
__DEFAULT ->
case seq# (case $wgo ($wdrop' 100000# x1) of { (# #) -> () }) eta2
of
{ (# ipv, ipv1 #) ->
$s$wbenchLoop x1 (subWord64# wild 1##64) ipv
};
0##64 -> eta2
}
end Rec }
eta1 :: [Char] -> [Char]
eta1 = \ (s :: [Char]) -> $wdrop' 100000# s
Rec {
$s$wbenchLoop1
:: [Char] -> Word64# -> State# RealWorld -> State# RealWorld
$s$wbenchLoop1
= \ (x1 :: [Char]) (ww :: Word64#) (eta2 :: State# RealWorld) ->
case ww of wild {
__DEFAULT ->
case seq#
(let {
x2 :: [Char]
x2 = eta1 x1 } in
case $wgo x2 of { (# #) -> x2 })
eta2
of
{ (# ipv, ipv1 #) ->
$s$wbenchLoop1 x1 (subWord64# wild 1##64) ipv
};
0##64 -> eta2
}
end Rec }
main3 :: Addr#
main3 = "1"#
main4 :: [Char]
main4 = unpackCString# main3
main5 :: Word64 -> State# RealWorld -> (# State# RealWorld, () #)
main5
= \ (eta2 :: Word64) (eta3 :: State# RealWorld) ->
case eta2 of { W64# ww ->
case $s$wbenchLoop input ww eta3 of ww1 { __DEFAULT ->
(# ww1, () #)
}
}
main6 :: TestTree
main6
= SingleTest
$fIsTestBenchmarkable main4 (main5 `cast` <Co:9> :: ...)
main7 :: Addr#
main7 = "2"#
main8 :: [Char]
main8 = unpackCString# main7
main9 :: Word64 -> State# RealWorld -> (# State# RealWorld, () #)
main9
= \ (eta2 :: Word64) (eta3 :: State# RealWorld) ->
case eta2 of { W64# ww ->
case $s$wbenchLoop1 input ww eta3 of ww1 { __DEFAULT ->
(# ww1, () #)
}
}
main10 :: TestTree
main10
= SingleTest
$fIsTestBenchmarkable main8 (main9 `cast` <Co:9> :: ...)
main11 :: [Benchmark]
main11 = : main10 []
main12 :: [Benchmark]
main12 = : main6 main11
main1 :: State# RealWorld -> (# State# RealWorld, () #)
main1 = defaultMain1 main12
main :: IO ()
main = main1 `cast` <Co:3> :: ...
main2 :: State# RealWorld -> (# State# RealWorld, () #)
main2 = runMainIO1 (main1 `cast` <Co:3> :: ...)
main :: IO ()
main = main2 `cast` <Co:3> :: ...
The important parts seem to be these:
-- The function that evaluates the strings to normal form
Rec {
$wgo :: [Char] -> (# #)
$wgo
= \ (ds :: [Char]) ->
case ds of {
[] -> (##);
: x1 xs -> case x1 of { C# ipv -> $wgo xs }
}
end Rec }
-- Expression that evaluates benchmark 1
seq# (case $wgo ($wdrop' 100000# x1) of { (# #) -> () }) eta2
-- Helper function for benchmark 2
eta1 :: [Char] -> [Char]
eta1 = \ (s :: [Char]) -> $wdrop' 100000# s
-- Expression that evaluates benchmark 2
seq#
(let {
x2 :: [Char]
x2 = eta1 x1 }
in
case $wgo x2 of { (# #) -> x2 })
eta2
One important difference seems to be that benchmark 2 needs to keep the result x2
fully in memory during the evaluation to normal form. However, I have not been able to reproduce the 4-5x performance difference with a standalone program. I’ve tried a program like this (with -fno-full-laziness
):
main = main' ()
main' () =
let
xs = replicate 1000000 'a'
x = drop' 100000 $!! xs
!_ = rnf x
in print (head x)
But this is only about 1.5x slower than the straightforward:
main = deepseq (drop' 100000 $!! replicate 1000000 'a') (return ())
Edit: it does seem to have to do with the memory retention and garbage collection. Running with the -T
RTS options shows the memory statistics which are telling:
All
1: OK (2.67s)
4.88 ms ± 250 μs, 41 MB allocated, 1.3 KB copied, 53 MB peak memory
2: OK (1.72s)
26.0 ms ± 2.6 ms, 41 MB allocated, 40 MB copied, 120 MB peak memory
Edit 2: The reason I can’t reproduce it in a standalone program is that the standalone program also includes the allocation of the gigantic string. That means the relative speedup will be much lower.
I think this solves the mystery. Now the question is if nf
in tasty-bench
should be changed to discard the output of the function instead of retaining it. And if so, how to actually do that.