Everyone knows that Data.List.nub
has quadratic time complexity.
{-# OPTIONS_GHC -O2 #-}
module Main where
import Data.List (nub)
import System.Environment (getArgs )
main :: IO ()
main = do
n : _ <- getArgs
print $ sum $ nub [(1::Integer)..read n]
Running this program for small n
indeed demonstrates a quadratic behaviour: growing n
2x makes execution time 4x longer. And yet something strange happens once the living set grows big enough for Gen 1 GC to kick in, at least on my aarch64 machine with GHC 9.6. A run with n = 20000
takes 1.2 seconds:
$ ghc Nub.hs && ./Nub 20000 +RTS -s
200010000
3,737,072 bytes allocated in the heap
3,272 bytes copied during GC
44,328 bytes maximum residency (1 sample(s))
25,304 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.001s ( 0.001s elapsed)
MUT time 1.211s ( 1.211s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 1.212s ( 1.213s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,084,968 bytes per MUT second
Productivity 99.9% of total user, 99.8% of total elapsed
One would expect that for n = 40000
the total execution time would be 4 x 1.2 = 4.8 seconds. But for some reason it is significantly smaller, less than 4 seconds:
$ ghc Nub.hs && ./Nub 40000 +RTS -s
800020000
7,417,072 bytes allocated in the heap
904,264 bytes copied during GC
44,328 bytes maximum residency (1 sample(s))
29,400 bytes maximum slop
7 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1 colls, 0 par 0.000s 0.000s 0.0004s 0.0004s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.001s ( 0.001s elapsed)
MUT time 3.849s ( 3.847s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.004s elapsed)
Total time 3.850s ( 3.852s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 1,927,260 bytes per MUT second
Productivity 100.0% of total user, 99.9% of total elapsed
All right, if n = 40000
took 3.85 seconds, we would expect n = 80000
took 15.4 seconds. But yet again in practice it is less than 13.5 seconds:
$ ghc Nub.hs && ./Nub 80000 +RTS -s
3200040000
14,777,112 bytes allocated in the heap
5,408,104 bytes copied during GC
2,727,680 bytes maximum residency (2 sample(s))
479,488 bytes maximum slop
11 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2 colls, 0 par 0.001s 0.001s 0.0005s 0.0007s
Gen 1 2 colls, 0 par 0.001s 0.001s 0.0006s 0.0010s
INIT time 0.001s ( 0.001s elapsed)
MUT time 13.493s ( 13.489s elapsed)
GC time 0.002s ( 0.002s elapsed)
EXIT time 0.000s ( 0.004s elapsed)
Total time 13.496s ( 13.495s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 1,095,151 bytes per MUT second
Productivity 100.0% of total user, 100.0% of total elapsed
Can you reproduce this in your environment? Is there a plausible explanation of the effect? The only idea I have is that it seems like Gen 1 GC rearranges memory to a shape which somehow facilitates faster access, but I’ve never heard about it.