Subquadratic behaviour of Data.List.nub

Everyone knows that Data.List.nub has quadratic time complexity.

{-# OPTIONS_GHC -O2 #-}

module Main where

import Data.List (nub)
import System.Environment (getArgs )

main :: IO ()
main = do
  n : _ <- getArgs
  print $ sum $ nub [(1::Integer)..read n]

Running this program for small n indeed demonstrates a quadratic behaviour: growing n 2x makes execution time 4x longer. And yet something strange happens once the living set grows big enough for Gen 1 GC to kick in, at least on my aarch64 machine with GHC 9.6. A run with n = 20000 takes 1.2 seconds:

$ ghc Nub.hs && ./Nub 20000 +RTS -s
200010000
       3,737,072 bytes allocated in the heap
           3,272 bytes copied during GC
          44,328 bytes maximum residency (1 sample(s))
          25,304 bytes maximum slop
               6 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    1.211s  (  1.211s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.002s elapsed)
  Total   time    1.212s  (  1.213s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,084,968 bytes per MUT second

  Productivity  99.9% of total user, 99.8% of total elapsed

One would expect that for n = 40000 the total execution time would be 4 x 1.2 = 4.8 seconds. But for some reason it is significantly smaller, less than 4 seconds:

$ ghc Nub.hs && ./Nub 40000 +RTS -s
800020000
       7,417,072 bytes allocated in the heap
         904,264 bytes copied during GC
          44,328 bytes maximum residency (1 sample(s))
          29,400 bytes maximum slop
               7 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         1 colls,     0 par    0.000s   0.000s     0.0004s    0.0004s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    3.849s  (  3.847s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.004s elapsed)
  Total   time    3.850s  (  3.852s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    1,927,260 bytes per MUT second

  Productivity 100.0% of total user, 99.9% of total elapsed

All right, if n = 40000 took 3.85 seconds, we would expect n = 80000 took 15.4 seconds. But yet again in practice it is less than 13.5 seconds:

$ ghc Nub.hs && ./Nub 80000 +RTS -s
3200040000
      14,777,112 bytes allocated in the heap
       5,408,104 bytes copied during GC
       2,727,680 bytes maximum residency (2 sample(s))
         479,488 bytes maximum slop
              11 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         2 colls,     0 par    0.001s   0.001s     0.0005s    0.0007s
  Gen  1         2 colls,     0 par    0.001s   0.001s     0.0006s    0.0010s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time   13.493s  ( 13.489s elapsed)
  GC      time    0.002s  (  0.002s elapsed)
  EXIT    time    0.000s  (  0.004s elapsed)
  Total   time   13.496s  ( 13.495s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    1,095,151 bytes per MUT second

  Productivity 100.0% of total user, 100.0% of total elapsed

Can you reproduce this in your environment? Is there a plausible explanation of the effect? The only idea I have is that it seems like Gen 1 GC rearranges memory to a shape which somehow facilitates faster access, but I’ve never heard about it.

6 Likes

This would probably produce a nice plot.

The only idea I have is that it seems like Gen 1 GC rearranges memory to a shape which somehow facilitates faster access, but I’ve never heard about it.

A copying GC generally improves locality (e.g. put the evaluted list elements right next to each other), so is this so surprising?

2 Likes

Would not Gen 0 GC have the same effect?

The output of nub is immediately consumed, and the only intermediate structure allocated is nub's accumulator, which grows one way only and should have a good memory locality as is.

1 Like

I’m getting a relatively consistent quadratic speedup:

Here I’ve divided the timing on my machine by the square of the input number. So if the graph slopes down it means subquadratic running time and if it slopes up then there is a superquadratic running time. The absolute values of the Y axis don’t matter.

You can see that for very small values and between 8192 and 16384 I get a subquadratic running time.

Here is the full data:
Name,Mean (ps),2*Stdev (ps)
All.2,26330,1008
All.4,63217,5780
All.8,203436,10432
All.16,728923,41984
All.32,2655315,164392
All.64,10187346,661324
All.128,39837673,2637054
All.256,157491759,5322750
All.512,626240243,41977578
All.1024,2497814450,170288482
All.2048,9985676825,340262352
All.4096,40263872100,1346900912
All.8192,162102463600,8195058550
All.16384,437802215350,887756776
All.32768,2288435824200,130463074886

And here the normalized numbers shown in the graph:

2	6582.5
4	3951.0625
8	3178.6875
16	2847.35546875
32	2593.0810546875
64	2487.14501953125
128	2431.49859619141
256	2403.13352966309
512	2388.91694259644
1024	2382.1014881134
2048	2380.7708799839
4096	2399.91379380226
8192	2415.51494002342
16384	1630.94034548849
32768	2131.27194363624

Edit: Here’s a similar graph for inputs between 10000 and 100000:

I should say that I’m using slightly different code:

{-# OPTIONS_GHC -O2 #-}

module Main where

import Data.List (nub)
import System.Environment (getArgs )
import Test.Tasty.Bench

mkBench i = bench (show i) (nf nub [1..i :: Integer])

main :: IO ()
main = defaultMain
  [ mkBench (10000 * i) | i <- [1..10] ]
Full data again:
10000	189755992200	18678302456
20000	823506612600	89390275256
30000	1408289599800	625788312376
40000	3849603130800	25572802156
50000	4052315431800	668211977916
60000	6085734789000	1163460474448
70000	9689423129800	1381519813834
80000	12056064375400	655208968036
90000	13859477355600	787350295122
100000	17183852247200	5538253026568

Normalized:

10000	1897.559922
20000	2058.7665315
30000	1564.766222
40000	2406.00195675
50000	1620.92617272
60000	1690.48188583333
70000	1977.43329179592
80000	1883.76005865625
90000	1711.04658711111
100000	1718.38522472
1 Like

Isn’t it vice versa? One would expect time for size 16384 be 4x time for 8192, but in your data it is significantly smaller: 437802215350 << 4 * 162102463600. This is very similar to the effect I observe. Could you possibly dump CSV with +RTS -T to check at which point Gen1 GC kicks in?

1 Like

Yeah, it is getting late… Still, I don’t see subquadratic behavior over many measurements. But there does seem to be a definite increase in speed between ~8000 and ~16000

2 Likes

Yeah, my measurements are also consistent with the hypothesis until the size reaches five digits. Then something interesting happens.

1 Like

I’m not sure you’re distinguishing in between the timing of GC runs correctly here? By your output, the first, smallest, run has 0 gen 0 colls and 1 gen 1 coll (but we may assume after the nub run). The second run has 1 coll each for both gens, and the third 2 colls each.

1 Like

Zooming in gives an even stranger picture:

Raw data:
Name	Mean	(ps)
All.1000	1847662112	177159080
All.2000	9536155950	694224644
All.3000	21528642150	912182464
All.4000	38458305200	2936567252
All.5000	50410646650	4186486320
All.6000	76720363700	3421926826
All.7000	120096119400	1653369680
All.8000	156642704400	2855620698
All.9000	194840710800	12289458226
All.10000	215075019400	7404635352
All.11000	298364216200	10711984716
All.12000	316697445800	4480285974
All.13000	292753574350	8586432608
All.14000	356030077400	28292269570
All.15000	407389385700	11345002144
All.16000	452861100700	7524605320
All.17000	695306929600	2690261682
All.18000	812954171200	11281412842
All.19000	649293321000	10154074128
All.20000	741282862200	26610316232
All.21000	833292301000	33265133594
All.22000	917515456600	72566174030
All.23000	1015895692200	100362206452
All.24000	1102282092800	104720691122
All.25000	1162594520350	2581650940
All.26000	1351289199600	44543070246
All.27000	1866105929400	80237888026
All.28000	1533975522300	30697405754
All.29000	1635183724800	10089286742
All.30000	1728411313400	58574034116

It does seem like there is a drop around 12000, but it there are a few smaller dips before it and there are two peaks after that too.

Edit: even higher resolution shows no significant dip at all:

Raw data:
Name,Mean (ps),2*Stdev (ps)
All.100,15000820,357432
All.200,96622810,5278434
All.300,216367509,10993054
All.400,332063012,22115194
All.500,598490781,24011456
All.600,743299531,45490846
All.700,1170353700,95663572
All.800,1320356150,86931670
All.900,1670336718,88389530
All.1000,2384527287,175170154
All.1100,2494678337,176028916
All.1200,3433722950,169187778
All.1300,4030343350,342458506
All.1400,4671142300,339256904
All.1500,4648044475,373108402
All.1600,4004924712,90852922
All.1700,4515679106,199900396
All.1800,7378092037,168965904
All.1900,8623028350,684494172
All.2000,9591154675,341072590
All.2100,9155611300,737768714
All.2200,11609682900,772803874
All.2300,12656339400,838667704
All.2400,13779660550,758822636
All.2500,14992343650,1040854090
All.2600,16193392450,796398418
All.2700,15260502100,765619388
All.2800,18878610700,1030876978
All.2900,17843150700,772159440
All.3000,21764972100,1917523546
All.3100,23136407600,1388565566
All.3200,24666859700,1783054468
All.3300,23069795800,2010253732
All.3400,19810009150,1480778102
All.3500,20963096825,671637228
All.3600,30929969100,2120052720
All.3700,31951512400,2813690398
All.3800,33620321500,1708270964
All.3900,26428407950,1914622696
All.4000,27959664750,1935386152
All.4100,29583987500,467321436
All.4200,42301356800,4225830382
All.4300,43274928000,1598833324
All.4400,46902718200,3245096334
All.4500,49264090400,3452105246
All.4600,51806652800,3019386738
All.4700,51488980900,2234788052
All.4800,53566912200,4589240622
All.4900,57689167400,3504628694
All.5000,60155251200,4062236072
All.5100,63063022000,2907666522
All.5200,65590577000,5010169686
All.5300,67706687600,2970733478
All.5400,59977747950,5452992316
All.5500,72600747500,4850762636
All.5600,77436506600,4558143178
All.5700,79562217600,4600521138
All.5800,82302689400,2843320534
All.5900,90075053500,2565489984
All.6000,92508306600,4232940184
All.6100,94984775200,2182779794
All.6200,98872679400,2667520870
All.6300,101887935300,7978677830
All.6400,97862583200,5198806170
All.6500,110568429000,2777502768
All.6600,114546155600,2948189440
All.6700,117216208300,5577650646
All.6800,112249590400,6473867586
All.6900,124882355600,2721557242
All.7000,120917387100,6502063956
All.7100,133871223400,3630237578
All.7200,129753380000,1599584644
All.7300,144266760700,4176502724
All.7400,136155208000,6198306642
All.7500,147473689500,9945949864
All.7600,151184569700,7547612222
All.7700,161402037400,2709735598
All.7800,166438034900,5067092374
All.7900,152154533250,3806317644
All.8000,157192797600,3186460372
All.8100,163055492300,3628895042
All.8200,188934464600,5537757850
All.8300,189903549900,4370863540
All.8400,184304450800,5716539536
All.8500,173794446350,13257084890
All.8600,183063068500,18291673334
All.8700,205221918300,2031969082
All.8800,206575913000,14793219764
All.8900,213396900500,9130576620
All.9000,217163401900,8770564310
All.9100,229656299600,12373432864
All.9200,211247850000,2690566682
All.9300,174422989950,2336669122
All.9400,215301363000,14957157088
All.9500,152077838050,4140678454
All.9600,218670439200,5070013126
All.9700,232410893400,12307028352
All.9800,233820247200,1824884592
All.9900,265928548900,11382955372
All.10000,237297090700,19566002438
All.10100,239766016800,13381044896
All.10200,175731482550,9894241464
All.10300,212439441300,11404034338
All.10400,255639040000,3855956118
All.10500,221165259025,6106558314
All.10600,265989612800,7784295546
All.10700,268942491000,22471959932
All.10800,207461710000,11877819800
All.10900,305813973400,30196064108
All.11000,312453590600,21050951256
All.11100,294376963000,11250726834
All.11200,230738484600,2785350026
All.11300,310036333200,16824097646
All.11400,242317100300,10594403316
All.11500,335274134600,31286838192
All.11600,325159374800,8800119762
All.11700,334253672800,17190092780
All.11800,269685215850,10997058116
All.11900,346995415050,3073040072
All.12000,348292426600,12839637384
All.12100,376417227100,5033877374
All.12200,328716485600,23269964296
All.12300,360934264800,3822601140
All.12400,333331435200,10359642118
All.12500,369793911850,2841625148
All.12600,408875116000,23111732522
All.12700,410088266000,18906138106
All.12800,423777364700,8618599342
All.12900,401377903400,30318265222
All.13000,431294997600,36088829950
All.13100,436820238000,42280865764
All.13200,365083110800,34304920410
All.13300,436920325300,42780513366
All.13400,369969597050,16282908486
All.13500,461317273300,9295534038
All.13600,468786020800,29339722520
All.13700,427575366400,23810531940
All.13800,393190943000,5813096976
All.13900,447747156800,16805095540
All.14000,461030665600,29951482992
All.14100,517421031600,8257681422
All.14200,522814375100,22138453524
All.14300,524052291800,48197873268
All.14400,559885302400,2230980028
All.14500,590621583100,26932617256
All.14600,556051631800,55086989858
All.14700,588254729400,5990261984
All.14800,583084974000,38743580638
All.14900,489599435200,4756476146
All.15000,532846186600,30597782664
All.15100,541322685100,1135476212
All.15200,618021462000,30014834376
All.15300,424545651100,10504934772
All.15400,596776116400,44309642190
All.15500,586590500550,18346082906
All.15600,575977098150,35053655026
All.15700,582811915800,41665511486
All.15800,649452360600,59797509890
All.15900,539325901350,35377846870
All.16000,648028860900,26552301896
All.16100,644804455600,34989135074
All.16200,642015088400,19250607396
All.16300,702977670000,10629307890
All.16400,651139118800,12626500752
All.16500,529009206150,13516992686
All.16600,700040344700,34492704056
All.16700,590189804525,15945379280
All.16800,601404794250,26114637652
All.16900,592972955900,28474864198
All.17000,723411276200,37362800388
All.17100,666147380050,34874521608
All.17200,779851726100,21246497636
All.17300,710261219700,22287746616
All.17400,723777161000,35238519486
All.17500,751578855200,27939587090
All.17600,720099325850,65098324810
All.17700,645457839000,35127760332
All.17800,729854545250,26581779086
All.17900,869110757300,35596466564
All.18000,848949972100,39064242624
All.18100,747223690450,22152728136
All.18200,768062180400,14889122706
All.18300,750862972900,39963947384
All.18400,798050365850,4534770678
All.18500,801055019800,48773421512
All.18600,806262473900,57321764940
All.18700,868697856700,58280458290
All.18800,863397601800,83941992322
All.18900,830161962850,21836534036
All.19000,879345126600,52863195210
All.19100,577988863650,18200427450
All.19200,857970739400,8022614776
All.19300,874988104400,2997433694
All.19400,896660898100,12900382854
All.19500,851107526800,41471765140
All.19600,960266401900,1386239496
All.19700,938666973000,19808834276
All.19800,838629885050,60828870306
All.19900,978242954400,46631119020
All.20000,939650477500,84995381138
1 Like

Thanks @jaror.

Overall it seems to be a fluke caused by scheduling of GC. I’ve written the following harness:

#!/usr/bin/env cabal
{- cabal:
build-depends: base, tasty-bench
-}

{-# OPTIONS_GHC -O2 #-}

module Main (main) where

import Data.List
import Test.Tasty.Bench

ns :: [Int]
ns = [2000,4000..80000]

main :: IO ()
main = defaultMain $ map mkBench ns

mkBench :: Int -> Benchmark
mkBench n = cmp $ bench (show n) $ nf (\x -> sum $ nub [1..x]) n
  where
    half = n `quot` 2
    cmp
      | half `elem` ns = bcompare ("$2==\"" <> show half <> "\"")
      | otherwise = id

It runs nub on lists of increasing length and compares performance with 2x shorter lists (if available).

Now running cabal run NubBench.hs -- --stdev 5 --csv nub5.csv +RTS -T largely demonstrates the effect which has puzzled me originally: until the length of ~30000 the behaviour is consistently quadratic, 2x longer lists take 4x time to process, and afterwards it is consistently subquadratic with a multiplier of 3.6x instead of 4x. This seems to be mostly caused by favourable/unfavourable pattern of GC: starting from ~30000 length a benchmark of nub will often run only 3 iterations, which is not enough to even out GC.

Rerunning with cabal run NubBench.hs -- --stdev 1 --csv nub1.csv +RTS -T takes longer measurements of nub even on large lists, which smooths out GC pattern and indeed results are in line with the theory, consistently 4x.

4 Likes