I wrote a naive (= no mutable data structures) implementation, taking 106 seconds on my machine: 1brc.hs · GitHub
$ ghc-9.8 -threaded -rtsopts 1brc.hs && time ./1brc +RTS -s -N8 -A128M
943,546,402,696 bytes allocated in the heap
8,316,733,712 bytes copied during GC
28,490,928 bytes maximum residency (6 sample(s))
15,721,296 bytes maximum slop
16199 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 864 colls, 864 par 13.831s 2.104s 0.0024s 0.0222s
Gen 1 6 colls, 5 par 0.064s 0.610s 0.1017s 0.5977s
Parallel GC work balance: 75.73% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N8)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.081s ( 0.081s elapsed)
MUT time 664.972s (103.921s elapsed)
GC time 13.895s ( 2.714s elapsed)
EXIT time 0.003s ( 0.002s elapsed)
Total time 678.950s (106.719s elapsed)
Alloc rate 1,418,927,204 bytes per MUT second
Productivity 97.9% of total user, 97.4% of total elapsed
./1brc +RTS -s -N8 -A128M 662.51s user 16.47s system 634% cpu 1:47.01 total
Profile suggests that Map
becomes a bottleneck here.