Diffarray has always been such a cool concept to me. If you don’t know, it’s a constant-time update array with an pure interface. The way it works is that when you update an element the array is mutated under the hood and returned as a new array, but the old one gets swapped out to point to the new array plus the list of changed elements.
This way you can pretend you’re always making a new copy when mutating but not pay the price as long as you use it linearly.
Unfortunately this apparently never took off because it turned out to be super slow.
The package in question is here: diffarray: DiffArray
The relevant discussion: #2727: DiffArray performance unusable for advertized purpose · Issues · Glasgow Haskell Compiler / GHC · GitLab
So I had a go at optimizing the implementation. I couldn’t really find a live repository for this package so I just took the source and appended the test from the issue at the bottom.
The first thing I noticed is that the test only really updates the array once per loop, and that the tested array is only 1000 elements long so the fact that the standard immutable array performs faster isn’t that surprising. The copy is probably cheap and most of the work is reading from the array.
So I commented out the update and focused on read performance.
The test I’m running is with a 10000 element array. 100000 loops.
These are the initial results with Data.Array:
9,248,936 bytes allocated in the heap
573,368 bytes copied during GC
243,376 bytes maximum residency (1 sample(s))
190,800 bytes maximum slop
103 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2 colls, 2 par 0.003s 0.001s 0.0005s 0.0006s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s
Parallel GC work balance: 24.91% (serial 0%, perfect 100%)
TASKS: 42 (1 bound, 41 peak workers (41 total), using -N20)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.012s ( 0.096s elapsed)
MUT time 1.300s ( 1.297s elapsed)
GC time 0.003s ( 0.001s elapsed)
EXIT time 0.004s ( 0.006s elapsed)
Total time 1.319s ( 1.400s elapsed)
Alloc rate 7,114,456 bytes per MUT second
Productivity 98.6% of total user, 92.6% of total elapsed
and here are the initial results for the diffarray:
40,003,169,280 bytes allocated in the heap
50,613,928 bytes copied during GC
310,120 bytes maximum residency (2 sample(s))
234,648 bytes maximum slop
103 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 9573 colls, 9573 par 11.360s 5.451s 0.0006s 0.0260s
Gen 1 2 colls, 1 par 0.003s 0.002s 0.0008s 0.0014s
Parallel GC work balance: 0.43% (serial 0%, perfect 100%)
TASKS: 42 (1 bound, 41 peak workers (41 total), using -N20)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.011s ( 0.050s elapsed)
MUT time 52.512s ( 48.944s elapsed)
GC time 11.362s ( 5.453s elapsed)
EXIT time 0.003s ( 0.004s elapsed)
Total time 63.888s ( 54.450s elapsed)
Alloc rate 761,786,038 bytes per MUT second
Productivity 82.2% of total user, 89.9% of total elapsed
Ouch. 50 times slower. And it also allocates like crazy for some reason even though it should only be reading.
So I did the following:
- Enabled Strict on the whole module
- Replaced MVar with IORef
- Added array bounds and number of elements to the DiffArray structure to avoid reading the IORef for bound checks
- Replaced unsafePerformIO with unsafeDupablePerformIO
I haven’t really thought about safety so I don’t know how valid the MVar to IORef and unsafePerformIO to unsafeDupablePerformIO swaps are, but here are the new results:
10,368,768 bytes allocated in the heap
573,520 bytes copied during GC
243,376 bytes maximum residency (1 sample(s))
190,800 bytes maximum slop
103 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2 colls, 2 par 0.003s 0.001s 0.0005s 0.0005s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0003s 0.0003s
Parallel GC work balance: 44.89% (serial 0%, perfect 100%)
TASKS: 42 (1 bound, 41 peak workers (41 total), using -N20)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.012s ( 0.064s elapsed)
MUT time 2.100s ( 2.095s elapsed)
GC time 0.003s ( 0.001s elapsed)
EXIT time 0.006s ( 0.010s elapsed)
Total time 2.121s ( 2.171s elapsed)
Alloc rate 4,937,692 bytes per MUT second
Productivity 99.0% of total user, 96.5% of total elapsed
Pretty close to the basic array performance.
A couple of reasons why I’m posting this:
- Perhaps there have already been attempts at optimizing the implementation that I’m not aware of, so I’m hesitant to invest more time into this without first checking.
- I’d like to know whether my changes are safe or not.
- This is more or less at the limit of my Haskell optimization knowledge so perhaps someone can jump in and get this on-par with array.
- I’d be interested in a general discussion about diffarrays. They always seemed super useful to me so it was always weird how there was no effort to revive them.
Both the implementation (with my changes) and the test are in this gist: diffarray.hs · GitHub
There’s a cabal preamble at the top so the file should be runnable with cabal run diffarray.hs -- +RTS -s
I’m running GHC 9.4.8 for the above results. On a mobile i9-13905H chip.