Optimizing diffarray

Diffarray has always been such a cool concept to me. If you don’t know, it’s a constant-time update array with an pure interface. The way it works is that when you update an element the array is mutated under the hood and returned as a new array, but the old one gets swapped out to point to the new array plus the list of changed elements.

This way you can pretend you’re always making a new copy when mutating but not pay the price as long as you use it linearly.

Unfortunately this apparently never took off because it turned out to be super slow.

The package in question is here: diffarray: DiffArray
The relevant discussion: #2727: DiffArray performance unusable for advertized purpose · Issues · Glasgow Haskell Compiler / GHC · GitLab

So I had a go at optimizing the implementation. I couldn’t really find a live repository for this package so I just took the source and appended the test from the issue at the bottom.

The first thing I noticed is that the test only really updates the array once per loop, and that the tested array is only 1000 elements long so the fact that the standard immutable array performs faster isn’t that surprising. The copy is probably cheap and most of the work is reading from the array.
So I commented out the update and focused on read performance.

The test I’m running is with a 10000 element array. 100000 loops.
These are the initial results with Data.Array:

       9,248,936 bytes allocated in the heap
         573,368 bytes copied during GC
         243,376 bytes maximum residency (1 sample(s))
         190,800 bytes maximum slop
             103 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         2 colls,     2 par    0.003s   0.001s     0.0005s    0.0006s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  Parallel GC work balance: 24.91% (serial 0%, perfect 100%)

  TASKS: 42 (1 bound, 41 peak workers (41 total), using -N20)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.012s  (  0.096s elapsed)
  MUT     time    1.300s  (  1.297s elapsed)
  GC      time    0.003s  (  0.001s elapsed)
  EXIT    time    0.004s  (  0.006s elapsed)
  Total   time    1.319s  (  1.400s elapsed)

  Alloc rate    7,114,456 bytes per MUT second

  Productivity  98.6% of total user, 92.6% of total elapsed

and here are the initial results for the diffarray:

  40,003,169,280 bytes allocated in the heap
      50,613,928 bytes copied during GC
         310,120 bytes maximum residency (2 sample(s))
         234,648 bytes maximum slop
             103 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      9573 colls,  9573 par   11.360s   5.451s     0.0006s    0.0260s
  Gen  1         2 colls,     1 par    0.003s   0.002s     0.0008s    0.0014s

  Parallel GC work balance: 0.43% (serial 0%, perfect 100%)

  TASKS: 42 (1 bound, 41 peak workers (41 total), using -N20)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.011s  (  0.050s elapsed)
  MUT     time   52.512s  ( 48.944s elapsed)
  GC      time   11.362s  (  5.453s elapsed)
  EXIT    time    0.003s  (  0.004s elapsed)
  Total   time   63.888s  ( 54.450s elapsed)

  Alloc rate    761,786,038 bytes per MUT second

  Productivity  82.2% of total user, 89.9% of total elapsed

Ouch. 50 times slower. And it also allocates like crazy for some reason even though it should only be reading.

So I did the following:

  • Enabled Strict on the whole module
  • Replaced MVar with IORef
  • Added array bounds and number of elements to the DiffArray structure to avoid reading the IORef for bound checks
  • Replaced unsafePerformIO with unsafeDupablePerformIO

I haven’t really thought about safety so I don’t know how valid the MVar to IORef and unsafePerformIO to unsafeDupablePerformIO swaps are, but here are the new results:

      10,368,768 bytes allocated in the heap
         573,520 bytes copied during GC
         243,376 bytes maximum residency (1 sample(s))
         190,800 bytes maximum slop
             103 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         2 colls,     2 par    0.003s   0.001s     0.0005s    0.0005s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0003s    0.0003s

  Parallel GC work balance: 44.89% (serial 0%, perfect 100%)

  TASKS: 42 (1 bound, 41 peak workers (41 total), using -N20)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.012s  (  0.064s elapsed)
  MUT     time    2.100s  (  2.095s elapsed)
  GC      time    0.003s  (  0.001s elapsed)
  EXIT    time    0.006s  (  0.010s elapsed)
  Total   time    2.121s  (  2.171s elapsed)

  Alloc rate    4,937,692 bytes per MUT second

  Productivity  99.0% of total user, 96.5% of total elapsed

Pretty close to the basic array performance.

A couple of reasons why I’m posting this:

  • Perhaps there have already been attempts at optimizing the implementation that I’m not aware of, so I’m hesitant to invest more time into this without first checking.
  • I’d like to know whether my changes are safe or not.
  • This is more or less at the limit of my Haskell optimization knowledge so perhaps someone can jump in and get this on-par with array.
  • I’d be interested in a general discussion about diffarrays. They always seemed super useful to me so it was always weird how there was no effort to revive them.

Both the implementation (with my changes) and the test are in this gist: diffarray.hs · GitHub

There’s a cabal preamble at the top so the file should be runnable with cabal run diffarray.hs -- +RTS -s

I’m running GHC 9.4.8 for the above results. On a mobile i9-13905H chip.

8 Likes

This is the first time I’ve heard about diff arrays and it seems pretty cool to me.

There will probably always be some overhead due to the need for allocating on every change, but the garbage collector is also optimized for this use case of high throughput and low liveness, so I’d say you should indeed be able to get close to the performance of mutable arrays if you use them properly.

If I understand it correctly, your benchmark is only testing the performance of reading values from the array. With mutable arrays that would be a single pointer dereference and now the diffarray adds one extra layer of indirection, so it would naively take twice as long. However, one of the pointers will always stay the same (if used linearly) and thus probably be cached on modern CPUs, so perhaps you could get close to matching the performance if you’re only reading from the array.

One thing I’d try is to start with a lower-level API. I’d choose the interface from the primitive package: Data.Primitive.Array (or even just Array# directly if you’re not afraid of unboxed types).

And I would certainly leave out all those boring instances if you’re still just testing things. Just pick one type, like Integer and do all your tests with that.

1 Like

There’s some more overhead with bounds checking (both for array and diffarray) but interestingly the performance gets even closer to array when do unsafe lookups (both for array and diffarray).

Also I’ve included the whole source for completeness if someone else wants to experiments with writes or unboxed perf.

This is an interesting structure, but I can’t think of where I would want to use it.

If the task is single-threaded and performance is important, I would use a mutable array because there will remain some overhead of using diffarray. If performance is not important, I would simply use a tree-based sequence like Seq.
Also, you have to be careful to not accidentally use old arrays, because indexing and updating become O(n). At least with Seq the operations are guaranteed O(log n). Thinking about it, the list of diffs could be changed to a Map to get O(log n) indexing, but this increases the update cost to O(d log d), so it’s not a clear win.

It’s mainly an issue of ergonimics I think. Mutable arrays force you into monadic code. If you could get 90% of the performance while remaining pure, I feel this would be worth it in a lot of cases.

4 Likes

Here’s something I threw together quickly:

{-# LANGUAGE MagicHash, UnboxedTuples, UnliftedDatatypes #-}
{-# OPTIONS_GHC -Wall -Wno-name-shadowing #-}
import GHC.Exts hiding (fromList)

import Data.Kind (Type)

data DiffArray a = DA (MutVar# RealWorld (DiffArrayData a))
type DiffArrayData :: Type -> UnliftedType
data DiffArrayData a
  = Here {-# UNPACK #-} !(MutableArray# RealWorld a)
  | There -- (Seq a)

fromList :: [a] -> DiffArray a
fromList xs = DA (runRW# $ \s -> 
  case newArray# (case length xs of I# x -> x) undefined s of
  { (# s , arr #) ->
  case newMutVar# (Here arr) (go arr 0# xs s) of
  { (# _ , x #) -> x
  }})
  where
    go _ _ [] s = s 
    go arr i (x:xs) s = go arr (i +# 1#) xs (writeArray# arr i x s)

(!) :: DiffArray a -> Int -> a
(!) (DA arr) (I# i) = runRW# $ \s -> 
  case readMutVar# arr s of
    (# s , Here arr #) -> 
      case readArray# arr i s of
        (# _ , x #) -> x
    (# _ , There #) -> error "Non-linear use" -- Seq.index seq (I# i)

type Arr = DiffArray Int

-- data Array a = A (MutableArray# RealWorld a)
-- 
-- fromList :: [a] -> Array a
-- fromList xs = A (runRW# $ \s -> 
--   case newArray# (case length xs of I# x -> x) undefined s of
--   { (# s , arr #) -> go arr 0# xs s
--   })
--   where
--     go arr _ [] _ = arr
--     go arr i (x:xs) s = go arr (i +# 1#) xs (writeArray# arr i x s)
-- 
-- (!) :: Array a -> Int -> a
-- (!) (A arr) (I# i) = runRW# $ \s -> 
--   case readArray# arr i s of
--     (# _ , x #) -> x
-- 
-- type Arr = Array Int

maxPos :: Int
maxPos = 10000
maxLoops :: Int
maxLoops = 100000

loop :: Arr -> Int -> Int -> Arr
loop arr i c
    | i > maxPos = arr -- // [(maxPos, c)]
    | otherwise = loop arr (i + 1) $! arr ! i + c

loop2 :: Int -> Arr -> Arr
loop2 i arr
    | i > maxLoops = arr
    | otherwise = loop2 (i + 1) $ loop arr 1 0

main :: IO ()
main = print $ loop2 1 arr ! maxPos

arr :: Arr
arr = fromList (replicate (maxPos + 1) 1)

Normal Array:

1
         610,568 bytes allocated in the heap
           3,272 bytes copied during GC
          44,328 bytes maximum residency (1 sample(s))
          25,304 bytes maximum slop
               6 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0004s    0.0004s

  INIT    time    0.003s  (  0.003s elapsed)
  MUT     time    1.143s  (  1.143s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.012s elapsed)
  Total   time    1.146s  (  1.158s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    534,212 bytes per MUT second

  Productivity  99.7% of total user, 98.7% of total elapsed

DiffArray:

1
         610,600 bytes allocated in the heap
           3,272 bytes copied during GC
          44,328 bytes maximum residency (1 sample(s))
          25,304 bytes maximum slop
               6 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0003s    0.0003s

  INIT    time    0.004s  (  0.004s elapsed)
  MUT     time    1.573s  (  1.588s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.011s elapsed)
  Total   time    1.577s  (  1.603s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    388,129 bytes per MUT second

  Productivity  99.7% of total user, 99.0% of total elapsed

That obviously does not implement the diffarray structure properly, but it should not influence the performance characteristics of simple reads which we are benchmarking here.

Edit: strangely the performance difference disappears pretty much completely if I compile and run this program on my AMD desktop, instead of my macbook.

6 Likes

IOArray is already basically MutableArray#, just with extra index tracking which you’d still have to reimplement here to write an IArray instance.
IORef is almost a MutVar# except for one data instead of newtype indirection. I’ll see if I can just swap that in and see what the performance is like.

Also, I’m not sure you’d still be able to have an unlifted type if you implemented the There branch, since that’s the one with a reference back to the original DiffArray? But I don’t know enough about this so correct me if I’m wrong.

I’ve now put together a slightly better setup with imo a better benchmark (it does random-ish reads based on values read from the array):

I’ve now also added a set function to the DiffArray implementation, but I’m not using it yet. Edit: I see now that this implementation is not very good. Edit 2: Fixed now.

On my AMD desktop the results are:

  diffarray: OK
    22.5 μs ± 870 ns,  43 B  allocated,   0 B  copied, 7.0 MB peak memory
  array:     OK
    22.8 μs ± 1.9 μs,  48 B  allocated,   0 B  copied, 7.0 MB peak memory

On my macbook, I get slightly worse performance:

  diffarray: OK
    30.7 μs ± 3.0 μs,  25 B  allocated,   0 B  copied, 7.0 MB peak memory
  array:     OK
    30.8 μs ± 498 ns,  38 B  allocated,   0 B  copied, 7.0 MB peak memory

But it is not very consistent, this was the result of another run:

  diffarray: OK
    29.8 μs ± 2.4 μs,   0 B  allocated,   0 B  copied, 7.0 MB peak memory
  array:     OK
    26.4 μs ± 2.1 μs,   0 B  allocated,   0 B  copied, 7.0 MB peak memory

I’ve now also added “old read” benchmarks, which show how performance degrades if we’re reading from old versions of the array. These are the results:

  array:        OK
    22.2 μs ± 539 ns,  42 B  allocated,   0 B  copied, 6.0 MB peak memory
  diffarray:    OK
    22.4 μs ± 1.6 μs,  63 B  allocated,   0 B  copied, 6.0 MB peak memory
  diffarray 1:  OK
    22.3 μs ± 285 ns,  56 B  allocated,   0 B  copied, 6.0 MB peak memory
  diffarray 2:  OK
    22.6 μs ± 781 ns,  59 B  allocated,   0 B  copied, 6.0 MB peak memory
  diffarray 5:  OK
    22.7 μs ± 1.5 μs,  64 B  allocated,   0 B  copied, 6.0 MB peak memory
  diffarray 7:  OK
    26.3 μs ± 523 ns,  56 B  allocated,   0 B  copied, 6.0 MB peak memory
  diffarray 10: OK
    39.7 μs ± 2.7 μs,  73 B  allocated,   1 B  copied, 6.0 MB peak memory

The number in the benchmark name means how many changes (and thus layers of indirection) there are between the original array and the current array.


I’ve now also added a proper quicksort implementation. I haven’t had time to benchmark it yet.


Here are benchmark results for my quicksort:

  quicksort array: OK
    1.31 ms ± 101 μs,  79 KB allocated,  59 B  copied,  11 MB peak memory
  quicksort diffarray:  OK
    4.88 ms ± 429 μs,  18 MB allocated, 1.2 KB copied,  25 MB peak memory

Using the wrong swap implementation causes quite a big slowdown:

  quicksort diffarray:  OK
    16.8 ms ± 1.1 ms,  24 MB allocated,  24 MB copied,  52 MB peak memory

If you add an explicit copy before the quicksort the performance improves:

  quicksort diffarray copy: OK
    3.18 ms ± 202 μs,  14 MB allocated,  29 KB copied,  11 MB peak memory

So, diff arrays are about 2.5x slower than monadic mutable arrays if used properly.


We can get better performance if we make the swap operation built-in:

  quicksort array:          OK
    1.39 ms ±  89 μs,  79 KB allocated,  60 B  copied,  11 MB peak memory
  quicksort diffarray:      OK
    3.34 ms ± 190 μs,  14 MB allocated, 1.0 KB copied,  21 MB peak memory
  quicksort diffarray copy: OK
    2.39 ms ±  50 μs,  11 MB allocated,  57 KB copied,  24 MB peak memory

That’s only 1.7x slower.

2 Likes

Which is good!

Consider jaror’s quicksort above.

If you have to be careful to enforce a sequence of operations on your DiffArray, that’s just monadic code, only it’s more error-prone now.

You only have to be careful it in some specific places like in tight loops where performance really matters. In other places you can be much more flexible in what you want to optimize. I think some debug tooling could also help a lot.

I think a great use case would be a backtracking solver (e.g. sudoku or chess), because the diff arrays essentially handle the undos for you.

One catch is that you don’t want to do any copying in that case, instead you want to “steal” back the old version of the array.

1 Like

So I’ve replaced the IORef with a custom UnliftedIORef (just newtype wrapping the MutVar# while keeping the IO interface) and this improved the performance to these values:

      10,369,312 bytes allocated in the heap
         573,520 bytes copied during GC
         243,376 bytes maximum residency (1 sample(s))
         190,800 bytes maximum slop
             103 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         2 colls,     2 par    0.002s   0.001s     0.0004s    0.0005s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  Parallel GC work balance: 10.48% (serial 0%, perfect 100%)

  TASKS: 42 (1 bound, 41 peak workers (41 total), using -N20)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.012s  (  0.064s elapsed)
  MUT     time    1.845s  (  1.841s elapsed)
  GC      time    0.002s  (  0.001s elapsed)
  EXIT    time    0.003s  (  0.004s elapsed)
  Total   time    1.862s  (  1.910s elapsed)

  Alloc rate    5,619,843 bytes per MUT second

  Productivity  99.1% of total user, 96.4% of total elapsed

It’s still not on par with array, like you implementation is, but I wonder why that is exactly. I’ve tried unlifting the datatype but this didn’t change the performance in my case. I get that I’m using IOArray instead of MutableArray# directly, but I don’t see why this would matter.

https://gist.github.com/LukaHorvat/711196cb00de4da4c49bb9ed81b061e1/revisions#diff-41d7d60984859615e9046678c77e420ddfe58251c1c93e32cd45089b14d848d8

I think the most important thing is to avoid being polymorphic in the array type. Polymorphism requires a layer of indirection. If you specialize DiffArrayData to IOArray then you can {-# UNPACK #-} it and remove that indirection.

Another easy thing you can do is to make the DiffArrayData unlifted:

type UnliftedIORef :: UnliftedType -> UnlifedType
newtype UnliftedIORef a = UnliftedIORef (MutVar# RealWorld a)

type DiffArrayData :: Type -> Type -> UnliftedType
data DiffArrayData i e
    = Current {-# UNPACK #-} !(IOArray i e)
    | Diff (DiffArray i e) [(Int, e)]

Specializing and unpacking the array does improve the performance a bit (1.8s to 1.6s).
Making the type unlifted didn’t help though and it complicated the code a bit since you can no longer have a readUnliftedIORef :: UnliftedIORef a -> IO a. I made it work with CPS like for newUnliftedIORef.

At this point you’ve definitely invested more time into this than me. Do you have any plans to package it up? I think if would be a great contribution to the ecosystem.

1 Like

Publishing it as a library is a good idea. I was thinking it would need a lot more polish and worked out use-cases, but shipping fast and extending it later could be a better strategy. I’ll see what I can do.

I’ve uploaded a candidate:

https://hackage.haskell.org/package/fleet-array-0.1.0.0/candidate

5 Likes

I’ve been experimenting with using weak pointers to remove unused versions from the history chain. If a sequence of versions somewhere in the middle of the chain is unused, their diffs could be combined into an IntMap which has logarithmic time lookup, so this could significantly improve the time it takes to read from old versions of the array.

I tried to start with a simple experiment: implement a list whose elements disappear when they are no longer directly referenced. I think that nicely models the behaviour we want. Here’s how you could implement that:

{-# LANGUAGE LambdaCase #-}

import System.Mem.Weak
import Data.IORef
import Data.Coerce
import System.Mem
import Control.Concurrent

newtype List a = List (IORef (ListData a))
data ListData a = Nil | Cons a (Weak (List a)) (IORef (ListData a))

nil :: IO (List a)
nil = List <$> newIORef Nil

cons :: a -> List a -> IO (List a)
cons x (List xs) = do
  r1 <- newIORef =<< readIORef xs
  r2 <- newIORef Nil
  w <- mkWeakIORef r2 (pure ())
  writeIORef r2 (Cons x (coerce w) r1)
  pure (List r2)

toList :: List a -> IO [a]
toList (List r) =
  readIORef r >>= \case
    Nil -> pure []
    Cons x w r' -> do
      deRefWeak w >>= \case
        Nothing -> toList (List r')
        Just _ -> (x :) <$> toList (List r')

main = do
  xs <- cons 3 =<< cons 2 =<< cons (1 :: Int) =<< nil
  ys <- cons 6 =<< cons 5 =<< cons 4 xs
  print =<< toList xs
  print =<< toList ys
  performGC
  threadDelay 100000
  print =<< toList xs
  print =<< toList ys

From the output you can see that elements that are not directly referenced are removed:

[3,2,1]
[6,5,4,3,2,1]
[3]
[6,3]

Of course this probably would have way too much overhead, but this could show that it is possible to get the GC to help us in theory.

2 Likes

Sounds like a great idea to me. Though to be honest I think the point of a diff array is to use it linearly. The fact that the old references are still valid always seemed like an unnecessary feature that’s only there so the interface is complete. I would, for example, be fine with a diff array that errored if you tried to access old copies.
In fact I could see some cases where this would be preferable behavior because it would let you know immediately that you’re using it wrong instead of just silently performing worse.

Perhaps there could be a linearAt function (and friends) that only works on the head, but this would only be useful when you’re working with the array directly. You couldn’t really use any pre-existing functions that work on any array.

I would, for example, be fine with a diff array that errored if you tried to access old copies.

… isn’t that just a regular array? :thinking:

This one lets you mutate in O(1) time without a monadic interface.