Improving performance of mutable arrays using linear types

Void · September 3, 2024, 2:40pm

One of the most important applications of Linear Haskell is to allow pure mutable arrays and improve performance of algorithms involving arrays. However, in playing with Linear Haskell, an example caught my eyes: Suppose we implement Quicksort using the facilities provided by the linear-base package. Comparing the very naive implementation:

qs :: Ord a => [a] -> [a]
qs [] = []
qs [x] = [x]
qs (x:xs) = qs [y|y <- xs, y < x] ++ x:[y <- xs, y >= x]

to the example implementation provided by the linear-base package: linear-base/examples/Simple/Quicksort.hs at master · tweag/linear-base · GitHub
we would expect the linear implementation to perform better. However, the actual results shows that the naive algorithm actually performs better! On my device, to sort a random Int list of length 10^7, the naive algorithm takes 23s while the linear algorithm takes 70s (all tests done with -O2 compiler flag). Something is clearly wrong.

After some inspection at the code, I think the main culprit of the performance loss is allocation in the innermost loop. Consider the following code snippet:

partition :: Array Int %1 -> Int -> Int -> Int -> (Array Int, Ur Int)
partition arr pivot lx rx
  | (rx < lx) = (arr, Ur (lx - 1))
  | otherwise =
      Array.read arr lx
        & \(Ur lVal, arr1) ->
          Array.read arr1 rx
            & \(Ur rVal, arr2) -> case (lVal <= pivot, pivot < rVal) of
              (True, True) -> partition arr2 pivot (lx + 1) (rx - 1)
              (True, False) -> partition arr2 pivot (lx + 1) rx
              (False, True) -> partition arr2 pivot lx (rx - 1)
              (False, False) ->
                swap arr2 lx rx
                  & \arr3 -> partition arr3 pivot (lx + 1) (rx - 1)

We see that tuples are spilled all over the place. Tuples are lifted types and adds a layer of indirection on the actual data, and creating them requires memory allocation. Furthermore, the Ur type is a data type instead of a newtype, which also causes allocation overhead. Those expensive allocations should be avoided as much as possible in the inner loop.

Here, I propose an approach to deal with this kind of overhead related to mutable array using linear types. The code can be found in GitHub - Void-Skeleton/hstest: My personal repository for putting all sorts of Haskell tests (the code here is heavily inspired by the code from linear-base), where the main module of this test is ExampleL.hs. To summarize my approach, I used two tricks to optimize away the problems of tuples and data Ur.

The tuples can be trivially optimized away using unboxed tuples.
Instead of making Ur a GADT, we make Ur a newtype, with newtype Ur a = Ur { unur :: a }. You may be worried about the fact that a newtype constructor must always be linear. However, in an actual implementation, we hide that constructor from the module and only leave open the functions:


ur :: a -> Ur a
ur = Ur {- Nonlinear construction -}

-- unur from newtype field

($-) :: (a %p -> b) %1 -> Ur a %1 -> b
($-) f = unsafeMultiplicity (\(Ur x) -> f x)

(&-) :: Ur a %1 -> (a %p -> b) %1 -> b
(&-) x f = f $- x

where unsafeMultiplicity :: (a %p -> b) %1 -> a %q -> b is an unsafe function obtained by wrapping unsafeEqualityProof. The exposed functions of the module corresponds perfectly to the axiomatic formulation of Ur a: A linear function from Ur a to b is just a regular function from a to b, while on the implementation level Ur a is nothing but a, removing the data constructor overhead.

With those optimizations done, a linear implementation of quicksort sorts 10^7 elements in 13s, 5 times faster than the implementation of linear-base and faster than the naive implementaion. Thus, I propose these tricks so as to make Linear Haskell a more viable choice for high-performance applications. I am also interested in the unsafe aspects of these tricks, and whether I missed something that causes safety problems in my implementation.

tomjaguarpaw · September 3, 2024, 2:50pm

Amazing! I have also been surprised that Ur required an indirection and I support making it a newtype (assuming the API is genuinely type safe, which it seems to be to me).

danidiaz · September 3, 2024, 6:11pm

Out of curiosity, if we define Ur with a strict field, like

data Ur a where
  Ur :: !a -> Ur a

Does that have any effect?

Vlix · September 3, 2024, 7:04pm

Uhmmm, is this the implementation you used to benchmark?

qs [] = []
qs [x] = [x]
qs (x:xs) = qs [y|y <- xs, y < x] ++ x:[y <- xs, y >= x]

Because I think it’s wrong. IIUC, it should be:

qs :: Ord a => [a] -> [a]
qs [] = []
qs [x] = [x]
qs (x:xs) =
    qs [y | y <- xs, y < x] ++ x : qs [y <- xs, y >= x]

Wxkj99 · September 4, 2024, 12:31am

Wonderful performance! But I think it’s not worth letting the user use the unlifted types as they really cause trouble.

aspiwack · September 10, 2024, 2:54am

I’ve been pointed to this thread. Thank you for your interest. I’m glad more people are getting into linear types these days.

Unfortunately, I don’t think that this is safe. This is no different than creating a newtype-Ur, so you’ll have the same unsafety issues (which I describe in this section of a blog post I wrote earlier this year). Namely, in your case, that const True $- destroyTheOneRing will never actually destroy the one ring.

tomjaguarpaw · September 10, 2024, 4:12pm

I think you’re saying that if Ur were a newtype with a special property obtained by hiding its constructors then its constructors wouldn’t actually be hidden to GHC and therefore GHC could apply linearity-violating transformations. Is that right?

tobz619 · September 10, 2024, 7:50pm

Sorry to be a noob but what the hell is Ur? :’)

mikeplus64 · September 11, 2024, 12:23am

Can you get around this with seq or a bang pattern? As in,

unur :: Ur a %1 -> a
unur (Ur a) = case seq a () of () -> a -- or
unur (Ur a) = seq a a -- or
unur (Ur !a) = a
-- etc. I'm pretty sure these all have the same semantics... right?

Which I guess has its own different overhead than an indirection.

aspiwack · September 11, 2024, 12:53am

I’m not sure what you mean, maybe? The point is that \(Ur x) -> f x, in the definition of ($-) is compiled as \x -> f x (up to some types shenanigans). Not sure if it counts as a transformation, it’s just the semantics of the thing. And for or case, it does mean that f is free to drop x.

Ur a is a value of type a which is unrestricted (aka, slightly abusively, non-linear) even if the Ur a itself is linear. This is the Haskell equivalent to the !A of linear logic. Another (formalist) way to say it, is that Ur a %1 -> b is isomorphic to a -> b.

I don’t know. I’ve convinced myself at some point that this wasn’t enough. But maybe I’m wrong: I don’t recall my argument.

tomjaguarpaw · September 11, 2024, 7:32am

Hmm, now that I’ve tried to think about it some more I don’t understand why compiling ($-) to \f x -> f x is actually bad. Why shouldn’t it drop x? You can only construct an Ur a from an a non-linearly, so you’re not violating anyone’s expectations if you drop or duplicate the a, are you?

For example, I can write

f :: Ur x %1 -> ()
f ux = (\_ -> ()) $- ux

to drop x, and

g :: Ur x %1 -> (x, x)
g ux = (\x -> (x, x)) $- ux

or even

h :: Ur x %1 -> Ur (x, x)
h = (\x -> ur (x, x)) $- ux

What’s wrong with that? “The Ur constructor is not forced”, indeed, because there is no Ur constructor! But I don’t see why that’s bad. What guarantees does being able to do that violate?

tomjaguarpaw · September 11, 2024, 7:39am

These are all no-ops.

The first means "when you come to evaluate the result (of unur (Ur a)) first evaluate seq a () to a (), which requires evaluating a first. But a is the result, so evaluating a before returning the result is a no-op.

The second means “when you come to evaluate the result evaluate a beforehand”, but again a is the result, so this is a no op.

The third is indeed just the second.

aspiwack · September 12, 2024, 1:10am

The problem is that x : Ur a is a thunk which is allowed to contain linear values which must be consumed once. Only when I force the Ur a thunk to reveal an Ur constructor do I actually consume these values.

Consider:

lotr :: Ur Peace %1 -> ()
lotr destroyTheOneRing = const () $- destroyTheOneRing

This violates the linearity condition that destroyTheOneRing must be consumed. It never is. Maybe destroyTheOneRing refers to a file or something, in which case the file is never closed.

tomjaguarpaw · September 12, 2024, 7:04am

Is it? How can it possibly when its constructor is non-linear? Can you give an example?

aspiwack · September 12, 2024, 8:44am

Something like:

f :: A %1 -> B
f x =
  let
    u :: Ur Int -- thunk whose free variables include x
    u = consume x `lseq` Ur 57
  in
  …u …

The idea is that the Ur a thunk is linear, whereas the value of type a it holds isn’t. And as any linear thunk, it may contain arbitrary linear stuff in its closure.

tomjaguarpaw · September 12, 2024, 10:12am

Oh, how interesting! To avoid having to know what lseq is, I think we can write the same example as

f :: A %1 -> B
f x =
  let
    u :: Ur Int -- thunk whose free variables include x
    u = case consume x of () -> Ur 57
  in
  … u …

In which case my next guess is that making the “continuation” strict in its argument is probably safe

($-) :: (a %p -> b) %1 -> Ur a %1 -> b
($-) f = unsafeMultiplicity (\(Ur !x) -> f x)

(which is probably what @mikeplus64’s intuition was getting at)

aspiwack · September 13, 2024, 1:42am

Maybe. I’m not confident about it (but neither am I confident that it’d be unsafe). It would require some careful proof. I’m not sure it’s worth it in recent GHCs though, considering that Ur is usually unboxed.

tomjaguarpaw · September 13, 2024, 7:26am

This Ur? Data.Unrestricted.Linear How is it unboxed? The definition itself is of a boxed type, isn’t it? Do you mean the GHC can often apply an unboxing transformation to it? Perhaps, but this thread is predicated on the observation that unboxing it manually (plus getting rid of the additional tag, i.e. converting it to a newtype) picks up impressive performance gains.

aspiwack · September 17, 2024, 8:54am

That GHC applies the unboxing transformation.

I think that what we’re seeing here is that older versions of GHC fail to optimize this code properly. Whereas newer versions do it easily. Per @tbagrel1 's post on the other thread.

tomjaguarpaw · September 17, 2024, 10:14am

That’s great if so! Although it says

Topic		Replies	Views
Linear Haskell QuickSort Performance Learn	8	1105	September 11, 2024
Quick Sort variants performance Learn	2	1539	July 7, 2020
Experience Report: Linear Haskell enables Pure, Parallel, and In-place Fast Fourier Transformation Show and Tell	11	2070	December 4, 2023
Performant haskell seems like a real pain Learn	2	830	December 3, 2020
Samsort: A sorting library Show and Tell	8	1274	May 7, 2024

Improving performance of mutable arrays using linear types

Related topics