One of the most important applications of Linear Haskell is to allow pure mutable arrays and improve performance of algorithms involving arrays. However, in playing with Linear Haskell, an example caught my eyes: Suppose we implement Quicksort using the facilities provided by the linear-base package. Comparing the very naive implementation:
qs :: Ord a => [a] -> [a]
qs [] = []
qs [x] = [x]
qs (x:xs) = qs [y|y <- xs, y < x] ++ x:[y <- xs, y >= x]
to the example implementation provided by the linear-base package: linear-base/examples/Simple/Quicksort.hs at master · tweag/linear-base · GitHub
we would expect the linear implementation to perform better. However, the actual results shows that the naive algorithm actually performs better! On my device, to sort a random Int list of length 10^7, the naive algorithm takes 23s while the linear algorithm takes 70s (all tests done with -O2 compiler flag). Something is clearly wrong.
After some inspection at the code, I think the main culprit of the performance loss is allocation in the innermost loop. Consider the following code snippet:
partition :: Array Int %1 -> Int -> Int -> Int -> (Array Int, Ur Int)
partition arr pivot lx rx
| (rx < lx) = (arr, Ur (lx - 1))
| otherwise =
Array.read arr lx
& \(Ur lVal, arr1) ->
Array.read arr1 rx
& \(Ur rVal, arr2) -> case (lVal <= pivot, pivot < rVal) of
(True, True) -> partition arr2 pivot (lx + 1) (rx - 1)
(True, False) -> partition arr2 pivot (lx + 1) rx
(False, True) -> partition arr2 pivot lx (rx - 1)
(False, False) ->
swap arr2 lx rx
& \arr3 -> partition arr3 pivot (lx + 1) (rx - 1)
We see that tuples are spilled all over the place. Tuples are lifted types and adds a layer of indirection on the actual data, and creating them requires memory allocation. Furthermore, the Ur
type is a data type instead of a newtype, which also causes allocation overhead. Those expensive allocations should be avoided as much as possible in the inner loop.
Here, I propose an approach to deal with this kind of overhead related to mutable array using linear types. The code can be found in GitHub - Void-Skeleton/hstest: My personal repository for putting all sorts of Haskell tests (the code here is heavily inspired by the code from linear-base), where the main module of this test is ExampleL.hs. To summarize my approach, I used two tricks to optimize away the problems of tuples and data Ur
.
- The tuples can be trivially optimized away using unboxed tuples.
- Instead of making
Ur
a GADT, we makeUr
anewtype
, withnewtype Ur a = Ur { unur :: a }
. You may be worried about the fact that a newtype constructor must always be linear. However, in an actual implementation, we hide that constructor from the module and only leave open the functions:
ur :: a -> Ur a
ur = Ur {- Nonlinear construction -}
-- unur from newtype field
($-) :: (a %p -> b) %1 -> Ur a %1 -> b
($-) f = unsafeMultiplicity (\(Ur x) -> f x)
(&-) :: Ur a %1 -> (a %p -> b) %1 -> b
(&-) x f = f $- x
where unsafeMultiplicity :: (a %p -> b) %1 -> a %q -> b
is an unsafe function obtained by wrapping unsafeEqualityProof
. The exposed functions of the module corresponds perfectly to the axiomatic formulation of Ur a
: A linear function from Ur a
to b
is just a regular function from a
to b
, while on the implementation level Ur a
is nothing but a
, removing the data constructor overhead.
With those optimizations done, a linear implementation of quicksort sorts 10^7 elements in 13s, 5 times faster than the implementation of linear-base and faster than the naive implementaion. Thus, I propose these tricks so as to make Linear Haskell a more viable choice for high-performance applications. I am also interested in the unsafe aspects of these tricks, and whether I missed something that causes safety problems in my implementation.