Linearly-typed mutable vector of unboxed values (hack) for NFA benchmark

hellwolf · April 7, 2025, 10:01am

This was prompted and inspired by Let's run some NFAs — 0xd34df00d.me to write a faster linear-type based NFA match function using a mutable vector of unboxed data.

It is made especially to compare fairly with the ST monad version of the NFA match function mentioned in the article.

How it was done

First, it is a hack: A linear-type-safe API wrapping unsafePerformIO under the hood.

However, the ST monad version uses the unsafe APIs (unsafeWrite, unsafeRead, unsafeNew) directly from Data.Vector.Unboxed.Mutable, which arguably is worse.

The linear-type-safe API wrapper over Data.Vector.Unboxed.Mutable is straightforward:

import Prelude.Linear qualified as L
import Unsafe.Linear qualified as L

newtype LinearIOVector a = MkLinearIOVector (L.Ur (VM.IOVector a))

instance L.Consumable (LinearIOVector a) where
  consume (MkLinearIOVector ur) = L.consume ur

new_linear_iovec :: VM.Unbox a => Int -> LinearIOVector a
new_linear_iovec = unsafePerformIO . (MkLinearIOVector . L.Ur <$>) . VM.unsafeNew

read_linear_iovec :: VM.Unbox a => LinearIOVector a %1 -> Int -> (L.Ur a, LinearIOVector a)
read_linear_iovec (MkLinearIOVector (L.Ur vec)) i = L.toLinear unsafePerformIO $ do
  a <- (L.toLinear VM.unsafeRead) vec i
  pure (L.Ur a, MkLinearIOVector (L.Ur vec))

write_linear_iovec :: VM.Unbox a => LinearIOVector a %1 -> Int -> a -> LinearIOVector a
write_linear_iovec (MkLinearIOVector (L.Ur vec)) i a = L.toLinear unsafePerformIO $ do
  (L.toLinear VM.unsafeWrite) vec i a >> pure (MkLinearIOVector (L.Ur vec))

Comparing to ST monad, the match function using the linear-type-safe API is monad/do-notation free:

match_linear2 :: forall q. (VM.Unbox q, StateId q) => NFA q -> BS.ByteString -> MatchResult Int
match_linear2 NFA{..} bs = L.unur L.$ go initState 0 (new_linear_iovec 24_000_000) 0
  where
    go q i !stack s
      | q == finState = stack `L.lseq` L.Ur (SuccessAt i)
      | otherwise = case q `getTrans` transitions of
                      TEps q' -> go q' i stack s
                      TBranch q1 q2 -> go q1 i (write_linear_iovec stack s (q2, i)) (s + 1)
                      TCh ch q'
                        | bs `BS.indexMaybe` i == Just ch -> go q' (i + 1) stack s
                        | s == 0 -> pure Failure
                        | otherwise -> let (L.Ur (q'', i''), stack'') = read_linear_iovec stack (s - 1)
                                       in go q'' i'' stack'' (s - 1)

Performance

Generate dataset: make aa.dat aaz.dat ab.dat abz.dat. This is in line with the data used in the original article.
To run the benchmark: make bench

for i in aa aaz ab abz;do ./nfa-perf.hs linear2 $i.dat +RTS -s;done
("linear2","aa.dat",Failure)
   2,848,059,560 bytes allocated in the heap
          19,400 bytes copied during GC
     288,028,464 bytes maximum residency (2 sample(s))
       1,394,896 bytes maximum slop
             282 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       619 colls,     0 par    0.001s   0.001s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.007s   0.007s     0.0034s    0.0067s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.426s  (  0.425s elapsed)
  GC      time    0.007s  (  0.007s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    0.434s  (  0.434s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    6,692,589,459 bytes per MUT second

  Productivity  98.2% of total user, 98.1% of total elapsed

("linear2","aaz.dat",SuccessAt 20000001)
     608,060,256 bytes allocated in the heap
          15,064 bytes copied during GC
     288,028,464 bytes maximum residency (2 sample(s))
       1,394,896 bytes maximum slop
             282 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        77 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.007s   0.007s     0.0035s    0.0070s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.281s  (  0.281s elapsed)
  GC      time    0.007s  (  0.007s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    0.289s  (  0.289s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,165,434,650 bytes per MUT second

  Productivity  97.3% of total user, 97.3% of total elapsed

("linear2","ab.dat",Failure)
   2,848,059,560 bytes allocated in the heap
          19,416 bytes copied during GC
     288,028,464 bytes maximum residency (2 sample(s))
       1,394,896 bytes maximum slop
             282 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       621 colls,     0 par    0.001s   0.001s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.003s   0.003s     0.0017s    0.0034s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.350s  (  0.350s elapsed)
  GC      time    0.004s  (  0.004s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.355s  (  0.355s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    8,140,847,621 bytes per MUT second

  Productivity  98.7% of total user, 98.6% of total elapsed

("linear2","abz.dat",SuccessAt 20000001)
   1,728,060,256 bytes allocated in the heap
          17,240 bytes copied during GC
     288,028,464 bytes maximum residency (2 sample(s))
       1,394,896 bytes maximum slop
             282 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       349 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.003s   0.003s     0.0016s    0.0031s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.288s  (  0.288s elapsed)
  GC      time    0.004s  (  0.004s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    0.292s  (  0.292s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    6,004,032,978 bytes per MUT second

  Productivity  98.6% of total user, 98.6% of total elapsed

for i in aa aaz ab abz;do ./nfa-perf.hs st $i.dat +RTS -s;done
("st","aa.dat",Failure)
     288,058,104 bytes allocated in the heap
           9,544 bytes copied during GC
          43,584 bytes maximum residency (1 sample(s))
          30,144 bytes maximum slop
             282 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         2 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.007s   0.007s     0.0069s    0.0069s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.389s  (  0.389s elapsed)
  GC      time    0.007s  (  0.007s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.396s  (  0.396s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    740,982,909 bytes per MUT second

  Productivity  98.1% of total user, 98.1% of total elapsed

("st","aaz.dat",SuccessAt 20000001)
     288,058,800 bytes allocated in the heap
           9,544 bytes copied during GC
          43,584 bytes maximum residency (1 sample(s))
          30,144 bytes maximum slop
             282 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         2 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.007s   0.007s     0.0066s    0.0066s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.280s  (  0.280s elapsed)
  GC      time    0.007s  (  0.007s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.287s  (  0.287s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    1,028,819,377 bytes per MUT second

  Productivity  97.5% of total user, 97.5% of total elapsed

("st","ab.dat",Failure)
     288,058,104 bytes allocated in the heap
           9,544 bytes copied during GC
          43,584 bytes maximum residency (1 sample(s))
          30,144 bytes maximum slop
             282 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         2 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.003s   0.003s     0.0035s    0.0035s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.299s  (  0.299s elapsed)
  GC      time    0.003s  (  0.004s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.303s  (  0.303s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    962,118,372 bytes per MUT second

  Productivity  98.7% of total user, 98.6% of total elapsed

("st","abz.dat",SuccessAt 20000001)
     288,058,800 bytes allocated in the heap
           9,544 bytes copied during GC
          43,584 bytes maximum residency (1 sample(s))
          30,144 bytes maximum slop
             282 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         2 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.003s   0.003s     0.0034s    0.0034s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.268s  (  0.268s elapsed)
  GC      time    0.003s  (  0.003s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.272s  (  0.272s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    1,075,067,816 bytes per MUT second

  Productivity  98.6% of total user, 98.5% of total elapsed

This is not a very scientifically rigorous experiment with many test runs, but to hint at the general performance difference:

	aa*	aa*z	ab*	ab*z
linear2	0.434s	0.289s	0.355s	0.292s
ST	0.396s	0.287s	0.303s	0.272s

The good news is that there is almost no difference. If anything, the linear2 version seems to have a little overhead after many runs.

This should not be surprising since essentially, it is the same mutable vector wrapped in different APIs with different type safetiness.

Conclusion

For the use case of mutability, the main advantage of LinearTypes over ST monad is to allow the usage of linear-typ-safe API.

Arguably, some people may also think that being able to write monad/do-notation free code is an advantage. But that seems mostly syntactical.

The downside is that, currently, you are on your own, since no library for linear, mutable unboxed data has been built and widely distributed yet.

Furthermore, I personally believe that LinearTypes offer many more use cases than providing a safe API for mutable data. I think stronger type safety should be its unique selling point. I am still exploring this, and I hope that I will be able to provide more inspiring examples in the future.

The code is available here

Topic		Replies	Views
Violating memory safety with Haskell's value restriction Links	6	709	May 26, 2025
Unboxed equality? Learn	6	639	July 25, 2024
[ANN] vector-hashtables-0.1.0.1 Announcements	4	535	September 10, 2021
Using Immutable Vector extractors on Mutable Vectors Learn	12	960	August 14, 2022
[ANN][RFC] strict-mutable-base - strict variants of mutable data types from base Announcements	8	1270	September 2, 2024

Linearly-typed mutable vector of unboxed values (hack) for NFA benchmark

How it was done

Performance

Conclusion

Related topics