I think I understand what’s going on here but to confirm:
SIMD is an architecture along with some instructions that allows data parallel computation. I.e it puts contiguous data into a special fixed width register and applied some special instructions in parallel. To use these data parallel instructions we need to exactly fit our data types into the underlying SIMD register.
So we first define primitives like FloatX4 etc that will be aligned to SIMD registers by the Native Code Generator (NCG).
Hence:
class Storable a => StorableF f a where
peekElemOffF :: Ptr a -> Int -> IO (f a) -- index in scalar elements
pokeElemOffF :: Ptr a -> Int -> f a -> IO () -- index in scalar elements
instance StorableF X4 Float where ...
instance StorableF X4 Double where ...
This is called X4 cause we can fit 4 of these into our special SIMD registers. And we need to define a class to know where these primitive start and end (I guess so we can copy them???).
And then our map function the takes two of the same function. One that works with the SIMD type and another that works with a regular type (in case we don’t have enough data to fit into the SIMD registers. This is eventually abstracted away with Identity so we have one function in our interface.
And now to the benchmark:
The float is faster than the double typically. Why is this? Is it cause double is larger so we can fit less of them into our registers so there’s less to be gained from data parallelism/there is more overhead than improvement?
Also is the goal now to have library authors support SIMD or will it eventually be built into the vector package? That wasn’t clear to me - I think the initial part of the article said vector would support it but that’s been abandoned.
Nice read! Thanks for taking time to work on SIMD instructions in GHC
I would maybe request a redo of the NCG benchmarks but with the same GHC version as the other benchmarks (GHC 9.8.4) so it’s easier to compare LLVM vs NCG.
(Or did I miss something and is it not possible to use the same GHC for all benchmarks? Either GHC 9.8.4 for the NCG or also 9.12.1 for the LLVM etc.?)
My understanding is that the Float
is faster than the Double
because more Float
values (4 values in a 128-bit register) can fit in a SIMD register than Double
(2 values).
My plan does not include making the vector
package support SIMD.
Vectorized map
/zipWith
/fold
would be implemented in a separate library.
I would maybe request a redo of the NCG benchmarks but with the same GHC version as the other benchmarks (GHC 9.8.4) so it’s easier to compare LLVM vs NCG.
Yeah, that is sensible. I’ll add benchmark results with 9.12.1 later.
I wrote the draft of the article before 9.12.1 was released, so using 9.8 (or 9.10) was a natural choice then.
With GHC 9.12.1, I encountered a problem with the LLVM backend (#25561). I’ll try with 9.12.2 once released.
What’s the rationale for a separate package? This seems important enough to ship with the library.
GHC’s SIMD support is an experimental feature and the usage of such feature should be opt-in.
We should experiment with a separate package first.