Some thoughts on GHC's SIMD primitives

aratamizuki · January 13, 2025, 8:29am

https://minoki.github.io/posts/2025-01-13-ghc-simd.html

mchav · January 13, 2025, 10:36am

I think I understand what’s going on here but to confirm:

SIMD is an architecture along with some instructions that allows data parallel computation. I.e it puts contiguous data into a special fixed width register and applied some special instructions in parallel. To use these data parallel instructions we need to exactly fit our data types into the underlying SIMD register.

So we first define primitives like FloatX4 etc that will be aligned to SIMD registers by the Native Code Generator (NCG).

Hence:

class Storable a => StorableF f a where
  peekElemOffF :: Ptr a -> Int -> IO (f a) -- index in scalar elements
  pokeElemOffF :: Ptr a -> Int -> f a -> IO () -- index in scalar elements

instance StorableF X4 Float where ...
instance StorableF X4 Double where ...

This is called X4 cause we can fit 4 of these into our special SIMD registers. And we need to define a class to know where these primitive start and end (I guess so we can copy them???).

And then our map function the takes two of the same function. One that works with the SIMD type and another that works with a regular type (in case we don’t have enough data to fit into the SIMD registers. This is eventually abstracted away with Identity so we have one function in our interface.

And now to the benchmark:

The float is faster than the double typically. Why is this? Is it cause double is larger so we can fit less of them into our registers so there’s less to be gained from data parallelism/there is more overhead than improvement?

Also is the goal now to have library authors support SIMD or will it eventually be built into the vector package? That wasn’t clear to me - I think the initial part of the article said vector would support it but that’s been abandoned.

Vlix · January 14, 2025, 11:47pm

Nice read! Thanks for taking time to work on SIMD instructions in GHC

I would maybe request a redo of the NCG benchmarks but with the same GHC version as the other benchmarks (GHC 9.8.4) so it’s easier to compare LLVM vs NCG.

(Or did I miss something and is it not possible to use the same GHC for all benchmarks? Either GHC 9.8.4 for the NCG or also 9.12.1 for the LLVM etc.?)

aratamizuki · January 16, 2025, 12:21am

My understanding is that the Float is faster than the Double because more Float values (4 values in a 128-bit register) can fit in a SIMD register than Double (2 values).

My plan does not include making the vector package support SIMD.
Vectorized map/zipWith/fold would be implemented in a separate library.

aratamizuki · January 16, 2025, 12:23am

I would maybe request a redo of the NCG benchmarks but with the same GHC version as the other benchmarks (GHC 9.8.4) so it’s easier to compare LLVM vs NCG.

Yeah, that is sensible. I’ll add benchmark results with 9.12.1 later.

I wrote the draft of the article before 9.12.1 was released, so using 9.8 (or 9.10) was a natural choice then.

aratamizuki · January 16, 2025, 12:32am

With GHC 9.12.1, I encountered a problem with the LLVM backend (#25561). I’ll try with 9.12.2 once released.

mchav · January 16, 2025, 12:52am

What’s the rationale for a separate package? This seems important enough to ship with the library.

aratamizuki · January 17, 2025, 4:27am

GHC’s SIMD support is an experimental feature and the usage of such feature should be opt-in.
We should experiment with a separate package first.

Topic		Replies	Views
How to cook with SIMD? Learn	8	959	August 26, 2024
New primitives requiring literal syntax Learn	2	345	November 6, 2024
[ANN] vector-hashtables-0.1.0.1 Announcements	4	535	September 10, 2021
[Survey] Which low-level details would like to see documented?	13	630	May 22, 2024
[ANN] cpu-features 0.1.0.0 Announcements	2	336	May 4, 2025

Some thoughts on GHC's SIMD primitives

Related topics