How to cook with SIMD?

I’ve been trying to make my rendering code a bit faster with wide ops, but everything I try ends up slower than the dumbest straightforward variant.

E.g. pointwise (+) for Vec4 ByteArray# is a tiny bit faster with FFIng, but loses by a third (!) to Vec4 Float Float Float Float and then using regular (+) per component.
Replacing scalar maths with the new FMA primops made it slower too.

I’ve found what appears to be an ideal target for this - 4-way ray-AABB intersection.
The thing is ~6.5x slower than a Haskell binary tree of AABBs with ordinary math for intersections.
I would be happy with the “explanation” that I simply suck at this or the partitioning is suboptimal. However…
Adding insult to injury, the profiler tells me that the 4-way version is indeed faster than 2-way:

2-way 4-way + SIMD
Entries 3085m 951m Yay, wide trees, less work
Time 95% 68.7% Smaller portion of CPU resources, should be faster, right?
Alloc 60.1% 3.4% 1/20 of allocations, gonna be good
Real time 8s 55s …excuse me?
12 Likes

Out of a (web)search for haskell ghc simd poor performance, this:

seems to be the most recent of the first six “vaguely-relevant” results - does anything there help?

@wiz could you show the code? What kind of SIMD are we talking about?

1 Like

my understanding is that aside from LLVM pigbacking we have no SIMD primops without them any “SIMD” is actually just normal instructions simulating SIMD which will at best be as fast as not wrapping the primitives

Exhibit 1: The basics.

data Vec4 = Vec4 ByteArray#
void Vec4_plus_Vec4_SIMD(float *x, float *y, float *o) {
    _m128 xv = _mm_load_ps(x);
    _m128 yv = _mm_load_ps(y);
    xv = _mm_add_ps(xv, yv);
    _mm_store_ps(o, xv);
}

Didn’t had much hope for this one, but it did made things tiiiiiiny bit faster (15.5ns → 15.0ns).

Now, behold the power of Haskell!

data Vec4' = Vec4' {-# UNPACK #-} Float {-# UNPACK #-} Float {-# UNPACK #-} Float {-# UNPACK #-} Float

plus (Vec4' a0 a1 a2 a3) (Vec4 b0 b1 b2 b3) = Vec4' (a0 + b0) (a1 + b1) (a2 + b2) (a3 + b3)

This only takes 10ns.

Exhibit 2, something more normal.

data Vec3 = Vec3 Float Float Float -- unpacked ofc

normalize v@(Vec3 x y z) =
  if nearZero q || nearZero (abs (1 - q))
    then v
    else Vec3 (x * r) (y * r) (z * r)
  where
    q = dot v v
    r = 1 / sqrt q
    nearZero a = a <= 1e-6

12.5ns

void Vec3_normalize_SIMD(float x, float y, float z, float *O) {
    simde__m128 v4 = simde_mm_set_ps(x, y, z, 0); // init

    simde__m128 q4 = simde_mm_dp_ps(v4, v4, 0xFF); // dot
    float q = simde_mm_cvtss_f32(q4); // extract to scalar for testing
    if (q <= 1e-6 || fabsf(1.0f - q) <= 1e-6) {
      simde_mm_store_ps(O, v4); // bail out
    } else {
      simde__m128 n4 = simde_mm_rsqrt_ss(q4); // r
      simde__m128 x = simde_mm_shuffle_ps(n4, n4, 0); // broadcast r
      simde_mm_store_ps(O, simde_mm_mul_ps(v4, x)); // multiply and store
    }
}

Getting multiple floats out required some wrapping/unwrapping

normalize3 :: Vec3 -> Vec3
normalize3 v = withVec4 out \a' b' c' _ -> vec3 a' b' c'
  where
    out = unsafePerformIO do
      result@(Vec4 o) <- unsafeNewVec4
      withVec3 v \a b c -> vec3_normalize a b c o
      pure result  

35.2ns. Okay, with so much dancing around the representation it is kinda expected.

Exhibit 3: brb… gotta collect more data

Well, my data is already unboxed and stuff. The SIMD was dismissed early in the thread and I want to know specifically how to actually use it.

@wiz does “ps” stand for picoseconds? If yes then something is wrong with your benchmarks, 10 ps is unrealistically fast.

Those are nanoseconds indeed.

My understanding is that to call opaque foreign code Haskell has to save / restore all CPU registers. This is lots of work, more than addition of four Floats, explaining Exhibit 1 and, at least partially, Exhibit 2.

A loop over an array of Floats should reside in C code for vectorised instructions to pay off.

6 Likes