I’ve been trying to make my rendering code a bit faster with wide ops, but everything I try ends up slower than the dumbest straightforward variant.
E.g. pointwise (+) for Vec4 ByteArray#
is a tiny bit faster with FFIng, but loses by a third (!) to Vec4 Float Float Float Float
and then using regular (+) per component.
Replacing scalar maths with the new FMA primops made it slower too.
I’ve found what appears to be an ideal target for this - 4-way ray-AABB intersection.
The thing is ~6.5x slower than a Haskell binary tree of AABBs with ordinary math for intersections.
I would be happy with the “explanation” that I simply suck at this or the partitioning is suboptimal. However…
Adding insult to injury, the profiler tells me that the 4-way version is indeed faster than 2-way:
2-way | 4-way + SIMD | ||
---|---|---|---|
Entries | 3085m | 951m | Yay, wide trees, less work |
Time | 95% | 68.7% | Smaller portion of CPU resources, should be faster, right? |
Alloc | 60.1% | 3.4% | 1/20 of allocations, gonna be good |
Real time | 8s | 55s | …excuse me? |