Mysterious performance difference

I’m having trouble understanding why Version 1 of this code performs about 10% better when compiled than Version 2. All I do in Version 2 is tidy up the variables to prevent them from being shadowed and remove some redundant functions. Any ideas?

Version 1

transform :: (Ord a, Floating a) => a -> a -> a -> a -> a -> a -> a
transform x0 l1 l2 a b x = lin (tra x x0 l1 l2) a b
  where  tra x x0 l1 l2 = (1-sig x x0) * (yeo x l1) + (sig x x0) * (yeo x l2)
         sig x x0       = 1 / (1 + exp (-50 * (x - x0)))
         lin x a b      = a * x + b
         yeo x l | l /= 0 && x >= 0 = ((x+1)**l - 1) / l
                 | l == 0 && x >= 0 = log (x + 1)
                 | l /= 2 && x  < 0 = -((-x + 1)**(2-l)-1)/(2-l)
                 | otherwise        = -log (-x + 1)

Version 2

transform :: (Ord a, Floating a) => a -> a -> a -> a -> a -> a -> a
transform x0 l1 l2 a b x = a * tra + b
  where tra = (1-sig) * (yeo l1) + (sig) * (yeo l2)
        sig = 1 / (1 + exp (-50 * (x - x0)))
        yeo l | l /= 0 && x >= 0 = ((x+1)**l - 1) / l
              | l == 0 && x >= 0 = log (x + 1)
              | l /= 2 && x  < 0 = -((-x + 1)**(2-l)-1)/(2-l)
              | otherwise        = -log (-x + 1)

Ultimately, measured performance is going to depend on how this function is used in the rest of your program. If you want to analyze how this function performs in isolation, you can use a microbenchmark framework and iterate on it (make small changes and see which one makes larger differences), but at the end of the day, for ±10% differences, it’s going to be difficult to advise you what to do differently without the rest of the program.

Some notes, though:

  • Everything here is lazy. I expect that most if not all of the difference in performance here is due to how many thunks are being allocated when. I recommend measuring allocation counts in addition to wall time, and making all your bindings strict.
  • Being generic in your data type has a performance cost; more optimizations could be done if GHC knows you’re working with Double. GHC might be doing these optimizations already, though; it depends on how this function is used. I recommend dumping the Core and inspecting it to see if the generic or specialized version of these functions is being used; if generic, you might want to try adding SPECIALIZE pragmas.
  • In Version 1, some of your inner functions are floatable up to the top level; in Version 2, they are not. This means that Version 2 shares more than Version 1. Sharing can be a performance boon or penalty, depending on how expensive the computation being shared is relative to the cost of memory access. In this case, I expect the performance is cheap enough that it’s a penalty—but making everything strict can change the calculus, which is why I make that recommendation first.
4 Likes

“If in doubt, bang it out” :slight_smile:

Shebangs all over the place seem to help. Reminds me a little bit of the cut operator in Prolog.

I’ve definitely written code where bangs make the performance worse!

If in doubt, I measure allocations. They’re a much more stable metric than time, so it’s easier to do A/B testing on small tweaks to see what’s helping and what’s hurting.

In this case, though, every function uses every argument unconditionally, so there’s absolutely nothing gained by deferring the evaluation of the argument. But GHC doesn’t know that because these functions are generic; if + or * are non-strict on one or more of their arguments, there might be some benefit to allocating thunks. Informing GHC that no, you (probably???) don’t intend to build towers of thunks in some abstract numeric data type, is going to make a difference.

2 Likes