How can I troubleshoot slowly growing CPU usage in long running Haskell executable?

Hello,
I’m maintaining a backend for web application written in Haskell.
Mostly it’s just serving bunch of rest APIs using servant.
For couple of weeks I’ve been trying to track down the following issue:
soon after the application is deployed, the CPU usage starts growing very slowly.
By slowly I mean like by 0.5% per day, as illustrated by following chart from our Datadog monitoring:

The application load is relatively low and pretty even (couple of users per day + automated test-like process running every 5 minutes, calling all HTTP endpoints via servant client generated functions).
The growing CPU usage is definitely coming from the Haskell binary (as confirmed by elapsed CPU time).
At the same time the memory usage is staying roughly constant, so there doesn’t seem to be any space leak.

To find out where the CPU usage might be coming from, I compiled the backend binary with stack build --profile ...
and I included the following RTS flags: -p to turn on time profiling and -V (to reduce sampling interval for time profiling, as I’m expecting to run that binary for many days before terminating it with SIGING to have .prof file generated), so overall ghc-options for the executable in cabal file look like this: ghc-options: -threaded -rtsopts "-with-rtsopts=-N -p -V1"

However after enabling profiling as noted above, the issue seems to go away (meaning CPU usage stayed constant for couple of days in row).
So I disabled profiling, and CPU usage started growing again.
So I enabled profiling again and problem seems to be gone away again. It starts to smell of Heisenbug, but how can I be sure?

Does anyone have any experience/useful tips for how to troubleshoot issues like this?

3 Likes

You could try something with the -eventlog option of ghc, that is less invasive than profiling, but it is less granular: threads instead of individual functions (but you can manually specify parts of your code that you want to profile). Here is an old blog post about it: https://well-typed.com/blog/86/. And here is the GHC documentation about it: https://ghc.gitlab.haskell.org/ghc/doc/users_guide/runtime_control.html#rts-eventlog.

4 Likes

This sounds like the classic “idle garbage collection” problem.

I’d recommend that you try passing the -I2 (collect every 2 seconds) or -I0 (disable the idle garbage collection) and report back your results.

2 Likes

Idle GC shouldn’t cause CPU usage to grow over time, should it, given that the memory usage is constant? (I suppose in theory it could if the application was doing extremely strange things with finalisers, but that seems unlikely.) It might cause unnecessary extra CPU usage, though. In current GHC versions it is better to rate-limit idle GCs using -Iw rather than increasing the threshold or disabling idle GC altogether.

@jhrcek when you say “memory usage is staying roughly constant” are you measuring process memory usage, live bytes on the heap or heap size?

5 Likes

@adamgundry When I say memory usage I mean process memory usage, which looks like this:

There’s not much else running on the machine except for the haskell binary and datadog agent. I didn’t look more deeply into memory profiling yet, as I saw memory is not an issue.

Sorry if this is stating the obvious, but I’d audit the code (as well as the code of the deps) for usages of forkIO which could lead to runaway threads. The fact that profiling makes the issue disappear is extremely suspicious though…

1 Like

It should not, but it is a rather common issue, unfortunately.

I am having the same issue with a servant server too. In the latest version of my server the issue got much worse. I upgraded from lts-16.23 to lts-16.31 and also added a custom cache implementation, so I think one of those are the reason for my major upswing.
image
The dipps in the graph is where I restart the process.

Here is a graph for a older month. The issue is still there but I hadn’t really noticed it since it grows so slowly and I have had to restart my server sporadically anyway.

I have not had any time to do any debugging, but will try your profiling trick when I have some time to see if I can reproduce that solution.

1 Like

Thanks for sharing you’re experiencing this issue to.
I assume you’re using the servant version coming from that stackage lts?
We’re using the versions coming from ts-16.17 (servant-16.2 and servant-server-16.2)

Yeah I am using the servant version in the lts. For my second graph from June 2020 I was using lts-10.10. I have a separate server that is running a different unrelated application also using servant and warp, that one is using lts-13.19 and does not have this problem. I have some doubts this issue is related to servant though.

@lorenzo mentioned idle garbage collection. A quick google search on that yielded this post where someone had similar problems:

Its probably worth trying changing the parameters he mention. I am going to try that when I have time to debug the issue.

2 Likes

FYI applying the garbage collection fix seems to have solved the issue on my end. I added -with-rtsopts=-I2 to the ghc-options of the server executable in my stack.yaml. I did this and used the new version on April 7. and my graph now looks like this:
image

Interestingly this also caused my idle CPU usage to go down from ~2.5% to ~1%.

It is still possible that my issue is not fixed since I have had periods without CPU leakage before, but I believe this was due to the Easter holiday and that people were not using the app. Will report back if the problem reappears.

2 Likes

Thank you all very much for useful suggestions!
In the end I tried -Iw10 to increase the minimal pause between idle GCs and the issue seems to be gone (CPU usage remains nicely low and flat).

I just wonder if any of you has any suggestions on appropriate value for the pause between idle GCs. My pick of 10 seconds is not substantiated by any measurements and frankly I’m not sure what I should even measure to determine the right value. Is there any risk of setting the pause too low / too high? Or some useful materials describing this?

2 Likes

Tuning these kinds of RTS flags is a bit of an under-documented area, unfortunately, and it is rather application/workload dependent.

In general, the main purposes for idle GC are:

  1. Avoiding retaining too much memory unnecessarily.
  2. Avoiding GC pauses in busy periods by encouraging major collections to happen when the application is idle instead.
  3. Making sure finalizers run (reasonably) promptly.

The potential downside is that idle GCs can end up doing lots of pointless work if your application receives infrequent requests that do a little allocation relative to the size of the old generation, leading to idle GCs being triggered unnecessarily frequently.

For points 1 and 2, choosing appropriate RTS flag values comes down to measuring your application’s memory usage and relevant performance characteristics (e.g. latency distribution). You can monitor when major GCs happen alongside other metrics, and see if the timing is sensible.

Memory usage over time may or may not matter depending on the application. If your machine is running a Haskell server process and not much else, it probably doesn’t matter if it retains a bit more memory provided it fits within RAM. In multi-tenant setups or for applications deployed to end users, though, it may be more important to free unused memory promptly. (Side note: GHC versions prior to 9.2 were very conservative about returning memory to the OS, but 9.2 will start to return unused memory more aggressively after idle GCs.)

The third point can potentially be an issue if you set very large -I/-Iw values or disable idle GCs entirely. If you have an application that uses ForeignPtr (e.g. wraps a C library, potentially via a dependency), it may rely on finalizers being run to release underlying resources. Thus without idle GCs, your application may end up retaining those resources for a long time after they are no longer reachable via the Haskell heap. Hence the advice now is to use -Iw rather than -I0.

3 Likes