I’m maintaining a backend for web application written in Haskell.
Mostly it’s just serving bunch of rest APIs using servant.
For couple of weeks I’ve been trying to track down the following issue:
soon after the application is deployed, the CPU usage starts growing very slowly.
By slowly I mean like by 0.5% per day, as illustrated by following chart from our Datadog monitoring:
The application load is relatively low and pretty even (couple of users per day + automated test-like process running every 5 minutes, calling all HTTP endpoints via servant client generated functions).
The growing CPU usage is definitely coming from the Haskell binary (as confirmed by elapsed CPU time).
At the same time the memory usage is staying roughly constant, so there doesn’t seem to be any space leak.
To find out where the CPU usage might be coming from, I compiled the backend binary with
stack build --profile ...
and I included the following RTS flags: -p to turn on time profiling and -V (to reduce sampling interval for time profiling, as I’m expecting to run that binary for many days before terminating it with SIGING to have .prof file generated), so overall ghc-options for the executable in cabal file look like this:
ghc-options: -threaded -rtsopts "-with-rtsopts=-N -p -V1"
However after enabling profiling as noted above, the issue seems to go away (meaning CPU usage stayed constant for couple of days in row).
So I disabled profiling, and CPU usage started growing again.
So I enabled profiling again and problem seems to be gone away again. It starts to smell of Heisenbug, but how can I be sure?
Does anyone have any experience/useful tips for how to troubleshoot issues like this?