I’ve run some experiments in prod today, at high traffic, with everyone’s suggestions. Threadscope was invaluable here. Thank you so much for the recommendation!
We found promising configs that would enable us to do what we want: use cpu affinity to avoid noisy neighbors while keeping parallelism high in our mutators and not raising p99 response times.
Thanks everyone for the valuable suggestions!
My test runs
Here’s the raw data if you wanna check it out
Deets on our configs
I consider taskset
here because I intend to try giving our critical containers exclusive access to a few cores, instead of relying on the kernel’s Completely Fair Scheduler and it’s notorious throttling behavior.
I considered running more capabilities than cores (-N6
with 3 cores) because we’ve had spikes of request queueing, and hope I can offset those by starting more things in parallel and letting the CPU rest while doing IO. My first attempt at this brought the app down to a crawl, so I gave it another try here with different settings.
Some findings
CPU Affinity (taskset
)
Using taskset
significantly worsened parallel GC performance with -N${NUM_CORES}
, even without core saturation. Our container sat well below 50% CPU usage, yet p99 response times almost doubled compared to letting the container roam free through all cores in the machine, letting CFS Quotas ensure it doesn’t use more than 100*NUM_CORES
ms of cpu time every 100ms of wall-clock time.
I think this makes sense with @ChShersh’s tip on -N + -qn == NUM_CORES
, but wasn’t obvious from the docs.
Another metric I did not collect thoroughly, but that might help paint a picture of why -N3
with 3 cores is so much worse, is I saw a much higher number of involuntary context switches when using tasksets to give the app 3 cores, vs when letting ir roam free and get throttled by the CFS if it used more than 300ms of CPU time every 100ms period. I can’t quite wrap my head around how context-switch cost can be so high as to double our p99 when threads aren’t actively doing enough work to saturate the cores.
Regardless, other findings below helped us make -N
numbers equal to or even greater than the taskset core count feasible.
-A
was way too low
I think?
With the default 1 MiB at Generation 0 I saw ~500-600 GC runs per second. That’s 1 GC every 2ms or less.
Our app has a 5ms avg response time, so the average request won’t finish without running GC twice!!
What’s a good -A???
I saw @simonmar’s caveat on an SO that using -A higher than L2 cache means we lower the L2 hit rate, and saw AWS’s c5.4xlarge
use Intel Xeon Platinum 8124M
, which should have 1MB cache per core.
@ChShersh’s comment hand’t arrived yet, so I didn’t know -A
was multiplied by -N
, and went with
-A3m
.
I also tried the max heap size I saw in ThreadScope (128MB), which means we ran ~1 GC per second, but memory usage climbed a bunch, from <200MB to >1GB.
Considering L2 cache sizes, I would have left -A
at 1MB, which is the default, but is clearly not enough for this app.
Both 3m
and 128m
reduced GC overhead enough that I can now afford to use 3 cores in tasksets while using -N3
.
Parallel GC
Threadscope reports a “parallel work balance” that was always very low (<27%), except when using very large values for -A
, like -A128m
, where we reached 42.85%. It kinda makes sense that parallelizing only helps when you have a large working set. I’m not sure if this is generalizable, but for our case it seems like can take this as a rule to not use parallel GC for a small -A
.
At the same time, disabling non-parallel GC got us pretty good p99 response time, even while doing one GC pause every 2ms. I haven’t measured, but could guess we’ll see a much lower number of involuntary context switches in this scenario.
Last core slowdown
I searched for “last core slowdown” and found a blogpost from 2010 on haskell dot org, which I can’t link to because I’m A New User Who Can’t Link More Than Two Things Per Post, which claims that with a new GC that might have been released in 2011 “last core slowdown” should be finally banished.
Notes on profiling and figuring this stuff out, for humans of the future
I ran this test in prod with our app compiled with -eventlog
, using +RTS -l -RTS
, by throwing it in the load balancer together with our regular containers, and noticed no slowdown in p99 response time or average response time compared.
The output eventlog file ate around 1MB per second for our app, which was doing 500 requests per second per process. We don’t have a lot of disk space on our worker nodes, so we had to watch out for this and be quick about collecting our findings and dropping the test container.