Optimal -N for core count

We have what we believe is a fairly normal Servant app. Gets some web requests, decodes json, shoves data in redis, fetches data from redis, shoves data in kafka.

We moved this app from regular EC2 to Kubernetes at low traffic, and started tweaking settings when we got higher throughput (currently ~250k requests per minute).

When moving to Kubernetes, our thinking was, to give Haskel NUM_CORES and have it behave like it did in an EC2 instance:

  • Set requests and limits to NUM_CORES
  • Set -N${NUM_CORES}
  • Enable CPU affinity by setting --cpu-manager-policy=static

This makes it so the Haskell executable gets exclusive access to NUM_CORES cores in the machine.

We noticed the number of threads created by Haskell far exceeds NUM_CORES. Over 20 threads, then -N3. We did confirm that we are not able to have more than 3 requests doing CPU work at the same time with -N3, and we understand -N3 shouldn’t really mean 3 threads, but were surprised by how far they exceeded 3, and thought we could be paying some context-switch cost by setting CPU affinity to 3.

We tried removing CPU affinity, by setting requests/limits to 2999m (2999 millicores), and keeping -N3, which would set Haskell to be able to use however many cores the OS wants to give to it, but not exceed 299ms of CPU time every 100ms of wall-clock time.

This lowered our p99 response time by more than half, and indeed, cut down involuntary context switches to almost 0.

This should mean the ideal -N is not really the number of cores available to an application, but some number below it, even in a regular EC2 instance, so we can account for the excess threads the RTS maintains.

Can anyone confirm this understanding or explain some reasoning behind how to choose optimal -N?

You may have been bitten by a parallel GC. Check RTS stats for GC productivity metric.
In that case, tweak runtime memory parameters - adjust -A, try non-parallel GC with -qg.

1 Like

Thank you for the suggestion!

I looked up the GC Productivity metric:

“Productivity” tells you what percentage of the Total CPU and wall clock elapsed times are spent in the mutator (MUT).

I’ve plotted it for our app before and after:

Both seem very similar. In absolutes, I’m not sure if 70~80% productivity is ok, but the p99 slice-in-half doesn’t correlate with it.

Oh, hm, this is a fire I caused when I tried -N6 and CPU affinity with 3 cores:

2021-08-26 at 20.24

It recovered when I moved back to -N3 at 15:00, then we had a small (~10%) dip after that, which correlated with a spike in response times.

So it might still be parallel GC, just small differences in this productivity metric cause big spikes in response time

1 Like

I read the options in the manual last night but was pretty puzzled on how I’d go about changing them in any way but at random.

I found this article from Channable which seems helpful, but would be grateful for any further reading material I could use to build a mental model on how to tweak the GC settings to reduce parallel GC overhead.

We’re stuck on GHC 8.8.4 for now, so I’d also appreciate any insight on whether a better course of action would be to prioritize moving to 8.10 to get the incremental garbage collector (no idea whether it’d help me) or to 9.smth to get the new non-moving GC.

1 Like

I think it’s hard to provide any concrete guidance here as the workload of your application is still quite vague to me. Have you tried using threadscope to look at an eventlog? That should clarify a lot about what the runtime is doing.

1 Like

Threascope is perfectly in-line with the guidance I’m looking for, thank you! I didn’t know it existed!

I was hoping hoping more for “read this/use this tool to understand how to tune the GC”, less “set -A32m”.

Edit: also, if there’s any other factors besides GC that I should consider for deciding what -N is appropriate for NUM_CORES.

To clear up a couple of things:

  • -N controls the number of threads that can be running Haskell code simultaneously. It doesn’t control the total number of threads; in particular the RTS will spin up extra threads to handle foreign calls as necessary.
  • The number of running threads at any given time might be much larger than -N if some of them are in safe foreign calls. Or it can be lower than -N of course, if your app has some contention somewhere.

You mention “CPU affinity” but there are various ways to do that, how specifically are you enabling affinity?

You might find this (oldish) blog post helpful: http://simonmar.github.io/posts/2016-12-08-Haskell-in-the-datacentre.html

3 Likes

I’m using Kubernetes’ --cpu-manager-policy=static coupled with containers using integer Requests/Limits. I don’t know exactly how Kubernetes configures that for my containers, but I know what the taskset looks like for my process:

# 6600 quiz-engine-http +RTS -M5.7g -N3 -RTS
$ taskset -cp 6600
pid 6600's current affinity list: 5,8,13
$ taskset -p 6600
pid 6600's current affinity mask: 2120

Thank you! I’ll experiment with different -N and -qn settings and see how our app behaves.

Non moving, low-latency GC is already part of GHC 8.10.x series, so grab .6 or .7 and if your code is OK with it, just test it with it too. And good luck to you since tunning RTS on multicore while your app is under data pressure is well, kind of black magic – or to me at least. :slight_smile:

1 Like

The old rule of thumb was N should always be one or two cores less than those available to the program. Googling for “last core slowdown” should bring up some old discussion of it. Certainly a lot of the information on it may be somewhat out of date, but it could still be illuminating in terms of questions it raises, etc.

1 Like

:books: People often suggest reading GHC User Guide for such kinds of things but for me, the optimal set of RTS options is not clear only from reading the manual. Ideally, it would be lovely to have a formula that takes a number of CPU cores and RAM size as inputs, and outputs recommended RTS options. However, I’m not aware of such a thing, so here is my interpretation of the user guide (I might be wrong, so any feedback from experts is appreciated).

:construction_worker_woman: The initial idea is that you want to parallelise as much as possible. So setting the -N option sounds reasonable. However, GHC has a garbage collector, which is also parallel by default. And by default, it uses the same number of cores as set by -N. See the -qn option description. So, if you have 8 physical cores (not considering hyperthreading atm) and you specify -N, RTS will allocate 8 threads for your program and 8 threads for parallel GC. And this may exhaust your resources. That’s why using the -qg option (disabling parallel GC) can help to improve the performance of Haskell apps on average.

:eight: I imagine that you probably want the total number of capabilities for both your program and GC to equal the total number of threads. So, if you have 8 cores, you probably want -N6 -qn2 or -N4 -qn4. Some experimenting is required here :slightly_smiling_face:

:hamburger: Adding -A option (e.g. -A128m) also improves performance but at the cost of higher memory consumption (note that the -A value is multiplied by the number provided to -N).

:snail: If you have high requirements for latency, consider using the new GC algorithm with the --non-moving flag (available since GHC 8.10). But the overall performance of your application can slightly decrease (and memory consumption can increase) at the cost of lower latency.

:sloth: Also, by default, GHC releases memory to OS lazily. So you may end up with higher memory consumption than your program actually uses. Fortunately, since GHC 9.2, you can add the -Fd4 option to resolve this situation.

7 Likes

I’ve run some experiments in prod today, at high traffic, with everyone’s suggestions. Threadscope was invaluable here. Thank you so much for the recommendation!

We found promising configs that would enable us to do what we want: use cpu affinity to avoid noisy neighbors while keeping parallelism high in our mutators and not raising p99 response times.

Thanks everyone for the valuable suggestions!

My test runs

Here’s the raw data if you wanna check it out

Deets on our configs

I consider taskset here because I intend to try giving our critical containers exclusive access to a few cores, instead of relying on the kernel’s Completely Fair Scheduler and it’s notorious throttling behavior.

I considered running more capabilities than cores (-N6 with 3 cores) because we’ve had spikes of request queueing, and hope I can offset those by starting more things in parallel and letting the CPU rest while doing IO. My first attempt at this brought the app down to a crawl, so I gave it another try here with different settings.

Some findings

CPU Affinity (taskset)

Using taskset significantly worsened parallel GC performance with -N${NUM_CORES}, even without core saturation. Our container sat well below 50% CPU usage, yet p99 response times almost doubled compared to letting the container roam free through all cores in the machine, letting CFS Quotas ensure it doesn’t use more than 100*NUM_CORES ms of cpu time every 100ms of wall-clock time.

I think this makes sense with @ChShersh’s tip on -N + -qn == NUM_CORES, but wasn’t obvious from the docs.

Another metric I did not collect thoroughly, but that might help paint a picture of why -N3 with 3 cores is so much worse, is I saw a much higher number of involuntary context switches when using tasksets to give the app 3 cores, vs when letting ir roam free and get throttled by the CFS if it used more than 300ms of CPU time every 100ms period. I can’t quite wrap my head around how context-switch cost can be so high as to double our p99 when threads aren’t actively doing enough work to saturate the cores.

Regardless, other findings below helped us make -N numbers equal to or even greater than the taskset core count feasible.

-A was way too low

I think?

With the default 1 MiB at Generation 0 I saw ~500-600 GC runs per second. That’s 1 GC every 2ms or less.

Our app has a 5ms avg response time, so the average request won’t finish without running GC twice!!

What’s a good -A???

I saw @simonmar’s caveat on an SO that using -A higher than L2 cache means we lower the L2 hit rate, and saw AWS’s c5.4xlarge use Intel Xeon Platinum 8124M, which should have 1MB cache per core.

@ChShersh’s comment hand’t arrived yet, so I didn’t know -A was multiplied by -N, and went with
-A3m.

I also tried the max heap size I saw in ThreadScope (128MB), which means we ran ~1 GC per second, but memory usage climbed a bunch, from <200MB to >1GB.

Considering L2 cache sizes, I would have left -A at 1MB, which is the default, but is clearly not enough for this app.

Both 3m and 128m reduced GC overhead enough that I can now afford to use 3 cores in tasksets while using -N3.

Parallel GC

Threadscope reports a “parallel work balance” that was always very low (<27%), except when using very large values for -A, like -A128m, where we reached 42.85%. It kinda makes sense that parallelizing only helps when you have a large working set. I’m not sure if this is generalizable, but for our case it seems like can take this as a rule to not use parallel GC for a small -A.

At the same time, disabling non-parallel GC got us pretty good p99 response time, even while doing one GC pause every 2ms. I haven’t measured, but could guess we’ll see a much lower number of involuntary context switches in this scenario.

Last core slowdown

I searched for “last core slowdown” and found a blogpost from 2010 on haskell dot org, which I can’t link to because I’m A New User Who Can’t Link More Than Two Things Per Post, which claims that with a new GC that might have been released in 2011 “last core slowdown” should be finally banished.

Notes on profiling and figuring this stuff out, for humans of the future

I ran this test in prod with our app compiled with -eventlog, using +RTS -l -RTS, by throwing it in the load balancer together with our regular containers, and noticed no slowdown in p99 response time or average response time compared.

The output eventlog file ate around 1MB per second for our app, which was doing 500 requests per second per process. We don’t have a lot of disk space on our worker nodes, so we had to watch out for this and be quick about collecting our findings and dropping the test container.

3 Likes

Disclaimers:

  • I’m only here because of the title (now that I’ve seen the content, perhaps a more appropriate one would have been e.g. Core counts in Kubernetes: optimal -N )
  • I only know of virtualisation - I’ve never had a need to use it myself (based on what I’ve seen here, the resulting infrastructure seems to resemble a Jenga pile…)

Now that you’ve been warned…some years ago, HaLVM appeared in the search results for another topic (perhaps House-related). If it or something similar was still an active project, would it be of interest?

Considering it allows a Haskell program to run without an OS (smaller Jenga pile!), it begs the question as to why it or a successor hasn't gone mainstream (i.e. now available directly in GHC, presumably as an alternate backend)...

@omnibs thanks for posting such a detailed write-up of your results!

One observation that I don’t think has been made yet (but apologies if I missed it somewhere in the thread): parallel GC in current GHC versions tends to perform poorly with more than a small number of cores, because its synchronization relies on contended spinlocks and sched_yield (which context-switches more than ideal). This is much improved in GHC 9.2 thanks to work by Douglas Wilson, who gave a talk about it at HIW last week. The 9.2 release notes have an overview of the changes.

2 Likes

I imagine that you probably want the total number of capabilities for both your program and GC to equal the total number of threads. So, if you have 8 cores, you probably want -N6 -qn2 or -N4 -qn4.

This doesn’t look right. Haskell GC is stop-the-world, so there’s no contention between running and collecting. AFAIK GC threads are simply using a subset of user capabilities.

1 Like

“parallel work balance” that was always very low (<27%), except when using very large values for -A, like -A128m, where we reached 42.85%.

That’s why always I use -qg (i.e. disable parallel GC). Never seen it beaten by a parallel one, for any amount of used memory or cores.

1 Like

The answer is naturally - it depends. Obviously if you’re interested in the performance of a certain single-threaded application, other processes just clutter your machine and compete over the shared resources. So let’s look at two cases where this question may be interesting:

You’re running multiple processes (let’s say they’re identical), and you’re interested in the aggregated performance.
You’re running a multithreaded application that can spawn as many threads as possible.
The second case is easier to answer, it (… wait for it …) depends on what you’re running! If you have locks, more threads may lead to higher contention and conflicts. If you’re lock free (or even some flavors of wait-free), you may still have fairness issues. It also depends on how the work is balanced internally in your application, or how your task schedulers work. There are simply too many possible solutions out there today.

I considered running more capabilities than cores (-N6 with 3 cores) because we’ve had spikes of request queueing, and hope I can offset those by starting more things in parallel and letting the CPU rest while doing IO. My first attempt at this brought the app down to a crawl, so I gave it another try here with different settings.

I don’t think that this ever makes sense. You can spawn off as many “green” threads as you want within your app, and GHC will map them to capabilities as it desires. So you can kick off as many concurrent things as you want. You don’t need to, nor does it generally help to match your number of explicit green threads to your count of capabilities. The capabilities you set with N tell GHC’s own scheduler how many OS threads to allocate, and I don’t see how allocating those threads would ever be a good thing for scheduling. Context switches between “green” threads are much cheaper than a full context switch between two OS threads on a single core, if that’s a clear way to say this, so its much better to do things that way.

1 Like

I’m normally having 2-4 inflight requests per process, but sometimes I get spikes of 30-40, and everything gets slow. When those spikes happen, CPU does not saturate, which is why I’m poking at capabilities.

On bumping to -N6, and making it viable by disabling the parallel GC, I still don’t see CPU going above 70% and I don’t see a bunch of involuntary context switches, which is what I would expect if I had way too many threads trying to do work at the same time (like when I was using the parallel GC with a small -A).

I get what you’re saying, but with taskset locking at 3 cores and 6 capabilities, when I have 30-40 inflight requests per process, I’m still not seeing any nonvoluntary_ctxt_switches in /proc/[pid]/status, so I’m not paying that price.

This is what my tracing spans look like during those 30-40 inflight request spikes:

The amount of white space between IO calls here tells me this green thread was waiting in line to get back in action, and since it wasn’t the OS scheduler holding it back, bc we had no nonvoluntary_ctxt_switches, we can benefit from having a larger pool of OS threads.

Maybe there’s some pervasive locking throughout my app that I can’t spot, but maybe there’s some inefficiency mapping green threads to OS threads and we can benefit from more OS threads at this point.