2024-09-29 14:06:59
kanishk.io
It’s been a long time since I wrote something here. Past few years I’ve been busy at work which is where most of my writing is done these days. This particular entry comes as an offshoot of a production disaster I saw and then took the opportunity to dive deep and learn more. What seemed like a convoluted problem at the outset, ended up being pretty fascinating in the end.
It’s a bit of a shock to me that even though the majority of Go applications are deployed in containerized environments, there’s very little awareness (from what I’ve seen) about this gotcha (as you will come to learn ahead) which can have major consequences.
This is a problem I had already solved for some of our services a few months ago, but when it came to explaining WHY it worked, I realized my understanding was somewhat broken. So, when we faced an issue caused by CPU throttling in another one of our services, I took the opportunity to dive deeper and understand why exactly this worked the way it did. I had a lot of fun learning how Kubernetes enforces CPU limits, and I hope you do too, and that this might save you a lot of headache and heartache before you ever run into this.
In this blog, I’ll attempt to answer the following questions:
This is going to be a bit of a lengthy read, and you might have to read a few sections again and again to let them sink in. So, please grab a cup of coffee before I start my yapping.
What could happen when a service experiences CPU throttling?
The possibilities are on a spectrum. On one end, you could have increased latencies in your application. On the other, under extremely high traffic your application will stop responding to liveness checks, leading to Kubernetes restarting your pods, which in turn would result in a further pile up of connections on a load balancer which results in overloading the load balancer where it stops serving traffic to other services in the network too. So yeah, the consequences range from somewhat bearable to outright catastrophe.
The example
We shall use the following example going ahead.
For simplicity’s sake, let’s assume we have a single node cluster with 2 CPU cores. On this cluster, we’re deploying a single service consisting of a single container. The deployment environment is a Kubernetes cluster.
These are the service resource definitions:
spec:
containers:
- name: app
image: images.my-company.example/app:v4
resources:
requests:
cpu: "100m"
limits:
cpu: "1000m"
For this blog, the memory requests and limits are irrelevant.
What does the notation 100m
& 1000m
mean exactly?
100m translates to 0.1 (or 10%) of a CPU core.
1000m translates to 1 full CPU core.
Requests & Limits
Requests define the amount of resources guaranteed to your service. If Kubernetes cannot find 100m
worth of CPU on any nodes, it will fail to schedule the pods for your service. Since we have a limit of 100m
, it means Kubernetes guarantees we’ll get 0.1 core for our service at all times.
Limits define the maximum amount of resources your service can utilize. Since we have a limit of 1000m
, it means our service has a CPU Quota of 1 CPU core. Meaning this is the maximum utilization allowed for our service.
How are CPU limits enforced by Kubernetes?
Kubernetes uses something known as the Linux Completely Fair Scheduler (CFS) to monitor, ensure, and enforce resource usage and limits. Resources being CPU and Memory on a node.
The job of CFS is fairly simple:
- Ensure every container receives CPU time equal to what’s set under
resources.request.cpu
- Prevent any container from using more CPU time than what’s set under
resources.limits.cpu
The CFS period is 100ms by default. Every 100ms, CFS will check if the service is using CPU time equivalent to its limit. This means your service is allowed to use CPU for the entire 100ms period (since proportionally 100ms means 100% of the CPU, which is what we meant by 1000m above)
Note: Kubernetes allows changing the CFS period value from its default of 100ms but the configuration is per node (not per service). You can always assume it stays unchanged unless your k8s admin has specified otherwise. I won’t be diving into why this is adjustable or where you might need to adjust it at all.
The quota is calculated as:
Quota per period = CPU Limit * Period
CPU limit is 1000m = 1 core
Period = 100ms (default)
Thus, Quota in a single period = 1 * 100 = 100ms
This would essentially mean, your service will have 1 core to itself for a period of 100ms. This is what they mean when they say 1000m = 1 core.
Go concurrency primer
Before we get into the why, let’s take a little trip to understand how Go models its concurrency. If you’re already aware of this, feel free to skip to the next section.
Go works on the principle of maximizing the concurrency of the application by leveraging parallelism on multi-core hardware.
Let’s break this down, right to left.
- G – Goroutines. These are user-space threads operated by the Go runtime. This is where your code runs. These are also said to be cheaper to create and use (in comparison to a machine thread). This isn’t exactly unique to Go, and other languages support what are known as Green Threads which are akin to these goroutines.
- P – Processors. This is a Go construct, used to schedule a goroutine on to the machine threads. The Go scheduler is responsible for moving goroutines on and off these Processors. This is the PROC in GOMAXPROCS
- M – Machine threads. Also known as OS threads or kernel threads as they operate in kernel space. These are expensive so the GO scheduler is designed to reuse them as much as possible.
This is a simplified model and there are a lot of nuances to how the GO scheduler works. If you’re interested, I’ve linked an amazing talk by Kavya Joshi down below where she walks us through the design principles of the Go scheduler.
The relevant part for us here is that GOMAXPROCS
controls the number of Ps your application has, and that in turn determines how many simultaneously running Ms your application will have. Put plainly, this decides how many goroutines will be running simultaneously in your application.
During initialization, the Go runtime will create Ms equal to the value of your GOMAXPROCS
. By default, it’s set to the value returned by runtime.NumCPU()
. You can also set it with:
runtime.GOMAXPROCS(n)
GOMAXPROCS
environment variable.
Why would a Go program experience throttling?
We didn’t set an explicit value for GOMAXPROCS
for our service. What do you think will be the default value of GOMAXPROCS
when the service is deployed on our cluster?
We specified in our deployment that the CPU request is 100m
and the limit is 1000m
. However, when the application is running inside a container, the Go runtime isn’t aware of the fact that the container is allotted just 1 core (1000m = 1 core). It sees all the available cores on the node. So, GOMAXPROCS
is instead set to 2! Why is this? What the frick Go?!
The Gotcha!
runtime.NumCPU()
returns the total number of logical cores on the NODE and not the cores allotted to the CONTAINER! This is because the file inside Linux from which Go reads CPU information is not modified by Linux when a container is created. Meaning, looking from inside a container and trying to find the available CPU cores, Go will end up seeing all cores on the node (which can be a VM or a bare metal server). This is the root of our problems.
How are we throttled?
The CPU utilization will look something like this:
If Go schedules some goroutines to carry out operations on both cores, they may start using the CPUs, and within the first 50ms of the period, CFS will realize that the service has exhausted its quota of 100ms, and throttle the ongoing CPU usage, i.e. prevent the service from using CPU for the rest of the 50ms remaining in the period. This is important because now your service can only resume its work after 50ms when the next period of 100ms starts. This is key to understanding how CPU throttling impacts your application latencies.
Will setting my requests == limit prevent throttling?
Even in this scenario, without an appropriate value for GOMAXPROCS
, your Go application still sees all cores on the node and thus, spawns the same number of machine threads to schedule work. Your usage is still split across a large number of cores, and you have a limit, so it’s measured against your quota. You might still end up consuming your quota very early and being throttled for the remaining duration of a CFS period. Remember, this effect worsens the more cores you have available on your node.
Setting requests == limits ensures Kubernetes provides a “Guaranteed” QoS for your container. You can learn more about that that means here. If you go for this, I’d recommend configuring a proper value for GOMAXPROCS
as well to ensure efficient usage of the CPU.
A note on limit GOMAXPROCS
Even if you do set GOMAXPROCS
but define a value
Consider that you’re tuning your GOMAXPROCS=1
(it can’t go any lower than this) and your service has a CPU limit of 100m.
Using the quota formula we saw previously, let’s quickly calculate the maximum CPU time we’ll be allowed to use in a single 100ms period.
CPU Limit = 100m = 0.1 core
CPU Period = 100ms
Quota per period = 0.1 * 100 = 10ms
If your service attempts to use more than 10ms of CPU time, even though it’s using a single core (because GOMAXPROCS=1
), it will be throttled if the operation it’s carrying out requires more CPU time than your quota. Assume one of these operations takes 30ms, your CPU usage would then look something like this:
If you do experience this, I’d recommend setting your limit to a whole CPU. This ensures your service gets uninterrupted CPU time.
To limit or not to limit?
CFS throttling is enforced by measuring your usage against your limits, and the big idea behind reducing throttling is to set your GOMAXPROCS
according to your CPU limit. But what if you don’t set limits for your service?
A very popular opinion is that services should set reasonable CPU requests and not set limits at all. We shouldn’t assume this will solve our throttling woes because we get away with setting GOMAXPROCS
. While most services get throttled because they exceed their CFS quotas, during node pressure CFS must ensure all other pods on the node get their request
-ed amount of CPU time, which is when it will throttle your service beyond its CPU request
.
This section from DataDog’s blog on limits & requests discusses cases when you should set limits. I agree with their argument for bringing predictability to your application performance by setting limits since you may be deploying on a cluster with multiple different services owned by multiple teams (as it is in my case) and you get a predictable performance for your application, whether or not there’s node pressure.
On the other side of the ring, there are folks arguing for not setting limits. This makes sense to me when I’m the owner of the cluster and aware of what’s running on my cluster at all times. This might not likely be the case if a Platform Engineering team owns and manages your cluster.
If you do not set CPU limits, I’d still recommend setting your GOMAXPROCS
somewhere around 1x or 2x your CPU request
to get predictable performance.
I leave it up to the reader to decide on this.
I use CPU limits, how do I set GOMAXPROCS
appropriately?
There are various ways.
automaxprocs
The one I use is automaxprocs. This is a one-line import into the cmd of your application which sets your GOMAXPROCS
at startup time by reading the cgroup quota information inside the container.
import _ "go.uber.org/automaxprocs"
func main() {
// Your application logic here.
}
Via resourceFieldRef
You could also just set the GOMAXPROCS
env var for your service to the resources.limits.cpu
parameter using resourceFieldRef
env:
- name: GOMAXPROCS
valueFrom:
resourceFieldRef:
resource: limits.cpu
I learned of this from Howard John’s blog. Take a look to know how this works.
Wait for Go runtime to be CFS aware
This is a long-standing open issue with the maintainers of Go. There doesn’t seem to be an end in sight for this and it looks like until this happens, we will need to be mindful when deploying Go in containers. I expect this to be eventually incorporated into the runtime, given that everyone almost universally deploys Go in containers and the fact that other languages like Java have followed suit and made JVM CFS aware.
Will setting GOMAXPROCS
to my CPU limit lower my service throughput?
Consider the alternative where your service had CPU limits but didn’t have GOMAXPROCS
set. There, your service was ultimately throttled and lost out on CPU time to finish its work. Knowing this, it’s common sense that your service wasn’t getting sustained and continuous CPU time, thereby limiting its ability to process more work. With a proper value for GOMAXPROCS
, you’d ensure your CPU gets uninterrupted CPU time. This should help maximize your throughput and also impact your P99 latencies positively.
Uber’s automaxprocs contains results of benchmark they ran comparing their service’s throughput and P99 latencies where it’s apparent that when you set GOMAXPROCS
= quota, you get the best results.
A frequently asked question that I come across is:
By setting GOMAXPROCS=1, am I limiting my application to use 1 goroutine only?
And the answer is No! What you are limiting is how many simultaneously running goroutines you can have in your application at any given instant. Remember, goroutines can be in orders of thousands and millions but only a small number of them (=GOMAXPROCS
) will ever be in running state. The rest will either be waiting to be scheduled or in a blocked state (due to several different reasons like blocking OS calls like Disk or Network I/O or waiting on mutexes, listening on a channel, etc.)
Takeaways & closing notes
A go application running in K8s isn’t aware it’s running inside a container and is thereby subject to its cgroup limits. This behavior isn’t talked about enough, IMO. The effects of CPU throttling on performance get worse the more cores your node has. In my case, the application was deployed over a 128-core node with a limit of a measly 4 CPUs, so you can extrapolate the above examples and realize how bad it can get.
This blog from DataDog shows metrics to watch for to monitor CPU usage and throttling for your service. These metrics are generally exposed to nearly every metrics stack (including Prometheus). I’d recommend everyone periodically take a look at their service’s CPU/Memory metrics to help decide if re-sizing is required.
Having a good understanding of how things like CPU/Memory quotas work and are enforced under the hood carries a lot more value than simply knowing things for the sake of trivia. It allows you to be mindful of how you size your service’s resource requirements. I’ve found this endeavor to be illuminating and I expect that I will probably need to re-learn this in the future, which is why I’m documenting it here, not just for you, but for a future me!
Through the course of my research, I came across some very good material on how Go & Kubernetes work under the hood. I’ve linked to some of those materials in the next section and I hope you’ll take some time to go through them.
Further reading
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.