What `os.cpu_count()` Gets Wrong in a CPU-Limited Kubernetes Pod

CPU Count vs Cgroup Quota

I gave a pod 500m CPU and then went inside and asked Python how many CPUs it could see. The answer was 20, and that seemed worth understanding.

TL;DR: When a Gunicorn config sizes workers from os.cpu_count() inside a 500m-limited pod, it might see the node’s full CPU count instead of what the cgroup actually allows, and most of what those extra workers do is wait to be scheduled.

I start by checking what Kubernetes reports for the node:

kubectl get node minikube -o jsonpath='{.status.capacity.cpu}{"\n"}{.status.allocatable.cpu}{"\n"}'

20
20

Both values are 20, which means Kubernetes scheduled the pod onto a node that advertises twenty CPUs as capacity and allocatable CPU. The important trick is that 500m does not remove CPUs from the process view; it limits how much CPU time the cgroup is allowed to spend while those CPUs remain visible.

And the YAML is where that promise gets made.

What the YAML Actually Promises

Request vs Limit: Two Jobs

The pod lives in its own namespace, python-cpu-quota-demo, and it’s set up with matching request and limit: both 500m.

kubectl get pod cpu-probe -n python-cpu-quota-demo -o yaml

resources:
  limits:
    cpu: 500m
    memory: 128Mi
  requests:
    cpu: 500m
    memory: 128Mi
qosClass: Guaranteed

These two fields sit next to each other in the spec, but they do very different jobs. The CPU request is for the scheduler before the pod exists, because Kubernetes needs to know whether the node has room. The CPU limit is for the kernel after the pod is running, because the cgroup needs to know how much CPU time this workload may spend per period.

Memory is crueler when you cross the line.

That is the gap I want to inspect from inside the pod.

What Python Sees vs What the Kernel Enforces

Three APIs, Three Different Questions

I want three values next to each other: what Python reports, what the cgroup says, and what Linux affinity allows for the current process.

kubectl logs cpu-probe -n python-cpu-quota-demo

python 3.13.14
os.cpu_count 20
os.process_cpu_count 20
sched_getaffinity 20
cpu.max 50000 100000
cpu.cfs_quota_us missing
cpu.cfs_period_us missing
cpu.stat usage_usec 106789
user_usec 81012
system_usec 25776
nice_usec 0
nr_periods 2
nr_throttled 2
throttled_usec 8150
nr_bursts 0
burst_usec 0
cpus_allowed_list 0-19

The CPU-count answers agree with each other. Python says 20, and Linux affinity says the process is allowed to run on CPUs 0-19, so that also comes out as 20. Those answers are not wrong, because the process really can be scheduled on any of those logical CPUs.

The cgroup is looking at a different limit:

kubectl exec deployment/gunicorn-cpu-demo -n python-cpu-quota-demo -- cat /sys/fs/cgroup/cpu.max

50000 100000

This pod is using cgroup v2, which is why the value lives in cpu.max:

kubectl exec deployment/gunicorn-cpu-demo -n python-cpu-quota-demo -- stat -fc %T /sys/fs/cgroup

cgroup2fs

On cgroup v2, cpu.max has two fields: quota and period. Here the cgroup can spend 50,000 microseconds of CPU time in each 100,000 microsecond period, which works out to half a CPU:

50000 / 100000 = 0.5 CPU

That is the 500m limit, and now the mismatch is visible: Python sees twenty CPUs, affinity allows twenty CPUs, but the kernel quota allows half a CPU worth of time.

All three answers are technically true because they are answering different questions. Python’s os.cpu_count() is answering “how many logical CPUs are in the system?”, os.process_cpu_count() and affinity are answering “which CPUs can this thread run on?”, and the cgroup is answering “how much CPU time can this group spend?” Worker sizing for a CPU-bound sync service should start from the third question, but a lot of worker formulas read the first answer.

So what does Gunicorn do with twenty?

What Gunicorn Does With That Number

By default, if WEB_CONCURRENCY is not set, this installed version of Gunicorn starts with one worker:

kubectl exec deployment/gunicorn-cpu-demo -n python-cpu-quota-demo -- \
  python -c "import os; os.environ.pop('WEB_CONCURRENCY', None); \
  from gunicorn.config import Config; config = Config(); \
  print(config.settings['workers'].default)"

1

The problem starts when an application config overrides that default and calculates worker count from the CPU count Python reports. A common Gunicorn config pattern looks like this:

import multiprocessing

bind = "127.0.0.1:8000"
workers = multiprocessing.cpu_count() * 2 + 1

Inside this pod, multiprocessing.cpu_count() returns 20, which makes that formula produce 41 workers. The demo startup log prints both the what-if calculation and the value actually used:

kubectl logs deployment/gunicorn-cpu-demo -n python-cpu-quota-demo --limit-bytes=5000

os.cpu_count=20
os.process_cpu_count=20
cpu.max=50000 100000
quota_cpus=0.50
gunicorn_formula_from_quota=2
gunicorn_formula_from_os_cpu_count=41
WEB_CONCURRENCY=1

[2026-06-22 15:17:19 +0000] [1] [INFO] Starting gunicorn 23.0.0
[2026-06-22 15:17:19 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2026-06-22 15:17:19 +0000] [1] [INFO] Using worker: sync
[2026-06-22 15:17:19 +0000] [12] [INFO] Booting worker with pid: 12
[22/Jun/2026:15:17:19 +0000] "GET /healthz HTTP/1.1" 200 3 "-" "kube-probe/1.34"

gunicorn_formula_from_os_cpu_count=41 is the dangerous what-if number, while WEB_CONCURRENCY=1 is what this run actually uses. The startup lines confirm that Gunicorn boots one worker, which gives us a small baseline before adding more workers to the same half-CPU budget.

To compare those worker counts, I need an endpoint that spends CPU in a boring and repeatable way.

The Endpoint and the Load Setup

The service exposes that endpoint as /burn. Each request runs a fixed Python loop and returns the worker pid with elapsed time:

import os
import time
from urllib.parse import parse_qs

DEFAULT_LOOPS = int(os.getenv("BURN_LOOPS", "1500000"))

def burn(iterations):
    total = 0
    for item in range(iterations):
        total = (total + (item * item)) % 1000003
    return total

def application(environ, start_response):
    path = environ.get("PATH_INFO", "/")

    if path == "/healthz":
        body = b"ok\n"
        start_response("200 OK", [("Content-Type", "text/plain"), ("Content-Length", str(len(body)))])
        return [body]

    if path != "/burn":
        body = b"use /burn\n"
        start_response("404 Not Found", [("Content-Type", "text/plain"), ("Content-Length", str(len(body)))])
        return [body]

    query = parse_qs(environ.get("QUERY_STRING", ""))
    iterations = int(query.get("n", [str(DEFAULT_LOOPS)])[0])

    started = time.perf_counter()
    result = burn(iterations)
    duration_ms = (time.perf_counter() - started) * 1000

    body = (
        f"pid={os.getpid()} loops={iterations} "
        f"result={result} duration_ms={duration_ms:.2f}\n"
    ).encode()
    start_response("200 OK", [("Content-Type", "text/plain"), ("Content-Length", str(len(body)))])
    return [body]

Each benchmark used the same setup. I port-forwarded the service to 127.0.0.1:18080, ran ab for 20 seconds against /burn, and captured /sys/fs/cgroup/cpu.stat from the app container before and after the run. That gave me both sides of the story: what the client saw and what the kernel counted.

With the workload fixed, the next question is what changed when I changed only the process layout inside the same tiny CPU budget.

So Which Worker Count Actually Won?

1 Worker vs 14 Workers: Outcomes

The pod spec, endpoint, CPU limit, and test duration stayed the same, while WEB_CONCURRENCY changed between runs. I also changed the client pressure: the 1-worker run used -c 5, and the 14-worker run used -c 20. That means this is a stress comparison, not a clean single-variable benchmark, but it still answers the question I cared about: what happens when too many workers fight over the same 0.5 CPU?

Workers	Why tested	Completed	ab length mismatches	Req/s	P50 latency	P95 latency
1	Explicit conservative value	101	0	5.03	1,002 ms	1,060 ms
14	Explicit high worker overcommit	46	45	2.18	6,390 ms	11,402 ms

One worker completed 101 responses with a 1,002ms median, while fourteen workers completed 46 responses and pushed the median past 6 seconds. The extra workers did not increase the amount of CPU available to the pod, so they could not buy real throughput, and instead gave the scheduler more runnable processes to pause and resume inside the same 0.5 CPU budget.

The ab length mismatches are also worth reading carefully.

The endpoint returns values like pid and duration_ms, so the response body is not a fixed length across requests. That is why ab reports length mismatches as failed requests in the 14-worker run. I still kept the column in the table so the raw client output is not hidden, but the more important numbers are completed requests, request rate, latency, and the kernel counters below.

With one worker, the full ab output:

ab -t 20 -c 5 -s 60 -q http://127.0.0.1:18080/burn

Concurrency Level:      5
Time taken for tests:   20.084 seconds
Complete requests:      101
Failed requests:        0
Requests per second:    5.03 [#/sec] (mean)
Time per request:       994.273 [ms] (mean)

Percentage of the requests served within a certain time (ms)
  50%   1002
  95%   1060
 100%   1299 (longest request)

With 14 workers:

ab -t 20 -c 20 -s 60 -q http://127.0.0.1:18080/burn

Concurrency Level:      20
Time taken for tests:   21.133 seconds
Complete requests:      46
Failed requests:        45
   (Connect: 0, Receive: 0, Length: 45, Exceptions: 0)
Requests per second:    2.18 [#/sec] (mean)
Time per request:       9188.476 [ms] (mean)

Percentage of the requests served within a certain time (ms)
  50%   6390
  95%  11402
 100%  12312 (longest request)

The client numbers tell us the service got slower, but cpu.stat tells us what the kernel was doing while the clients waited. Before the 14-worker run, the cgroup had 24 throttled periods and just under 1 second of throttled time accumulated since the pod started. After the load test, the same counters looked like this:

nr_periods 328
nr_throttled 272
throttled_usec 285826526

That is 248 newly throttled periods and about 285 seconds of newly accumulated throttled time during roughly 20 seconds of wall-clock load. The number can be much larger than wall time because cpu.stat accounts throttling across the runnable tasks in the cgroup. With 14 workers competing for the same 0.5 CPU quota, blocked time piles up in parallel.

The 1-worker run is the same workload with fewer processes trying to spend the quota. In that run, the cgroup had 16 throttled periods and about 0.7 seconds of throttled time before the test. After the load test:

nr_periods 830
nr_throttled 220
throttled_usec 10877871

That is 204 newly throttled periods and about 10.2 seconds of newly accumulated throttled time across the run. The single worker still hit the CPU limit, because the endpoint is CPU-bound and the quota is small, but it spent the quota on useful work instead of spreading it across a crowd of workers.

At this point the kernel counters have already told the story, but dashboards are where people usually look first. So I wanted to know whether Prometheus would show the same failure mode, or whether this would stay hidden unless you went into cpu.stat.

The Prometheus View

This cluster runs kube-prometheus-stack and scrapes cAdvisor Prometheus metrics from the kubelet. Before writing a query, I first checked which container CPU metrics were actually available:

curl -sG 'http://localhost:9090/api/v1/label/__name__/values' \
  | python3 -c "
import sys, json
d = json.load(sys.stdin)
cadvisor = [m for m in d['data'] if 'container_cpu' in m]
print('\n'.join(sorted(cadvisor)))
"

container_cpu_cfs_periods_total
container_cpu_cfs_throttled_periods_total
container_cpu_usage_seconds_total

Prometheus gives me throttled periods here, while the raw throttled seconds still have to come from cpu.stat. container_cpu_cfs_throttled_periods_total is present, but container_cpu_cfs_throttled_seconds_total is not exposed by the kubelet cAdvisor on this cluster.

One other detail matters for the query. On this Minikube setup, cAdvisor emits these metrics at pod scope without a container label, which means queries filtering on container="app" return empty results. The actual label set looks like this:

{
  "__name__": "container_cpu_cfs_throttled_periods_total",
  "namespace": "python-cpu-quota-demo",
  "pod": "gunicorn-cpu-demo-694589fb97-45cq2",
  "node": "minikube"
}

There is no container key there, so the throttle ratio query filters by namespace and pod instead:

rate(container_cpu_cfs_throttled_periods_total{
  namespace="python-cpu-quota-demo",
  pod=~"gunicorn-cpu-demo-.*"
}[1m])
/
rate(container_cpu_cfs_periods_total{
  namespace="python-cpu-quota-demo",
  pod=~"gunicorn-cpu-demo-.*"
}[1m])

During the 14-worker load test, that query showed about 83% of measured periods hitting throttling. Because this is a 1-minute rate() query, the exact value moves as the load window ages out of Prometheus’ lookback range:

throttle_ratio: 83.4 % | pod: gunicorn-cpu-demo-694589fb97-45cq2

Both runs throttle because the workload is CPU-bound and the quota is tight, but the cost of that throttling is different. With 14 workers, the run completed 46 requests while the cgroup spent a large share of measured periods throttled. With 1 worker, the cgroup still throttled, but 101 requests completed cleanly because the single worker spent its quota on the loop instead of sharing it across extra processes.

Now the fix is less mysterious: choose worker count from the CPU time the cgroup can spend, not from the number of CPUs Python can see.

Reading the Quota Before Sizing Workers

Quota-Aware Worker Sizing Flow

Once you put the pieces next to each other, the chain from YAML to latency is short.

The safer starting point is to read the quota before sizing workers. A quota-aware helper can read cgroup v2 first and fall back to cgroup v1 CFS bandwidth files:

import math
import os
from pathlib import Path

def cgroup_cpu_quota():
    cpu_max = Path("/sys/fs/cgroup/cpu.max")
    if cpu_max.exists():
        quota, period = cpu_max.read_text().strip().split()
        if quota != "max":
            return int(quota) / int(period)

    for cgroup_cpu_dir in (
        Path("/sys/fs/cgroup/cpu"),
        Path("/sys/fs/cgroup/cpu,cpuacct"),
    ):
        quota_file = cgroup_cpu_dir / "cpu.cfs_quota_us"
        period_file = cgroup_cpu_dir / "cpu.cfs_period_us"
        if not quota_file.exists() or not period_file.exists():
            continue

        quota = int(quota_file.read_text().strip())
        period = int(period_file.read_text().strip())
        if quota > 0:
            return quota / period

    return len(os.sched_getaffinity(0))

quota_cpus = cgroup_cpu_quota()
cpu_bound_workers = max(1, math.floor(quota_cpus))
workers = int(os.getenv("WEB_CONCURRENCY", cpu_bound_workers))

This helper does two separate things on purpose. It reads the kernel-enforced quota first, then still leaves WEB_CONCURRENCY as an override, because the quota is the right starting point but the workload decides the final number.

Key Takeaways

Inside a CPU-limited Kubernetes pod, Python can report the node’s full CPU count while the kernel enforces a much smaller CPU budget through the cgroup. In this lab, os.cpu_count() returned 20, but cpu.max was 50000 100000, which means the pod was limited to 0.5 CPU.
A Gunicorn worker formula based on os.cpu_count() can produce a worker count that has nothing to do with the pod’s actual CPU quota. In the demo pod, the common workers = multiprocessing.cpu_count() * 2 + 1 pattern would calculate 41 workers from a container that only had half a CPU worth of time.
More workers did not mean more completed work for this CPU-bound sync endpoint. Under the same 500m limit, one worker completed 101 requests with a 1,002ms median latency, while fourteen workers completed 46 requests with a 6,390ms median because they were all sharing the same half-CPU quota.

Wrapping Up

That was the mismatch hiding behind the healthy pod: Python was honest, Kubernetes was quiet, and the worker math was looking at the wrong layer.

Until next time!