How Kubernetes Assigns GPUs: Tracing the Device Plugin Path from Pod Spec to CUDA

I asked Kubernetes for one GPU, the pod ran CUDA, and then I asked the API which physical GPU the container got. The answer was still just nvidia.com/gpu: 1, and that seemed worth following all the way down to the node.

TL;DR: nvidia.com/gpu: 1 is enough for the scheduler to find a node with GPU capacity, but it is not the physical device assignment. In this minikube run, kubelet got the GPU UUID from the NVIDIA device plugin over nvidia-gpu.sock, wrote it to a local checkpoint file, and the container ended up with NVIDIA_VISIBLE_DEVICES set to that UUID while the pod object still showed only the original one-GPU request.

What the Scheduler Sees

I start with the one thing the scheduler can see: node capacity.

kubectl get nodes -o json | jq '.items[].status.capacity'

{
  "cpu": "20",
  "ephemeral-storage": "949626612Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "32603084Ki",
  "nvidia.com/gpu": "1",
  "pods": "110"
}

The node advertises one nvidia.com/gpu, so a pod that asks for one GPU can be placed here. That number is enough for scheduling, but it does not say which GPU UUID, which device node, or which CUDA-visible device the container will receive.

After the CUDA pod is running, I check the pod object for the same detail:

kubectl get pod before-cuda-demo -o json | jq '.spec.containers[].resources'

{
  "limits": {
    "nvidia.com/gpu": "1"
  },
  "requests": {
    "nvidia.com/gpu": "1"
  }
}

The pod object still shows the request and limit, not the physical GPU. That means the scheduler has answered only the first question: which node has enough advertised GPU capacity.

The next question is narrower and more local: which device on that node did the container get?

That is where the device plugin socket comes in.

The Device Plugin Socket

nvidia-plugin kubelet.sock device-manager node-status Register GPU forward reg ACK ListAndWatch update capacity nvidia-plugin kubelet.sock device-manager node-status
Device Plugin Registration Flow

The first place to look is kubelet’s device plugin directory on the node, because device plugins register with kubelet over Unix sockets:

minikube ssh -- sudo ls -la /var/lib/kubelet/device-plugins/

srwxr-xr-x 1 root root 0 Mar 21 14:49 kubelet.sock
srwxr-xr-x 1 root root 0 Mar 21 14:49 nvidia-gpu.sock

There are two sockets here. kubelet.sock is kubelet’s registration socket, and nvidia-gpu.sock is the socket owned by the NVIDIA device plugin.

The plugin is running as a DaemonSet pod in kube-system:

kubectl get pods -n kube-system | grep nvidia

nvidia-device-plugin-daemonset-b4fh7   1/1   Running   1 (5d ago)   6d

When that pod starts, it connects to kubelet.sock and registers itself as the handler for nvidia.com/gpu. I check the kubelet log for the registration line:

minikube ssh -- sudo journalctl -u kubelet | grep "Got registration request"

I0321 20:13:22.532584   77328 server.go:160] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"

That line is the handoff point. From then on, kubelet knows which local plugin owns nvidia.com/gpu, and the plugin can report devices back through the device plugin API.

Once kubelet has a plugin, the next useful thing to watch is the moment a pod asks for a GPU.

The 5-Millisecond Allocation Path

kubelet DM nvidia-gpu.sock checkpoint inspect gpu request GetPreferredAllocation GPU-11ba558d Allocate(GPU-11ba558d) AllocateResponse write checkpoint kubelet DM nvidia-gpu.sock checkpoint
GPU Allocation gRPC Sequence

Once the pod lands on this node, kubelet has to turn nvidia.com/gpu: 1 into a real device. That happens through device-plugin gRPC calls over nvidia-gpu.sock.

Those calls are not visible at kubelet’s default verbosity. I raise kubelet verbosity on the minikube node so the device manager prints the allocation path:

minikube ssh -- "sudo sed -i 's/verbosity: 0/verbosity: 4/' /var/lib/kubelet/config.yaml && sudo systemctl restart kubelet"

After restarting kubelet and creating the GPU pod again, I filter for the device manager lines that mention the resource name or the checkpoint write:

minikube ssh -- sudo journalctl -u kubelet | grep -E "Looking for needed|Need devices to allocate|GetPreferredAllocation|Making allocation request|Checkpoint file written" | grep -E "nvidia.com/gpu|kubelet_internal_checkpoint"

I0322 09:19:09.136353   53696 manager.go:835] "Looking for needed resources" resourceName="nvidia.com/gpu" pod="default/before-cuda-demo" containerName="cuda" needed=1
I0322 09:19:09.136380   53696 manager.go:600] "Need devices to allocate for pod" deviceNumber=1 resourceName="nvidia.com/gpu" podUID="f2b9f4e5-..." containerName="cuda"
I0322 09:19:09.136391   53696 manager.go:1018] "Issuing a GetPreferredAllocation call for container" resourceName="nvidia.com/gpu" containerName="cuda"
I0322 09:19:09.137452   53696 manager.go:881] "Making allocation request for device plugin" devices=["GPU-11ba558d-3437-2335-2fd8-a78c8502f87f"] resourceName="nvidia.com/gpu" pod="default/before-cuda-demo" containerName="cuda"
I0322 09:19:09.142098   53696 manager.go:502] "Checkpoint file written" checkpoint="kubelet_internal_checkpoint"

Those five lines are the allocation path in miniature. The first timestamp is .136353, and the checkpoint write lands at .142098, so this run spent about 5.74 milliseconds between “this container needs a GPU” and “the result is written locally.”

The first two lines are kubelet reading the container’s resource request. It sees nvidia.com/gpu: 1, so the device manager needs one device for the cuda container.

At .136391, kubelet issues GetPreferredAllocation. The next allocation log line contains the chosen device ID: GPU-11ba558d-3437-2335-2fd8-a78c8502f87f.

This node has a single GPU, so there is only one candidate in this run.

The next relevant log line is the Allocate() request for that device. By the last line, kubelet has written the result to a local checkpoint file, and the allocation is done.

The log tells us kubelet wrote a checkpoint, so the next step is to read that file:

minikube ssh -- sudo cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint

{
  "Data": {
    "PodDeviceEntries": [{
      "PodUID": "f2b9f4e5-ba3b-418e-ab8c-c1751a0619d8",
      "ContainerName": "cuda",
      "ResourceName": "nvidia.com/gpu",
      "DeviceIDs": { "-1": ["GPU-11ba558d-3437-2335-2fd8-a78c8502f87f"] },
      "AllocResp": "CkIKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSKEdQVS0xMWJhNTU4ZC0zNDM3LTIzMzUtMmZkOC1hNzhjODUwMmY4N2Y="
    }],
    "RegisteredDevices": {
      "nvidia.com/gpu": ["GPU-11ba558d-3437-2335-2fd8-a78c8502f87f"]
    }
  }
}

Now the missing UUID is visible. The checkpoint has the pod UID, the container name, the resource name, and the assigned device ID, all stored on the node rather than in the pod spec.

What Gets Injected Into the Container

The checkpoint also has an AllocResp field. That base64 string is the serialized AllocateResponse protobuf that the device plugin returned to kubelet, so I decode it before looking inside the container:

echo "CkIKFk5WSURJQV9WSVNJQkxFX0RFVklDRVMSKEdQVS0xMWJhNTU4ZC0zNDM3LTIzMzUtMmZkOC1hNzhjODUwMmY4N2Y=" | base64 -d | strings

NVIDIA_VISIBLE_DEVICES
GPU-11ba558d-3437-2335-2fd8-a78c8502f87f

The decoded response contains NVIDIA_VISIBLE_DEVICES and the same GPU UUID from the checkpoint. That is the first bridge from the Kubernetes resource request to what CUDA will see inside the container.

The next bridge is the container runtime. On this minikube node, Docker is configured to use the NVIDIA runtime by default:

minikube ssh -- cat /etc/docker/daemon.json

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime"
    }
  }
}

That tells me which runtime path started the container. Now I want to check the result from inside a running container, because the pod spec did not declare any /dev/nvidia* device nodes.

The vectorAdd sample exits as soon as it finishes, so I create a second pod with the same image and GPU request but a sleep command:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-inspect
spec:
  containers:
  - name: cuda
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

This cluster has one advertised GPU, so gpu-inspect can run only after the previous GPU pod releases it. In this run, the vectorAdd pod had completed, and the following kubectl exec commands show that gpu-inspect was running.

First I list the NVIDIA device nodes visible inside the container:

kubectl exec gpu-inspect -- ls /dev/ | grep nvidia

nvidia-uvm
nvidia-uvm-tools
nvidia0
nvidiactl

Those device nodes were not in the pod YAML. They appeared after kubelet, the device plugin response, Docker, and the NVIDIA runtime finished preparing the container.

The allocation response contained an environment variable, so I check the NVIDIA variables next:

kubectl exec gpu-inspect -- env | grep NVIDIA

NVIDIA_VISIBLE_DEVICES=GPU-11ba558d-3437-2335-2fd8-a78c8502f87f
NVIDIA_REQUIRE_CUDA=cuda>=11.7 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 ...
NVIDIA_DRIVER_CAPABILITIES=compute,utility

NVIDIA_VISIBLE_DEVICES matches the UUID in the checkpoint. The same device ID now appears in three places: kubelet’s local checkpoint, the decoded allocation response, and the running container’s environment.

The device nodes are not the whole story, because CUDA also needs driver-related files. I list the NVIDIA-related mount points from /proc/mounts:

kubectl exec gpu-inspect -- cat /proc/mounts | grep nvidia | awk '{print $2}'

/proc/driver/nvidia
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.163.01
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.163.01
/usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.550.163.01
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.550.163.01
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.550.163.01
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.550.163.01
/usr/lib/firmware/nvidia/550.163.01/gsp_ga10x.bin
/usr/lib/firmware/nvidia/550.163.01/gsp_tu10x.bin
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia0
/proc/driver/nvidia/gpus/0000:01:00.0

This container has 17 NVIDIA-related mount points. They include host-provided driver binaries, shared libraries, firmware blobs, device nodes, and proc entries.

At this point the ordinary one-GPU path is visible end to end: the pod asks for a count, kubelet gets a UUID from the plugin, and the runtime makes that UUID usable inside the container.

Time-Slicing the Same GPU

So far, one advertised GPU has meant one GPU pod at a time. To see what time-slicing changes, I first check the NVIDIA resource capacity before changing the plugin config:

kubectl get node -o json | jq '.items[0].status.capacity | to_entries[] | select(.key | startswith("nvidia"))'

{
  "key": "nvidia.com/gpu",
  "value": "1"
}

The node reports nvidia.com/gpu: 1. Now I give the NVIDIA device plugin a time-slicing config with three replicas:

cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: kube-system
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 3
EOF

configmap/time-slicing-config created

The ConfigMap does nothing until the plugin reads it, so I redeploy the DaemonSet with that config mounted at /config:

kubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system

cat <<'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: system-node-critical
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.17.0
        name: nvidia-device-plugin-ctr
        env:
        - name: CONFIG_FILE
          value: /config/any
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: config
          mountPath: /config
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: config
        configMap:
          name: time-slicing-config
EOF

The important pieces are CONFIG_FILE=/config/any and the ConfigMap volume mounted at /config. Once the plugin pod is running again, I check the node capacity:

kubectl get nodes -o json | jq '.items[0].status.capacity["nvidia.com/gpu"]'

"3"

The scheduler now sees three allocatable GPU slots. I create three pods that each ask for one:

for i in 1 2 3; do
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: ts-pod-$i
spec:
  containers:
  - name: cuda
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 1
EOF
done

pod/ts-pod-1 created
pod/ts-pod-2 created
pod/ts-pod-3 created

All three pods were created. Now the question is whether they received different physical devices or the same one, so I print NVIDIA_VISIBLE_DEVICES from each container:

for i in 1 2 3; do
  echo "ts-pod-$i: $(kubectl exec ts-pod-$i -- printenv NVIDIA_VISIBLE_DEVICES)"
done

ts-pod-1: GPU-11ba558d-3437-2335-2fd8-a78c8502f87f
ts-pod-2: GPU-11ba558d-3437-2335-2fd8-a78c8502f87f
ts-pod-3: GPU-11ba558d-3437-2335-2fd8-a78c8502f87f

Before time-slicing, the node reported nvidia.com/gpu: 1; after this ConfigMap, it reported nvidia.com/gpu: 3. The three containers all printed the same UUID, which means this lab showed a scheduling-count change rather than hardware partitioning.

Before continuing, clean up the time-slicing setup and restore the default device plugin:

kubectl delete pod ts-pod-1 ts-pod-2 ts-pod-3
kubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system
kubectl delete configmap time-slicing-config -n kube-system
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

Wait for the plugin pod to come back and verify the node reports one GPU again:

kubectl get nodes -o json | jq '.items[0].status.capacity["nvidia.com/gpu"]'

"1"

What Happens When Things Break

So far, kubelet can reach the plugin and allocation works.

The next thing I want to know is what breaks when the plugin disappears. I start by deleting the running NVIDIA device plugin pod:

kubectl get pods -n kube-system | grep nvidia

nvidia-device-plugin-daemonset-bnmkq   1/1   Running   1 (5d ago)   6d

kubectl delete pod nvidia-device-plugin-daemonset-bnmkq -n kube-system

pod "nvidia-device-plugin-daemonset-bnmkq" deleted

The DaemonSet creates a replacement pod, but the interesting part is what kubelet sees during the gap. I check the kubelet logs for removal and registration:

minikube ssh -- sudo journalctl -u kubelet --since "1 min ago" | grep -E "Removed device plugin|Got registration request"

W0322 10:05:41.218903   53696 manager.go:312] "Removed device plugin for resource" resourceName="nvidia.com/gpu"
I0322 10:05:43.894122   53696 server.go:160] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"

In this run, there were about 2.7 seconds between kubelet removing the old plugin and receiving a new registration.

That raises a different question: does a running GPU container lose its device nodes when the plugin pod goes away?

I check that gpu-inspect can see /dev/nvidia0, then delete the plugin pod while that container is running:

kubectl exec gpu-inspect -- ls /dev/nvidia0
/dev/nvidia0

kubectl delete -n kube-system $(kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -o name | head -1) --force

After the plugin pod is deleted, I check the same device path again:

kubectl exec gpu-inspect -- ls /dev/nvidia0

/dev/nvidia0

/dev/nvidia0 is still there. In this run, removing the plugin pod did not remove device nodes that had already been mounted into a running container.

That does not mean the plugin is optional. New allocations still need the plugin, so I delete the entire DaemonSet and check what the node reports:

kubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system

kubectl get nodes -o json | jq '.items[0].status.allocatable["nvidia.com/gpu"]'

"0"

Allocatable dropped to zero. With no allocatable GPU left, I check the new test pod:

kubectl get pod gpu-pending-test

NAME               READY   STATUS    RESTARTS   AGE
gpu-pending-test   0/1     Pending   0          9s

The new GPU pod is pending because the scheduler sees zero allocatable GPUs.

The plugin failure path explains new scheduling, but kubelet has another piece of local state: the checkpoint file.

Kubelet stores device allocation data in /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint, so corrupting that file is a direct way to see how kubelet reacts when its local allocation record is unreadable.

I back up the checkpoint, write garbage into it, and restart kubelet:

minikube ssh -- "sudo cp /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint.bak"
minikube ssh -- "echo 'CORRUPTED' | sudo tee /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint"
minikube ssh -- "sudo systemctl restart kubelet"

Then I check kubelet’s checkpoint-related logs:

minikube ssh -- sudo journalctl -u kubelet --since "30 sec ago" | grep -i checkpoint

E0328 20:53:56.625089  118745 manager.go:347] "Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date" err="invalid character 'C' looking for beginning of value"

Kubelet found the corrupted file, failed to parse it, and continued with a warning that device allocation info may not be up to date. The running pod still had its GPU mounts because those mounts were created during container startup.

Restore the backup before continuing:

minikube ssh -- "sudo cp /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint.bak /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint"
minikube ssh -- "sudo systemctl restart kubelet"

One more node-local detail is the socket itself. I delete the plugin pod again:

kubectl delete -n kube-system $(kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -o name | head -1)

pod "nvidia-device-plugin-daemonset-gvczz" deleted

Then I check whether kubelet saw the old gRPC stream end and a new registration arrive:

minikube ssh -- "sudo journalctl -u kubelet --since '30 sec ago' | grep -E 'ListAndWatch ended|Got registration'"

E0328 20:59:23.736830  118883 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
I0328 20:59:24.920559  118883 server.go:160] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"

The first line is kubelet discovering that its gRPC stream to the old socket hit EOF. The plugin on the other end is gone.

About one second later, the replacement plugin pod has started, created a fresh nvidia-gpu.sock, and registered with kubelet.

I can check the socket file timestamp to see whether the socket changed. Before deleting the plugin pod:

minikube ssh -- "sudo stat -c '%Y %n' /var/lib/kubelet/device-plugins/nvidia-gpu.sock"

1774731875 /var/lib/kubelet/device-plugins/nvidia-gpu.sock

After deleting the plugin pod and waiting a few seconds:

minikube ssh -- "sudo stat -c '%Y %n' /var/lib/kubelet/device-plugins/nvidia-gpu.sock"

1774731959 /var/lib/kubelet/device-plugins/nvidia-gpu.sock

The timestamp changed, so this run saw a fresh nvidia-gpu.sock after the replacement plugin started.

The pattern is the same as the successful allocation path. The plugin socket, checkpoint file, device assignment, and replacement behavior are all node-local details.

Where the Device Mapping Lives

Now the original mismatch is easier to explain. The pod object showed the request, while the node had the physical assignment.

To find the assigned GPU UUID for gpu-inspect, I ask the checkpoint file with the pod UID:

POD_UID=$(kubectl get pod gpu-inspect -o jsonpath='{.metadata.uid}')

minikube ssh -- "sudo cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint" \
  | jq --arg uid "$POD_UID" '.Data.PodDeviceEntries[] | select(.PodUID==$uid) | {PodUID, DeviceIDs, ResourceName}'

{
  "PodUID": "60921126-5b78-4e9b-867f-371ba9a65349",
  "DeviceIDs": {
    "-1": [
      "GPU-11ba558d-3437-2335-2fd8-a78c8502f87f"
    ]
  },
  "ResourceName": "nvidia.com/gpu"
}

This is the physical GPU UUID that was missing from the pod object.

The checkpoint is not the only node-local view. Kubelet also exposes the same mapping through the PodResources gRPC API, so I query that socket too:

minikube ssh -- "sudo grpcurl -plaintext -unix \
  -import-path /tmp -proto api.proto \
  /var/lib/kubelet/pod-resources/kubelet.sock v1.PodResourcesLister/List" \
  | jq '.podResources[] | select(.name=="gpu-inspect") | .containers[0].devices'

[
  {
    "resourceName": "nvidia.com/gpu",
    "deviceIds": [
      "GPU-11ba558d-3437-2335-2fd8-a78c8502f87f"
    ]
  }
]

Same UUID again. The checkpoint file, the PodResources API, and the container environment all point to GPU-11ba558d-3437-2335-2fd8-a78c8502f87f.

Key Takeaways

  • A running GPU pod can still look generic from the Kubernetes API. In this lab, kubectl get pod before-cuda-demo -o json showed only "nvidia.com/gpu": "1" in the container resources, while the physical GPU UUID did not appear anywhere in the pod spec.

  • The actual GPU assignment showed up in node-local kubelet data rather than the pod object. For gpu-inspect, the checkpoint file and the PodResources gRPC API both reported GPU-11ba558d-3437-2335-2fd8-a78c8502f87f, while the pod object still only described a one-GPU request.

  • The NVIDIA device plugin allocation response was enough to connect the pod request to CUDA-visible hardware. In this run, decoding the checkpoint’s serialized AllocateResponse showed NVIDIA_VISIBLE_DEVICES=GPU-11ba558d-3437-2335-2fd8-a78c8502f87f, and the running container had /dev/nvidia0, /dev/nvidiactl, NVIDIA driver binaries, shared libraries, firmware files, and proc entries mounted in.

  • Time-slicing changed the schedulable GPU count, not the physical GPU each container saw. After configuring three replicas, node capacity changed from "nvidia.com/gpu": "1" to "3", but ts-pod-1, ts-pod-2, and ts-pod-3 all printed the same NVIDIA_VISIBLE_DEVICES value: GPU-11ba558d-3437-2335-2fd8-a78c8502f87f.

Until next time!

G.

References