Using NVIDIA GPU work nodes in Kubernetes Engine cluster
Explain two methods for creating GPU nodes in a Kubernetes cluster created with Kubernetes Engine and using GPU in workloads.
- Estimated time required: 15 minutes
- User environment
- Recommended operating systems: MacOS, Ubuntu
- Region: kr-central-2
- Prerequisites
Before you start
Install Helm, a package manager that helps manage Kubernetes packages. With Helm, you can search, share, and use Kubernetes software efficiently. Install the Helm package on your local machine using the commands below.
- Mac
- Linux(Ubuntu)
-
Install using the Homebrew package manager.
brew install helm
-
Verify the installation with the following command.
helm version
# version.BuildInfo{Version:"v3.14.0", GitCommit:"3fc9f4b2638e76f26739cd77c7017139be81d0ea", GitTreeState:"clean", GoVersion:"go1.21.6"}
-
Install using the
curl
command.curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh -
Verify the installation with the following command.
helm version
# version.BuildInfo{Version:"v3.14.0", GitCommit:"3fc9f4b2638e76f26739cd77c7017139be81d0ea", GitTreeState:"clean", GoVersion:"go1.21.6"}
Getting started
Explain two methods for creating GPU nodes in a Kubernetes cluster created with Kubernetes Engine and using GPU in workloads.
Type 1. Build Nvidia GPU work nodes
Type 1 explains how to build Nvidia GPU work nodes in a Kubernetes cluster.
Step 1. Create Nvidia GPU node pool in Kubernetes cluster
This document does not cover how to create a Kubernetes cluster. Create a new cluster or add a GPU node pool to an existing cluster.
Follow these steps in the KakaoCloud console to create a Kubernetes node pool:
-
Access the KakaoCloud console.
-
Go to Kubernetes Engine > Cluster list and select a cluster or click the [Create cluster] button.
- If you clicked on a cluster, navigate to the Node pool tab in the cluster details page and click the [Create node pool] button. Enter the node pool information and create the node pool.
- If you clicked [Create cluster], follow the steps in Cluster creation. In Step 2: Node pool settings, enter the node pool information as shown below:
Node pool type Instance type Node count Volume(GB) GPU p2i.6xlarge 1 100 -
To ensure the GPU node is created as a single GPU instance, include the following script in the user script section:
sudo nvidia-smi -mig 0
Step 2. Configure Nvidia GPU work node environment
The nvidia-k8s-device-plugin automatically exposes, manages, and operates GPU resources in the cluster.
When creating GPU work nodes via a node pool, the environment required for GPU resources, such as Driver and Container Runtime, is configured by default.
-
Use Helm to add nvidia-k8s-device-plugin to expose GPU resources to the Kubernetes cluster:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install \
--version=0.12.3 \
--namespace nvidia-device-plugin \
--create-namespace \
--generate-name \
nvdp/nvidia-device-plugin -
Verify that nvidia-k8s-device-plugin has been added to the cluster:
kubectl get all -A | grep nvidia
# kube-system pod/nvidia-device-plugin-daemonset-dpp4b 1/1 Running 0 64m
# kube-system pod/nvidia-device-plugin-daemonset-qprc7 1/1 Running 0 64m
# kube-system daemonset.apps/nvidia-device-plugin-daemonset 2 2 2 2 2 <none> 64m -
Check the node resources to confirm the GPU resource has been added:
kubectl describe nodes | grep nvidia
# nvidia.com/gpu: 1
# nvidia.com/gpu: 1
# ... -
Add a GPU test Pod to the Kubernetes cluster:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda10.2"
resources:
limits:
nvidia.com/gpu: 1
EOF -
Verify the test Pod's status:
kubectl get pod/gpu-test
# NAME READY STATUS RESTARTS AGE
# gpu-test 0/1 Completed 0 2m -
Check the logs of the test Pod:
kubectl logs gpu-test
# GPU 0: NVIDIA A100 80GB PCIe ...
# [Vector addition of 50000 elements]
# Copy input data from the host memory to the CUDA device
# CUDA kernel launch with 196 blocks of 256 threads
# Copy output data from the CUDA device to the host memory
# Test PASSED
# Done -
Delete the test Pod:
kubectl delete pod/gpu-test
# pod "gpu-test" deleted
Type 2. Build MIG-enabled Nvidia GPU work nodes
Type 2 explains one of the methods for creating GPU work nodes in a Kubernetes cluster and using MIG.
Some GPUs provided by KakaoCloud (e.g., A100) support MIG (Multi-Instance GPU). Users can create Kubernetes cluster nodes as GPU-enabled types and use MIG strategies to utilize resources more efficiently.
You can check which GPUs are available in the Instance specifications by type.
Step 1. Add GPU work nodes to the Kubernetes cluster
This document does not cover how to create a Kubernetes cluster. Create a new cluster or add a GPU node pool to an existing cluster.
-
Access the KakaoCloud console.
-
Go to Kubernetes Engine > Cluster list and select a cluster or click the [Create cluster] button.
- If you clicked on a cluster, navigate to the Node pool tab in the cluster details page and click the [Create node pool] button. Enter the node pool information.
- If you clicked [Create cluster], follow the steps in Cluster creation. In Step 2: Node pool settings, enter the node pool information as shown below:
Node pool type Instance type Node count Volume(GB) GPU p2i.6xlarge 1 100 -
During GPU node provisioning, enable MIG and split GPU resources into seven instances. Configure GPU container runtime settings with the following script:
sudo nvidia-smi -mig 1
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C -
Create the node pool.
Step 2. Configure Nvidia GPU work node environment
After completing Step 1, MIG-enabled GPU nodes have been added to the cluster.
-
Use Helm to add nvidia-k8s-device-plugin to expose GPU nodes in the Kubernetes cluster:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update -
Use the
migStrategy
flag to enable MIG resources in the cluster. For more information, refer to the NVIDIA/k8s-device-plugin GitHub repository.helm install \
--version=0.12.3 \
--namespace nvidia-device-plugin \
--create-namespace \
--set compatWithCPUManager=true \
--set migStrategy=mixed \
--generate-name \
nvdp/nvidia-device-plugin -
Verify that the
nvidia-k8s-device-plugin
pod is running and that MIG resources are added to the cluster:kubectl describe nodes | grep nvidia
# nvidia.com/mig-1g.10gb: 7
# nvidia.com/mig-1g.10gb: 7 -
Deploy multiple workloads on the cluster to test the GPU configuration and verify resource scheduling:
for i in $(seq 7); do
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: mig-example-${i}
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvidia/cuda:11.2.2-base-ubuntu20.04
command: ["nvidia-smi"]
args: ["-L"]
resources:
limits:
nvidia.com/mig-1g.10gb: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
done
# pod/mig-example-1 created
# pod/mig-example-2 created
# pod/mig-example-3 created
# pod/mig-example-4 created
# pod/mig-example-5 created
# pod/mig-example-6 created
# pod/mig-example-7 created -
Verify the status of the created pods:
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# mig-example-1 0/1 Completed 0 29s
# mig-example-2 0/1 Completed 0 26s
# mig-example-3 0/1 Completed 0 22s
# mig-example-4 0/1 Completed 0 19s
# mig-example-5 0/1 Completed 0 15s
# mig-example-6 0/1 Completed 0 12s
# mig-example-7 0/1 Completed 0 9s -
Check the logs to verify the MIG UUID assigned to each pod:
for i in $(seq 7); do
kubectl logs mig-example-${i} | grep MIG
done
# MIG 1g.10gb Device 0: (UUID: MIG-aaaaa)
# MIG 1g.10gb Device 0: (UUID: MIG-bbbbb)
# MIG 1g.10gb Device 0: (UUID: MIG-ccccc) -
Delete the example pods:
for i in $(seq 7); do
kubectl delete pod mig-example-${i}
done
# pod "mig-example-1" deleted
# pod "mig-example-2" deleted
# pod "mig-example-3" deleted