Using NVIDIA GPU work nodes in Kubernetes Engine cluster
This document explains how to create GPU nodes in a Kubernetes cluster created with Kubernetes Engine (IKE) and how to use GPUs in workloads in two different ways.
- Estimated time: 15 minutes
- Recommended operating systems: MacOS, Ubuntu
- Region: kr-central-2
- Prerequisites
Prework
Install Helm, the package manager that assists with managing Kubernetes packages.
Helm allows you to search for, share, and use software for Kubernetes. Install the Helm package by entering the following command on your local machine.
- Mac
- Linux(Ubuntu)
-
Install using the Homebrew package manager.
brew install helm
-
Verify the installation.
helm version
# version.BuildInfo{Version:"v3.14.0", GitCommit:"3fc9f4b2638e76f26739cd77c7017139be81d0ea", GitTreeState:"clean", GoVersion:"go1.21.6"}
-
Install using the
curl
command.curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh -
Verify the installation to ensure it was successful.
helm version
# version.BuildInfo{Version:"v3.14.0", GitCommit:"3fc9f4b2638e76f26739cd77c7017139be81d0ea", GitTreeState:"clean", GoVersion:"go1.21.6"}
Type 1. Build Nvidia GPU work nodes
In Type 1, we will explain how to build Nvidia GPU work nodes in a Kubernetes cluster.
Step 1. Create Nvidia GPU node pool in Kubernetes cluster
This document does not cover the process of creating a Kubernetes cluster. You can create a new cluster or add a GPU node pool to an existing cluster.
-
Access the KakaoCloud Console.
-
In the Kubernetes Engine > Cluster list, select the cluster where you want to work or click the [Create cluster] button.
- If you select a cluster, go to the cluster’s details page and click the Node Pool tab, then click the [Create node pool] button, enter the node pool information, and create the node pool.
- If you click the [Create cluster] button, follow the step-by-step process to create the cluster. In Step 2: Node Pool Settings, refer to the following to enter node pool information and create the cluster.
Node Pool Type Instance Type Number of Nodes Volume(GB) GPU p2i.6xlarge 1 100 -
When creating a GPU node, include the following code in the user script to ensure it is created as a single GPU instance.
sudo nvidia-smi -mig 0
Step 2. Configure Nvidia GPU work node environment
The nvidia-k8se-device-plugin
automatically exposes and manages GPU resources within the cluster, providing the necessary functions for management and execution.
When you create a GPU work node through the node pool, the environment required to use GPU resources, including the Driver and Container Runtime, is set up by default.
-
To expose GPUs in the Kubernetes cluster, use Helm to add the
nvidia-k8s-device-plugin
to the components.helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install \
--version=0.12.3 \
--namespace nvidia-device-plugin \
--create-namespace \
--generate-name \
nvdp/nvidia-device-plugin -
Verify that the
nvidia-k8s-device-plugin
has been added to the cluster.kubectl get all -A | grep nvidia
# kube-system pod/nvidia-device-plugin-daemonset-dpp4b 1/1 Running 0 64m
# kube-system pod/nvidia-device-plugin-daemonset-qprc7 1/1 Running 0 64m
# kube-system daemonset.apps/nvidia-device-plugin-daemonset 2 2 2 2 2 <none> 64m -
Use the following command to display detailed information about node resources and verify that GPU resources have been added.
kubectl describe nodes | grep nvidia
# nvidia.com/gpu: 1
# nvidia.com/gpu: 1
# ... -
Add a test pod to the Kubernetes cluster to verify GPU functionality.
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda10.2"
resources:
limits:
nvidia.com/gpu: 1
EOF -
Verify that the task has been completed using the following command.
kubectl get pod/gpu-test
# NAME READY STATUS RESTARTS AGE
# gpu-test 0/1 Completed 0 2m -
You can check the task execution logs using the following command.
kubectl logs gpu-test
# GPU 0: NVIDIA A100 80GB PCIe ...
# [Vector addition of 50000 elements]
# Copy input data from the host memory to the CUDA device
# CUDA kernel launch with 196 blocks of 256 threads
# Copy output data from the CUDA device to the host memory
# Test PASSED
# Done -
Remove the created test pod.
kubectl delete pod/gpu-test
# pod "gpu-test" deleted
Type 2. Build MIG-enabled Nvidia GPU work nodes
In Type 2, we will explain one method of creating GPU work nodes in a Kubernetes cluster and using MIG (Multi-Instance GPU). Some GPUs provided by KakaoCloud (e.g., A100) support MIG. Users can create Kubernetes cluster nodes with GPU support and utilize the MIG strategy to use resources more efficiently.
You can find the instances that offer this GPU in the GPU instance.
Step 1. Add GPU work nodes to the Kubernetes cluster
This document does not cover the process of creating a Kubernetes cluster. You can create a new cluster or add a GPU node pool to an existing cluster.
-
Access the KakaoCloud Console.
-
In Kubernetes Engine > Cluster list, select the cluster where you want to work or click the [Create cluster] button.
- If you select a cluster, go to the cluster’s details page, click the Node Pool tab, and click the [Create node pool] button to enter the node pool information.
- If you click the [Create cluster] button, follow the step-by-step process to enter the information. Refer to the following settings in Step 2: Node Pool Settings of the cluster creation process.
Node Pool Type Instance Type Number of Nodes Volume(GB) GPU p2i.6xlarge 1 100 -
When creating GPU nodes, during the node provisioning phase, enable MIG and write a script to partition the GPU resources into MIG instances. The script will divide the GPU into 7 instances and configure the GPU container runtime.
sudo nvidia-smi -mig 1
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C -
Create the node pool.
Step 2. Configure Nvidia GPU work node environment
Through Step 1, a GPU node with MIG enabled has been created in the cluster.
-
To expose the GPU node to the Kubernetes cluster, use Helm to add the
nvidia-k8s-device-plugin
to the Kubernetes components.helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update -
You can use the
migStrategy
flag in the command to enable the use of MIG resources in the cluster. For more information, refer to the NVIDIA/k8s-device-plugin GitHub.helm install \
--version=0.12.3 \
--namespace nvidia-device-plugin \
--create-namespace \
--set compatWithCPUManager=true \
--set migStrategy=mixed \
--generate-name \
nvdp/nvidia-device-plugin -
When you run the Helm command, the
nvidia-k8se-device-plugin
pod will be deployed to the GPU work node, exposing the GPU to the cluster. Use the following command to display detailed information about node resources and verify that 7 MIG instances have been added.kubectl describe nodes | grep nvidia
# nvidia.com/mig-1g.10gb: 7
# nvidia.com/mig-1g.10gb: 7 -
Use the following command to display information about the GPU currently in use and deploy multiple workloads to the cluster. This helps verify the scheduling of cluster resources.
for i in $(seq 7); do
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: mig-example-${i}
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvidia/cuda:11.2.2-base-ubuntu20.04
command: ["nvidia-smi"]
args: ["-L"]
resources:
limits:
nvidia.com/mig-1g.10gb: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
done
# pod/mig-example-1 created
# pod/mig-example-2 created
# pod/mig-example-3 created
# pod/mig-example-4 created
# pod/mig-example-5 created
# pod/mig-example-6 created
# pod/mig-example-7 created -
Check the created pods and verify that the tasks have been completed as expected using the following command.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# mig-example-1 0/1 Completed 0 29s
# mig-example-2 0/1 Completed 0 26s
# mig-example-3 0/1 Completed 0 22s
# mig-example-4 0/1 Completed 0 19s
# mig-example-5 0/1 Completed 0 15s
# mig-example-6 0/1 Completed 0 12s
# mig-example-7 0/1 Completed 0 9s -
Once the task is completed, you can check the UUID of the MIG that ran the pod through the logs. Enter the following command to view the result.
for i in $(seq 7); do
kubectl logs mig-example-${i} | grep MIG
done
# MIG 1g.10gb Device 0: (UUID: MIG-aaaaa)
# MIG 1g.10gb Device 0: (UUID: MIG-bbbbb)
# MIG 1g.10gb Device 0: (UUID: MIG-ccccc)
# MIG 1g.10gb Device 0: (UUID: MIG-aaaaa)
# MIG 1g.10gb Device 0: (UUID: MIG-bbbbb)
# MIG 1g.10gb Device 0: (UUID: MIG-ddddd)
# MIG 1g.10gb Device 0: (UUID: MIG-eeeee) -
Remove the example pod that was created.
for i in $(seq 7); do
kubectl delete pod mig-example-${i}
done
# pod "mig-example-1" deleted
# pod "mig-example-2" deleted
# pod "mig-example-3" deleted
# pod "mig-example-4" deleted
# pod "mig-example-5" deleted
# pod "mig-example-6" deleted
# pod "mig-example-7" deleted