Set up NVIDIA GPU monitoring environment in K8S cluster
This document explains how to set up an NVIDIA GPU monitoring environment in a Kubernetes cluster. Users can follow this guide to create an NVIDIA GPU worker node pool in a Kubernetes cluster created with Kubernetes Engine and set up a monitoring environment.
- Estimated time: 20 minutes
- User environment
- Recommended operating system: MacOS, Ubuntu
- Region: kr-central-2
- Prerequisites
- Helm installed on the local environment
- Reference documentation
Getting started
Step 1. Create GPU node pool in Kubernetes cluster
-
Create a worker node pool with NVIDIA GPUs attached to a newly created or existing Kubernetes cluster, referring to the table below:
Node pool type Instance type Quantity Disk(GB) GPU p2i.6xlarge 1 100 -
To expose GPUs in the Kubernetes cluster, use Helm to add the
nvidia-k8s-device-plugin
to the cluster. Open your terminal in the local environment and execute the following Helm commands:helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install \
--version=0.12.3 \
--namespace nvidia-device-plugin \
--create-namespace \
--generate-name \
nvdp/nvidia-device-plugin
# NAME: nvidia-device-plugin-*
# LAST DEPLOYED: Fri Dec 9 16:26:48 2022
# NAMESPACE: nvidia-device-plugin
# STATUS: deployed
# REVISION: 1
# TEST SUITE: None
Step 2. Set up monitoring environment
You can use Helm to conveniently create and manage a Prometheus stack environment in the Kubernetes cluster. Install the kube-prometheus-stack to deploy the kube-prometheus stack, Grafana dashboard, and other monitoring tools in the Kubernetes cluster.
-
Create a directory for the example task in your local environment's default download path.
mkdir ~/Downloads/ike-gpu-monitor
cd ~/Downloads/ike-gpu-monitor -
Download the pre-configured
prom-values.yaml
file for kube-prometheus-stack.curl -O https://raw.githubusercontent.com/kakaoenterprise/kc-handson-config/k8s-gpu-monitor/prom-values.yaml
-
Install the kube-prometheus-stack using the downloaded
prom-values.yaml
file.helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -f prom-values.yaml -n kube-system
# NAME: prometheus
# LAST DEPLOYED: Fri Dec 9 16:29:51 2022
# NAMESPACE: kube-system
# STATUS: deployed
# REVISION: 1
# NOTES:
# kube-prometheus-stack has been installed. Check its status by running:
# kubectl --namespace kube-system get pods -l "release=prometheus" -
Finally, install
dcgm-exporter
, which provides GPU information and status metrics.helm repo add gpu-helm-charts \
https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter -n kube-system
# NAME: dcgm-exporter-*
# LAST DEPLOYED: Fri Dec 9 16:31:41 2022
# NAMESPACE: kube-system
# STATUS: deployed
# REVISION: 1
# TEST SUITE: None
# NOTES:
# 1. Get the application URL by running these commands:
# export POD_NAME=$(kubectl get pods -n kube-system -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1670571097" -o jsonpath="{.items[0].metadata.name}")
# kubectl -n kube-system port-forward $POD_NAME 8080:9400 &
Step 3. Access Grafana dashboard
Perform port forwarding to access the Grafana dashboard.
-
Use the following
kubectl
command to forward port30080
on your local environment to the Grafana dashboard endpoint:kubectl port-forward svc/prometheus-grafana -n kube-system 30080:80 &
-
Open a browser on your local environment and navigate to
http://localhost:30080
. If installed correctly, the Grafana login page will appear.open http://localhost:30080
Use the default Grafana credentials to log in:
Key Value username admin password prom-operator
Step 4. Create Grafana data source
Configure the Prometheus data source in Grafana to collect metrics from prometheus-server
.
Key | Data |
---|---|
Name | k8se-tutorial |
URL | http://prometheus-kube-prometheus-prometheus:9090 |
Step 5. Apply and check the dashboard
Import the NVIDIA-DCGM-Exporter-Dashboard to visualize GPU metrics.
Enter the Dashboard ID (12239
) in the Import Dashboard section and connect it to the previously created data source.
The dashboard will display metrics such as temperature and GPU utilization.