Setting up NVIDIA GPU monitoring in K8S cluster
This guide explains how to set up a monitoring environment for NVIDIA GPUs in a Kubernetes cluster. You will learn how to create an Nvidia GPU worker node pool in a Kubernetes cluster created with Kubernetes Engine and build a monitoring environment.
- Estimated time: 20 minutes
- Recommended OS: macOS, Ubuntu
- Region: kr-central-2
- Prerequisites:
- Helm installed in your local environment
- Reference:
Step 1. Create GPU node pool in Kubernetes cluster
-
Create a worker node pool with Nvidia GPUs attached to a new or existing Kubernetes cluster based on the following configuration:
Node Pool Type Instance type Quantity Disk(GB) GPU p2i.6xlarge 1 100 -
To expose GPUs to the Kubernetes cluster, add the
nvidia-k8se-device-plugin
using Helm. Open your terminal and run the following Helm command:helm repo add nvdp https://nvidia.github.io/k8se-device-plugin
helm repo update
helm install \
--version=0.12.3 \
--namespace nvidia-device-plugin \
--create-namespace \
--generate-name \
nvdp/nvidia-device-plugin
# NAME: nvidia-device-plugin-*
# LAST DEPLOYED: Fri Dec 9 16:26:48 2022
# NAMESPACE: nvidia-device-plugin
# STATUS: deployed
# REVISION: 1
# TEST SUITE: None
Step 2. Set up monitoring environment
Helm allows you to easily create and manage the Prometheus stack in your K8S cluster. Install the kube-prometheus-stack to deploy the Prometheus stack and Grafana dashboards.
-
Create a directory in your local environment for the example tasks:
mkdir ~/Downloads/ike-gpu-monitor
cd ~/Downloads/ike-gpu-monitor -
Download the pre-configured
values.yaml
for the kube-prometheus-stack:curl -O https://raw.githubusercontent.com/kakaoenterprise/kc-handson-config/k8s-gpu-monitor/prom-values.yaml
-
Install the kube-prometheus-stack with the downloaded
values.yaml
:helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -f prom-values.yaml -n kube-system
# NAME: prometheus
# LAST DEPLOYED: Fri Dec 9 16:29:51 2022
# NAMESPACE: kube-system
# STATUS: deployed
# REVISION: 1
# NOTES:
# Check its status by running:
# kubectl --namespace kube-system get pods -l "release=prometheus" -
Install
dcgm-exporter
to provide GPU metrics:helm repo add gpu-helm-charts \
https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter -n kube-system
# NAME: dcgm-exporter-*
# LAST DEPLOYED: Fri Dec 9 16:31:41 2022
# NAMESPACE: kube-system
# STATUS: deployed
# NOTES:
# Get the application URL:
# export POD_NAME=$(kubectl get pods -n kube-system -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1670571097" -o jsonpath="{.items[0].metadata.name}")
# kubectl -n kube-system port-forward $POD_NAME 8080:9400 &
Step 3. Access Grafana dashboard
To access the Grafana dashboard, perform port forwarding.
-
Run the following
kubectl
command to forward traffic to port 30080 in your local environment:kubectl port-forward svc/prometheus-grafana -n kube-system 30080:80 &
-
Open your browser and go to
http://localhost:30080
. If installed correctly, the Grafana dashboard should load.open http://localhost:30080
Use the following credentials to log in to Grafana:
Key Value username admin password prom-operator -
Once logged in, you will see the Grafana home screen.
Step 4. Create Grafana data source
prometheus-server
, part of the Prometheus stack, collects node metrics from prometheus-node-exporter
. Create a data source in Grafana to request metrics from prometheus-server
. Use the following details to create the data source:
Key | Data |
---|---|
Name | k8se-tutorial |
URL | http://prometheus-kube-prometheus-prometheus:9090 |
Step 5. Apply and check the dashboard
Create a dashboard to visualize the data source. Use the NVIDIA-DCGM-Exporter-Dashboard as an example to display dcgm-exporter
metrics. Enter the unique ID (12239) and connect the data source to create the GPU monitoring dashboard.
You can now monitor GPU metrics like temperature and utilization through the created dashboard.