Configure NVIDIA GPU monitoring environment in K8S cluster
This document explains how to set up an NVIDIA GPU monitoring environment in a Kubernetes cluster. Users can follow this guide to create an NVIDIA GPU worker node pool in a Kubernetes cluster created with Kubernetes Engine and set up a monitoring environment.
- Estimated time: 20 minutes
- Recommended operating system: MacOS, Ubuntu
- Prerequisites
- Helm installed on the local environment
- Reference documentation
Getting started
Step 1. Create GPU node pool in Kubernetes cluster
-
Create a worker node pool with NVIDIA GPUs attached to a newly created or existing Kubernetes cluster, referring to the table below:
Node pool type Instance type Quantity Disk(GB) GPU p2i.6xlarge 1 100 -
To expose GPUs in the Kubernetes cluster, use Helm to add the
nvidia-k8s-device-pluginto the cluster. Open your terminal in the local environment and execute the following Helm commands:helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install \
--version=0.12.3 \
--namespace nvidia-device-plugin \
--create-namespace \
--generate-name \
nvdp/nvidia-device-plugin
# NAME: nvidia-device-plugin-*
# LAST DEPLOYED: Fri Dec 9 16:26:48 2022
# NAMESPACE: nvidia-device-plugin
# STATUS: deployed
# REVISION: 1
# TEST SUITE: None
Step 2. Set up monitoring environment
You can use Helm to conveniently create and manage a Prometheus stack environment in the Kubernetes cluster. Install the kube-prometheus-stack to deploy the kube-prometheus stack, Grafana dashboard, and other monitoring tools in the Kubernetes cluster.
-
Create a directory for the example task in your local environment's default download path.
mkdir ~/Downloads/ike-gpu-monitor
cd ~/Downloads/ike-gpu-monitor -
Download the pre-configured
prom-values.yamlfile for kube-prometheus-stack.curl -O https://raw.githubusercontent.com/kakaoenterprise/kc-handson-config/k8s-gpu-monitor/prom-values.yaml -
Install the kube-prometheus-stack using the downloaded
prom-values.yamlfile.helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -f prom-values.yaml -n kube-system
# NAME: prometheus
# LAST DEPLOYED: Fri Dec 9 16:29:51 2022
# NAMESPACE: kube-system
# STATUS: deployed
# REVISION: 1
# NOTES:
# kube-prometheus-stack has been installed. Check its status by running:
# kubectl --namespace kube-system get pods -l "release=prometheus" -
Finally, install
dcgm-exporter, which provides GPU information and status metrics.helm repo add gpu-helm-charts \
https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter -n kube-system
# NAME: dcgm-exporter-*
# LAST DEPLOYED: Fri Dec 9 16:31:41 2022
# NAMESPACE: kube-system
# STATUS: deployed
# REVISION: 1
# TEST SUITE: None
# NOTES:
# 1. Get the application URL by running these commands:
# export POD_NAME=$(kubectl get pods -n kube-system -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1670571097" -o jsonpath="{.items[0].metadata.name}")
# kubectl -n kube-system port-forward $POD_NAME 8080:9400 &
Step 3. Access Grafana dashboard
Perform port forwarding to access the Grafana dashboard.
-
Use the following
kubectlcommand to forward port30080on your local environment to the Grafana dashboard endpoint:kubectl port-forward svc/prometheus-grafana -n kube-system 30080:80 & -
Open a browser on your local environment and navigate to
http://localhost:30080. If installed correctly, the Grafana login page will appear.open http://localhost:30080Use the default Grafana credentials to log in:
Key Value username admin password prom-operator
Step 4. Create Grafana data source
Configure the Prometheus data source in Grafana to collect metrics from prometheus-server.
| Key | Data |
|---|---|
| Name | k8se-tutorial |
| URL | http://prometheus-kube-prometheus-prometheus:9090 |
Step 5. Apply and check the dashboard
Import the NVIDIA-DCGM-Exporter-Dashboard to visualize GPU metrics.
Enter the Dashboard ID (12239) in the Import Dashboard section and connect it to the previously created data source.
The dashboard will display metrics such as temperature and GPU utilization.