Configure NVIDIA GPU monitoring environment in K8S cluster

This document explains how to set up an NVIDIA GPU monitoring environment in a Kubernetes cluster. Users can follow this guide to create an NVIDIA GPU worker node pool in a Kubernetes cluster created with Kubernetes Engine and set up a monitoring environment.

info

Estimated time: 20 minutes
Recommended operating system: MacOS, Ubuntu
Prerequisites
- Helm installed on the local environment
Reference documentation
- Use NVIDIA GPU worker node in Kubernetes Engine cluster

Getting started

Step 1. Create GPU node pool in Kubernetes cluster

Create a worker node pool with NVIDIA GPUs attached to a newly created or existing Kubernetes cluster, referring to the table below:

Node pool type Instance type Quantity Disk(GB)
GPU p2i.6xlarge 1 100

Node pool type	Instance type	Quantity	Disk(GB)
GPU	p2i.6xlarge	1	100

To expose GPUs in the Kubernetes cluster, use Helm to add the nvidia-k8s-device-plugin to the cluster. Open your terminal in the local environment and execute the following Helm commands:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install \
    --version=0.12.3 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --generate-name \
    nvdp/nvidia-device-plugin
# NAME: nvidia-device-plugin-*
# LAST DEPLOYED: Fri Dec  9 16:26:48 2022
# NAMESPACE: nvidia-device-plugin
# STATUS: deployed
# REVISION: 1
# TEST SUITE: None

Step 2. Set up monitoring environment

You can use Helm to conveniently create and manage a Prometheus stack environment in the Kubernetes cluster. Install the kube-prometheus-stack to deploy the kube-prometheus stack, Grafana dashboard, and other monitoring tools in the Kubernetes cluster.

Create a directory for the example task in your local environment's default download path.
```
mkdir ~/Downloads/ike-gpu-monitor
cd ~/Downloads/ike-gpu-monitor
```

Download the pre-configured prom-values.yaml file for kube-prometheus-stack.

curl -O https://raw.githubusercontent.com/kakaoenterprise/kc-handson-config/k8s-gpu-monitor/prom-values.yaml

Install the kube-prometheus-stack using the downloaded prom-values.yaml file.

helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -f prom-values.yaml -n kube-system
# NAME: prometheus
# LAST DEPLOYED: Fri Dec  9 16:29:51 2022
# NAMESPACE: kube-system
# STATUS: deployed
# REVISION: 1
# NOTES:
# kube-prometheus-stack has been installed. Check its status by running:
#   kubectl --namespace kube-system get pods -l "release=prometheus"

Finally, install dcgm-exporter, which provides GPU information and status metrics.

helm repo add gpu-helm-charts \
https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter -n kube-system
# NAME: dcgm-exporter-*
# LAST DEPLOYED: Fri Dec  9 16:31:41 2022
# NAMESPACE: kube-system
# STATUS: deployed
# REVISION: 1
# TEST SUITE: None
# NOTES:
# 1. Get the application URL by running these commands:
#   export POD_NAME=$(kubectl get pods -n kube-system -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1670571097" -o jsonpath="{.items[0].metadata.name}")
#   kubectl -n kube-system port-forward $POD_NAME 8080:9400 &

Step 3. Access Grafana dashboard

Perform port forwarding to access the Grafana dashboard.

Use the following kubectl command to forward port 30080 on your local environment to the Grafana dashboard endpoint:
```
kubectl port-forward svc/prometheus-grafana -n kube-system 30080:80 &
```
Open a browser on your local environment and navigate to http://localhost:30080. If installed correctly, the Grafana login page will appear.
```
open http://localhost:30080
```
Use the default Grafana credentials to log in:

Key Value
username admin
password prom-operator

Key	Value
username	admin
password	prom-operator

Step 4. Create Grafana data source

Configure the Prometheus data source in Grafana to collect metrics from prometheus-server.

Key	Data
Name	k8se-tutorial
URL	`http://prometheus-kube-prometheus-prometheus:9090`

Create Data Source

Step 5. Apply and check the dashboard

Import the NVIDIA-DCGM-Exporter-Dashboard to visualize GPU metrics.

Enter the Dashboard ID (12239) in the Import Dashboard section and connect it to the previously created data source.

Create Dashboard

The dashboard will display metrics such as temperature and GPU utilization.

Getting started​

Step 1. Create GPU node pool in Kubernetes cluster​

Step 2. Set up monitoring environment​

Step 3. Access Grafana dashboard​

Step 4. Create Grafana data source​

Step 5. Apply and check the dashboard​