Skip to main content

Set up NVIDIA GPU monitoring environment in K8S cluster

This document explains how to set up an NVIDIA GPU monitoring environment in a Kubernetes cluster. Users can follow this guide to create an NVIDIA GPU worker node pool in a Kubernetes cluster created with Kubernetes Engine and set up a monitoring environment.

info

Getting started

Step 1. Create GPU node pool in Kubernetes cluster

  1. Create a worker node pool with NVIDIA GPUs attached to a newly created or existing Kubernetes cluster, referring to the table below:

    Node pool typeInstance typeQuantityDisk(GB)
    GPUp2i.6xlarge1100
  2. To expose GPUs in the Kubernetes cluster, use Helm to add the nvidia-k8s-device-plugin to the cluster. Open your terminal in the local environment and execute the following Helm commands:

    helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
    helm repo update
    helm install \
    --version=0.12.3 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --generate-name \
    nvdp/nvidia-device-plugin
    # NAME: nvidia-device-plugin-*
    # LAST DEPLOYED: Fri Dec 9 16:26:48 2022
    # NAMESPACE: nvidia-device-plugin
    # STATUS: deployed
    # REVISION: 1
    # TEST SUITE: None

Step 2. Set up monitoring environment

You can use Helm to conveniently create and manage a Prometheus stack environment in the Kubernetes cluster. Install the kube-prometheus-stack to deploy the kube-prometheus stack, Grafana dashboard, and other monitoring tools in the Kubernetes cluster.

  1. Create a directory for the example task in your local environment's default download path.

    mkdir ~/Downloads/ike-gpu-monitor
    cd ~/Downloads/ike-gpu-monitor
  2. Download the pre-configured prom-values.yaml file for kube-prometheus-stack.

    curl -O https://raw.githubusercontent.com/kakaoenterprise/kc-handson-config/k8s-gpu-monitor/prom-values.yaml
  3. Install the kube-prometheus-stack using the downloaded prom-values.yaml file.

    helm repo add prometheus-community \
    https://prometheus-community.github.io/helm-charts
    helm install prometheus prometheus-community/kube-prometheus-stack -f prom-values.yaml -n kube-system
    # NAME: prometheus
    # LAST DEPLOYED: Fri Dec 9 16:29:51 2022
    # NAMESPACE: kube-system
    # STATUS: deployed
    # REVISION: 1
    # NOTES:
    # kube-prometheus-stack has been installed. Check its status by running:
    # kubectl --namespace kube-system get pods -l "release=prometheus"
  4. Finally, install dcgm-exporter, which provides GPU information and status metrics.

    helm repo add gpu-helm-charts \
    https://nvidia.github.io/dcgm-exporter/helm-charts
    helm repo update
    helm install --generate-name gpu-helm-charts/dcgm-exporter -n kube-system
    # NAME: dcgm-exporter-*
    # LAST DEPLOYED: Fri Dec 9 16:31:41 2022
    # NAMESPACE: kube-system
    # STATUS: deployed
    # REVISION: 1
    # TEST SUITE: None
    # NOTES:
    # 1. Get the application URL by running these commands:
    # export POD_NAME=$(kubectl get pods -n kube-system -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1670571097" -o jsonpath="{.items[0].metadata.name}")
    # kubectl -n kube-system port-forward $POD_NAME 8080:9400 &

Step 3. Access Grafana dashboard

Perform port forwarding to access the Grafana dashboard.

  1. Use the following kubectl command to forward port 30080 on your local environment to the Grafana dashboard endpoint:

    kubectl port-forward svc/prometheus-grafana -n kube-system 30080:80 &
  2. Open a browser on your local environment and navigate to http://localhost:30080. If installed correctly, the Grafana login page will appear.

    open http://localhost:30080

Use the default Grafana credentials to log in:

KeyValue
usernameadmin
passwordprom-operator

Step 4. Create Grafana data source

Configure the Prometheus data source in Grafana to collect metrics from prometheus-server.

KeyData
Namek8se-tutorial
URLhttp://prometheus-kube-prometheus-prometheus:9090

Create Data Source

Step 5. Apply and check the dashboard

Import the NVIDIA-DCGM-Exporter-Dashboard to visualize GPU metrics.

Enter the Dashboard ID (12239) in the Import Dashboard section and connect it to the previously created data source.

Create Dashboard

The dashboard will display metrics such as temperature and GPU utilization.