Skip to main content

Setting up NVIDIA GPU monitoring in K8S cluster

This guide explains how to set up a monitoring environment for NVIDIA GPUs in a Kubernetes cluster. You will learn how to create an Nvidia GPU worker node pool in a Kubernetes cluster created with Kubernetes Engine and build a monitoring environment.

info

Step 1. Create GPU node pool in Kubernetes cluster

  1. Create a worker node pool with Nvidia GPUs attached to a new or existing Kubernetes cluster based on the following configuration:

    Node Pool TypeInstance typeQuantityDisk(GB)
    GPUp2i.6xlarge1100
  2. To expose GPUs to the Kubernetes cluster, add the nvidia-k8se-device-plugin using Helm. Open your terminal and run the following Helm command:

    helm repo add nvdp https://nvidia.github.io/k8se-device-plugin
    helm repo update
    helm install \
    --version=0.12.3 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --generate-name \
    nvdp/nvidia-device-plugin
    # NAME: nvidia-device-plugin-*
    # LAST DEPLOYED: Fri Dec 9 16:26:48 2022
    # NAMESPACE: nvidia-device-plugin
    # STATUS: deployed
    # REVISION: 1
    # TEST SUITE: None

Step 2. Set up monitoring environment

Helm allows you to easily create and manage the Prometheus stack in your K8S cluster. Install the kube-prometheus-stack to deploy the Prometheus stack and Grafana dashboards.

  1. Create a directory in your local environment for the example tasks:

    mkdir ~/Downloads/ike-gpu-monitor
    cd ~/Downloads/ike-gpu-monitor
  2. Download the pre-configured values.yaml for the kube-prometheus-stack:

    curl -O https://raw.githubusercontent.com/kakaoenterprise/kc-handson-config/k8s-gpu-monitor/prom-values.yaml
  3. Install the kube-prometheus-stack with the downloaded values.yaml:

    helm repo add prometheus-community \
    https://prometheus-community.github.io/helm-charts
    helm install prometheus prometheus-community/kube-prometheus-stack -f prom-values.yaml -n kube-system
    # NAME: prometheus
    # LAST DEPLOYED: Fri Dec 9 16:29:51 2022
    # NAMESPACE: kube-system
    # STATUS: deployed
    # REVISION: 1
    # NOTES:
    # Check its status by running:
    # kubectl --namespace kube-system get pods -l "release=prometheus"
  4. Install dcgm-exporter to provide GPU metrics:

    helm repo add gpu-helm-charts \
    https://nvidia.github.io/dcgm-exporter/helm-charts
    helm repo update
    helm install --generate-name gpu-helm-charts/dcgm-exporter -n kube-system
    # NAME: dcgm-exporter-*
    # LAST DEPLOYED: Fri Dec 9 16:31:41 2022
    # NAMESPACE: kube-system
    # STATUS: deployed
    # NOTES:
    # Get the application URL:
    # export POD_NAME=$(kubectl get pods -n kube-system -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1670571097" -o jsonpath="{.items[0].metadata.name}")
    # kubectl -n kube-system port-forward $POD_NAME 8080:9400 &

Step 3. Access Grafana dashboard

To access the Grafana dashboard, perform port forwarding.

  1. Run the following kubectl command to forward traffic to port 30080 in your local environment:

    kubectl port-forward svc/prometheus-grafana -n kube-system 30080:80 &
  2. Open your browser and go to http://localhost:30080. If installed correctly, the Grafana dashboard should load.

    open http://localhost:30080

    Use the following credentials to log in to Grafana:

    KeyValue
    usernameadmin
    passwordprom-operator
  3. Once logged in, you will see the Grafana home screen.

Step 4. Create Grafana data source

prometheus-server, part of the Prometheus stack, collects node metrics from prometheus-node-exporter. Create a data source in Grafana to request metrics from prometheus-server. Use the following details to create the data source:

KeyData
Namek8se-tutorial
URLhttp://prometheus-kube-prometheus-prometheus:9090

image

Step 5. Apply and check the dashboard

Create a dashboard to visualize the data source. Use the NVIDIA-DCGM-Exporter-Dashboard as an example to display dcgm-exporter metrics. Enter the unique ID (12239) and connect the data source to create the GPU monitoring dashboard.

image

You can now monitor GPU metrics like temperature and utilization through the created dashboard.