Metric Exporter Installation

To use AI Insight, Metric Exporter or the monitoring agent must be installed in the target environment. If it is not installed, the resource may appear in Agent Missing status, and key metrics such as GPU utilization, memory usage, temperature, idle ratio, and ECC Error may not be collected.

AI Insight uses DCGM Exporter to collect GPU metrics. Metric collection components and prerequisites differ depending on the target environment, such as Kubernetes Engine or Virtual Machine. Check the items for your environment before installation.

Note

After installation, it may take some time for actual data to appear in AI Insight.

Prerequisites

Prepare the following items for the target environment before installation. Prerequisites refer to the access permissions, tools, files, and target resources required to run the installation commands.

Environment	Prerequisites
Kubernetes Engine	- Access to the target cluster with `kubectl` - Helm installed - GPU nodes prepared - `dcgm-exporter-metrics.csv` file prepared
Virtual Machine	- VM with NVIDIA GPU prepared - Ubuntu 22.04 x86_64 environment - Administrator privileges - Network environment that can download external packages

Caution

The dcgm-exporter-metrics.csv file used in Kubernetes Engine must be in the current directory where you run the GPU Operator installation command.

Installation Method

Select the tab for your target environment and install the metric collection components.

Kubernetes Engine
Virtual Machine

Kubernetes Engine

In Kubernetes Engine, install NVIDIA GPU Operator to configure DCGM Exporter. In this guide, this is described as Metric Exporter installation.

1. Add Helm Repository

Run the following commands to add and update the NVIDIA Helm repository.

Add Helm Repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

2. Check GPU Operator Installation Options

Check the driver.enabled and mig.strategy values to enter in the GPU Operator installation command. These are not prerequisites, but installation options selected when running the helm install command.

Option	Value	Description
`driver.enabled`	`true`	GPU Operator installs and manages the driver. This is the default.
`driver.enabled`	`false`	Use this when the driver is preinstalled on the node.
`mig.strategy`	`none`	Disables MIG and uses the whole GPU. This is the default.
`mig.strategy`	`single`	Applies the same MIG profile to all GPUs.
`mig.strategy`	`mixed`	Allows different MIG profiles for each GPU.

Note

If you do not use MIG, set --set mig.strategy=none. In environments that use MIG, select single or mixed according to your operation policy.

3. Install GPU Operator

Change the driver.enabled and mig.strategy values in the following command to match your environment, then install GPU Operator.

Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --version v25.10.0 \
  --set driver.enabled=<true|false> \
  --set mig.strategy=<none|single|mixed> \
  --set dcgmExporter.config.name=custom-dcgm-exporter-metrics \
  --set dcgmExporter.config.create=true \
  --set-file dcgmExporter.config.data=./dcgm-exporter-metrics.csv \
  --wait

Caution

<true|false> and <none|single|mixed> are example placeholders. Before running the command, remove the placeholder notation including angle brackets and enter only one value.

4. Configure DCGM Exporter hostNetwork

After installing GPU Operator, run the following patch commands in order to apply hostNetwork and dnsPolicy to DCGM Exporter.

4-1. Patch ClusterPolicy

Patch ClusterPolicy
kubectl patch clusterpolicy cluster-policy \
  -n gpu-operator \
  --type=merge \
  -p '{
    "spec": {
      "dcgmExporter": {
        "hostNetwork": true,
        "dnsPolicy": "ClusterFirstWithHostNet"
      }
    }
  }'

4-2. Patch DaemonSet

Patch DaemonSet
kubectl patch daemonset nvidia-dcgm-exporter \
  -n gpu-operator \
  --type=merge \
  -p '{
    "spec": {
      "template": {
        "spec": {
          "hostNetwork": true,
          "dnsPolicy": "ClusterFirstWithHostNet"
        }
      }
    }
  }'

5. Verify Installation

Run the following commands to verify that GPU Operator and DCGM Exporter are installed properly.

Check All Pod Status
kubectl get pods -n gpu-operator

Check hostNetwork
kubectl get pod -n gpu-operator -l app=nvidia-dcgm-exporter -o yaml | grep -E "hostNetwork|dnsPolicy"

Check GPU Resource Recognition
kubectl describe node GPU_NODE_NAME | grep nvidia.com/gpu

Check Item	Expected Result
Pod status	Pods in the `gpu-operator` namespace are in Running status
hostNetwork	`hostNetwork: true` is displayed
dnsPolicy	`dnsPolicy: ClusterFirstWithHostNet` is displayed
GPU resource	`nvidia.com/gpu` resource is displayed in node details

If the expected results are not met, see If Data Is Not Displayed After Installation.

Virtual Machine

In Virtual Machine environments, install DCGM, DCGM Exporter, and the monitoring agent, then configure the monitoring agent to collect Prometheus-format GPU metrics exposed by DCGM Exporter.

Caution

The following commands are based on Ubuntu 22.04 x86_64. Package paths or installation commands may differ depending on the operating system or image.

1. Install DCGM

Install NVIDIA CUDA Keyring and DCGM
# Download NVIDIA CUDA Keyring (GPG authentication key and repository address)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

# Install Keyring
sudo dpkg -i cuda-keyring_1.1-1_all.deb

# Install DCGM
sudo apt-get update
sudo apt-get install -y datacenter-gpu-manager-4-cuda12=1:4.4.1-1 \
                        datacenter-gpu-manager-4-core=1:4.4.1-1 \
                        datacenter-gpu-manager-4-proprietary=1:4.4.1-1
sudo apt-mark hold datacenter-gpu-manager-4-cuda12 \
                   datacenter-gpu-manager-4-core \
                   datacenter-gpu-manager-4-proprietary

2. Enable DCGM Service

Enable the DCGM service and check whether the GPU is recognized on the VM.

Enable and Check DCGM Service
sudo systemctl daemon-reload
sudo systemctl enable --now nvidia-dcgm

# Check
sudo systemctl status nvidia-dcgm
sudo dcgmi discovery -l

3. Install DCGM Exporter

Install packages required to build DCGM Exporter.

Install Dependencies
sudo snap install go --classic
sudo apt-get install -y git make

Download the DCGM Exporter source code and create the metric definition file that DCGM Exporter will expose.

Prepare DCGM Exporter Build
git clone https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter
git checkout 4.4.1-4.6.0

Create default-counters.csv
cat <<'EOF' > etc/default-counters.csv
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product)
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
# DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
# DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations
DCGM_FI_DEV_XID_ERRORS,              gauge,   Value of the last XID error encountered.
DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# DCGM Exporter fields

# DCGM_EXP_CLOCK_EVENTS_COUNT, counter, reported clock events
# DCGM_EXP_XID_ERRORS_COUNT, counter, reported XIDs during last window
# DCGM_EXP_GPU_HEALTH_STATUS, counter, DCGM reported health status
# DCGM_EXP_P2P_STATUS, counter, P2P NvLink status

# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# DCGM_FI_DEV_FB_RESERVED, gauge, Framebuffer memory reserved (in MiB).

# ECC
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.

# VGPU License status
# DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows
# DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
# DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
# DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

# Static configuration information. These appear as labels on the other metrics
# DCGM_FI_DRIVER_VERSION,        label, Driver Version
# DCGM_FI_NVML_VERSION,          label, NVML Version
DCGM_FI_DEV_BRAND,             label, Device Brand
DCGM_FI_DEV_SERIAL,            label, Device Serial Number
# DCGM_FI_DEV_OEM_INFOROM_VER,   label, OEM inforom version
# DCGM_FI_DEV_ECC_INFOROM_VER,   label, ECC inforom version
# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
DCGM_FI_DEV_VBIOS_VERSION,     label, VBIOS version of the device

# Datacenter Profiling (DCP) metrics
# NOTE: supported on Nvidia datacenter Volta GPUs and newer
DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active.
DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned.
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
# DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active.
# DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
# DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
EOF

Build and install DCGM Exporter.

Build and Install DCGM Exporter
make binary
sudo make install

Register DCGM Exporter Service
sudo tee /etc/systemd/system/dcgm-exporter.service <<'EOF'
[Unit]
Description=NVIDIA DCGM Exporter
After=nvidia-dcgm.service
Requires=nvidia-dcgm.service

[Service]
Type=simple
ExecStart=/usr/bin/dcgm-exporter
Restart=always

[Install]
WantedBy=multi-user.target
EOF

Enable DCGM Exporter and check whether metrics are exposed.

Enable and Check DCGM Exporter Service
sudo systemctl daemon-reload
sudo systemctl enable --now dcgm-exporter

# Check
sudo systemctl status dcgm-exporter
curl -s http://localhost:9400/metrics | grep -i "DCGM_FI_DEV_GPU_TEMP"

4. Install Monitoring Agent and Configure Metric Collection

Install the monitoring agent.

Install Monitoring Agent
wget https://objectstorage.kr-central-2.kakaocloud.com/v1/52867b7dc99d45fb808b5bc874cb5b79/kic-monitoring-agent/package/kic_monitor_agent_1.1.0_amd64.deb
sudo dpkg -i kic_monitor_agent_1.1.0_amd64.deb
sudo systemctl enable kic_monitor_agent

Add a configuration so that the monitoring agent collects Prometheus-format GPU metrics exposed by DCGM Exporter.

Add Prometheus-format Metric Collection Configuration
sudo sed -i '/\[\[inputs\.mem\]\]/i \
[[inputs.prometheus]]\
  urls = ["http://localhost:9400/metrics"]\
  metric_version = 2\
' /etc/kic_monitor_agent/kic_monitor_agent.conf

Restart the monitoring agent and check the execution logs.

Restart Monitoring Agent and Check Logs
sudo systemctl restart kic_monitor_agent.service
sudo journalctl -u kic_monitor_agent.service -f

5. Verify Installation

Check the following items to verify that DCGM, DCGM Exporter, and the monitoring agent are working properly.

Check Item	Expected Result
DCGM service	`nvidia-dcgm` service is active
GPU recognition	GPU information is displayed by `sudo dcgmi discovery -l`
DCGM Exporter	`dcgm-exporter` service is active
Metric exposure	DCGM metrics are displayed at `localhost:9400/metrics`
Monitoring agent	`kic_monitor_agent` service logs show no errors

If the expected results are not met, see If Data Is Not Displayed After Installation.

If Data Is Not Displayed After Installation

If data is not displayed in AI Insight after installation or Agent Missing status persists, check the following items.

Environment	Check Item	Description
Kubernetes Engine	Pod status	Check GPU Operator-related Pod status with `kubectl get pods -n gpu-operator`
Kubernetes Engine	DCGM Exporter status	Check whether the `nvidia-dcgm-exporter` DaemonSet is running properly
Kubernetes Engine	hostNetwork settings	Check whether `hostNetwork: true` and `dnsPolicy: ClusterFirstWithHostNet` are applied
Kubernetes Engine	GPU resource recognition	Check whether the `nvidia.com/gpu` resource is displayed in GPU node details
Virtual Machine	DCGM service status	Check whether the `nvidia-dcgm` service is active
Virtual Machine	DCGM Exporter status	Check whether the `dcgm-exporter` service is active
Virtual Machine	Metric exposure	Check whether DCGM metrics are displayed at `localhost:9400/metrics`
Virtual Machine	Monitoring agent status	Check `kic_monitor_agent` service and logs
Common	Time range	Change the time range in AI Insight and query again
Common	Refresh	Run manual refresh or configure auto refresh, then query again

For more information, see Troubleshooting.

Prerequisites​

Installation Method​

Kubernetes Engine​

1. Add Helm Repository​

2. Check GPU Operator Installation Options​

3. Install GPU Operator​

4. Configure DCGM Exporter hostNetwork​

4-1. Patch ClusterPolicy​

4-2. Patch DaemonSet​

5. Verify Installation​

Virtual Machine​

1. Install DCGM​

2. Enable DCGM Service​

3. Install DCGM Exporter​

4. Install Monitoring Agent and Configure Metric Collection​

5. Verify Installation​

If Data Is Not Displayed After Installation​

Prerequisites

Installation Method

Kubernetes Engine

1. Add Helm Repository

2. Check GPU Operator Installation Options

3. Install GPU Operator

4. Configure DCGM Exporter hostNetwork

4-1. Patch ClusterPolicy

4-2. Patch DaemonSet

5. Verify Installation

Virtual Machine

1. Install DCGM

2. Enable DCGM Service

3. Install DCGM Exporter

4. Install Monitoring Agent and Configure Metric Collection

5. Verify Installation

If Data Is Not Displayed After Installation