Metric Exporter Installation
To use AI Insight, Metric Exporter or the monitoring agent must be installed in the target environment. If it is not installed, the resource may appear in Agent Missing status, and key metrics such as GPU utilization, memory usage, temperature, idle ratio, and ECC Error may not be collected.
AI Insight uses DCGM Exporter to collect GPU metrics. Metric collection components and prerequisites differ depending on the target environment, such as Kubernetes Engine or Virtual Machine. Check the items for your environment before installation.
It may take some time after installation for data to appear in AI Insight.
Prerequisites
Prepare the following items for the target environment before installation. Prerequisites refer to the access permissions, tools, files, and target resources required to run the installation commands.
| Environment | Prerequisites |
|---|---|
| Kubernetes Engine | - Access to the target cluster with kubectl- Helm installed - GPU nodes prepared - dcgm-exporter-metrics.csv file prepared |
| Virtual Machine | - VM with NVIDIA GPU prepared - Ubuntu 22.04 x86_64 environment - Administrator privileges - Network environment that can download external packages |
The dcgm-exporter-metrics.csv file used in Kubernetes Engine must be in the current directory where you run the GPU Operator installation command.
Installation Method
Select the tab for your target environment and install the metric collection components.
- Kubernetes Engine
- Virtual Machine
Kubernetes Engine
In Kubernetes Engine, install NVIDIA GPU Operator to configure DCGM Exporter. In this guide, this is described as Metric Exporter installation.
1. Add Helm Repository
Run the following commands to add and update the NVIDIA Helm repository.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
2. Check GPU Operator Installation Options
Check the driver.enabled and mig.strategy values to enter in the GPU Operator installation command. These are not prerequisites, but installation options selected when running the helm install command.
| Option | Value | Description |
|---|---|---|
driver.enabled | true | GPU Operator installs and manages the driver. This is the default. |
driver.enabled | false | Use this when the driver is preinstalled on the node. |
mig.strategy | none | Disables MIG and uses the whole GPU. This is the default. |
mig.strategy | single | Applies the same MIG profile to all GPUs. |
mig.strategy | mixed | Allows different MIG profiles for each GPU. |
If you do not use MIG, set --set mig.strategy=none. In environments that use MIG, select single or mixed according to your operation policy.
3. Install GPU Operator
Change the driver.enabled and mig.strategy values in the following command to match your environment, then install GPU Operator.
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--version v25.10.0 \
--set driver.enabled=<true|false> \
--set mig.strategy=<none|single|mixed> \
--set dcgmExporter.config.name=custom-dcgm-exporter-metrics \
--set dcgmExporter.config.create=true \
--set-file dcgmExporter.config.data=./dcgm-exporter-metrics.csv \
--wait
<true|false> and <none|single|mixed> are example placeholders. Before running the command, remove the placeholder notation including angle brackets and enter only one value.
4. Configure DCGM Exporter hostNetwork
After installing GPU Operator, run the following patch commands in order to apply hostNetwork and dnsPolicy to DCGM Exporter.
4-1. Patch ClusterPolicy
kubectl patch clusterpolicy cluster-policy \
-n gpu-operator \
--type=merge \
-p '{
"spec": {
"dcgmExporter": {
"hostNetwork": true,
"dnsPolicy": "ClusterFirstWithHostNet"
}
}
}'
4-2. Patch DaemonSet
kubectl patch daemonset nvidia-dcgm-exporter \
-n gpu-operator \
--type=merge \
-p '{
"spec": {
"template": {
"spec": {
"hostNetwork": true,
"dnsPolicy": "ClusterFirstWithHostNet"
}
}
}
}'
5. Verify Installation
Run the following commands to verify that GPU Operator and DCGM Exporter are installed properly.
kubectl get pods -n gpu-operator
kubectl get pod -n gpu-operator -l app=nvidia-dcgm-exporter -o yaml | grep -E "hostNetwork|dnsPolicy"
kubectl describe node GPU_NODE_NAME | grep nvidia.com/gpu
| Check Item | Expected Result |
|---|---|
| Pod status | Pods in the gpu-operator namespace are in Running status |
| hostNetwork | hostNetwork: true is displayed |
| dnsPolicy | dnsPolicy: ClusterFirstWithHostNet is displayed |
| GPU resource | nvidia.com/gpu resource is displayed in node details |
If the expected results are not met, see If Data Is Not Displayed After Installation.
Virtual Machine
In Virtual Machine environments, install DCGM, DCGM Exporter, and the monitoring agent, then configure the monitoring agent to collect Prometheus-format GPU metrics exposed by DCGM Exporter.
The following commands are based on Ubuntu 22.04 x86_64. Package paths or installation commands may differ depending on the operating system or image.
1. Install DCGM
Register the NVIDIA CUDA repository keyring, then install DCGM packages.
# Download NVIDIA CUDA Keyring (GPG authentication key and repository address)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
# Install Keyring
sudo dpkg -i cuda-keyring_1.1-1_all.deb
# Install DCGM
sudo apt-get update
sudo apt-get install -y datacenter-gpu-manager-4-cuda12=1:4.4.1-1 \
datacenter-gpu-manager-4-core=1:4.4.1-1 \
datacenter-gpu-manager-4-proprietary=1:4.4.1-1
sudo apt-mark hold datacenter-gpu-manager-4-cuda12 \
datacenter-gpu-manager-4-core \
datacenter-gpu-manager-4-proprietary
2. Enable DCGM Service
Enable the DCGM service and check whether the GPU is recognized on the VM.
sudo systemctl daemon-reload
sudo systemctl enable --now nvidia-dcgm
# Check
sudo systemctl status nvidia-dcgm
sudo dcgmi discovery -l
3. Install DCGM Exporter
Install packages required to build DCGM Exporter.
sudo snap install go --classic
sudo apt-get install -y git make
Download the DCGM Exporter source code and create the metric definition file that DCGM Exporter will expose.
git clone https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter
git checkout 4.4.1-4.6.0
cat <<'EOF' > etc/default-counters.csv
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES, counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES, counter, Total number of bytes received through PCIe RX via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product)
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
# DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
# DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violations
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# DCGM Exporter fields
# DCGM_EXP_CLOCK_EVENTS_COUNT, counter, reported clock events
# DCGM_EXP_XID_ERRORS_COUNT, counter, reported XIDs during last window
# DCGM_EXP_GPU_HEALTH_STATUS, counter, DCGM reported health status
# DCGM_EXP_P2P_STATUS, counter, P2P NvLink status
# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# DCGM_FI_DEV_FB_RESERVED, gauge, Framebuffer memory reserved (in MiB).
# ECC
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages
# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
# VGPU License status
# DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
# Remapped rows
# DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
# DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
# DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
# Static configuration information. These appear as labels on the other metrics
# DCGM_FI_DRIVER_VERSION, label, Driver Version
# DCGM_FI_NVML_VERSION, label, NVML Version
DCGM_FI_DEV_BRAND, label, Device Brand
DCGM_FI_DEV_SERIAL, label, Device Serial Number
# DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version
# DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version
# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device
# Datacenter Profiling (DCP) metrics
# NOTE: supported on Nvidia datacenter Volta GPUs and newer
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active.
DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned.
DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active.
# DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active.
# DCGM_FI_PROF_PCIE_TX_BYTES, gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
# DCGM_FI_PROF_PCIE_RX_BYTES, gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
EOF
Build and install DCGM Exporter.
make binary
sudo make install
Register DCGM Exporter as a systemd service.
sudo tee /etc/systemd/system/dcgm-exporter.service <<'EOF'
[Unit]
Description=NVIDIA DCGM Exporter
After=nvidia-dcgm.service
Requires=nvidia-dcgm.service
[Service]
Type=simple
ExecStart=/usr/bin/dcgm-exporter
Restart=always
[Install]
WantedBy=multi-user.target
EOF
Enable DCGM Exporter and check whether metrics are exposed.
sudo systemctl daemon-reload
sudo systemctl enable --now dcgm-exporter
# Check
sudo systemctl status dcgm-exporter
curl -s http://localhost:9400/metrics | grep -i "DCGM_FI_DEV_GPU_TEMP"
4. Install Monitoring Agent and Configure Metric Collection
Install the monitoring agent.
wget https://objectstorage.kr-central-2.kakaocloud.com/v1/52867b7dc99d45fb808b5bc874cb5b79/kic-monitoring-agent/package/kic_monitor_agent_1.1.0_amd64.deb
sudo dpkg -i kic_monitor_agent_1.1.0_amd64.deb
sudo systemctl enable kic_monitor_agent
Add a configuration so that the monitoring agent collects Prometheus-format GPU metrics exposed by DCGM Exporter.
sudo sed -i '/\[\[inputs\.mem\]\]/i \
[[inputs.prometheus]]\
urls = ["http://localhost:9400/metrics"]\
metric_version = 2\
' /etc/kic_monitor_agent/kic_monitor_agent.conf
Restart the monitoring agent and check the execution logs.
sudo systemctl restart kic_monitor_agent.service
sudo journalctl -u kic_monitor_agent.service -f
5. Verify Installation
Check the following items to verify that DCGM, DCGM Exporter, and the monitoring agent are working properly.
| Check Item | Expected Result |
|---|---|
| DCGM service | nvidia-dcgm service is active |
| GPU recognition | GPU information is displayed by sudo dcgmi discovery -l |
| DCGM Exporter | dcgm-exporter service is active |
| Metric exposure | DCGM metrics are displayed at localhost:9400/metrics |
| Monitoring agent | kic_monitor_agent service logs show no errors |
If the expected results are not met, see If Data Is Not Displayed After Installation.
If Data Is Not Displayed After Installation
If data is not displayed in AI Insight after installation or Agent Missing status persists, check the following items.
| Environment | Check Item | Description |
|---|---|---|
| Kubernetes Engine | Pod status | Check GPU Operator-related Pod status with kubectl get pods -n gpu-operator |
| Kubernetes Engine | DCGM Exporter status | Check whether the nvidia-dcgm-exporter DaemonSet is running properly |
| Kubernetes Engine | hostNetwork settings | Check whether hostNetwork: true and dnsPolicy: ClusterFirstWithHostNet are applied |
| Kubernetes Engine | GPU resource recognition | Check whether the nvidia.com/gpu resource is displayed in GPU node details |
| Virtual Machine | DCGM service status | Check whether the nvidia-dcgm service is active |
| Virtual Machine | DCGM Exporter status | Check whether the dcgm-exporter service is active |
| Virtual Machine | Metric exposure | Check whether DCGM metrics are displayed at localhost:9400/metrics |
| Virtual Machine | Monitoring agent status | Check the kic_monitor_agent service and logs |
| Common | Time range | Change the time range in AI Insight and query again |
| Common | Refresh | Run manual refresh or configure auto refresh, then query again |
For more information, see Troubleshooting.