Skip to main content

Metric Exporter Installation

To use AI Insight, Metric Exporter or the monitoring agent must be installed in the target environment. If it is not installed, the resource may appear in Agent Missing status, and key metrics such as GPU utilization, memory usage, temperature, idle ratio, and ECC Error may not be collected.

AI Insight uses DCGM Exporter to collect GPU metrics. Metric collection components and prerequisites differ depending on the target environment, such as Kubernetes Engine or Virtual Machine. Check the items for your environment before installation.

Note

It may take some time after installation for data to appear in AI Insight.

Prerequisites

Prepare the following items for the target environment before installation. Prerequisites refer to the access permissions, tools, files, and target resources required to run the installation commands.

EnvironmentPrerequisites
Kubernetes Engine- Access to the target cluster with kubectl
- Helm installed
- GPU nodes prepared
- dcgm-exporter-metrics.csv file prepared
Virtual Machine- VM with NVIDIA GPU prepared
- Ubuntu 22.04 x86_64 environment
- Administrator privileges
- Network environment that can download external packages
Caution

The dcgm-exporter-metrics.csv file used in Kubernetes Engine must be in the current directory where you run the GPU Operator installation command.

Installation Method

Select the tab for your target environment and install the metric collection components.

Kubernetes Engine

In Kubernetes Engine, install NVIDIA GPU Operator to configure DCGM Exporter. In this guide, this is described as Metric Exporter installation.

1. Add Helm Repository

Run the following commands to add and update the NVIDIA Helm repository.

Add Helm Repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

2. Check GPU Operator Installation Options

Check the driver.enabled and mig.strategy values to enter in the GPU Operator installation command. These are not prerequisites, but installation options selected when running the helm install command.

OptionValueDescription
driver.enabledtrueGPU Operator installs and manages the driver. This is the default.
driver.enabledfalseUse this when the driver is preinstalled on the node.
mig.strategynoneDisables MIG and uses the whole GPU. This is the default.
mig.strategysingleApplies the same MIG profile to all GPUs.
mig.strategymixedAllows different MIG profiles for each GPU.
Note

If you do not use MIG, set --set mig.strategy=none. In environments that use MIG, select single or mixed according to your operation policy.

3. Install GPU Operator

Change the driver.enabled and mig.strategy values in the following command to match your environment, then install GPU Operator.

Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--version v25.10.0 \
--set driver.enabled=<true|false> \
--set mig.strategy=<none|single|mixed> \
--set dcgmExporter.config.name=custom-dcgm-exporter-metrics \
--set dcgmExporter.config.create=true \
--set-file dcgmExporter.config.data=./dcgm-exporter-metrics.csv \
--wait
Caution

<true|false> and <none|single|mixed> are example placeholders. Before running the command, remove the placeholder notation including angle brackets and enter only one value.

4. Configure DCGM Exporter hostNetwork

After installing GPU Operator, run the following patch commands in order to apply hostNetwork and dnsPolicy to DCGM Exporter.

4-1. Patch ClusterPolicy
Patch ClusterPolicy
kubectl patch clusterpolicy cluster-policy \
-n gpu-operator \
--type=merge \
-p '{
"spec": {
"dcgmExporter": {
"hostNetwork": true,
"dnsPolicy": "ClusterFirstWithHostNet"
}
}
}'
4-2. Patch DaemonSet
Patch DaemonSet
kubectl patch daemonset nvidia-dcgm-exporter \
-n gpu-operator \
--type=merge \
-p '{
"spec": {
"template": {
"spec": {
"hostNetwork": true,
"dnsPolicy": "ClusterFirstWithHostNet"
}
}
}
}'

5. Verify Installation

Run the following commands to verify that GPU Operator and DCGM Exporter are installed properly.

Check All Pod Status
kubectl get pods -n gpu-operator
Check hostNetwork
kubectl get pod -n gpu-operator -l app=nvidia-dcgm-exporter -o yaml | grep -E "hostNetwork|dnsPolicy"
Check GPU Resource Recognition
kubectl describe node GPU_NODE_NAME | grep nvidia.com/gpu
Check ItemExpected Result
Pod statusPods in the gpu-operator namespace are in Running status
hostNetworkhostNetwork: true is displayed
dnsPolicydnsPolicy: ClusterFirstWithHostNet is displayed
GPU resourcenvidia.com/gpu resource is displayed in node details

If the expected results are not met, see If Data Is Not Displayed After Installation.

If Data Is Not Displayed After Installation

If data is not displayed in AI Insight after installation or Agent Missing status persists, check the following items.

EnvironmentCheck ItemDescription
Kubernetes EnginePod statusCheck GPU Operator-related Pod status with kubectl get pods -n gpu-operator
Kubernetes EngineDCGM Exporter statusCheck whether the nvidia-dcgm-exporter DaemonSet is running properly
Kubernetes EnginehostNetwork settingsCheck whether hostNetwork: true and dnsPolicy: ClusterFirstWithHostNet are applied
Kubernetes EngineGPU resource recognitionCheck whether the nvidia.com/gpu resource is displayed in GPU node details
Virtual MachineDCGM service statusCheck whether the nvidia-dcgm service is active
Virtual MachineDCGM Exporter statusCheck whether the dcgm-exporter service is active
Virtual MachineMetric exposureCheck whether DCGM metrics are displayed at localhost:9400/metrics
Virtual MachineMonitoring agent statusCheck the kic_monitor_agent service and logs
CommonTime rangeChange the time range in AI Insight and query again
CommonRefreshRun manual refresh or configure auto refresh, then query again

For more information, see Troubleshooting.