View GPU Resource Details

In GPU Explorer, you can query the status and metrics of GPU resources collected by AI Insight by cluster, node, or GPU. After checking abnormal resources in Overview, use GPU Explorer to analyze detailed metrics and events.

Go to the GPU Explorer Menu

Go to the KakaoCloud Console.
Go to AI Service > AI Insight.
Expand GPU Explorer in the left menu, then click Cluster, Node, or GPU.

Common Query Features

Cluster, Node, and GPU detail pages provide the following common features.

Feature	Description
Breadcrumb	Displays the current resource path
Time range	Select from 1 hour, 3 hours, 12 hours, 1 day, or 7 days
Auto refresh	Automatically refreshes page data at the selected interval
Manual refresh	Refreshes data immediately by clicking the refresh icon
Last updated	Displays the last time data was refreshed
Resource selection	Selects the resource to query from the dropdown at the top right

View Cluster Details

On the Cluster page, you can check the status and metrics of GPU resources in a specific cluster.

Click GPU Explorer > Cluster in the left menu.
Select a cluster from the cluster selection dropdown at the top right.
Check the top summary cards for GPU count, average/maximum GPU load, average/maximum GPU memory usage, average/maximum GPU temperature, average/maximum GPU idle ratio, and ECC Error count.
Check the number of Active, Warning, Critical, Pending, Idle, and Agent Missing GPUs in the status cards.
Select an abnormal GPU or MIG instance in GPU Map.
Check detailed data in GPU Metrics, GPU Outlier Detection, and GPU Correlation.

Cluster Page Components

Area	Description
Summary cards	Summarizes key GPU metrics in the cluster
Status cards	Displays the number of GPUs by status
GPU Map	Visually displays GPUs and MIG instances in the cluster
GPU Metrics	Displays trends for GPU utilization, memory usage, temperature, idle ratio, and ECC Error
GPU Outlier Detection	Displays average/maximum metrics and peak times by GPU
GPU Correlation	Displays relationships between GPU utilization and temperature, and between GPU utilization and idle time

Check Abnormal GPUs in Cluster

Check the number of Warning or Critical GPUs in the status cards.
Select an abnormal GPU in GPU Map.
Check temperature, ECC Error, and Throttling trends in GPU Metrics.
In GPU Outlier Detection, check whether a resource has an average or maximum value different from other GPUs.
In GPU Correlation, check whether any GPU has a high temperature relative to its utilization.

View Node Details

On the Node page, you can check the system resource status of a specific node and the status of GPUs connected to that node.

Click GPU Explorer > Node in the left menu.
Select a node from the node selection dropdown at the top right.
Check the top summary cards for total GPU count, average/maximum GPU load, average/maximum GPU memory usage, average/maximum GPU temperature, average/maximum GPU idle ratio, and ECC Error count.
Check the number of GPUs by status for the node in the status cards.
Check CPU, memory, disk, and network metrics in Node Status.
If data is not displayed, check Agent Missing status or the metric collection configuration.

Node Status Metrics

Metric	Description
Total CPU Usage	Trend of overall node CPU usage
CPU Usage by Core	Trend of CPU usage by core
Total Memory Usage	Trend of overall node memory usage
Disk Read Bytes	Trend of disk read throughput
Disk Write Bytes	Trend of disk write throughput
Network Rx Bytes	Trend of network receive throughput

Check Root Cause on Node

When a GPU is in Warning or Critical status, check node status together with GPU metrics.

Check Item	Purpose
CPU usage	Check whether overall node load affects GPU workloads
Memory usage	Check whether node memory is insufficient
Disk read/write	Check for bottlenecks in data loading or storage
Network receive	Check for network bottlenecks such as training data transfer or distributed training communication
Agent Missing	Check whether the metric collection component for the node or GPU is working properly

View GPU Details

On the GPU page, you can check detailed status and metrics for an individual GPU or MIG instance.

Click GPU Explorer > GPU in the left menu.
Select a GPU from the GPU selection dropdown at the top right.
Check the top summary cards for GPU status, average/maximum GPU utilization, average/maximum GPU memory usage, average/maximum GPU temperature, average/maximum GPU idle ratio, and ECC Error count.
Check trends for utilization, memory usage, temperature, idle ratio, throttling, and ECC Error in GPU Metrics.
If MIG is configured, check the status of each MIG instance in the legend or list.

GPU Metrics

Metric	Description	How to Interpret
GPU Utilization Trend	GPU utilization trend	May be used as one of the Active conditions when 10% or higher.
GPU Memory Usage Trend	GPU memory usage trend	May be used as one of the Active conditions when 20% or higher.
GPU Temperature Trend	GPU temperature trend	May be included in Warning when 85°C or higher persists for 3 minutes, or Critical when 90°C or higher persists for 2 minutes.
GPU Idle Trend	GPU idle ratio trend	Used to identify idle GPUs.
GPU Throttling	Accumulated time or occurrence of clock throttling due to limit conditions	Minor Thermal/Power limits may cause Warning, and severe Thermal limits or sustained Reliability limits may cause Critical.
GPU ECC Error Count	Number of ECC Errors in the last 24 hours	SBE may be included in Warning, and DBE may be included in Critical.

Check Status Details

Check Idle or Active

Idle or Active is determined based on GPU compute utilization and GPU memory usage.

Status	Criteria
Idle	GPU compute utilization is below 10% and GPU memory usage is below 20%
Active	GPU compute utilization is 10% or higher, or GPU memory usage is 20% or higher

If MIG is enabled, Idle or Active status may be displayed differently for each MIG instance.

Check Warning

When Warning status is displayed, check the following metrics first.

Metric to Check	Warning Condition
GPU Temperature Trend	GPU temperature of 85°C or higher persists for 3 minutes
GPU ECC Error Count	SBE ECC Error occurs
GPU Throttling	Thermal or Power Violation increases by 30 seconds or more in the last 5 minutes

Check Critical

When Critical status is displayed, check the following metrics first.

Metric to Check	Critical Condition
GPU Temperature Trend	GPU temperature of 90°C or higher persists for 2 minutes
GPU ECC Error Count	DBE ECC Error occurs
GPU Throttling	Thermal Violation increases by 180 seconds or more in the last 5 minutes, or Reliability Violation persists

Caution

Critical status requires immediate inspection or action. Check GPU temperature, ECC Error, and Throttling metrics first, and use XID Event Code as reference information when checking workload or node status.

Interpret Event Metrics

Event	How to Interpret
ECC Error	Total SBE and DBE ECC Errors that occurred in the last 24 hours. XID, Throttling, and Overheat are not included.
XID Event Code	Last detected GPU error event code. It is informational and is not reflected in Warning/Critical status determination. Do not use it to check occurrence count.
Throttle Event	Not a simple event count. It may be displayed based on accumulated time or occurrence of GPU clock reduction due to limit conditions.
Overheat	Determined by GPU temperature threshold exceedance or Thermal Violation. It may differ from the actual hardware event count.

Check Status in MIG Environments

In MIG environments, Idle or Active may be displayed differently for each MIG instance, while Warning or Critical may be applied at the physical GPU level.

Category	Description
Idle / Active	May be displayed differently depending on GPU utilization and memory usage of each MIG instance.
Warning / Critical	Temperature, ECC Error, and Throttling are collected at the physical GPU level, so the same status may be applied to MIG instances on the same physical GPU. XID Event Code may also be displayed at the physical GPU level, but it is not reflected in status determination.

Abnormal Status Check Flow

When checking abnormal status in AI Insight, use the following flow.

In Overview, check whether any GPU is in Warning, Critical, or Agent Missing status.
Select an abnormal resource in GPU Map and check the details panel.
In GPU Explorer > Cluster, check overall GPU metrics and Outlier Detection for the cluster.
In GPU Explorer > Node, check CPU, memory, disk, and network status for the node that contains the GPU.
In GPU Explorer > GPU, check temperature, ECC Error, XID Event Code, and Throttle Event for the individual GPU.

If Data Is Not Displayed

If No data to display appears on the GPU Explorer page, check the following:

Category	Check Item	Description
Common	Time range	Change the query time range, then query again
Common	Refresh	Run manual refresh
Common	Agent Missing status	Check whether the target resource is in Agent Missing status
Common	Metric Exporter or monitoring agent	Check the installation status of Metric Exporter or the monitoring agent
Kubernetes Engine	GPU Operator and DCGM Exporter	Check the status of GPU Operator and `nvidia-dcgm-exporter` Pod

For detailed checks by cause, see AI Insight Troubleshooting.

Go to the GPU Explorer Menu​

Common Query Features​

View Cluster Details​

Cluster Page Components​

Check Abnormal GPUs in Cluster​

View Node Details​

Node Status Metrics​

Check Root Cause on Node​

View GPU Details​

GPU Metrics​

Check Status Details​

Check Idle or Active​

Check Warning​

Check Critical​

Interpret Event Metrics​

Check Status in MIG Environments​

Abnormal Status Check Flow​

If Data Is Not Displayed​