Query GPU Resource Details
In GPU Explorer, you can query the status and metrics of GPU resources collected by AI Insight by cluster, node, or GPU. After checking abnormal resources in Overview, use GPU Explorer to analyze detailed metrics and events.
Go to the GPU Explorer Menu
- Go to the KakaoCloud Console.
- Go to AI Service > AI Insight.
- Expand GPU Explorer in the left menu, then click Cluster, Node, or GPU.
Common Query Features
Cluster, Node, and GPU detail pages provide the following common features.
| Feature | Description |
|---|---|
| Breadcrumb | Displays the current resource path |
| Time range | Select from 1 hour, 3 hours, 12 hours, 1 day, or 7 days |
| Auto refresh | Automatically refreshes page data at the selected interval |
| Manual refresh | Refreshes data immediately by clicking the refresh icon |
| Last updated | Displays the last time data was refreshed |
| Resource selection | Selects the resource to query from the dropdown at the top right |
View Cluster Details
On the Cluster page, you can check the status and metrics of GPU resources in a specific cluster.
- Click GPU Explorer > Cluster in the left menu.
- Select a cluster from the cluster selection dropdown at the top right.
- Check the top summary cards for GPU count, average/maximum GPU load, average/maximum GPU memory usage, average/maximum GPU temperature, average/maximum GPU idle ratio, and ECC Error count.
- Check the number of Active, Warning, Critical, Pending, Idle, and Agent Missing GPUs in the status cards.
- Select an abnormal GPU or MIG instance in GPU Map.
- Check detailed data in GPU Metrics, GPU Outlier Detection, and GPU Correlation.
Cluster Page Components
| Area | Description |
|---|---|
| Summary cards | Summarizes key GPU metrics in the cluster |
| Status cards | Displays the number of GPUs by status |
| GPU Map | Visually displays GPUs and MIG instances in the cluster |
| GPU Metrics | Displays trends for GPU utilization, memory usage, temperature, idle ratio, and ECC Error |
| GPU Outlier Detection | Displays average/maximum metrics and peak times by GPU |
| GPU Correlation | Displays relationships between GPU utilization and temperature, and between GPU utilization and idle time |
Check Abnormal GPUs in Cluster
- Check the number of Warning or Critical GPUs in the status cards.
- Select an abnormal GPU in GPU Map.
- Check temperature, ECC Error, and Throttling trends in GPU Metrics.
- In GPU Outlier Detection, check whether a resource has an average or maximum value different from other GPUs.
- In GPU Correlation, check whether any GPU has a high temperature relative to its utilization.
View Node Details
On the Node page, you can check the system resource status of a specific node and the status of GPUs connected to that node.
- Click GPU Explorer > Node in the left menu.
- Select a node from the node selection dropdown at the top right.
- Check the top summary cards for total GPU count, average/maximum GPU load, average/maximum GPU memory usage, average/maximum GPU temperature, average/maximum GPU idle ratio, and ECC Error count.
- Check the number of GPUs by status for the node in the status cards.
- Check CPU, memory, disk, and network metrics in Node Status.
- If data is not displayed, check Agent Missing status or the metric collection configuration.
Node Status Metrics
| Metric | Description |
|---|---|
| Total CPU Usage | Trend of overall node CPU usage |
| CPU Usage by Core | Trend of CPU usage by core |
| Total Memory Usage | Trend of overall node memory usage |
| Disk Read Bytes | Trend of disk read throughput |
| Disk Write Bytes | Trend of disk write throughput |
| Network Rx Bytes | Trend of network receive throughput |
Check Root Cause on Node
When a GPU is in Warning or Critical status, check node status together with GPU metrics.
| Check Item | Purpose |
|---|---|
| CPU usage | Check whether overall node load affects GPU workloads |
| Memory usage | Check whether node memory is insufficient |
| Disk read/write | Check for bottlenecks in data loading or storage |
| Network receive | Check for network bottlenecks such as training data transfer or distributed training communication |
| Agent Missing | Check whether the metric collection component for the node or GPU is working properly |
View GPU Details
On the GPU page, you can check detailed status and metrics for an individual GPU or MIG instance.
- Click GPU Explorer > GPU in the left menu.
- Select a GPU from the GPU selection dropdown at the top right.
- Check the top summary cards for GPU status, average/maximum GPU utilization, average/maximum GPU memory usage, average/maximum GPU temperature, average/maximum GPU idle ratio, and ECC Error count.
- Check trends for utilization, memory usage, temperature, idle ratio, throttling, and ECC Error in GPU Metrics.
- If MIG is configured, check the status of each MIG instance in the legend or list.
GPU Metrics
| Metric | Description | How to Interpret |
|---|---|---|
| GPU Utilization Trend | GPU utilization trend | May be used as one of the Active conditions when 10% or higher. |
| GPU Memory Usage Trend | GPU memory usage trend | May be used as one of the Active conditions when 20% or higher. |
| GPU Temperature Trend | GPU temperature trend | May be included in Warning when 85°C or higher persists for 3 minutes, or Critical when 90°C or higher persists for 2 minutes. |
| GPU Idle Trend | GPU idle ratio trend | Used to identify idle GPUs. |
| GPU Throttling | Accumulated time or occurrence of clock throttling due to limit conditions | Minor Thermal/Power limits may cause Warning, and severe Thermal limits or sustained Reliability limits may cause Critical. |
| GPU ECC Error Count | Number of ECC Errors in the last 24 hours | SBE may be included in Warning, and DBE may be included in Critical. |
Check Status Details
Check Idle or Active
Idle or Active is determined based on GPU compute utilization and GPU memory usage.
| Status | Criteria |
|---|---|
| Idle | GPU compute utilization is below 10% and GPU memory usage is below 20% |
| Active | GPU compute utilization is 10% or higher, or GPU memory usage is 20% or higher |
If MIG is enabled, Idle or Active status may be displayed differently for each MIG instance.
Check Warning
When Warning status is displayed, check the following metrics first.
| Metric to Check | Warning Condition |
|---|---|
| GPU Temperature Trend | GPU temperature of 85°C or higher persists for 3 minutes |
| GPU ECC Error Count | SBE ECC Error occurs |
| GPU Throttling | Thermal or Power limit occurs |
Check Critical
When Critical status is displayed, check the following metrics first.
| Metric to Check | Critical Condition |
|---|---|
| GPU Temperature Trend | GPU temperature of 90°C or higher persists for 2 minutes |
| GPU ECC Error Count | DBE ECC Error occurs |
| GPU Throttling | Severe Thermal Throttling or sustained Reliability Violation occurs |
Critical status requires immediate inspection or action. Check GPU temperature, ECC Error, and Throttling metrics first, and use XID Event Code as reference information when checking workload or node status.
Interpret Event Metrics
| Event | How to Interpret |
|---|---|
| ECC Error | Total SBE and DBE ECC Errors that occurred in the last 24 hours. XID, Throttling, and Overheat are not included. |
| XID Event Code | Last detected GPU error event code. It is informational and is not reflected in Warning/Critical status determination. Do not use it as an occurrence count. |
| Throttle Event | Not a simple event count. It may be displayed based on accumulated time or occurrence of GPU clock reduction due to limit conditions. |
| Overheat | Determined by GPU temperature threshold exceedance or Thermal Violation. It may differ from the actual hardware event count. |
Check Status in MIG Environments
In MIG environments, Idle or Active may be displayed differently for each MIG instance, while Warning or Critical may be applied at the physical GPU level.
| Category | Description |
|---|---|
| Idle / Active | May be displayed differently depending on GPU utilization and memory usage of each MIG instance. |
| Warning / Critical | Temperature, ECC Error, and Throttling are collected at the physical GPU level, so the same status may be applied to MIG instances on the same physical GPU. XID Event Code may also be displayed at the physical GPU level, but it is not reflected in status determination. |
Abnormal Status Check Flow
When checking abnormal status in AI Insight, use the following flow.
- In Overview, check whether any GPU is in Warning, Critical, or Agent Missing status.
- Select an abnormal resource in GPU Map and check the details panel.
- In GPU Explorer > Cluster, check overall GPU metrics and Outlier Detection for the cluster.
- In GPU Explorer > Node, check CPU, memory, disk, and network status for the node that contains the GPU.
- In GPU Explorer > GPU, check temperature, ECC Error, XID Event Code, and Throttle Event for the individual GPU.
If Data Is Not Displayed
If No data to display appears on the GPU Explorer page, check the following:
- Change the query time range.
- Run manual refresh.
- Check whether the target resource is in Agent Missing status.
- Check the installation status of Metric Exporter or the monitoring agent.
- In Kubernetes Engine environments, check the status of GPU Operator and
nvidia-dcgm-exporterPod.