Skip to main content

Query GPU Resource Details

In GPU Explorer, you can query the status and metrics of GPU resources collected by AI Insight by cluster, node, or GPU. After checking abnormal resources in Overview, use GPU Explorer to analyze detailed metrics and events.

Go to the GPU Explorer Menu

  1. Go to the KakaoCloud Console.
  2. Go to AI Service > AI Insight.
  3. Expand GPU Explorer in the left menu, then click Cluster, Node, or GPU.

Common Query Features

Cluster, Node, and GPU detail pages provide the following common features.

FeatureDescription
BreadcrumbDisplays the current resource path
Time rangeSelect from 1 hour, 3 hours, 12 hours, 1 day, or 7 days
Auto refreshAutomatically refreshes page data at the selected interval
Manual refreshRefreshes data immediately by clicking the refresh icon
Last updatedDisplays the last time data was refreshed
Resource selectionSelects the resource to query from the dropdown at the top right

View Cluster Details

On the Cluster page, you can check the status and metrics of GPU resources in a specific cluster.

  1. Click GPU Explorer > Cluster in the left menu.
  2. Select a cluster from the cluster selection dropdown at the top right.
  3. Check the top summary cards for GPU count, average/maximum GPU load, average/maximum GPU memory usage, average/maximum GPU temperature, average/maximum GPU idle ratio, and ECC Error count.
  4. Check the number of Active, Warning, Critical, Pending, Idle, and Agent Missing GPUs in the status cards.
  5. Select an abnormal GPU or MIG instance in GPU Map.
  6. Check detailed data in GPU Metrics, GPU Outlier Detection, and GPU Correlation.

Cluster Page Components

AreaDescription
Summary cardsSummarizes key GPU metrics in the cluster
Status cardsDisplays the number of GPUs by status
GPU MapVisually displays GPUs and MIG instances in the cluster
GPU MetricsDisplays trends for GPU utilization, memory usage, temperature, idle ratio, and ECC Error
GPU Outlier DetectionDisplays average/maximum metrics and peak times by GPU
GPU CorrelationDisplays relationships between GPU utilization and temperature, and between GPU utilization and idle time

Check Abnormal GPUs in Cluster

  1. Check the number of Warning or Critical GPUs in the status cards.
  2. Select an abnormal GPU in GPU Map.
  3. Check temperature, ECC Error, and Throttling trends in GPU Metrics.
  4. In GPU Outlier Detection, check whether a resource has an average or maximum value different from other GPUs.
  5. In GPU Correlation, check whether any GPU has a high temperature relative to its utilization.

View Node Details

On the Node page, you can check the system resource status of a specific node and the status of GPUs connected to that node.

  1. Click GPU Explorer > Node in the left menu.
  2. Select a node from the node selection dropdown at the top right.
  3. Check the top summary cards for total GPU count, average/maximum GPU load, average/maximum GPU memory usage, average/maximum GPU temperature, average/maximum GPU idle ratio, and ECC Error count.
  4. Check the number of GPUs by status for the node in the status cards.
  5. Check CPU, memory, disk, and network metrics in Node Status.
  6. If data is not displayed, check Agent Missing status or the metric collection configuration.

Node Status Metrics

MetricDescription
Total CPU UsageTrend of overall node CPU usage
CPU Usage by CoreTrend of CPU usage by core
Total Memory UsageTrend of overall node memory usage
Disk Read BytesTrend of disk read throughput
Disk Write BytesTrend of disk write throughput
Network Rx BytesTrend of network receive throughput

Check Root Cause on Node

When a GPU is in Warning or Critical status, check node status together with GPU metrics.

Check ItemPurpose
CPU usageCheck whether overall node load affects GPU workloads
Memory usageCheck whether node memory is insufficient
Disk read/writeCheck for bottlenecks in data loading or storage
Network receiveCheck for network bottlenecks such as training data transfer or distributed training communication
Agent MissingCheck whether the metric collection component for the node or GPU is working properly

View GPU Details

On the GPU page, you can check detailed status and metrics for an individual GPU or MIG instance.

  1. Click GPU Explorer > GPU in the left menu.
  2. Select a GPU from the GPU selection dropdown at the top right.
  3. Check the top summary cards for GPU status, average/maximum GPU utilization, average/maximum GPU memory usage, average/maximum GPU temperature, average/maximum GPU idle ratio, and ECC Error count.
  4. Check trends for utilization, memory usage, temperature, idle ratio, throttling, and ECC Error in GPU Metrics.
  5. If MIG is configured, check the status of each MIG instance in the legend or list.

GPU Metrics

MetricDescriptionHow to Interpret
GPU Utilization TrendGPU utilization trendMay be used as one of the Active conditions when 10% or higher.
GPU Memory Usage TrendGPU memory usage trendMay be used as one of the Active conditions when 20% or higher.
GPU Temperature TrendGPU temperature trendMay be included in Warning when 85°C or higher persists for 3 minutes, or Critical when 90°C or higher persists for 2 minutes.
GPU Idle TrendGPU idle ratio trendUsed to identify idle GPUs.
GPU ThrottlingAccumulated time or occurrence of clock throttling due to limit conditionsMinor Thermal/Power limits may cause Warning, and severe Thermal limits or sustained Reliability limits may cause Critical.
GPU ECC Error CountNumber of ECC Errors in the last 24 hoursSBE may be included in Warning, and DBE may be included in Critical.

Check Status Details

Check Idle or Active

Idle or Active is determined based on GPU compute utilization and GPU memory usage.

StatusCriteria
IdleGPU compute utilization is below 10% and GPU memory usage is below 20%
ActiveGPU compute utilization is 10% or higher, or GPU memory usage is 20% or higher

If MIG is enabled, Idle or Active status may be displayed differently for each MIG instance.

Check Warning

When Warning status is displayed, check the following metrics first.

Metric to CheckWarning Condition
GPU Temperature TrendGPU temperature of 85°C or higher persists for 3 minutes
GPU ECC Error CountSBE ECC Error occurs
GPU ThrottlingThermal or Power limit occurs

Check Critical

When Critical status is displayed, check the following metrics first.

Metric to CheckCritical Condition
GPU Temperature TrendGPU temperature of 90°C or higher persists for 2 minutes
GPU ECC Error CountDBE ECC Error occurs
GPU ThrottlingSevere Thermal Throttling or sustained Reliability Violation occurs
Caution

Critical status requires immediate inspection or action. Check GPU temperature, ECC Error, and Throttling metrics first, and use XID Event Code as reference information when checking workload or node status.

Interpret Event Metrics

EventHow to Interpret
ECC ErrorTotal SBE and DBE ECC Errors that occurred in the last 24 hours. XID, Throttling, and Overheat are not included.
XID Event CodeLast detected GPU error event code. It is informational and is not reflected in Warning/Critical status determination. Do not use it as an occurrence count.
Throttle EventNot a simple event count. It may be displayed based on accumulated time or occurrence of GPU clock reduction due to limit conditions.
OverheatDetermined by GPU temperature threshold exceedance or Thermal Violation. It may differ from the actual hardware event count.

Check Status in MIG Environments

In MIG environments, Idle or Active may be displayed differently for each MIG instance, while Warning or Critical may be applied at the physical GPU level.

CategoryDescription
Idle / ActiveMay be displayed differently depending on GPU utilization and memory usage of each MIG instance.
Warning / CriticalTemperature, ECC Error, and Throttling are collected at the physical GPU level, so the same status may be applied to MIG instances on the same physical GPU. XID Event Code may also be displayed at the physical GPU level, but it is not reflected in status determination.

Abnormal Status Check Flow

When checking abnormal status in AI Insight, use the following flow.

  1. In Overview, check whether any GPU is in Warning, Critical, or Agent Missing status.
  2. Select an abnormal resource in GPU Map and check the details panel.
  3. In GPU Explorer > Cluster, check overall GPU metrics and Outlier Detection for the cluster.
  4. In GPU Explorer > Node, check CPU, memory, disk, and network status for the node that contains the GPU.
  5. In GPU Explorer > GPU, check temperature, ECC Error, XID Event Code, and Throttle Event for the individual GPU.

If Data Is Not Displayed

If No data to display appears on the GPU Explorer page, check the following:

  • Change the query time range.
  • Run manual refresh.
  • Check whether the target resource is in Agent Missing status.
  • Check the installation status of Metric Exporter or the monitoring agent.
  • In Kubernetes Engine environments, check the status of GPU Operator and nvidia-dcgm-exporter Pod.