AI Insight Overview

KakaoCloud AI Insight is a GPU monitoring service that lets you check the status and key metrics of GPU resources by cluster, node, and GPU. You can understand the overall GPU status at a glance, quickly identify GPUs with abnormal signs, and analyze causes on detailed pages.

AI Insight provides metrics required for GPU operations, such as GPU utilization, GPU memory usage, temperature, idle ratio, ECC Error, XID Event Code, and Throttling. You can monitor both Kubernetes Engine (KE)-based GPU nodes and Virtual Machine (VM)-based GPU nodes. In environments with MIG configured, you can also check status by MIG instance.

Note

To check metrics in AI Insight, Metric Exporter or the monitoring agent must be installed in the target environment. If it is not installed or is not working properly, the resource may appear in Agent Missing status and GPU metrics may not be collected.

Key Features

Feature	Description
Check overall GPU status	Check total GPUs, clusters, nodes, average GPU utilization, average memory usage, average temperature, and ECC Error count on the Overview page.
Check GPU status	Check the number of GPUs in Active, Idle, Warning, Critical, Pending, and Agent Missing status.
GPU Map	Visualize resources by GPU, cluster, or node and explore resources by status.
GPU Explorer	Check detailed metrics and events by Cluster, Node, and GPU.
GPU event analysis	Identify abnormal causes using ECC Error, XID Event Code, Throttling, and Overheat information.
Check MIG instances	Check utilization and status by instance for GPUs with MIG enabled.
Check node system metrics	Check CPU, memory, disk, and network metrics for VM or KE nodes.

GPU Status Criteria

AI Insight displays GPU status based on collected GPU metrics and node status. If multiple status conditions are met at the same time, the status with the highest severity is displayed first.

Status	Description
Active	Normal operating status where GPU compute or memory is in use.
Idle	Idle status where both GPU compute and memory usage are low.
Warning	Status where abnormal signs are detected, such as increased GPU temperature, SBE ECC Error, or minor Thermal/Power Throttling.
Critical	Status requiring immediate inspection, such as excessive GPU temperature increase, DBE ECC Error, severe Thermal Throttling, or Reliability Violation.
Pending	Status where the node containing the GPU is in an inactive lifecycle state, such as stopped, booting, rebooting, or resizing.
Agent Missing	Status where metrics cannot be collected because Metric Exporter or the monitoring agent is not installed or is not working properly.

Note

XID Event Code is displayed as an informational metric on the GPU details page. Currently, XID Event Code is not reflected in Warning or Critical status determination.

AI Insight consists of the following pages.

Menu	Description
Overview	Page for checking summary status, GPU counts by status, and GPU Map for all GPU resources.
GPU Explorer > Cluster	Page for checking status, metrics, outliers, and correlations for GPU resources in a specific cluster.
GPU Explorer > Node	Page for checking GPU status and node system metrics, such as CPU, memory, disk, and network, for a specific node.
GPU Explorer > GPU	Page for checking detailed utilization, memory usage, temperature, idle ratio, Throttling, and ECC Error trends for an individual GPU or MIG instance.

Workflow

You can use AI Insight with the following flow.

Install Metric Exporter or the monitoring agent in the target environment.
Check the overall GPU status and GPU counts by status on the AI Insight Overview page.
If Warning, Critical, or Agent Missing status exists, select the target resource from GPU Map or the list.
In GPU Explorer, check detailed metrics and events by Cluster, Node, and GPU.
Depending on the cause, check GPU temperature, ECC Error, Throttling, and node system resource status together. Use XID Event Code as reference information.

Prerequisites

AI Insight metric collection methods differ depending on the target environment.

Target Environment	Required Configuration	Reference
Kubernetes Engine	Install Metric Exporter based on GPU Operator and DCGM Exporter	Metric Exporter Installation
Virtual Machine	Install DCGM, DCGM Exporter, and the monitoring agent, and configure Prometheus input	Metric Exporter Installation

Caution

If Metric Exporter or the monitoring agent is not installed or is not working properly, metrics such as GPU utilization, GPU memory usage, temperature, and ECC Error are not collected.

Document	Description
Key Concepts	Describes AI Insight components, GPU status, key metrics, and event metrics.
Metric Exporter Installation	Describes how to install components for collecting GPU metrics in KE and VM environments.
Check Overall GPU Status	Describes how to check the overall GPU status on the Overview page.
View GPU Resource Details	Describes how to view Cluster, Node, and GPU details.

Key Features​

GPU Status Criteria​

Console Menu​

Workflow​

Prerequisites​

Related Documents​

Key Features

GPU Status Criteria

Console Menu

Workflow

Prerequisites

Related Documents