Skip to main content

Key Concepts

Data Catalog is a centralized repository that stores and manages metadata for user data. Metadata is not shared between catalogs created in different VPCs (subnets), and all catalogs are operated in high availability (HA) mode.

Apache Hive Metastore-Compatible Catalog

  • The default catalog type compatible with Apache Hive Metastore.
  • Can be created by specifying a VPC (subnet).
  • Stores, modifies, and deletes metadata such as table definitions and storage paths for owned data.
  • Manages Hive-format databases and tables.

Apache Iceberg Catalog

  • Supports the Apache Iceberg table format.

Overview of Apache Iceberg

What Is Iceberg?

Apache Iceberg is an open table format designed to manage large-scale analytic datasets as a single SQL table in a reliable way. It enables multiple engines, such as Spark and Trino, to safely read and write to the same table concurrently.

Key Features of Iceberg

  • SQL-Friendly Operations: Supports expressing row-level changes using standard SQL (DML/DDL) and provides a unified table format across multiple engines. Supported features may vary by engine.
  • Schema and Partition Evolution: Allows adding, renaming, or deleting columns and changing partition layouts without recreating the table.
  • Hidden Partitioning: Manages partition transformations (e.g., day(ts), bucket(id)) as metadata, eliminating the need to manually handle directory structures.
  • Time Travel and Rollback: Every write creates a snapshot, enabling queries or rollbacks to specific points in time or snapshots (support to be added).
  • Operational Optimization (Compaction, etc.): Reduces metadata overhead and improves scan performance by optimizing file layouts such as merging small files.

Using Iceberg with the Data Catalog Service

  • The Data Catalog service manages metadata (schema, snapshots, partition transformations, etc.) of Iceberg tables.
  • Query execution is performed in compute engines such as Spark or Trino, which reference the REST catalog URI and warehouse (Object Storage) settings.

Supported Scope and Limitations

  • Format Version: Supports Iceberg v2. Iceberg v1 tables can be created and queried but may have limited functionality.
  • Catalog Count Limit: Each project (or account) can create only one Iceberg catalog.
  • Object Storage: General buckets (S3-compatible) are recommended. Classic buckets (Swift-based) are not supported.

Database

A database in the Data Catalog is a container that stores tables. Its supported scope depends on the parent catalog type.

  • Used to organize metadata tables.
  • Each table belongs to exactly one database.
  • The database list in the KakaoCloud console displays all databases within the project.
  • Supported types:
    • Standard Database: Hive-format database
    • Iceberg Database: Iceberg-format database

Table

A table in the Data Catalog represents metadata describing data stored in a data store. You can create tables in the KakaoCloud console, where their metadata values are displayed in the table list.

  • Includes lower-level metadata such as schema, partition, and table properties.
  • Can be manually created and modified.
  • When using the Data Catalog as a metastore for the Hadoop ecosystem, you can also edit metadata for migrated tables.
  • Behavioral differences by catalog type:
    • Standard Table: Hive-based tables supporting Avro, JSON, Parquet, ORC, CSV, and TEXT formats. Queried by Hive, Trino, etc.
    • Iceberg Table: Iceberg-format tables supporting Avro, Parquet, and ORC, providing Iceberg-specific features.

Crawler

In the Data Catalog, a crawler scans MySQL data, extracts metadata, and automatically updates the Data Catalog to simplify data discovery. You can create crawlers from the KakaoCloud console, and tables generated by crawlers appear in the table list.

  • The schema extracted by the crawler is stored as a Data Catalog table.
    The table name is defined as Prefix + MySQL Database Name_Table Name.
  • Crawler execution history is retained for up to 90 days; records older than 90 days are automatically deleted.
  • You can schedule crawler executions.
caution

Crawlers are not supported for Iceberg-type catalogs.

Resource Status and Lifecycle

You can check the status of catalogs, databases, and tables in the Data Catalog.
When you create a catalog, it initializes a centralized repository to store and manage metadata for your data assets.
(Creation takes approximately 10 minutes.)
Catalogs are fully managed and have multiple states, including operational and terminated, allowing users to track their current state.

The status information by resource is as follows:

Image Catalog lifecycle

Catalog Status

Standard Catalog Status
StatusDescription
INITThe catalog has just been created.
PROVISIONINGThe catalog is provisioning VMs for use.
RUNNINGThe catalog is running and available.
PENDINGThe catalog is performing failover to recover from an error state.
FATALThe catalog has encountered an unrecoverable error.
TERMINATINGThe catalog is releasing hardware resources for termination.
TERMINATEDThe catalog has been terminated and is no longer available.
Iceberg Catalog Status
StatusDescription
RUNNINGThe Iceberg catalog is running and available.
FATALThe Iceberg catalog has encountered an unrecoverable error.
TERMINATEDThe Iceberg catalog has been terminated and is no longer available.

Database and Table Status

Databases and tables change status based on create, modify, and delete operations.
Each status affects how the resources are managed and determines which operations are available.
A table’s state also depends on the state of its parent database.
For example, a table can only be created or modified when the database is in the ACTIVE or ALTERING state.

info

The statuses for Standard and Iceberg databases and tables are the same.

StatusDescription
CREATINGDatabase or table is being created.
ALTERINGDatabase or table is being modified.
DELETINGDatabase or table is being deleted.
ACTIVEDatabase or table is available for use.
INACTIVEDatabase or table is unavailable.

Crawler Status

Crawlers change status based on create, modify, run, and delete operations, and they are also affected by the state of the database and MySQL.
For example, a crawler can only be created or run when MySQL is in the Available state.

StatusDescription
CREATINGCrawler is being created.
ALTERINGCrawler is being modified.
DELETINGCrawler is being deleted.
ACTIVECrawler is active and available.
RUNNINGCrawler is currently running.
INACTIVECrawler is inactive (e.g., when the associated database is deleted), but crawler history remains viewable.