Skip to main content

2 posts tagged with "iceberg"

View All Tags

Hadoop Eco adds features for operational efficiency in data lake architecture

· 5 min read
Evan (진은용)
Service Manager
HDE update

When enterprises design cloud-based large-scale data lake architectures, we have reached a point where we must go beyond simply accumulating data and maximize operational efficiency. To secure efficiency, it is necessary to build a balanced set of core elements such as high-performance processing, flexible separation of compute resources, and robust data governance.

If this balance breaks down, complex problems can occur, such as real-time analytics queries being delayed by batch jobs or difficulty understanding the location and reliability of the data needed.

KakaoCloud Hadoop Eco (HDE) recently carried out a large-scale update to solve these problems and improve the processing power and operational management capabilities of analytics environments. Based on the release of the new HDE-2.3.0 version, this update includes major changes such as improved integration with Iceberg catalogs, a next-generation metastore, and the introduction of task nodes optimized for workloads.

In this post, we briefly introduce how these improvements can be used within HDE to improve analytics workflows.

🚀 New HDE-2.3.0 version and powerful components added

With this update, HDE-2.3.0 is newly provided, and JupyterLab, Impala, and Kudu components have been added to effectively support data analytics and processing workflows.

Create HDE cluster Create HDE cluster

  • JupyterLab: Provides a web-based programming and shell environment, offering a development environment where data exploration and analysis code can be executed immediately within cluster nodes.
  • Impala: A powerful query engine that supports fast interactive queries against data stores such as Kudu based on Hive Metastore.
  • Kudu: Serves as a columnar data store that supports low-latency reads and writes.

In addition, Druid, a core component of Dataflow-type clusters, has been upgraded to v33.0.0, and Superset has been upgraded to v5.0.0, further improving performance and stability.

💡 View the Hadoop Eco component list

⚙️ Securing cluster structure flexibility: introducing task nodes

One of the tricky parts of cluster operations is separating batch processing and interactive processing resources to minimize mutual interference. In this update, the newly introduced task node effectively reduces operational burden.

Task node settings Task node settings

  • Role separation: Task nodes are mainly used as dedicated compute resources for executing large-scale batch computation jobs (YARN Jobs). By separating their role from worker nodes, they ensure the stability of core data processing resources and effectively prevent performance degradation caused by resource contention.
  • More accurate capacity planning: With the introduction of task nodes, the method for calculating YARN available resources has been changed to include the number and flavor of task nodes. This makes cluster capacity planning more accurate and predictable.

⚠️ Note when using task nodes: Task nodes can only be added when creating a cluster. Please carefully decide whether to add task nodes during the initial design stage, because they cannot be added after creation. However, reducing the number of nodes to 0 and increasing it again is possible.

🧊 Iceberg catalog integration, now with one click

As KakaoCloud Data Catalog officially supports the Apache Iceberg format, Iceberg catalog integration when creating a Hadoop Eco cluster has been dramatically simplified.

Iceberg catalog integration Iceberg catalog integration

In the Hadoop Eco service with this improvement, the console now lets you directly select and connect a Data Catalog Iceberg catalog in the external metastore integration setting during cluster creation. This minimizes human error, shortens integration time, and lets you start analytics work immediately.

In addition, an option has been added so that users can choose whether to automatically retain data during the data retention period (90 days) after cluster deletion. This feature can be used to prevent unnecessary metadata retention costs and clarify governance.

This Hadoop Eco update is not just a feature expansion. It further strengthens the operational efficiency of data lake architecture around three axes: stable metadata governance, high-performance interactive analytics environments, and flexible compute resource management.

Operate analytics workflows more efficiently and systematically with KakaoCloud's new Hadoop Eco.

Thank you.

👉 Start KakaoCloud now

Latest service updates for stronger operational reliability - Iceberg, PITR, SMS

· 4 min read
Mia (정혜원)
Technical Contents Manager
update

One of the most important values in cloud operations is stability. System stability is not only about preventing problems. Its reliability is determined by how quickly and flexibly problems can be recovered and resolved when they occur, and how well they can be prevented and prepared for in advance.

Through recent updates across several services, KakaoCloud has further strengthened this important value of Operational Reliability. We focused on improving users' operating experience around safe data recovery, efficient system maintenance, and fast failure notification systems.

In this post, we take a closer look at three notable improvements that can substantially improve operational reliability.


🧊 1. Iceberg format support for data integrity

One notable change in the recent update is that Data Catalog has officially started supporting the Apache Iceberg format. Apache Iceberg, developed by Netflix, is a powerful open-source table format designed for tracking change history in large-scale data (Time Travel) and restoring to specific points in time.

You can now select the Iceberg catalog type in KakaoCloud Data Catalog. With Iceberg added alongside the existing Hive Metastore-based Standard type, version management and point-in-time recovery have become much simpler even in large-scale data environments. Even if data loss or errors occur, you can easily restore to a previous state, and integration with major analytics engines such as Spark and Trino can be used immediately.

Through this update, KakaoCloud Data Catalog fully supports the integrity and resilience of large-scale data at a practical operational level, and is expected to further improve data reliability across analytics environments.

📝 Learn more about Apache Iceberg catalogs

⏪ 2. Stronger recovery reliability with point-in-time recovery (PITR)

Databases are one of the most important elements of cloud operational stability. Improving the reliability of recovery features in these database systems is truly important. In this MySQL update, the long-awaited Point-in-Time Recovery (PITR) feature has been added.

Based on automatic backups and Binary Logs, you can specify a desired point in time and restore a new instance group to the state at that time. Because you can now specify the recovery point down to the second, you can respond very flexibly to data loss caused by mistakes or errors.

💡 Please note! For service stability, point-in-time recovery currently supports a single availability configuration. If high availability (HA) configuration is required, we recommend adding instances after recovery is complete.

In addition, security groups can now be modified while instances are running, improving flexibility in network control. Account management procedures have also been improved so that password policies are applied in the same way when procedures are used. These detailed improvements to security and recovery features are important changes that substantially increase stability in real operating environments.

📝 Learn more about MySQL point-in-time recovery

📩 3. Notification speed is response speed

The outcome of responding to issues depends on how quickly operators recognize the system status. In this update, Maintenance introduced a new SMS notification feature in addition to existing email. When a maintenance task fails or an important event occurs, a notification is immediately sent to the registered mobile phone number. Now, even if you do not check email, you can recognize and respond to problem situations in real time.

💡 Please note! SMS notifications are sent only for events that require quick action, and project administrators must register valid contact information in advance.

📝 Learn more about Maintenance


These three updates were made in different services, but they all point in the same direction. Data can be restored safely without loss, security settings have become more flexible, and failures can be detected faster. This is the operational resilience KakaoCloud aims for. Stability improvements covering the entire operational process, from data to notifications, will continue.

KakaoCloud will continue improving technical completeness so that customers' operating environments become more stable and predictable. We appreciate your continued interest and support.

👉 Start KakaoCloud now