Skip to main content

2 posts tagged with "kafka"

View All Tags

Building a Kafka-based real-time data pipeline

· 3 min read
Erin (오예진)
Cloud Engineer
Tutorial new release

Logs, user events, and transaction information generated by services. Storing this data is important, but it becomes a truly "meaningful flow" only when it can be analyzed quickly.

The Kafka-based real-time data pipeline tutorial series introduced here is a hands-on tutorial that lets you directly follow how to implement this "flow of data" on KakaoCloud.

This series consists of three parts and guides you step by step through the entire process, from receiving real-time messages to storage and analysis. It is designed so that you can connect Kafka, Object Storage, Data Catalog, and Data Query, understand the overall structure through which data flows, and implement it directly.

architect Architecture for building a real-time data pipeline

Part 1: Build a structure for receiving Kafka messages

In the first tutorial, you create a Kafka cluster and configure an environment for sending and receiving messages through topics. You create Kafka topics, configure producers and consumers, and send and receive messages to establish the foundation for real-time data collection. This process focuses on understanding the basic structure of an event-driven system and creating the starting point of message flow.

👉 View the message processing through Kafka tutorial

Part 2: Store received messages in Object Storage

The second tutorial covers the flow of periodically collecting messages received through Kafka and storing them in Object Storage. Messages are collected at regular intervals and stored as a single file, and the stored files are used later as data sources for analysis. In this process, you can also consider the boundary between streaming and batch and how file formats and structures should be designed.

👉 View the tutorial for loading Kafka data into Object Storage

Part 3: Real-time analysis with Data Catalog and Data Query

The final tutorial configures an environment where data stored in Object Storage is registered in Data Catalog and SQL-based analysis can be performed through Data Query. Tables registered in the catalog are managed by partition, and new data can be automatically reflected through periodic synchronization settings. The most important part of this stage is converting real-time data collected through Kafka into a structure that can be analyzed immediately without a separate complex pipeline.

👉 View the tutorial for analyzing Kafka messages using Data Catalog and Data Query


This real-time data pipeline tutorial series is not a simple code example. It is written based on architecture and settings that can be used as-is in operating environments. By directly following the entire process of receiving Kafka messages, storing them in Object Storage, and connecting them to analysis with Data Catalog and Data Query, you can quickly build practical intuition for designing real-time services, monitoring systems, and event-based statistics pipelines.

If you are designing a Kafka-based real-time data pipeline for the first time or want to expand an existing pipeline on KakaoCloud, this tutorial will be a good reference.

🖥️ Try it now!
View the Kafka-based real-time data pipeline tutorial series at a glance

Building a CDC Pipeline with Kafka

· 4 min read
Analytics Use Cases

Hello. In this post, we introduce how to build a CDC (Change Data Capture) pipeline for real-time data synchronization using KakaoCloud services.

CDC (Change Data Capture) is a technology that detects changes in a database in real time and delivers them to other systems. By capturing changes such as INSERT, UPDATE, and DELETE that occur in a database and delivering them to other systems, real-time data synchronization and processing become possible. This technology is widely used for various purposes, including real-time data sharing between microservices, providing up-to-date data for real-time analytics, and improving the reliability and speed of data backups.

Importance of CDC for real-time synchronization

Let's use the order system of a large online shopping mall as an example. During a special sale for a popular product, Customer A completes the purchase of the last item in stock. In a system without CDC, there may be a delay before changes in the inventory database are reflected in other systems. Therefore, if another customer, Customer B, orders and completes payment for the same product during this delay, the order must later be canceled due to insufficient inventory. If this situation continues to occur in the system, it will negatively affect business reliability as well as customer satisfaction.

If CDC technology had been applied in advance, the database change would have been detected immediately after Customer A's purchase was completed and reflected in real time across all related systems, including inventory management, product display, and payment systems. In this process, the product could immediately be displayed as "sold out," preventing unnecessary additional orders from Customer B.

In this way, CDC contributes to improving both business operational efficiency and customer satisfaction by immediately reflecting database changes. For this reason, many companies are adopting CDC solutions to improve data management and system integration.

KakaoCloud provides various managed services for building CDC pipelines. By using these services, you can easily build a stable and cost-effective CDC pipeline. The following are the core services required to build a CDC pipeline.

  • MySQL: KakaoCloud provides an enterprise-grade managed MySQL service. Automatic backup, real-time monitoring, and security patches are performed automatically, and stable database operations are possible through high availability and automatic failure handling.

  • Advanced Managed Kafka: Advanced Managed Kafka is KakaoCloud's fully managed Apache Kafka service. It automatically configures and manages high-performance infrastructure for large-scale real-time data streaming, and cluster operation and monitoring are automated, enabling a stable message brokering service.

  • Hadoop Eco: Hadoop Eco is a data analytics ecosystem that makes it easy and fast to perform various tasks using large-scale data. It provides various open-source components in the Hadoop ecosystem as fully managed services, reducing the burden of building and operating complex big data environments.

Building a CDC Pipeline with Kafka

You can check the CDC pipeline configuration example described above in detail in a tutorial in KakaoCloud technical documentation.

The Building a CDC Pipeline with Kafka tutorial explains how to set up a CDC pipeline using MySQL, a managed database service, Advanced Managed Kafka for real-time data streaming, and Hadoop Eco for data analytics.

The following architecture shows the overall flow of the tutorial: Debezium detects data changes in MySQL, delivers them in real time through Kafka, and finally analyzes them in Druid and visualizes them with Superset.

Image KakaoCloud CDC pipeline architecture

KakaoCloud CDC pipelines can be used effectively in various business environments, such as real-time inventory management, user behavior analytics, and event-driven systems. The Building a CDC Pipeline with Kafka tutorial provides a useful guide for implementing these cases and applying them to real business environments.

Closing

In recent business environments, CDC pipelines have become an essential element for supporting real-time data synchronization and analytics. Please also remember that by using KakaoCloud managed services, you can easily and efficiently build stable and scalable CDC pipelines.

For more details and usage methods, see Building a CDC Pipeline with Kafka.

Thank you!