Skip to main content

Key Concepts

KakaoCloud's Hadoop Eco is a cloud platform service designed to execute distributed processing tasks using open-source frameworks such as Hadoop, Hive, HBase, Spark, Trino, and Kafka. KakaoCloud provides provisioning services for Hadoop, HBase, Trino, and Dataflow using Virtual Machines.
The key concepts of the Hadoop Eco service are as follows.

Cluster

A cluster is a group of nodes provisioned using Virtual Machines.

Cluster types

Hadoop Eco offers the following types: Core Hadoop, HBase, Trino, and Dataflow.

TypeDescription
Core HadoopIncludes Hadoop, Hive, Spark, and Tez
- Stores data in HDFS and analyzes it using Hive and Spark
HBaseIncludes Hadoop and HBase
- Stores data in HDFS and provides NoSQL services using HBase
TrinoIncludes Hadoop, Trino, Hive, and Tez
- Stores data in HDFS and analyzes it using Trino and Hive
DataflowIncludes Hadoop, Kafka, Druid, and Superset
- Collects data using Kafka and analyzes it using Druid and Superset

Cluster availability types

To ensure operational stability, cluster availability is categorized into Standard (Single) and High Availability (HA) types.

Availability typeDescription
Standard (Single)Composed of one master node and multiple worker nodes
- If the master node fails, HDFS and YARN may stop working
High Availability (HA)Composed of three master nodes and multiple worker nodes
- HDFS and YARN are configured for HA, allowing automatic recovery in case of a master node failure

Cluster versions

The version of Hadoop Eco determines the component versions installed. The HDE cluster supports the Core Hadoop type for data analysis, the HBase type for HDFS-based NoSQL services, and from HDE 1.1.2, the Trino and Dataflow types.
HDE 2.0.1 supports Hadoop 3.x, HBase 2.x, and Hive 3.x versions.

Image

Components installed by cluster type per version

Image
Core Hadoop

Cluster lifecycle

A Hadoop Eco cluster has various states and lifecycles, allowing users to monitor and manage operational and task statuses. After the initial creation request, its status transitions through installation, operation, and deletion stages. The statuses of clusters and instances may differ depending on user operations.

Image
Cluster lifecycle

Cluster and node statuses

StatusDescription
InitializingMetadata is stored, and a VM creation request is sent
CreatingVM is being created
InstallingHadoop Eco components are being installed on the created VM
StartingHadoop Eco components are starting
RunningAll components are running, and the cluster is operational
Running(Scale out initializing)VM creation request due to cluster expansion
Running(Scale out creating)VM is being created
Running(Scale out installing)Hadoop Eco components are being installed on the new VM
Running(Scale out starting)Components are starting
Running(Scale out running)Verifying cluster operation with the expanded VMs
Running(Scale in Initializing)Verifying VMs for removal during cluster reduction
Running(Scale in ready)Components on target VMs are shutting down
Running(Scale in starting)Verifying component shutdown on target VMs
Running(Scale in terminating)VMs are being deleted
Failed to scale outVM creation for expansion failed
Failed to scale out vmComponent installation or startup failed on the expansion VM
Failed to scale inVM deletion during scale-in failed
Failed to scale in vmComponent shutdown failed on the scale-in VM
TerminatingCluster is being terminated
Terminated(User)Cluster was terminated by the user
Terminated(UserCommand)Cluster was terminated after successful job scheduling
Terminated(Scale in)Cluster scaled down successfully
Terminated(Error)Cluster was terminated due to an error

Instance and cluster states

Instances are KakaoCloud Virtual Machines that make up a cluster. The states of instances and clusters may differ.

  • If the master node instance is not in the Active state, a Single availability cluster cannot function properly.
  • For HA availability, the cluster can operate normally if at least one master node instance (Master 1 or 2) is in the Active state.

Components

The following are the components running in a Hadoop Eco cluster.

Core Hadoop

LocationComponentAddress
Master 1HDFS NamenodeHDE-2.0.0 below: http://{HadoopMST-cluster-1}:50070
HDE-2.0.0 above: http://{HadoopMST-cluster-1}:9870
YARN ResourceManagerhttp://{HadoopMST-cluster-1}:8088
TimelineServerhttp://{HadoopMST-cluster-1}:8188
JobHistoryServerhttp://{HadoopMST-cluster-1}:19888
SparkHistoryServerhttp://{HadoopMST-cluster-1}:18082
SparkThriftServerhttp://{HadoopMST-cluster-1}:20000
Tez UIhttp://{HadoopMST-cluster-1}:9999
HiveServer2 (HS2)http://{HadoopMST-cluster-1}:10002
Huehttp://{HadoopMST-cluster-1}:8888
Zeppelinhttp://{HadoopMST-cluster-1}:8180
Ooziehttp://{HadoopMST-cluster-1}:11000

HBase

LocationComponentAddress
Master 1HDFS NamenodeHDE-2.0.0 below: http://{HadoopMST-cluster-1}:50070
HDE-2.0.0 above: http://{HadoopMST-cluster-1}:9870
YARN ResourceManagerhttp://{HadoopMST-cluster-1}:8088
HMasterhttp://{HadoopMST-cluster-1}:16010
TimelineServerhttp://{HadoopMST-cluster-1}:8188
JobHistoryServerhttp://{HadoopMST-cluster-1}:19888
Huehttp://{HadoopMST-cluster-1}:8888

Instance

Instances can be viewed in the Cluster list and are managed similarly to standard VMs.

info

For stable operation, it is recommended to have at least 16GB for master node instances and 32GB for worker node instances.

Volume

A volume is the default storage configured when creating an instance and determines the capacity of HDFS. For stable HDFS operation, an appropriate size must be selected.
For detailed information about volumes, refer to the Create and manage volumes document.

Network and security

All instances created in Hadoop Eco are provided within a VPC environment. To configure a cluster, a security group must be created, and Inbound rules must be configured for component communication.
For detailed information on network and security settings, refer to security group.