Skip to main content

Key Concepts

KakaoCloud's Hadoop Eco is a cloud platform service designed for distributed processing tasks using open-source frameworks such as Hadoop, Hive, HBase, Spark, Trino, and Kafka. It provides provisioning services for Hadoop, HBase, Trino, and Dataflow using KakaoCloud's Virtual Machines. The key concepts of Hadoop Eco are as follows.

Cluster

A cluster is a collection of nodes provisioned using Virtual Machines.

Cluster types

Hadoop Eco offers the following cluster types: Core Hadoop, HBase, Trino, and Dataflow.

TypeDescription
Core HadoopIncludes Hadoop, Hive, Spark, and Tez
- Data is stored in HDFS and analyzed using Hive and Spark
HBaseIncludes Hadoop and HBase
- Data is stored in HDFS and NoSQL services are provided using HBase
TrinoIncludes Hadoop, Trino, Hive, and Tez
- Data is stored in HDFS and analyzed using Trino and Hive
DataflowIncludes Hadoop, Kafka, Druid, and Superset
- Data is collected via Kafka and analyzed using Druid and Superset

Cluster availability types

To ensure operational stability, availability types include Standard (Single) and High Availability (HA).

Availability TypeDescription
Standard (Single)Composed of one master node and multiple worker nodes
- Since there is only one master node, if a failure occurs, HDFS and YARN may not function
High Availability (HA)Composed of three master nodes and multiple worker nodes
- HDFS and YARN are configured for HA, and the master is automatically recovered in case of failure

Cluster versions

The version of Hadoop Eco determines the versions of the components installed. HDE clusters support Core Hadoop for data analysis, HBase for HDFS-based NoSQL services, and from version HDE 1.1.2, Trino and Dataflow are available. HDE 2.0.1 supports Hadoop 3.x, HBase 2.x, and Hive 3.x.

Image

Components Installed by Cluster Type Per Version

Image Core Hadoop

Cluster lifecycle

Hadoop Eco clusters go through various states and lifecycles, allowing users to check and manage the operational and task statuses. After the initial creation request, the lifecycle includes installation, operation, and deletion stages. The states of the cluster and instances may vary based on user actions.

Image Cluster Lifecycle

Cluster and Node States

StateDescription
InitializingMetadata is being saved and VM creation has been requested
CreatingVM creation in progress
InstallingInstalling Hadoop Eco components on the created VM
StartingHadoop Eco components are running
RunningAll components are running and the cluster is operational
Running(Scale out initializing)VM creation requested for cluster scaling
Running(Scale out creating)VM creation in progress
Running(Scale out installing)Installing Hadoop Eco components on the created VM
Running(Scale out starting)Components are running
Running(Scale out running)Verifying the operation of the existing cluster and newly scaled-out VMs
Running(Scale in Initializing)Verifying whether the target VM can be deleted for scaling down
Running(Scale in ready)Shutting down components on the scaling-down VM
Running(Scale in starting)Checking if component shutdown on the scaling-down VM is successful
Running(Scale in terminating)Deleting VM
Failed to scale outFailed to create the scaling-out VM
Failed to scale out vmFailed to install or run components on the scaling-out VM
Failed to scale inFailed to delete the scaling-down VM
Failed to scale in vmFailed to properly shut down components on the scaling-down VM
TerminatingCluster termination in progress
Terminated(User)Cluster terminated by the user
Terminated(UserCommand)Cluster terminated after successful task scheduling
Terminated(Scale in)Cluster scaled down and VM terminated successfully
Terminated(Error)Cluster terminated due to error
Terminated(Failed to create vm)Error during VM creation
Terminated(Failed to destroy vm)Error during VM termination
Terminated(Check time over)Cluster creation exceeded time limits
Terminated(Install error)Cluster terminated due to component installation or execution failure
Terminated(Failed to scale out)VM terminated due to scaling-out failure
Terminated(Failed to scale in)Forced termination of the VM after scaling-down failure
Terminated(User deleted VM)User manually deleted the Hadoop Eco cluster VM
PendingHadoop Eco creation requests are available after Open API is enabled
ProcessingHadoop Eco creation and job scheduling in progress after Open API is enabled

Instance and cluster states

Instances are KakaoCloud's Virtual Machines that make up a cluster. The states of instances and clusters may differ.
Here are scenarios where the instance and cluster states differ:

  • If the master node's instance is not in Active state, the cluster will not operate correctly if the availability type is Single.
  • If the availability type is HA, the cluster can operate correctly as long as one of the master node instances (1st or 2nd node) is Active.

Components

The components running in a Hadoop Eco cluster are as follows:

Core Hadoop

LocationComponentURL
Master 1HDFS NameNodeHDE-2.0.0 below: http://{HadoopMST-cluster-1}:50070
HDE-2.0.0 or above: http://{HadoopMST-cluster-1}:9870
YARN ResourceManagerhttp://{HadoopMST-cluster-1}:8088
TimelineServerhttp://{HadoopMST-cluster-1}:8188
JobHistoryServerhttp://{HadoopMST-cluster-1}:19888
SparkHistoryServerhttp://{HadoopMST-cluster-1}:18082
SparkThriftServerhttp://{HadoopMST-cluster-1}:20000
Tez UIhttp://{HadoopMST-cluster-1}:9999
HiveServer2 (HS2)http://{HadoopMST-cluster-1}:10002
Huehttp://{HadoopMST-cluster-1}:8888
Zeppelinhttp://{HadoopMST-cluster-1}:8180
Ooziehttp://{HadoopMST-cluster-1}:11000

HBase

LocationComponentURL
Master 1HDFS NameNodeHDE-2.0.0 below: http://{HadoopMST-cluster-1}:50070
HDE-2.0.0 or above: http://{HadoopMST-cluster-1}:9870
YARN ResourceManagerhttp://{HadoopMST-cluster-1}:8088
HMasterhttp://{HadoopMST-cluster-1}:16010
TimelineServerhttp://{HadoopMST-cluster-1}:8188
JobHistoryServerhttp://{HadoopMST-cluster-1}:19888
Huehttp://{HadoopMST-cluster-1}:8888

Trino

LocationComponentURL
Master 1HDFS NameNodeHDE-2.0.0 below: http://{HadoopMST-cluster-1}:50070
HDE-2.0.0 or above: http://{HadoopMST-cluster-1}:9870
YARN ResourceManagerhttp://{HadoopMST-cluster-1}:8088
Trino Coordinatorhttp://{HadoopMST-cluster-1}:8780
TimelineServerhttp://{HadoopMST-cluster-1}:8188
JobHistoryServerhttp://{HadoopMST-cluster-1}:19888
Tez UIhttp://{HadoopMST-cluster-1}:9999
HiveServer2 (HS2)http://{HadoopMST-cluster-1}:10002
Huehttp://{HadoopMST-cluster-1}:8888
Zeppelinhttp://{HadoopMST-cluster-1}:8180

Dataflow

LocationComponentURL
Master 1HDFS NameNodeHDE-2.0.0 below: http://{HadoopMST-cluster-1}:50070
HDE-2.0.0 or above: http://{HadoopMST-cluster-1}:9870
YARN ResourceManagerhttp://{HadoopMST-cluster-1}:8088
TimelineServerhttp://{HadoopMST-cluster-1}:8188
JobHistoryServerhttp://{HadoopMST-cluster-1}:19888
Kafka Brokerhttp://{HadoopMST-cluster-1}:9092
Druid Masterhttp://{HadoopMST-cluster-1}:3001
Druid Brokerhttp://{HadoopMST-cluster-1}:3002
Druid Routerhttp://{HadoopMST-cluster-1}:3008
Supersethttp://{HadoopMST-cluster-1}:4000
Huehttp://{HadoopMST-cluster-1}:8888

Instance

Instances can be checked from the Cluster List, and they behave in the same way as standard VMs.

info

For stable operation, it is recommended that master node instances have at least 16GB, and worker node instances have at least 32GB of RAM.

Volume

A volume is the basic storage where the image is configured when creating an instance, and it represents the capacity of HDFS. To ensure stable operation of HDFS, selecting an appropriate size is essential. For detailed information on volumes, refer to the Volume Creation and Management document.

Network and security

All instances created by Hadoop Eco are provided within a VPC environment. To configure a cluster, a security group must be created, and inbound rules for component configuration must be set up to create the cluster. For detailed information on network and security settings, refer to the Security Group document.