Key Concepts

KakaoCloud's Hadoop Eco is a cloud platform service designed for distributed processing tasks using open-source frameworks such as Hadoop, Hive, HBase, Spark, Trino, and Kafka. It provides provisioning services for Hadoop, HBase, Trino, and Dataflow using KakaoCloud's Virtual Machines. The key concepts of Hadoop Eco are as follows.

Cluster

A cluster is a collection of nodes provisioned using Virtual Machines.

Cluster bundle

Hadoop Eco provides Core Hadoop, HBase, Trino, Dataflow, and Custom bundles.

Type	Description
Core Hadoop	Includes Hadoop, Hive, Spark, and Tez - Data is stored in HDFS and analyzed using Hive and Spark
HBase	Includes Hadoop and HBase - Data is stored in HDFS and NoSQL services are provided using HBase
Trino	Includes Hadoop, Trino, Hive, and Tez - Data is stored in HDFS and analyzed using Trino and Hive
Dataflow	Includes Hadoop, Kafka, Druid, and Superset - Data is collected via Kafka and analyzed using Druid and Superset
Custom	Hadoop and Zookeeper are installed by default, and users can freely select additional components to install. - Changing components from other bundles automatically switches to the Custom bundle.

Cluster availability types

To ensure operational stability, availability types include Standard (Single) and High Availability (HA).

Availability Type	Description
Standard (Single)	Composed of one master node and multiple worker nodes - Since there is only one master node, if a failure occurs, HDFS and YARN may not function
High Availability (HA)	Composed of three master nodes and multiple worker nodes - HDFS and YARN are configured for HA, and the master is automatically recovered in case of failure

Cluster versions

The installed component versions depend on the Hadoop Eco version.
HDE clusters offer the Core Hadoop bundle for data analysis and the HBase bundle for providing HDFS-based NoSQL services. Starting from HDE version 1.1.2, Trino and Dataflow bundles are also available.
HDE version 2.0.1 supports Hadoop 3.x, HBase 2.x, and Hive 3.x versions.

Components installed by bundle per cluster version

Core Hadoop
HBase
Trino
Dataflow

Core Hadoop

Cluster lifecycle

Hadoop Eco clusters go through various states and lifecycles, allowing users to check and manage the operational and task statuses. After the initial creation request, the lifecycle includes installation, operation, and deletion stages. The states of the cluster and instances may vary based on user actions.

Cluster Lifecycle

Cluster and Node States

State	Description
`Initializing`	Metadata is being saved and VM creation has been requested
`Creating`	VM creation in progress
`Installing`	Installing Hadoop Eco components on the created VM
`Starting`	Hadoop Eco components are running
`Running`	All components are running and the cluster is operational
`Running(Scale out initializing)`	VM creation requested for cluster scaling
`Running(Scale out creating)`	VM creation in progress
`Running(Scale out installing)`	Installing Hadoop Eco components on the created VM
`Running(Scale out starting)`	Components are running
`Running(Scale out running)`	Verifying the operation of the existing cluster and newly scaled-out VMs
`Running(Scale in Initializing)`	Verifying whether the target VM can be deleted for scaling down
`Running(Scale in ready)`	Shutting down components on the scaling-down VM
`Running(Scale in starting)`	Checking if component shutdown on the scaling-down VM is successful
`Running(Scale in terminating)`	Deleting VM
`Failed to scale out`	Failed to create the scaling-out VM
`Failed to scale out vm`	Failed to install or run components on the scaling-out VM
`Failed to scale in`	Failed to delete the scaling-down VM
`Failed to scale in vm`	Failed to properly shut down components on the scaling-down VM
`Terminating`	Cluster termination in progress
`Terminated(User)`	Cluster terminated by the user
`Terminated(UserCommand)`	Cluster terminated after successful task scheduling
`Terminated(Scale in)`	Cluster scaled down and VM terminated successfully
`Terminated(Error)`	Cluster terminated due to error
`Terminated(Failed to create vm)`	Error during VM creation
`Terminated(Failed to destroy vm)`	Error during VM termination
`Terminated(Check time over)`	Cluster creation exceeded time limits
`Terminated(Install error)`	Cluster terminated due to component installation or execution failure
`Terminated(Failed to scale out)`	VM terminated due to scaling-out failure
`Terminated(Failed to scale in)`	Forced termination of the VM after scaling-down failure
`Terminated(User deleted VM)`	User manually deleted the Hadoop Eco cluster VM
`Pending`	Hadoop Eco creation requests are available after Open API is enabled
`Processing`	Hadoop Eco creation and job scheduling in progress after Open API is enabled

Instance and cluster states

Instances are KakaoCloud's Virtual Machines that make up a cluster. The states of instances and clusters may differ.
Here are scenarios where the instance and cluster states differ:

If the master node's instance is not in Active state, the cluster will not operate correctly if the availability type is Single.
If the availability type is HA, the cluster can operate correctly as long as one of the master node instances (1st or 2nd node) is Active.

Components

The components running in a Hadoop Eco cluster are as follows:

Core Hadoop

Standard (Single)
HA (High Availability)

Location	Component	URL
Master 1	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-1}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-1}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-1}`:8088
	TimelineServer	`http://{HadoopMST-cluster-1}`:8188
	JobHistoryServer	`http://{HadoopMST-cluster-1}`:19888
	SparkHistoryServer	`http://{HadoopMST-cluster-1}`:18082
	SparkThriftServer	`http://{HadoopMST-cluster-1}`:20000
	TezUI	`http://{HadoopMST-cluster-1}`:9999
	HiveServer2 (HS2)	`http://{HadoopMST-cluster-1}`:10002
	Hue	`http://{HadoopMST-cluster-1}`:8888
	Zeppelin	`http://{HadoopMST-cluster-1}`:8180
	Oozie	`http://{HadoopMST-cluster-1}`:11000

Location	Component	URL
Master 1	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-1}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-1}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-1}`:8088
	HiveServer2 (HS2)	`http://{HadoopMST-cluster-1}`:10002
Master 2	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-2}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-2}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-2}`:8088
	HiveServer2 (HS2)	`http://{HadoopMST-cluster-2}`:10002
Master 3	TimelineServer	`http://{HadoopMST-cluster-3}`:8188
	JobHistoryServer	`http://{HadoopMST-cluster-3}`:19888
	SparkHistoryServer	`http://{HadoopMST-cluster-3}`:18082
	SparkThriftServer	`http://{HadoopMST-cluster-1}`:20000
	TezUI	`http://{HadoopMST-cluster-3}`:9999
	HiveServer2 (HS2)	`http://{HadoopMST-cluster-3}`:10002
	Hue	`http://{HadoopMST-cluster-3}`:8888
	Zeppelin	`http://{HadoopMST-cluster-3}`:8180
	Oozie	`http://{HadoopMST-cluster-3}`:11000

HBase

Standard (Single)
HA (High Availability)

Location	Component	URL
Master 1	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-1}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-1}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-1}`:8088
	HMaster	`http://{HadoopMST-cluster-1}`:16010
	TimelineServer	`http://{HadoopMST-cluster-1}`:8188
	JobHistoryServer	`http://{HadoopMST-cluster-1}`:19888
	Hue	`http://{HadoopMST-cluster-1}`:8888

Location	Component	URL
Master 1	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-1}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-1}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-1}`:8088
	HMaster	`http://{HadoopMST-cluster-1}`:16010
Master 2	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-2}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-2}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-2}`:8088
	HMaster	`http://{HadoopMST-cluster-2}`:16010
Master 3	HMaster	`http://{HadoopMST-cluster-3}`:16010
	TimelineServer	`http://{HadoopMST-cluster-3}`:8188
	JobHistoryServer	`http://{HadoopMST-cluster-3}`:19888
	Hue	`http://{HadoopMST-cluster-3}`:8888

Trino

Standard (Single)
HA (High Availability)

Location	Component	URL
Master 1	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-1}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-1}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-1}`:8088
	Trino Coordinator	`http://{HadoopMST-cluster-1}`:8780
	TimelineServer	`http://{HadoopMST-cluster-1}`:8188
	JobHistoryServer	`http://{HadoopMST-cluster-1}`:19888
	TezUI	`http://{HadoopMST-cluster-1}`:9999
	HiveServer2 (HS2)	`http://{HadoopMST-cluster-1}`:10002
	Hue	`http://{HadoopMST-cluster-1}`:8888
	Zeppelin	`http://{HadoopMST-cluster-1}`:8180

Location	Component	URL
Master 1	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-1}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-1}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-1}`:8088
	HiveServer2 (HS2)	`http://{HadoopMST-cluster-1}`:10002
Master 2	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-2}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-2}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-2}`:8088
	HiveServer2 (HS2)	`http://{HadoopMST-cluster-2}`:10002
Master 3	Trino Coordinator	`http://{HadoopMST-cluster-3}`:8780
	TimelineServer	`http://{HadoopMST-cluster-3}`:8188
	JobHistoryServer	`http://{HadoopMST-cluster-3}`:19888
	TezUI	`http://{HadoopMST-cluster-3}`:9999
	HiveServer2 (HS2)	`http://{HadoopMST-cluster-3}`:10002
	Hue	`http://{HadoopMST-cluster-3}`:8888
	Zeppelin	`http://{HadoopMST-cluster-3}`:8180

Dataflow

Standard (Single)
HA (High Availability)

Location	Component	URL
Master 1	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-1}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-1}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-1}`:8088
	TimelineServer	`http://{HadoopMST-cluster-1}`:8188
	JobHistoryServer	`http://{HadoopMST-cluster-1}`:19888
	Kafka Broker	`http://{HadoopMST-cluster-1}`:9092
	Druid Master	`http://{HadoopMST-cluster-1}`:3001
	Druid Broker	`http://{HadoopMST-cluster-1}`:3002
	Druid Router	`http://{HadoopMST-cluster-1}`:3008
	Superset	`http://{HadoopMST-cluster-1}`:4000
	Hue	`http://{HadoopMST-cluster-1}`:8888

Location	Component	URL
Master 1	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-1}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-1}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-1}`:8088
	Kafka Broker	`http://{HadoopMST-cluster-1}`:9092
	Druid Master	`http://{HadoopMST-cluster-1}`:3001
	Druid Broker	`http://{HadoopMST-cluster-1}`:3002
Master 2	HDFS NameNode	HDE-2.0.0 below: `http://{HadoopMST-cluster-2}`:50070 HDE-2.0.0 or above: `http://{HadoopMST-cluster-2}`:9870
	YARN ResourceManager	`http://{HadoopMST-cluster-2}`:8088
	Kafka Broker	`http://{HadoopMST-cluster-2}`:9092
	Druid Master	`http://{HadoopMST-cluster-2}`:3001
	Druid Broker	`http://{HadoopMST-cluster-2}`:3002
Master 3	TimelineServer	`http://{HadoopMST-cluster-3}`:8188
	JobHistoryServer	`http://{HadoopMST-cluster-3}`:19888
	Kafka Broker	`http://{HadoopMST-cluster-3}`:9092
	Druid Master	`http://{HadoopMST-cluster-3}`:3001
	Druid Broker	`http://{HadoopMST-cluster-3}`:3002
	Druid Router	`http://{HadoopMST-cluster-3}`:3008
	Superset	`http://{HadoopMST-cluster-3}`:4000
	Hue	`http://{HadoopMST-cluster-3}`:8888

Instance

Instances can be checked from the Cluster List, and they behave in the same way as standard VMs.

info

For stable operation, it is recommended that master node instances have at least 16GB, and worker node instances have at least 32GB of RAM.

Volume

A volume is the basic storage where the image is configured when creating an instance, and it represents the capacity of HDFS. To ensure stable operation of HDFS, selecting an appropriate size is essential. For detailed information on volumes, refer to the Volume Creation and Management document.

Network and security

All instances created by Hadoop Eco are provided within a VPC environment. To configure a cluster, a security group must be created, and inbound rules for component configuration must be set up to create the cluster. For detailed information on network and security settings, refer to the Security Group document.

Cluster​

Cluster bundle​

Cluster availability types​

Cluster versions​

Components installed by bundle per cluster version​

Cluster lifecycle​

Instance and cluster states​

Components​

Core Hadoop​

HBase​

Trino​

Dataflow​

Instance​

Volume​

Network and security​