Key Concepts
KakaoCloud's Hadoop Eco is a cloud platform service designed to execute distributed processing tasks using open-source frameworks such as Hadoop, Hive, HBase, Spark, Trino, and Kafka. KakaoCloud provides provisioning services for Hadoop, HBase, Trino, and Dataflow using Virtual Machines.
The key concepts of the Hadoop Eco service are as follows.
Cluster
A cluster is a group of nodes provisioned using Virtual Machines.
Cluster types
Hadoop Eco offers the following types: Core Hadoop
, HBase
, Trino
, and Dataflow
.
Type | Description |
---|---|
Core Hadoop | Includes Hadoop, Hive, Spark, and Tez - Stores data in HDFS and analyzes it using Hive and Spark |
HBase | Includes Hadoop and HBase - Stores data in HDFS and provides NoSQL services using HBase |
Trino | Includes Hadoop, Trino, Hive, and Tez - Stores data in HDFS and analyzes it using Trino and Hive |
Dataflow | Includes Hadoop, Kafka, Druid, and Superset - Collects data using Kafka and analyzes it using Druid and Superset |
Cluster availability types
To ensure operational stability, cluster availability is categorized into Standard (Single) and High Availability (HA) types.
Availability type | Description |
---|---|
Standard (Single) | Composed of one master node and multiple worker nodes - If the master node fails, HDFS and YARN may stop working |
High Availability (HA) | Composed of three master nodes and multiple worker nodes - HDFS and YARN are configured for HA, allowing automatic recovery in case of a master node failure |
Cluster versions
The version of Hadoop Eco determines the component versions installed. The HDE cluster supports the Core Hadoop type for data analysis, the HBase type for HDFS-based NoSQL services, and from HDE 1.1.2, the Trino and Dataflow types.
HDE 2.0.1 supports Hadoop 3.x, HBase 2.x, and Hive 3.x versions.
Components installed by cluster type per version
- Core Hadoop
- HBase
- Trino
- Dataflow
Core Hadoop
HBase
Trino
Dataflow
Cluster lifecycle
A Hadoop Eco cluster has various states and lifecycles, allowing users to monitor and manage operational and task statuses. After the initial creation request, its status transitions through installation, operation, and deletion stages. The statuses of clusters and instances may differ depending on user operations.
Cluster lifecycle
Cluster and node statuses
Status | Description |
---|---|
Initializing | Metadata is stored, and a VM creation request is sent |
Creating | VM is being created |
Installing | Hadoop Eco components are being installed on the created VM |
Starting | Hadoop Eco components are starting |
Running | All components are running, and the cluster is operational |
Running(Scale out initializing) | VM creation request due to cluster expansion |
Running(Scale out creating) | VM is being created |
Running(Scale out installing) | Hadoop Eco components are being installed on the new VM |
Running(Scale out starting) | Components are starting |
Running(Scale out running) | Verifying cluster operation with the expanded VMs |
Running(Scale in Initializing) | Verifying VMs for removal during cluster reduction |
Running(Scale in ready) | Components on target VMs are shutting down |
Running(Scale in starting) | Verifying component shutdown on target VMs |
Running(Scale in terminating) | VMs are being deleted |
Failed to scale out | VM creation for expansion failed |
Failed to scale out vm | Component installation or startup failed on the expansion VM |
Failed to scale in | VM deletion during scale-in failed |
Failed to scale in vm | Component shutdown failed on the scale-in VM |
Terminating | Cluster is being terminated |
Terminated(User) | Cluster was terminated by the user |
Terminated(UserCommand) | Cluster was terminated after successful job scheduling |
Terminated(Scale in) | Cluster scaled down successfully |
Terminated(Error) | Cluster was terminated due to an error |
Instance and cluster states
Instances are KakaoCloud Virtual Machines that make up a cluster. The states of instances and clusters may differ.
- If the master node instance is not in the
Active
state, aSingle
availability cluster cannot function properly. - For
HA
availability, the cluster can operate normally if at least one master node instance (Master 1 or 2) is in theActive
state.
Components
The following are the components running in a Hadoop Eco cluster.
Core Hadoop
- Standard (Single)
- HA (High Availability)
Location | Component | Address |
---|---|---|
Master 1 | HDFS Namenode | HDE-2.0.0 below: http://{HadoopMST-cluster-1}:50070 HDE-2.0.0 above: http://{HadoopMST-cluster-1}:9870 |
YARN ResourceManager | http://{HadoopMST-cluster-1}:8088 | |
TimelineServer | http://{HadoopMST-cluster-1}:8188 | |
JobHistoryServer | http://{HadoopMST-cluster-1}:19888 | |
SparkHistoryServer | http://{HadoopMST-cluster-1}:18082 | |
SparkThriftServer | http://{HadoopMST-cluster-1}:20000 | |
Tez UI | http://{HadoopMST-cluster-1}:9999 | |
HiveServer2 (HS2) | http://{HadoopMST-cluster-1}:10002 | |
Hue | http://{HadoopMST-cluster-1}:8888 | |
Zeppelin | http://{HadoopMST-cluster-1}:8180 | |
Oozie | http://{HadoopMST-cluster-1}:11000 |
Location | Component | Address |
---|---|---|
Master 1 | HDFS Namenode | HDE-2.0.0 below: http://{HadoopMST-cluster-1}:50070 HDE-2.0.0 above: http://{HadoopMST-cluster-1}:9870 |
YARN ResourceManager | http://{HadoopMST-cluster-1}:8088 | |
HiveServer2 (HS2) | http://{HadoopMST-cluster-1}:10002 | |
Master 2 | HDFS Namenode | HDE-2.0.0 below: http://{HadoopMST-cluster-2}:50070 HDE-2.0.0 above: http://{HadoopMST-cluster-2}:9870 |
YARN ResourceManager | http://{HadoopMST-cluster-2}:8088 | |
HiveServer2 (HS2) | http://{HadoopMST-cluster-2}:10002 | |
Master 3 | TimelineServer | http://{HadoopMST-cluster-3}:8188 |
JobHistoryServer | http://{HadoopMST-cluster-3}:19888 | |
SparkHistoryServer | http://{HadoopMST-cluster-3}:18082 | |
SparkThriftServer | http://{HadoopMST-cluster-1}:20000 | |
Tez UI | http://{HadoopMST-cluster-3}:9999 | |
HiveServer2 (HS2) | http://{HadoopMST-cluster-3}:10002 | |
Hue | http://{HadoopMST-cluster-3}:8888 | |
Zeppelin | http://{HadoopMST-cluster-3}:8180 | |
Oozie | http://{HadoopMST-cluster-3}:11000 |
HBase
- Standard (Single)
Location | Component | Address |
---|---|---|
Master 1 | HDFS Namenode | HDE-2.0.0 below: http://{HadoopMST-cluster-1}:50070 HDE-2.0.0 above: http://{HadoopMST-cluster-1}:9870 |
YARN ResourceManager | http://{HadoopMST-cluster-1}:8088 | |
HMaster | http://{HadoopMST-cluster-1}:16010 | |
TimelineServer | http://{HadoopMST-cluster-1}:8188 | |
JobHistoryServer | http://{HadoopMST-cluster-1}:19888 | |
Hue | http://{HadoopMST-cluster-1}:8888 |
Instance
Instances can be viewed in the Cluster list and are managed similarly to standard VMs.
For stable operation, it is recommended to have at least 16GB for master node instances and 32GB for worker node instances.
Volume
A volume is the default storage configured when creating an instance and determines the capacity of HDFS. For stable HDFS operation, an appropriate size must be selected.
For detailed information about volumes, refer to the Create and manage volumes document.
Network and security
All instances created in Hadoop Eco are provided within a VPC environment. To configure a cluster, a security group must be created, and Inbound rules must be configured for component communication.
For detailed information on network and security settings, refer to security group.