Create cluster
A cluster is a set of nodes provisioned using Virtual Machines.
The process to create a cluster in the Hadoop Eco service is as follows.
It takes approximately 20-25 minutes to create a cluster.
Step 1. Configure cluster
Set the basic information for creating a Hadoop cluster.
-
Go to KakaoCloud console > Analytics > Hadoop Eco menu.
-
In the Cluster menu, select the [Create cluster] button located at the top right.
-
In Step 1: Configure cluster, enter the information and select the [Next] button.
Item Category Description Cluster name Example: my-cluster-01
- Cluster names must be unique within the same project
- VM is created based on the cluster name
- Master node: Created in the formatHadoopMST-{ClusterName}-{Number}
- Worker node: Created in the formatHadoopWRK-{ClusterName}-{Number}
Cluster configuration Cluster version Select the cluster version Cluster type - Select the cluster type based on the cluster version
- For detailed explanation, refer to Cluster version and typeCluster availability Provides standard and high availability types for operational stability
- Standard (Single, 1 master node instance): Resource manager, Name node run in 1 instance
- Creates a single master node, suitable for small-scale tasks
- High availability (HA) (3 master node instances): Resource manager, Name node run in HA mode
- Creates 3 master nodes, allowing for uninterrupted tasks even during rebootsAdministrator settings Admin ID Enter the admin ID Admin password Enter the admin password
- For details on resetting the password, refer to Hue password reset
- When Ranger is applied, a specific password creation rule must be followed. For details, refer to Ranger applicationConfirm admin password Enter the same admin password VPC settings Select a VPC and subnet
- Select Management page to go to VPC
- Public IP accessible from external sources can be assigned after instance creation in the Assign public IP menuSecurity group Create new security group: Enter a name to create a new security group
- Automatically set inbound/outbound rules for Hadoop Eco
Select an existing security group: Check inbound/outbound rules
- Select the [Refresh] icon to fetch network information
- For detailed explanation, refer to Security group
Cluster bundles and components
Cluster version | Cluster bundle | Component options |
---|---|---|
Hadoop Eco 1.0.1 (Trino, Dataflow types not supported) | Core hadoop | Hadoop 2.10.1, Hive 2.3.2, Hue 4.11.0, Oozie 5.2.1, Spark 2.4.6, Tez 0.9.2, Zeppelin 0.10.0, Zookeeper 3.5.7 |
Hbase | Hadoop 2.10.1, HBase 1.4.13, Hue 4.11.0, Zookeeper 3.5.7 | |
Hadoop Eco 1.1.2 | Core hadoop | Hadoop 2.10.2, Flink 1.14.4, Hive 2.3.9, Hue 4.11.0, Oozie 5.2.1, Spark 2.4.8, Sqoop 1.4.7, Tez 0.9.2, Zeppelin 0.10.1, Zookeeper 3.8.0 |
Hbase | Hadoop 2.10.2, HBase 1.7.1, Hue 4.11.0, Zookeeper 3.8.0 | |
Trino | Hadoop 2.10.2, Hive 2.3.9, Hue 4.11.0, Tez 0.9.2, Trino 377, Zeppelin 0.10.1, Zookeeper 3.8.0 | |
Dataflow | Hadoop 2.10.2, Druid 25.0.0, Hue 4.11.0, Kafka 3.4.0, Superset 2.1.1, Zookeeper 3.8.0 | |
Hadoop Eco 2.0.1 | Core hadoop | Hadoop 3.3.4, Flink 1.15.1, Hive 3.1.3, Hue 4.11.0, Oozie 5.2.1, Spark 3.2.2, Sqoop 1.4.7, Tez 0.10.1, Zeppelin 0.10.1, Zookeeper 3.8.0 |
Hbase | Hadoop 3.3.4, HBase 2.4.13, Hue 4.11.0, Zookeeper 3.8.0 | |
Trino | Hadoop 3.3.4, Hive 3.1.3, Hue 4.11.0, Tez 0.10.1, Trino 393, Zeppelin 0.10.1, Zookeeper 3.8.0 | |
Dataflow | Hadoop 3.3.4, Druid 25.0.0, Hue 4.11.0, Kafka 3.4.0, Superset 2.1.1, Zookeeper 3.8.0 | |
Hadoop Eco 2.1.0 | Core hadoop | Hadoop 3.3.6, Flink 1.20.0, Hive 3.1.3, Hue 4.11.0, Oozie 5.2.1, Spark 3.5.2, Sqoop 1.4.7, Tez 0.10.2, Zeppelin 0.10.1, Zookeeper 3.9.2 |
Hbase | Hadoop 3.3.6, HBase 2.6.0, Hue 4.11.0, Zookeeper 3.9.2 | |
Trino | Hadoop 3.3.6, Hive 3.1.3, Hue 4.11.0, Tez 0.10.2, Trino 393, Zeppelin 0.10.1, Zookeeper 3.9.2 | |
Dataflow | Hadoop 3.3.6, Druid 25.0.0, Hue 4.11.0, Kafka 3.8.0, Superset 2.1.1, Zookeeper 3.9.2 | |
Hadoop Eco 2.2.0 | Core hadoop | Hadoop 3.3.6, Flink 1.20.0, Hive 3.1.3, Hue 4.11.0, Oozie 5.2.1, Spark 3.5.2, Sqoop 1.4.7, Tez 0.10.2, Zeppelin 0.10.1, Zookeeper 3.9.2 |
Hbase | Hadoop 3.3.6, HBase 2.6.0, Hue 4.11.0, Phoenix 5.2.1, Zookeeper 3.9.2 | |
Trino | Hadoop 3.3.6, Hive 3.1.3, Hue 4.11.0, Tez 0.10.2, Trino 436, Zeppelin 0.10.1, Zookeeper 3.9.2 | |
Dataflow | Hadoop 3.3.6, Druid 25.0.0, Hue 4.11.0, Kafka 3.8.0, Superset 2.1.1, Zookeeper 3.9.2 |
- Hadoop Eco 1.0.0, 1.1.0, 1.1.1, and 2.0.0 versions are not supported.
- When default components of a bundle are modified, it is automatically treated as a custom bundle.
Configure security group
When creating a cluster, a security group with necessary component ports is automatically generated.
- Based on the selected subnet ID, a security group is automatically created. If a cluster already exists in the same subnet, the existing security group is reused.
- Example security group name:
HDE-{%subnet_ID%}
- You can apply existing security groups additionally through the extra security group setting.
- Inbound rules
- Outbound rules
Ports configured in the auto-generated security group are as follows.
Protocol | Source | Port | Description |
---|---|---|---|
ALL | VPC subnet CIDR | ALL | Hadoop eco internal |
Outbound is set to "allow all".
Step 2. Configure instance
Configure master and worker instances, storage, and network.
Enter the required information in Step 2: Configure instance, then click the [Next] button.
- After cluster creation, instance and disk volume settings cannot be changed.
- Adding master/worker instances or disk volumes will be supported in the future.
Item | Category | Description |
---|---|---|
Master node config | Master node count | Fixed based on cluster availability - Standard (Single) type: 1 - HA type: 3 |
Master node type | Choose from supported instance types - Hardware configuration varies by instance type | |
Disk volume type / size | - Volume type: Only SSD supported (others to be supported later) - Size: 50–5,120 GB | |
Worker node config | Worker node count | Can be set based on purpose, within project quota limits |
Worker node type | Choose from supported instance types - Hardware configuration varies by instance type | |
Disk volume type / size | Volume type: Only SSD supported (others to be supported later) - Size: 50–5,120 GB | |
Total YARN usage | YARN Core | Calculated as 'Number of worker nodes × vCPUs per node' |
YARN Memory | Calculated as 'Worker node count × memory per node × YARN allocation ratio (0.8)' | |
Key pair | Select key pair for instance - Use an existing or newly created KakaoCloud key pair - See Create new key pair for details - Click Admin page to navigate to Virtual Machine > Key pair | |
User script (optional) | Script to automatically configure environment at instance startup via user data |
Create new key pair
To create a key pair during cluster creation, follow the steps below:
- Select Create new key pair and enter a key pair name.
- Click the [Create and download key pair] button.
- A private key file with a
.pem
extension will be downloaded using the specified key pair name.
Be sure to store the downloaded private key file securely.
Step 3. Advanced settings (optional)
1. Configure service integration (optional)
Apply settings for cluster service integration. Integration is available with KakaoCloud's Data Catalog, MySQL, and MemStore services.
If service integration is not configured, Standard (Single) type installs MySQL on master node 1, and HA type installs MySQL on all 3 master nodes for use as a metastore.
In Configure service integration, choose whether to install the monitoring agent and configure the desired service integration.
Item | Description |
---|---|
Install monitoring agent | Select whether to Install monitoring agent |
Integrate external storage | Hive metastore: None / Integrate with Data Catalog / Integrate with MySQL - Hive metastore integration is only available if Hive is selected as a component Superset cache store: None / Integrate with MemStore - Superset cache integration is only available if Superset is selected as a component |
Install monitoring agent
When monitoring agent is installed, node monitoring becomes available in Hadoop Eco > Cluster details > Monitoring tab:
- CPU usage (%) per node
- Memory usage (%) per node
Integrate with Data Catalog
- Prepare a pre-created Data Catalog for integration. For details on creating a catalog, refer to Create catalog.
- In Configure service integration (optional), select Data Catalog integration.
- Confirm Hadoop network/subnet info in the integration section, then select a desired catalog.
Integrate with MySQL
- Prepare a pre-created MySQL instance for integration. For details, see Create MySQL instance group.
- In Configure service integration (optional), select MySQL integration:
- Choose the instance where MySQL is installed.
- After selecting the instance, enter the database name, MySQL ID, and password.
Integrate with MemStore
MemStore integration is only available when the Dataflow bundle or Superset component is selected.
- Create a MemStore instance. For instructions, refer to Create MemStore cluster.
- In Configure service integration (optional), select MemStore integration:
- Choose the MemStore to integrate in the MemStore name field.
- Depending on whether cluster mode is used in MemStore, fields for Superset Cache DB ID and Superset Query Cache DB ID may appear:
- If cluster mode is enabled: no additional input required
- If cluster mode is disabled: you can enter Superset Cache DB ID and Superset Query Cache DB ID, or leave them blank to use the default (0, 1)
2. Configure cluster details (optional)
You can configure HDFS block size, replication factor, and other cluster component settings. HDFS settings take precedence over cluster component settings.
Item | Description |
---|---|
HDFS settings | HDFS block size - Sets the dfs.blocksize value in hdfs-site.xml - Volume created with size between 1–1,024 MB (default: 128 MB) HDFS replication factor - Sets the dfs.blockreplication value in hdfs-site.xml - Can be set between 1–500 - Replication count must not exceed the number of worker nodes |
Cluster configuration (optional) | Enter component-specific configurations for the cluster - Either upload a JSON file or enter the values directly - For object storage integration, see Integrate with Object Storage |
Log storage settings | Choose whether to Configure log storage |
Cluster configuration - component settings
Cluster configuration is defined using a JSON file in key-value pair format. The configurations
field is a list, where classification
represents the config file name and properties
contains the configuration parameters. The basic format is shown below.
-- Format
{
"configurations":
[
{
"classification": "file name",
"properties": {
"property name": "value"
}
}
]
}
-- Example
{
"configurations":
[
{
"classification": "core-site",
"properties": {
"dfs.blocksize": "67108864"
}
}
]
}
Depending on the configuration file name, the format is classified as env
, xml
, properties
, or user-env
.
Format | Description |
---|---|
env | Input keys are mapped to predefined values; only specific keys can be modified |
xml | Input key-value pairs are written as XML elements (<name> , <value> ) |
properties | Input key-value pairs are written as plain text property entries |
user-env | Adds user-specific environment variables using the format user-env:<username> |
User-provided configurations are inserted into the appropriate files based on the classification
name.
env
Classification | File path | Setting name | Sample value |
---|---|---|---|
hadoop-env | /etc/hadoop/conf | hadoop_env_hadoop_heapsize | 2048 |
hadoop_env_hadoop_namenode_heapsize | "-Xmx2048m" | ||
hadoop_env_hadoop_jobtracker_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_tasktracker_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_shared_hadoop_namenode_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_datanode_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_zkfc_opts | "-Xmx1024m" | ||
hadoop_env_hadoop_log_level | INFO, DRFA, console | ||
hadoop_env_hadoop_security_log_level | INFO, DRFAS | ||
hadoop_env_hadoop_audit_log_level | INFO, RFAAUDIT | ||
mapred-env | /etc/hadoop/conf | mapred_env_hadoop_job_historyserver_heapsize | 2000 |
hive-env | /etc/hive/conf | hive_env_hive_metastore_heapsize | 2048 |
hive_env_hiveserver2_heapsize | 2048 | ||
hive_env_hadoop_client_opts | "-Xmx2048m" | ||
hbase-env | /etc/hbase/conf | hbase_env_hbase_master_heapsize | "-Xmx2048m" |
hbase_env_hbase_regionserver_heapsize | "-Xmx2048m" | ||
spark-defaults | /etc/spark/conf | spark_defaults_spark_driver_memory | 2g |
trino-config | /etc/trino/conf | trino_jvm_config_heap | -Xmx10G |
xml
Classification | File path | Reference link | Note |
---|---|---|---|
core-site | /etc/hadoop/conf | core-default.xml | |
hdfs-site | /etc/hadoop/conf | hdfs-default.xml | |
httpfs-site | /etc/hadoop/conf | ServerSetup.html | |
mapred-site | /etc/hadoop/conf | mapred-default.xml | |
yarn-site | /etc/hadoop/conf | yarn-default.xml | |
capacity-scheduler | /etc/hadoop/conf | CapacityScheduler.html | YARN scheduler config |
tez-site | /etc/tez/conf | TezConfiguration.html | |
hive-site | /etc/hive/conf | Hive configuration properties | |
hiveserver2-site | /etc/hive/conf | Setting up hiveserver2 | Hiveserver2-specific |
properties
Classification | File path | Description |
---|---|---|
spark-defaults | /etc/spark/conf | Spark configuration in key–tab–value format |
user-env
Classification | File path | Description |
---|---|---|
user-env:profile | /etc/profile | Add global environment variables |
user-env:[username] | ~/.bashrc | Add environment variables to user's bashrc |
XML format
<configuration>
<property>
<name>yarn.app.mapreduce.am.job.client.port-range</name>
<value>41000-43000</value>
</property>
</configuration>
ENV format
...
export HADOOP_HEAPSIZE="3001"
...
Properties format
spark.driver.memory 4000M
spark.network.timeout 800
user-env format
{
"configurations": [
{
"classification": "user-env:profile",
"properties": {
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties": {
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}
Sample example
{
"configurations":
[
{
"classification": "mapred-site", -- xml format
"properties":
{
"yarn.app.mapreduce.am.job.client.port-range": "41000-43000"
}
},
{
"classification": "hadoop-env", -- env format
"properties":
{
"hadoop_env_hadoop_heapsize": 3001,
"hadoop_env_hadoop_namenode_heapsize": "-Xmx3002m"
}
},
{
"classification": "spark-defaults", -- properties format
"properties":
{
"spark.driver.memory": "4000M",
"spark.network.timeout": "800s"
}
},
{
"classification": "user-env:profile", -- user-env format
"properties":
{
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties":
{
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}
Configure log storage
When using log storage settings, you need to set the log storage path.
-
Set log storage settings to enable.
-
Select the Object Storage bucket to use for log storage.
-
After checking the path, if you want to change to a new path, enter the new path.
- Deleting stored logs may cause the Spark History Server to malfunction.
3. Configure job scheduling (optional)
If you select the Core Hadoop bundle or Hive/Spark components in the cluster configuration step,
you can specify jobs to run after the cluster is created.
configure hive job scheduling
Set the scheduling for Hive jobs.
When selecting a bucket in Hive options, Storage Object Manager and Storage Object Creator can upload objects but do not have Object Storage bucket access permission, so objects cannot be viewed in the console. However, objects can be read when accessed via Object Storage API.
- In Step 3: configure job scheduling, select Hive job as the job type.
- Enter scheduling information for the Hive job.
Category | Description |
---|---|
Job type | Hive job: runs Hive job after cluster creation |
Execution file | Execution file type - File: select Object Storage bucket and register executable file (only .hql files allowed) - Text: write Hive query and pass it to the job |
Hive options | Write option values to pass to the job (refer to Hive options) - File: select Object Storage bucket and upload Hive options file - Text: write Hive option values and pass to the job |
Job end action | Select action on job end - Wait on failure: cluster stops only if job succeeds - Always wait: cluster does not stop regardless of job success or failure - Always stop: cluster stops regardless of job success or failure |
Save scheduling logs | Select whether to save scheduling log files - Do not save: do not save scheduling logs - Save to Object Storage: save scheduling log files in the selected bucket * Log files are stored in bucket-name/log/ path in yyyy-mm-dd.log format |
Hive options
Hive options refer to hive configuration properties used when running Hive jobs.
You can write options as below. For detailed information on hive configuration properties, see the official document.
--hiveconf hive.tez.container.size=2048 --hiveconf hive.tez.java.opts=-Xmx1600m
Configure spark job scheduling
Set the scheduling for Spark jobs.
- In Step 3: configure job scheduling, select Spark job as the job type.
- Enter scheduling information for the Spark job.
Category | Description |
---|---|
Job type | Spark job: runs Spark job after cluster creation |
Execution file | Select Object Storage bucket and register executable file - only .jar files allowed |
Spark options (optional) | Write option values to pass to the job |
Arguments (optional) | Write arguments to pass to the executing .jar file - File: select Object Storage bucket and upload arguments file - Text: write arguments and pass to the job |
Deploy mode | Set Spark execution mode - choose between {client / cluster} |
Job end action | Select action on job end - Wait on failure: cluster stops only if job succeeds - Always wait: cluster does not stop regardless of job success or failure - Always stop: cluster stops regardless of job success or failure |
Save scheduling logs | Select whether to save scheduling log files - Do not save: do not save scheduling logs - Save to Object Storage: save scheduling log files in the selected bucket * Log files are stored in bucket-name/log/ path in yyyy-mm-dd.log format |
Spark options
Spark options mean the settings passed to spark-submit when running Spark job files.
You can write options as below. For detailed information on options, see the official document.
Including the --deploy-mode option may cause errors. Please use the deploy-mode option available on the screen.
--class org.apache.spark.examples.SparkPi --master yarn
4. Configure security details (optional)
Apply cluster security features through Kerberos and Ranger settings.
Security features cannot be used when using the Data Catalog integration feature.
Item | Description |
---|---|
Kerberos | If you want to install, select Install and enter the following items. Kerberos realm name - Only uppercase English letters, numbers, and periods (.) are allowed (1–50 characters). KDC (Key Distribution Center) password - Automatically set to the administrator password configured in the cluster setup step. |
Ranger | If you want to install, select Install. Ranger password - Automatically set to the administrator password configured in the cluster setup step. |