Create cluster

A cluster is a set of nodes provisioned using Virtual Machines.
The process to create a cluster in the Hadoop Eco service is as follows.

info

It takes approximately 20-25 minutes to create a cluster.

Step 1. Configure cluster

Set the basic information for creating a Hadoop cluster.

Go to KakaoCloud console > Analytics > Hadoop Eco menu.
In the Cluster menu, select the [Create cluster] button located at the top right.

In Step 1: Configure cluster, enter the information and select the [Next] button.

Item	Category	Description
Cluster name		Example: my-cluster-01 - Cluster names must be unique within the same project - VM is created based on the cluster name - Master node: Created in the format `HadoopMST-{ClusterName}-{Number}` - Worker node: Created in the format `HadoopWRK-{ClusterName}-{Number}`
Cluster configuration	Cluster version	Select the cluster version
	Cluster type	- Select the cluster type based on the cluster version - For detailed explanation, refer to Cluster version and type
	Cluster availability	Provides standard and high availability types for operational stability - Standard (Single, 1 master node instance): Resource manager, Name node run in 1 instance - Creates a single master node, suitable for small-scale tasks - High availability (HA) (3 master node instances): Resource manager, Name node run in HA mode - Creates 3 master nodes, allowing for uninterrupted tasks even during reboots
Administrator settings	Admin ID	Enter the admin ID
	Admin password	Enter the admin password - For details on resetting the password, refer to Hue password reset - When Ranger is applied, a specific password creation rule must be followed. For details, refer to Ranger application
	Confirm admin password	Enter the same admin password
VPC settings		Select a VPC and subnet - Select Management page to go to VPC - Public IP accessible from external sources can be assigned after instance creation in the Assign public IP menu
Security group		Create new security group: Enter a name to create a new security group - Automatically set inbound/outbound rules for Hadoop Eco Select an existing security group: Check inbound/outbound rules - Select the [Refresh] icon to fetch network information - For detailed explanation, refer to Security group

Cluster bundles and components

Cluster version	Cluster bundle	Component options
Hadoop Eco 1.0.1 (Trino, Dataflow types not supported)	Core hadoop	Hadoop 2.10.1, Hive 2.3.2, Hue 4.11.0, Oozie 5.2.1, Spark 2.4.6, Tez 0.9.2, Zeppelin 0.10.0, Zookeeper 3.5.7
	Hbase	Hadoop 2.10.1, HBase 1.4.13, Hue 4.11.0, Zookeeper 3.5.7
Hadoop Eco 1.1.2	Core hadoop	Hadoop 2.10.2, Flink 1.14.4, Hive 2.3.9, Hue 4.11.0, Oozie 5.2.1, Spark 2.4.8, Sqoop 1.4.7, Tez 0.9.2, Zeppelin 0.10.1, Zookeeper 3.8.0
	Hbase	Hadoop 2.10.2, HBase 1.7.1, Hue 4.11.0, Zookeeper 3.8.0
	Trino	Hadoop 2.10.2, Hive 2.3.9, Hue 4.11.0, Tez 0.9.2, Trino 377, Zeppelin 0.10.1, Zookeeper 3.8.0
	Dataflow	Hadoop 2.10.2, Druid 25.0.0, Hue 4.11.0, Kafka 3.4.0, Superset 2.1.1, Zookeeper 3.8.0
Hadoop Eco 2.0.1	Core hadoop	Hadoop 3.3.4, Flink 1.15.1, Hive 3.1.3, Hue 4.11.0, Oozie 5.2.1, Spark 3.2.2, Sqoop 1.4.7, Tez 0.10.1, Zeppelin 0.10.1, Zookeeper 3.8.0
	Hbase	Hadoop 3.3.4, HBase 2.4.13, Hue 4.11.0, Zookeeper 3.8.0
	Trino	Hadoop 3.3.4, Hive 3.1.3, Hue 4.11.0, Tez 0.10.1, Trino 393, Zeppelin 0.10.1, Zookeeper 3.8.0
	Dataflow	Hadoop 3.3.4, Druid 25.0.0, Hue 4.11.0, Kafka 3.4.0, Superset 2.1.1, Zookeeper 3.8.0
Hadoop Eco 2.1.0	Core hadoop	Hadoop 3.3.6, Flink 1.20.0, Hive 3.1.3, Hue 4.11.0, Oozie 5.2.1, Spark 3.5.2, Sqoop 1.4.7, Tez 0.10.2, Zeppelin 0.10.1, Zookeeper 3.9.2
	Hbase	Hadoop 3.3.6, HBase 2.6.0, Hue 4.11.0, Zookeeper 3.9.2
	Trino	Hadoop 3.3.6, Hive 3.1.3, Hue 4.11.0, Tez 0.10.2, Trino 393, Zeppelin 0.10.1, Zookeeper 3.9.2
	Dataflow	Hadoop 3.3.6, Druid 25.0.0, Hue 4.11.0, Kafka 3.8.0, Superset 2.1.1, Zookeeper 3.9.2
Hadoop Eco 2.2.0	Core hadoop	Hadoop 3.3.6, Flink 1.20.0, Hive 3.1.3, Hue 4.11.0, Oozie 5.2.1, Spark 3.5.2, Sqoop 1.4.7, Tez 0.10.2, Zeppelin 0.10.1, Zookeeper 3.9.2
	Hbase	Hadoop 3.3.6, HBase 2.6.0, Hue 4.11.0, Phoenix 5.2.1, Zookeeper 3.9.2
	Trino	Hadoop 3.3.6, Hive 3.1.3, Hue 4.11.0, Tez 0.10.2, Trino 436, Zeppelin 0.10.1, Zookeeper 3.9.2
	Dataflow	Hadoop 3.3.6, Druid 25.0.0, Hue 4.11.0, Kafka 3.8.0, Superset 2.1.1, Zookeeper 3.9.2

info

Hadoop Eco 1.0.0, 1.1.0, 1.1.1, and 2.0.0 versions are not supported.
When default components of a bundle are modified, it is automatically treated as a custom bundle.

Configure security group

When creating a cluster, a security group with necessary component ports is automatically generated.

Based on the selected subnet ID, a security group is automatically created. If a cluster already exists in the same subnet, the existing security group is reused.
Example security group name: HDE-{%subnet_ID%}
You can apply existing security groups additionally through the extra security group setting.

Inbound rules
Outbound rules

Ports configured in the auto-generated security group are as follows.

Protocol	Source	Port	Description
ALL	VPC subnet CIDR	ALL	Hadoop eco internal

Step 2. Configure instance

Configure master and worker instances, storage, and network.

Enter the required information in Step 2: Configure instance, then click the [Next] button.

After cluster creation, instance and disk volume settings cannot be changed.
Adding master/worker instances or disk volumes will be supported in the future.

Item	Category	Description
Master node config	Master node count	Fixed based on cluster availability - Standard (Single) type: 1 - HA type: 3
	Master node type	Choose from supported instance types - Hardware configuration varies by instance type
	Disk volume type / size	- Volume type: Only SSD supported (others to be supported later) - Size: 50–16,384 GB
Worker node config	Worker node count	Can be set based on purpose, within project quota limits
	Worker node type	Choose from supported instance types - Hardware configuration varies by instance type
	Disk volume type / size	Volume type: Only SSD supported (others to be supported later) - Size: 50–16,384 GB
Total YARN usage	YARN Core	Calculated as 'Number of worker nodes × vCPUs per node'
	YARN Memory	Calculated as 'Worker node count × memory per node × YARN allocation ratio (0.8)'
Key pair		Select key pair for instance - Use an existing or newly created KakaoCloud key pair - See Create new key pair for details - Click Admin page to navigate to Virtual Machine > Key pair
User script (optional)		Script to automatically configure environment at instance startup via user data - However, installation may fail if the script contains Windows line breaks (\r\n).

Create new key pair

To create a key pair during cluster creation, follow the steps below:

Select Create new key pair and enter a key pair name.
Click the [Create and download key pair] button.
A private key file with a .pem extension will be downloaded using the specified key pair name.

info

Be sure to store the downloaded private key file securely.

Step 3. Advanced settings (optional)

1. Configure service integration (optional)

Apply settings for cluster service integration. Integration is available with KakaoCloud's Data Catalog, MySQL, and MemStore services.

info

If service integration is not configured, Standard (Single) type installs MySQL on master node 1, and HA type installs MySQL on all 3 master nodes for use as a metastore.

In Configure service integration, choose whether to install the monitoring agent and configure the desired service integration.

Item	Description
Install monitoring agent	Select whether to Install monitoring agent
Integrate external storage	Hive metastore: None / Integrate with Data Catalog / Integrate with MySQL - Hive metastore integration is only available if Hive is selected as a component Superset cache store: None / Integrate with MemStore - Superset cache integration is only available if Superset is selected as a component

Install monitoring agent

When monitoring agent is installed, node monitoring becomes available in Hadoop Eco > Cluster details > Monitoring tab:

CPU usage (%) per node
Memory usage (%) per node

Integrate with Data Catalog

Prepare a pre-created Data Catalog for integration. For details on creating a catalog, refer to Create catalog.
In Configure service integration (optional), select Data Catalog integration.
- Confirm Hadoop network/subnet info in the integration section, then select a desired catalog.

Integrate with MySQL

Prepare a pre-created MySQL instance for integration. For details, see Create MySQL instance group.
In Configure service integration (optional), select MySQL integration:
1. Choose the instance where MySQL is installed.
2. After selecting the instance, enter the database name, MySQL ID, and password.

Integrate with MemStore

info

MemStore integration is only available when the Dataflow bundle or Superset component is selected.

Create a MemStore instance. For instructions, refer to Create MemStore cluster.
In Configure service integration (optional), select MemStore integration:
1. Choose the MemStore to integrate in the MemStore name field.
2. Depending on whether cluster mode is used in MemStore, fields for Superset Cache DB ID and Superset Query Cache DB ID may appear:
  - If cluster mode is enabled: no additional input required
  - If cluster mode is disabled: you can enter Superset Cache DB ID and Superset Query Cache DB ID, or leave them blank to use the default (0, 1)

2. Configure cluster details (optional)

You can configure HDFS block size, replication factor, and other cluster component settings. HDFS settings take precedence over cluster component settings.

Item	Description
HDFS settings	HDFS block size - Sets the dfs.blocksize value in hdfs-site.xml - Volume created with size between 1–1,024 MB (default: 128 MB) HDFS replication factor - Sets the dfs.blockreplication value in hdfs-site.xml - Can be set between 1–500 - Replication count must not exceed the number of worker nodes
Cluster configuration (optional)	Enter component-specific configurations for the cluster - Either upload a JSON file or enter the values directly - For object storage integration, see Integrate with Object Storage
Log storage settings	Choose whether to Configure log storage

Cluster configuration - component settings

Cluster configuration

Cluster configuration is defined using a JSON file in key-value pair format. The configurations field is a list, where classification represents the config file name and properties contains the configuration parameters. The basic format is shown below.

JSON file format
-- Format
{
    "configurations":
    [
        {
            "classification": "file name",
            "properties": {
                "property name": "value"
            }
        }
    ]
}

-- Example
{
    "configurations":
    [
        {
            "classification": "core-site",
            "properties": {
                "dfs.blocksize": "67108864"
            }
        }
    ]
}

Depending on the configuration file name, the format is classified as env, xml, properties, or user-env.

Format	Description
`env`	Input keys are mapped to predefined values; only specific keys can be modified
`xml`	Input key-value pairs are written as XML elements (`<name>`, `<value>`)
`properties`	Input key-value pairs are written as plain text property entries
`user-env`	Adds user-specific environment variables using the format `user-env:<username>`

User-provided configurations are inserted into the appropriate files based on the classification name.

env

Classification	File path	Setting name	Sample value
hadoop-env	/etc/hadoop/conf	hadoop_env_hadoop_heapsize	2048
		hadoop_env_hadoop_namenode_heapsize	"-Xmx2048m"
		hadoop_env_hadoop_jobtracker_heapsize	"-Xmx1024m"
		hadoop_env_hadoop_tasktracker_heapsize	"-Xmx1024m"
		hadoop_env_hadoop_shared_hadoop_namenode_heapsize	"-Xmx1024m"
		hadoop_env_hadoop_datanode_heapsize	"-Xmx1024m"
		hadoop_env_hadoop_zkfc_opts	"-Xmx1024m"
		hadoop_env_hadoop_log_level	INFO, DRFA, console
		hadoop_env_hadoop_security_log_level	INFO, DRFAS
		hadoop_env_hadoop_audit_log_level	INFO, RFAAUDIT
mapred-env	/etc/hadoop/conf	mapred_env_hadoop_job_historyserver_heapsize	2000
hive-env	/etc/hive/conf	hive_env_hive_metastore_heapsize	2048
		hive_env_hiveserver2_heapsize	2048
		hive_env_hadoop_client_opts	"-Xmx2048m"
hbase-env	/etc/hbase/conf	hbase_env_hbase_master_heapsize	"-Xmx2048m"
		hbase_env_hbase_regionserver_heapsize	"-Xmx2048m"
spark-defaults	/etc/spark/conf	spark_defaults_spark_driver_memory	2g
trino-config	/etc/trino/conf	trino_jvm_config_heap	-Xmx10G

xml

Classification	File path	Reference link	Note
core-site	/etc/hadoop/conf	core-default.xml
hdfs-site	/etc/hadoop/conf	hdfs-default.xml
httpfs-site	/etc/hadoop/conf	ServerSetup.html
mapred-site	/etc/hadoop/conf	mapred-default.xml
yarn-site	/etc/hadoop/conf	yarn-default.xml
capacity-scheduler	/etc/hadoop/conf	CapacityScheduler.html	YARN scheduler config
tez-site	/etc/tez/conf	TezConfiguration.html
hive-site	/etc/hive/conf	Hive configuration properties
hiveserver2-site	/etc/hive/conf	Setting up hiveserver2	Hiveserver2-specific

properties

Classification	File path	Description
spark-defaults	/etc/spark/conf	Spark configuration in key–tab–value format

user-env

Classification	File path	Description
user-env:profile	/etc/profile	Add global environment variables
user-env:[username]	~/.bashrc	Add environment variables to user's bashrc

XML format

XML format
<configuration>
 
    <property>
        <name>yarn.app.mapreduce.am.job.client.port-range</name>
        <value>41000-43000</value>
    </property>
 
</configuration>

ENV format

ENV format
...
export HADOOP_HEAPSIZE="3001"
...

Properties format

Properties format
spark.driver.memory              4000M
spark.network.timeout            800

user-env format

user-env format
{
  "configurations": [
    {
      "classification": "user-env:profile",
      "properties": {
        "env": "FOO=profile\nVAR=profile\nexport U=profile"
      }
    },
    {
      "classification": "user-env:ubuntu",
      "properties": {
        "env": "FOO=foo\nVAR=var\nexport U=N"
      }
    }
  ]
}

Sample example

sample example
{
    "configurations":
    [
        {
            "classification": "mapred-site",  -- xml format
            "properties":
            {
                "yarn.app.mapreduce.am.job.client.port-range": "41000-43000"
            }
        },
        {
            "classification": "hadoop-env",  -- env format
            "properties":
            {
                "hadoop_env_hadoop_heapsize": 3001,
                "hadoop_env_hadoop_namenode_heapsize": "-Xmx3002m"
            }
        },
        {
            "classification": "spark-defaults",  -- properties format
            "properties":
            {
                "spark.driver.memory": "4000M",
                "spark.network.timeout": "800s"
            }
        },
        {
            "classification": "user-env:profile", -- user-env format
            "properties": 
            {
                "env": "FOO=profile\nVAR=profile\nexport U=profile"
            }
        },
        {
            "classification": "user-env:ubuntu",
             "properties": 
            {
                "env": "FOO=foo\nVAR=var\nexport U=N"
            }
        }
    ]
}

Configure log storage

When using log storage settings, you need to set the log storage path.

Set log storage settings to enable.
Select the Object Storage bucket to use for log storage.
After checking the path, if you want to change to a new path, enter the new path.

caution

Deleting stored logs may cause the Spark History Server to malfunction.

3. Configure job scheduling (optional)

If you select the Core Hadoop bundle or Hive/Spark components in the cluster configuration step,
you can specify jobs to run after the cluster is created.

configure hive job scheduling

Set the scheduling for Hive jobs.

info

When selecting a bucket in Hive options, Storage Object Manager and Storage Object Creator can upload objects but do not have Object Storage bucket access permission, so objects cannot be viewed in the console. However, objects can be read when accessed via Object Storage API.

In Step 3: configure job scheduling, select Hive job as the job type.
Enter scheduling information for the Hive job.

Category	Description
Job type	Hive job: runs Hive job after cluster creation
Execution file	Execution file type - File: select Object Storage bucket and register executable file (only .hql files allowed) - Text: write Hive query and pass it to the job
Hive options	Write option values to pass to the job (refer to Hive options) - File: select Object Storage bucket and upload Hive options file - Text: write Hive option values and pass to the job
Job end action	Select action on job end - Wait on failure: cluster stops only if job succeeds - Always wait: cluster does not stop regardless of job success or failure - Always stop: cluster stops regardless of job success or failure
Save scheduling logs	Select whether to save scheduling log files - Do not save: do not save scheduling logs - Save to Object Storage: save scheduling log files in the selected bucket * Log files are stored in bucket-name/log/ path in yyyy-mm-dd.log format

Hive options

Hive options refer to hive configuration properties used when running Hive jobs.
You can write options as below. For detailed information on hive configuration properties, see the official document.

--hiveconf hive.tez.container.size=2048 --hiveconf hive.tez.java.opts=-Xmx1600m

Configure spark job scheduling

Set the scheduling for Spark jobs.

In Step 3: configure job scheduling, select Spark job as the job type.
Enter scheduling information for the Spark job.

Category	Description
Job type	Spark job: runs Spark job after cluster creation
Execution file	Select Object Storage bucket and register executable file - only .jar files allowed
Spark options (optional)	Write option values to pass to the job
Arguments (optional)	Write arguments to pass to the executing `.jar` file - File: select Object Storage bucket and upload arguments file - Text: write arguments and pass to the job
Deploy mode	Set Spark execution mode - choose between `{client / cluster}`
Job end action	Select action on job end - Wait on failure: cluster stops only if job succeeds - Always wait: cluster does not stop regardless of job success or failure - Always stop: cluster stops regardless of job success or failure
Save scheduling logs	Select whether to save scheduling log files - Do not save: do not save scheduling logs - Save to Object Storage: save scheduling log files in the selected bucket * Log files are stored in bucket-name/log/ path in yyyy-mm-dd.log format

Spark options

Spark options mean the settings passed to spark-submit when running Spark job files.
You can write options as below. For detailed information on options, see the official document.

info

Including the --deploy-mode option may cause errors. Please use the deploy-mode option available on the screen.

--class org.apache.spark.examples.SparkPi --master yarn

4. Configure security details (optional)

Apply cluster security features through Kerberos and Ranger settings.

info

Security features cannot be used when using the Data Catalog integration feature.

Item	Description
Kerberos	If you want to install, select Install and enter the following items. Kerberos realm name - Only uppercase English letters, numbers, and periods (.) are allowed (1–50 characters). KDC (Key Distribution Center) password - Automatically set to the administrator password configured in the cluster setup step.
Ranger	If you want to install, select Install. Ranger password - Automatically set to the administrator password configured in the cluster setup step.

Step 1. Configure cluster​

Cluster bundles and components​

Configure security group​

Step 2. Configure instance​

Create new key pair​

Step 3. Advanced settings (optional)​

1. Configure service integration (optional)​

Install monitoring agent​

Integrate with Data Catalog​

Integrate with MySQL​

Integrate with MemStore​

2. Configure cluster details (optional)​

Cluster configuration - component settings​

env​

xml​

properties​

user-env​

XML format​

ENV format​

Properties format​

user-env format​

Sample example​

Configure log storage​

3. Configure job scheduling (optional)​

configure hive job scheduling​

Hive options​

Configure spark job scheduling​

Spark options​

4. Configure security details (optional)​

Step 1. Configure cluster

Cluster bundles and components

Configure security group

Step 2. Configure instance

Create new key pair

Step 3. Advanced settings (optional)

1. Configure service integration (optional)

Install monitoring agent

Integrate with Data Catalog

Integrate with MySQL

Integrate with MemStore

2. Configure cluster details (optional)

Cluster configuration - component settings

env

xml

properties

user-env

XML format

ENV format

Properties format

user-env format

Sample example

Configure log storage

3. Configure job scheduling (optional)

configure hive job scheduling

Hive options

Configure spark job scheduling

Spark options

4. Configure security details (optional)