Skip to main content

Create cluster

A cluster is a set of nodes provisioned using Virtual Machines.
The process to create a cluster in the Hadoop Eco service is as follows.

info

It takes approximately 20-25 minutes to create a cluster.

Step 1. Configure cluster

Set the basic information for creating a Hadoop cluster.

  1. Go to KakaoCloud console > Analytics > Hadoop Eco menu.

  2. In the Cluster menu, select the [Create cluster] button located at the top right.

  3. In Step 1: Configure cluster, enter the information and select the [Next] button.

    ItemCategoryDescription
    Cluster nameExample: my-cluster-01
    - Cluster names must be unique within the same project
    - VM is created based on the cluster name
    - Master node: Created in the format HadoopMST-{ClusterName}-{Number}
    - Worker node: Created in the format HadoopWRK-{ClusterName}-{Number}
    Cluster configurationCluster versionSelect the cluster version
    Cluster type- Select the cluster type based on the cluster version
    - For detailed explanation, refer to Cluster version and type
    Cluster availabilityProvides standard and high availability types for operational stability
    - Standard (Single, 1 master node instance): Resource manager, Name node run in 1 instance
       - Creates a single master node, suitable for small-scale tasks
    - High availability (HA) (3 master node instances): Resource manager, Name node run in HA mode
       - Creates 3 master nodes, allowing for uninterrupted tasks even during reboots
    Administrator settingsAdmin IDEnter the admin ID
    Admin passwordEnter the admin password
    - For details on resetting the password, refer to Hue password reset
    - When Ranger is applied, a specific password creation rule must be followed. For details, refer to Ranger application
    Confirm admin passwordEnter the same admin password
    VPC settingsSelect a VPC and subnet
    - Select Management page to go to VPC
    - Public IP accessible from external sources can be assigned after instance creation in the Assign public IP menu
    Security groupCreate new security group: Enter a name to create a new security group
      - Automatically set inbound/outbound rules for Hadoop Eco
    Select an existing security group: Check inbound/outbound rules
      - Select the [Refresh] icon to fetch network information
      - For detailed explanation, refer to Security group

Cluster bundles and components

Cluster versionCluster bundleComponent options
Hadoop Eco 1.0.1
(Trino, Dataflow types not supported)
Core hadoopHadoop 2.10.1, Hive 2.3.2, Hue 4.11.0, Oozie 5.2.1, Spark 2.4.6, Tez 0.9.2, Zeppelin 0.10.0, Zookeeper 3.5.7
HbaseHadoop 2.10.1, HBase 1.4.13, Hue 4.11.0, Zookeeper 3.5.7
Hadoop Eco 1.1.2Core hadoopHadoop 2.10.2, Flink 1.14.4, Hive 2.3.9, Hue 4.11.0, Oozie 5.2.1, Spark 2.4.8, Sqoop 1.4.7, Tez 0.9.2, Zeppelin 0.10.1, Zookeeper 3.8.0
HbaseHadoop 2.10.2, HBase 1.7.1, Hue 4.11.0, Zookeeper 3.8.0
TrinoHadoop 2.10.2, Hive 2.3.9, Hue 4.11.0, Tez 0.9.2, Trino 377, Zeppelin 0.10.1, Zookeeper 3.8.0
DataflowHadoop 2.10.2, Druid 25.0.0, Hue 4.11.0, Kafka 3.4.0, Superset 2.1.1, Zookeeper 3.8.0
Hadoop Eco 2.0.1Core hadoopHadoop 3.3.4, Flink 1.15.1, Hive 3.1.3, Hue 4.11.0, Oozie 5.2.1, Spark 3.2.2, Sqoop 1.4.7, Tez 0.10.1, Zeppelin 0.10.1, Zookeeper 3.8.0
HbaseHadoop 3.3.4, HBase 2.4.13, Hue 4.11.0, Zookeeper 3.8.0
TrinoHadoop 3.3.4, Hive 3.1.3, Hue 4.11.0, Tez 0.10.1, Trino 393, Zeppelin 0.10.1, Zookeeper 3.8.0
DataflowHadoop 3.3.4, Druid 25.0.0, Hue 4.11.0, Kafka 3.4.0, Superset 2.1.1, Zookeeper 3.8.0
Hadoop Eco 2.1.0Core hadoopHadoop 3.3.6, Flink 1.20.0, Hive 3.1.3, Hue 4.11.0, Oozie 5.2.1, Spark 3.5.2, Sqoop 1.4.7, Tez 0.10.2, Zeppelin 0.10.1, Zookeeper 3.9.2
HbaseHadoop 3.3.6, HBase 2.6.0, Hue 4.11.0, Zookeeper 3.9.2
TrinoHadoop 3.3.6, Hive 3.1.3, Hue 4.11.0, Tez 0.10.2, Trino 393, Zeppelin 0.10.1, Zookeeper 3.9.2
DataflowHadoop 3.3.6, Druid 25.0.0, Hue 4.11.0, Kafka 3.8.0, Superset 2.1.1, Zookeeper 3.9.2
Hadoop Eco 2.2.0Core hadoopHadoop 3.3.6, Flink 1.20.0, Hive 3.1.3, Hue 4.11.0, Oozie 5.2.1, Spark 3.5.2, Sqoop 1.4.7, Tez 0.10.2, Zeppelin 0.10.1, Zookeeper 3.9.2
HbaseHadoop 3.3.6, HBase 2.6.0, Hue 4.11.0, Phoenix 5.2.1, Zookeeper 3.9.2
TrinoHadoop 3.3.6, Hive 3.1.3, Hue 4.11.0, Tez 0.10.2, Trino 436, Zeppelin 0.10.1, Zookeeper 3.9.2
DataflowHadoop 3.3.6, Druid 25.0.0, Hue 4.11.0, Kafka 3.8.0, Superset 2.1.1, Zookeeper 3.9.2
info
  • Hadoop Eco 1.0.0, 1.1.0, 1.1.1, and 2.0.0 versions are not supported.
  • When default components of a bundle are modified, it is automatically treated as a custom bundle.

Configure security group

When creating a cluster, a security group with necessary component ports is automatically generated.

  • Based on the selected subnet ID, a security group is automatically created. If a cluster already exists in the same subnet, the existing security group is reused.
  • Example security group name: HDE-{%subnet_ID%}
  • You can apply existing security groups additionally through the extra security group setting.

Ports configured in the auto-generated security group are as follows.

ProtocolSourcePortDescription
ALLVPC subnet CIDRALLHadoop eco internal

Step 2. Configure instance

Configure master and worker instances, storage, and network.

Enter the required information in Step 2: Configure instance, then click the [Next] button.

  • After cluster creation, instance and disk volume settings cannot be changed.
  • Adding master/worker instances or disk volumes will be supported in the future.
ItemCategoryDescription
Master node configMaster node countFixed based on cluster availability
- Standard (Single) type: 1
- HA type: 3
Master node typeChoose from supported instance types
- Hardware configuration varies by instance type
Disk volume type / size- Volume type: Only SSD supported (others to be supported later)
- Size: 50–5,120 GB
Worker node configWorker node countCan be set based on purpose, within project quota limits
Worker node typeChoose from supported instance types
- Hardware configuration varies by instance type
Disk volume type / sizeVolume type: Only SSD supported (others to be supported later)
- Size: 50–5,120 GB
Total YARN usageYARN CoreCalculated as 'Number of worker nodes × vCPUs per node'
YARN MemoryCalculated as 'Worker node count × memory per node × YARN allocation ratio (0.8)'
Key pairSelect key pair for instance
- Use an existing or newly created KakaoCloud key pair
- See Create new key pair for details
- Click Admin page to navigate to Virtual Machine > Key pair
User script (optional)Script to automatically configure environment at instance startup via user data

Create new key pair

To create a key pair during cluster creation, follow the steps below:

  1. Select Create new key pair and enter a key pair name.
  2. Click the [Create and download key pair] button.
  3. A private key file with a .pem extension will be downloaded using the specified key pair name.
info

Be sure to store the downloaded private key file securely.

Step 3. Advanced settings (optional)

1. Configure service integration (optional)

Apply settings for cluster service integration. Integration is available with KakaoCloud's Data Catalog, MySQL, and MemStore services.

info

If service integration is not configured, Standard (Single) type installs MySQL on master node 1, and HA type installs MySQL on all 3 master nodes for use as a metastore.

In Configure service integration, choose whether to install the monitoring agent and configure the desired service integration.

ItemDescription
Install monitoring agentSelect whether to Install monitoring agent
Integrate external storageHive metastore: None / Integrate with Data Catalog / Integrate with MySQL
  - Hive metastore integration is only available if Hive is selected as a component
Superset cache store: None / Integrate with MemStore
  - Superset cache integration is only available if Superset is selected as a component

Install monitoring agent

When monitoring agent is installed, node monitoring becomes available in Hadoop Eco > Cluster details > Monitoring tab:

  • CPU usage (%) per node
  • Memory usage (%) per node

Integrate with Data Catalog

  1. Prepare a pre-created Data Catalog for integration. For details on creating a catalog, refer to Create catalog.
  2. In Configure service integration (optional), select Data Catalog integration.
    • Confirm Hadoop network/subnet info in the integration section, then select a desired catalog.

Integrate with MySQL

  1. Prepare a pre-created MySQL instance for integration. For details, see Create MySQL instance group.
  2. In Configure service integration (optional), select MySQL integration:
    1. Choose the instance where MySQL is installed.
    2. After selecting the instance, enter the database name, MySQL ID, and password.

Integrate with MemStore

info

MemStore integration is only available when the Dataflow bundle or Superset component is selected.

  1. Create a MemStore instance. For instructions, refer to Create MemStore cluster.
  2. In Configure service integration (optional), select MemStore integration:
    1. Choose the MemStore to integrate in the MemStore name field.
    2. Depending on whether cluster mode is used in MemStore, fields for Superset Cache DB ID and Superset Query Cache DB ID may appear:
      • If cluster mode is enabled: no additional input required
      • If cluster mode is disabled: you can enter Superset Cache DB ID and Superset Query Cache DB ID, or leave them blank to use the default (0, 1)

2. Configure cluster details (optional)

You can configure HDFS block size, replication factor, and other cluster component settings. HDFS settings take precedence over cluster component settings.

ItemDescription
HDFS settingsHDFS block size
- Sets the dfs.blocksize value in hdfs-site.xml
- Volume created with size between 1–1,024 MB (default: 128 MB)

HDFS replication factor
- Sets the dfs.blockreplication value in hdfs-site.xml
- Can be set between 1–500
- Replication count must not exceed the number of worker nodes
Cluster configuration (optional)Enter component-specific configurations for the cluster
- Either upload a JSON file or enter the values directly
- For object storage integration, see Integrate with Object Storage
Log storage settingsChoose whether to Configure log storage

Cluster configuration - component settings

Cluster configuration

Cluster configuration is defined using a JSON file in key-value pair format. The configurations field is a list, where classification represents the config file name and properties contains the configuration parameters. The basic format is shown below.

JSON file format
-- Format
{
"configurations":
[
{
"classification": "file name",
"properties": {
"property name": "value"
}
}
]
}

-- Example
{
"configurations":
[
{
"classification": "core-site",
"properties": {
"dfs.blocksize": "67108864"
}
}
]
}

Depending on the configuration file name, the format is classified as env, xml, properties, or user-env.

FormatDescription
envInput keys are mapped to predefined values; only specific keys can be modified
xmlInput key-value pairs are written as XML elements (<name>, <value>)
propertiesInput key-value pairs are written as plain text property entries
user-envAdds user-specific environment variables using the format user-env:<username>

User-provided configurations are inserted into the appropriate files based on the classification name.

env

ClassificationFile pathSetting nameSample value
hadoop-env/etc/hadoop/confhadoop_env_hadoop_heapsize2048
hadoop_env_hadoop_namenode_heapsize"-Xmx2048m"
hadoop_env_hadoop_jobtracker_heapsize"-Xmx1024m"
hadoop_env_hadoop_tasktracker_heapsize"-Xmx1024m"
hadoop_env_hadoop_shared_hadoop_namenode_heapsize"-Xmx1024m"
hadoop_env_hadoop_datanode_heapsize"-Xmx1024m"
hadoop_env_hadoop_zkfc_opts"-Xmx1024m"
hadoop_env_hadoop_log_levelINFO, DRFA, console
hadoop_env_hadoop_security_log_levelINFO, DRFAS
hadoop_env_hadoop_audit_log_levelINFO, RFAAUDIT
mapred-env/etc/hadoop/confmapred_env_hadoop_job_historyserver_heapsize2000
hive-env/etc/hive/confhive_env_hive_metastore_heapsize2048
hive_env_hiveserver2_heapsize2048
hive_env_hadoop_client_opts"-Xmx2048m"
hbase-env/etc/hbase/confhbase_env_hbase_master_heapsize"-Xmx2048m"
hbase_env_hbase_regionserver_heapsize"-Xmx2048m"
spark-defaults/etc/spark/confspark_defaults_spark_driver_memory2g
trino-config/etc/trino/conftrino_jvm_config_heap-Xmx10G

xml

ClassificationFile pathReference linkNote
core-site/etc/hadoop/confcore-default.xml
hdfs-site/etc/hadoop/confhdfs-default.xml
httpfs-site/etc/hadoop/confServerSetup.html
mapred-site/etc/hadoop/confmapred-default.xml
yarn-site/etc/hadoop/confyarn-default.xml
capacity-scheduler/etc/hadoop/confCapacityScheduler.htmlYARN scheduler config
tez-site/etc/tez/confTezConfiguration.html
hive-site/etc/hive/confHive configuration properties
hiveserver2-site/etc/hive/confSetting up hiveserver2Hiveserver2-specific

properties

ClassificationFile pathDescription
spark-defaults/etc/spark/confSpark configuration in key–tab–value format

user-env

ClassificationFile pathDescription
user-env:profile/etc/profileAdd global environment variables
user-env:[username]~/.bashrcAdd environment variables to user's bashrc
XML format
XML format
<configuration>

<property>
<name>yarn.app.mapreduce.am.job.client.port-range</name>
<value>41000-43000</value>
</property>

</configuration>
ENV format
ENV format
...
export HADOOP_HEAPSIZE="3001"
...
Properties format
Properties format
spark.driver.memory              4000M
spark.network.timeout 800
user-env format
user-env format
{
"configurations": [
{
"classification": "user-env:profile",
"properties": {
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties": {
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}

Sample example

sample example
{
"configurations":
[
{
"classification": "mapred-site", -- xml format
"properties":
{
"yarn.app.mapreduce.am.job.client.port-range": "41000-43000"
}
},
{
"classification": "hadoop-env", -- env format
"properties":
{
"hadoop_env_hadoop_heapsize": 3001,
"hadoop_env_hadoop_namenode_heapsize": "-Xmx3002m"
}
},
{
"classification": "spark-defaults", -- properties format
"properties":
{
"spark.driver.memory": "4000M",
"spark.network.timeout": "800s"
}
},
{
"classification": "user-env:profile", -- user-env format
"properties":
{
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties":
{
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}

Configure log storage

When using log storage settings, you need to set the log storage path.

  1. Set log storage settings to enable.

  2. Select the Object Storage bucket to use for log storage.

  3. After checking the path, if you want to change to a new path, enter the new path.

caution
  • Deleting stored logs may cause the Spark History Server to malfunction.

3. Configure job scheduling (optional)

If you select the Core Hadoop bundle or Hive/Spark components in the cluster configuration step,
you can specify jobs to run after the cluster is created.

configure hive job scheduling

Set the scheduling for Hive jobs.

info

When selecting a bucket in Hive options, Storage Object Manager and Storage Object Creator can upload objects but do not have Object Storage bucket access permission, so objects cannot be viewed in the console. However, objects can be read when accessed via Object Storage API.

  1. In Step 3: configure job scheduling, select Hive job as the job type.
  2. Enter scheduling information for the Hive job.
CategoryDescription
Job typeHive job: runs Hive job after cluster creation
Execution fileExecution file type
- File: select Object Storage bucket and register executable file (only .hql files allowed)
- Text: write Hive query and pass it to the job
Hive optionsWrite option values to pass to the job (refer to Hive options)
- File: select Object Storage bucket and upload Hive options file
- Text: write Hive option values and pass to the job
Job end actionSelect action on job end
- Wait on failure: cluster stops only if job succeeds
- Always wait: cluster does not stop regardless of job success or failure
- Always stop: cluster stops regardless of job success or failure
Save scheduling logsSelect whether to save scheduling log files
- Do not save: do not save scheduling logs
- Save to Object Storage: save scheduling log files in the selected bucket
* Log files are stored in bucket-name/log/ path in yyyy-mm-dd.log format

Hive options

Hive options refer to hive configuration properties used when running Hive jobs.
You can write options as below. For detailed information on hive configuration properties, see the official document.

--hiveconf hive.tez.container.size=2048 --hiveconf hive.tez.java.opts=-Xmx1600m

Configure spark job scheduling

Set the scheduling for Spark jobs.

  1. In Step 3: configure job scheduling, select Spark job as the job type.
  2. Enter scheduling information for the Spark job.
CategoryDescription
Job typeSpark job: runs Spark job after cluster creation
Execution fileSelect Object Storage bucket and register executable file
- only .jar files allowed
Spark options (optional)Write option values to pass to the job
Arguments (optional)Write arguments to pass to the executing .jar file
- File: select Object Storage bucket and upload arguments file
- Text: write arguments and pass to the job
Deploy modeSet Spark execution mode
- choose between {client / cluster}
Job end actionSelect action on job end
- Wait on failure: cluster stops only if job succeeds
- Always wait: cluster does not stop regardless of job success or failure
- Always stop: cluster stops regardless of job success or failure
Save scheduling logsSelect whether to save scheduling log files
- Do not save: do not save scheduling logs
- Save to Object Storage: save scheduling log files in the selected bucket
* Log files are stored in bucket-name/log/ path in yyyy-mm-dd.log format

Spark options

Spark options mean the settings passed to spark-submit when running Spark job files.
You can write options as below. For detailed information on options, see the official document.

info

Including the --deploy-mode option may cause errors. Please use the deploy-mode option available on the screen.

--class org.apache.spark.examples.SparkPi --master yarn

4. Configure security details (optional)

Apply cluster security features through Kerberos and Ranger settings.

info

Security features cannot be used when using the Data Catalog integration feature.

ItemDescription
KerberosIf you want to install, select Install and enter the following items.

Kerberos realm name
- Only uppercase English letters, numbers, and periods (.) are allowed (1–50 characters).

KDC (Key Distribution Center) password
- Automatically set to the administrator password configured in the cluster setup step.
RangerIf you want to install, select Install.

Ranger password
- Automatically set to the administrator password configured in the cluster setup step.