Skip to main content

Create cluster

A cluster is a set of nodes provisioned using Virtual Machines.
The process to create a cluster in the Hadoop Eco service is as follows.

info

It takes approximately 20-25 minutes to create a cluster.

Step 1. Configure cluster

Set the basic information for creating a Hadoop cluster.

  1. Go to KakaoCloud console > Analytics > Hadoop Eco menu.

  2. In the Cluster menu, click the [Create cluster] button located at the top right.

  3. In Step 1: Configure cluster, enter the information and click the [Next] button.

    ItemCategoryDescription
    Cluster nameExample: my-cluster-01
    - Cluster names must be unique within the same project
    - VM is created based on the cluster name
    - Master node: Created in the format HadoopMST-{ClusterName}-{Number}
    - Worker node: Created in the format HadoopWRK-{ClusterName}-{Number}
    Cluster configurationCluster versionSelect the cluster version
    Cluster type- Select the cluster type based on the cluster version
    - For detailed explanation, refer to Cluster version and type
    Cluster availabilityProvides standard and high availability types for operational stability
    - Standard (Single, 1 master node instance): Resource manager, Name node run in 1 instance
       - Creates a single master node, suitable for small-scale tasks
    - High availability (HA) (3 master node instances): Resource manager, Name node run in HA mode
       - Creates 3 master nodes, allowing for uninterrupted tasks even during reboots
    Administrator settingsAdmin IDEnter the admin ID
    Admin passwordEnter the admin password
    - For details on resetting the password, refer to Hue password reset
    - When Ranger is applied, a specific password creation rule must be followed. For details, refer to Ranger application
    Confirm admin passwordEnter the same admin password
    VPC settingsSelect a VPC and subnet
    - Click Management page to go to VPC
    - Public IP accessible from external sources can be assigned after instance creation in the Assign public IP menu
    Security groupCreate new security group: Enter a name to create a new security group
      - Automatically set inbound/outbound rules for Hadoop Eco
    Select an existing security group: Check inbound/outbound rules
      - Click the [Refresh] icon to fetch network information
      - For detailed explanation, refer to Security group

Configure security group

In order to install a cluster in the Hadoop Eco service, you must configure the ports used by the components in the security group. Certain ports must be open for component configuration. You can check the policies applied to instances in the list of applied policies.

  • When a security group is automatically created, the default name of the security group is automatically set based on the creation date, cluster version, and type.
  • Example of a security group name: {cluster_name}-HDE-{version}-{type}
  • Only one security group can be set when creating an instance, and additional security group configurations can be made after the instance is created.

A security group is automatically created during cluster creation. The port information set in the automatically created security group is as follows.

ProtocolPacket SourcePort NumberPolicy Description
ALL   VPC subnet CIDRALL      Hadoop Eco internal

Cluster version and type

Cluster versionCluster typeOptions
Hadoop Eco 1.0.1
(Trino, Dataflow type not supported)
Core HadoopCore Hadoop (Hadoop 2.10.1 HDFS, YARN, Spark 2.4.6)
- Apache Spark and Hive, which can be used with Hadoop, are installed together
HBaseHBase (Hadoop 2.10.1, HBase 1.4.13)
- Apache HBase, a distributed database based on Hadoop, is installed together
Hadoop Eco 1.1.2Core HadoopCore Hadoop (Hadoop 2.10.2 HDFS, YARN, Spark 2.4.8)
HBaseHBase (Hadoop 2.10.2, HDFS, YARN, HBase 1.7.1)
TrinoTrino (Hadoop 2.10.2, HDFS, YARN, Trino 377)
DataflowDataflow (Hadoop 2.10.2, HDFS, YARN, Kafka 3.4.0, Druid 25.0.0, Superset 2.1.1)
Hadoop Eco 2.0.1Core HadoopCore Hadoop (Hadoop 3.3.4 HDFS, YARN, Spark 3.2.2)
HBaseHBase (Hadoop 3.3.4 HDFS, YARN, HBase 2.4.13)
TrinoTrino (Hadoop 3.3.4, HDFS, YARN, Trino 393)
DataflowDataflow (Hadoop 3.3.4, HDFS, YARN, Kafka 3.4.0, Druid 25.0.0, Superset 2.1.1)
Hadoop Eco 2.1.0Core HadoopCore Hadoop (Hadoop 3.3.6 HDFS, YARN, Spark 3.5.2)
HBaseHBase (Hadoop 3.3.6 HDFS, YARN, HBase 2.6.0)
TrinoTrino (Hadoop 3.3.6, HDFS, YARN, Trino 393)
DataflowDataflow (Hadoop 3.3.6, HDFS, YARN, Kafka 3.8.0, Druid 25.0.0, Superset 2.1.1)
info

Versions Hadoop Eco 1.0.0, 1.1.0, 1.1.1, and 2.0.0 are not supported.

Step 2. Configure instance

Configure the master and worker instances, storage, and network.

In Step 2: Configure instance, enter the information and click the [Next] button.

  • After creating the cluster, instance and disk volume configurations cannot be changed.
  • Adding master/worker instances and disk volumes will be supported in the future.
CategoryDetailsDescription
Master node settingsMaster node instance countFixed based on cluster availability
- Standard (Single) type: 1 instance
- HA type: 3 instances
Master node instance typeChoose from supported instance types
- Hardware configuration depends on the selected instance type
Disk volume type/size- Volume type: Currently only SSD type is supported (other types will be supported in the future)
- Volume size: 50 ~ 5,120GB
Worker node settingsWorker node instance countChoose the number of instances based on purpose; the total number is determined by the project's quota
Worker node instance typeChoose from supported instance types
- Hardware configuration depends on the selected instance type
Disk volume type/sizeVolume type: Currently only SSD type is supported (other types will be supported in the future)
- Volume size: 50 ~ 5,120GB
Total YARN usageYARN CoreResult of 'number of worker nodes x vCPU count per node'
YARN MemoryResult of 'number of worker nodes x memory size per node x YARN allocated ratio (0.8)'
Key pairChoose a key pair to apply to the instance
- Select an existing key pair from KakaoCloud or create a new one
- For creating a new key pair, refer to Create new key pair
- Click Management Page to navigate to Virtual Machine > Key pairs
User script (optional)A script that runs user data to automatically configure the environment when starting the instance

Create new key pair

To create a key pair during cluster creation, follow these steps:

  1. Select Create new key pair and enter the key pair name.
  2. Click the [Create and download key pair] button.
  3. A private key file with a .pem extension will be downloaded with the entered key pair name.
info

Please keep the downloaded private key file in a safe place.

Step 3. Configure detailed settings (optional)

1. Configure service integration (optional)

Apply settings for cluster service integration. You can configure integration with the Data Catalog service provided by KakaoCloud.

info

If service integration is not performed, the Standard (Single) type will install MySQL on the master node 1, and the HA type will install MySQL on the master node 3 to be used as the metastore.

In Service integration settings, choose whether to install the monitoring agent and configure service integration:

CategoryDescription
Monitoring agent installationSelect whether to install the monitoring agent
Service integrationSelect between "Do not integrate", Data Catalog integration, External Hive Metastore integration, or MemStore integration

Install Monitoring Agent

When the monitoring agent is installed, additional node monitoring can be viewed under the Hadoop Eco > Cluster Details page > Monitoring tab.

  • CPU usage per node (%)
  • Memory usage per node (%)

Integrate Data Catalog

  1. To integrate with Data Catalog, prepare a Data Catalog that was created in advance. For more details on creating a catalog, refer to Create catalog.

  2. To integrate with Data Catalog, select Data Catalog integration in Service Integration Settings (optional).

    • In the Data Catalog integration section, check the Hadoop network/subnet information and select the desired catalog.

Integrate external Hive metastore

  1. To integrate with an external Hive metastore, create MySQL. For more details on creating MySQL, refer to Create MySQL instance group.

  2. To integrate with MySQL, select "External Hive Metastore Integration" in Service Integration Settings (optional).

    1. In the service integration section, select the instance where MySQL is installed.
    2. After selecting the instance, enter the MySQL database name, MySQL ID, and password.

Integrate MemStore

info

MemStore integration is only available for Hadoop Eco - Dataflow types. The MemStore integration button will not be displayed for other cluster types.

  1. To integrate with MemStore, create MemStore. For more details on creating MemStore, refer to Create MemStore cluster.

  2. To integrate with MemStore, select MemStore integration in Service Integration Settings (optional).

    1. In the MemStore name field, select the MemStore to integrate.
    2. Depending on whether MemStore cluster mode is used, fields for Superset Cache DB ID and Superset Query Cache DB ID will appear.
      • When cluster mode is used: no additional input fields
      • When cluster mode is not used: you can set Superset Cache DB ID and Superset Query Cache DB ID, and if not set, they will be automatically configured to 0,1.

2. Configure cluster details (optional)

You can configure the HDFS block size, replication factor, and other cluster settings. The HDFS settings take precedence over the cluster configuration settings.

CategoryDescription
HDFS settingsHDFS Block Size
- Set the dfs.blocksize value in hdfs-site.xml
- Create volumes with sizes between 1 and 1,024MB (default: 128MB)

HDFS Replication Factor
- Set the dfs.blockreplication value in hdfs-site.xml
- Set replication factor between 1 and 500
- Replication factor must not exceed the number of worker node instances
Cluster configuration settings (optional)Enter the settings for the components that configure the cluster
- Upload a JSON file or enter the settings manually
- For details on Object Storage integration, refer to Integrate with Object Storage
Log storage settingsSelect whether to use Log Storage Settings

Configure cluster - Component settings

info

Cluster configuration settings are made with a JSON file in key-value pair format. The configurations list contains classification for the filename and properties for the setting names. The basic input format is as follows:

Json file settings
-- Input method
{
"configurations":
[
{
"classification": "filename",
"properties": {
"setting_name": "setting_value"
}
}
]
}

-- Example
{
"configurations":
[
{
"classification": "core-site",
"properties": {
"dfs.blocksize": "67108864"
}
}
]
}

You can categorize the setting file name into env, xml, and properties formats.

FormatDescription
envThe value of the entered key is transformed into the configuration file, and the key must be modified to change the fixed name value.
xmlThe key-value pair is transformed into XML elements, with the key as <name> and the value as <value>.
propertiesThe key-value pair is transformed into the defined format based on the name of the key.
user-envCreates a user environment variable with <username>.

The user's settings are added to the location where the setting file is generated based on the classification entered by the user.

env

ClassificationLocationSettingSample Value
hadoop-env/etc/hadoop/confhadoop_env_hadoop_heapsize2048
hadoop_env_hadoop_namenode_heapsize"-Xmx2048m"
hadoop_env_hadoop_jobtracker_heapsize"-Xmx1024m"
hadoop_env_hadoop_tasktracker_heapsize"-Xmx1024m"
hadoop_env_hadoop_shared_hadoop_namenode_heapsize"-Xmx1024m"
hadoop_env_hadoop_datanode_heapsize"-Xmx1024m"
hadoop_env_hadoop_zkfc_opts"-Xmx1024m"
hadoop_env_hadoop_log_levelINFO, DRFA, console
hadoop_env_hadoop_security_log_levelINFO, DRFAS
hadoop_env_hadoop_audit_log_levelINFO, RFAAUDIT
mapred-env/etc/hadoop/confmapred_env_hadoop_job_historyserver_heapsize2000
hive-env/etc/hive/confhive_env_hive_metastore_heapsize2048
hive_env_hiveserver2_heapsize2048
hive_env_hadoop_client_opts"-Xmx2048m"
hbase-env /etc/hbase/confhbase_env_hbase_master_heapsize"-Xmx2048m"
hbase_env_hbase_regionserver_heapsize"-Xmx2048m"
spark-defaults/etc/spark/confspark_defaults_spark_driver_memory2g
trino-config/etc/trino/conftrino_jvm_config_heap-Xmx10G


#### xml \{#xml}

| Classification | Location | Reference Location | Notes |
| -------------------- | ----------------- | ------------------------------------------------------------------------------- | ------ |
| core-site | /etc/hadoop/conf | [core-default.xml](https://hadoop.apache.org/docs/r2.10.1/hadoop-project-dist/hadoop-common/core-default.xml) | |
| hdfs-site | /etc/hadoop/conf | [hdfs-default.xml](https://hadoop.apache.org/docs/r2.10.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml) | |
| httpfs-site | /etc/hadoop/conf | [ServerSetup.html](https://hadoop.apache.org/docs/r2.10.1/hadoop-hdfs-httpfs/ServerSetup.html) | |
| mapred-site | /etc/hadoop/conf | [mapred-default.xml](https://hadoop.apache.org/docs/r2.10.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) | |
| yarn-site | /etc/hadoop/conf | [yarn-default.xml](https://hadoop.apache.org/docs/r2.10.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml) | |
| capacity-scheduler | /etc/hadoop/conf | [CapacityScheduler.html](https://hadoop.apache.org/docs/r2.10.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html) | yarn scheduler setup |
| tez-site | /etc/tez/conf | [TezConfiguration.html](https://tez.apache.org/releases/0.9.2/tez-api-javadocs/configs/TezConfiguration.html) |
| hive-site | /etc/hive/conf | [Hive configuration properties](https://cwiki.apache.org/confluence/display/hive/configuration+properties) | |
| hiveserver2-site | /etc/hive/conf | [Setting up hiveserver2](https://cwiki.apache.org/confluence/display/hive/setting+up+hiveserver2) | hiveserver2-specific settings |

#### properties \{#properties}

| Classification | Location | Description |
| -------------------- | --------------- | ----------------------------------------------------------------------- |
| spark-defaults | /etc/spark/conf | Spark configuration values are entered as key[tab]value pairs. |

#### user-env \{#user-env}

| Classification | Location | Description |
| -------------------- | --------------- | ----------------------------------------------------------------------- |
| user-env:profile | /etc/profile | Add global variables |
| user-env:[username] | ~/.bashrc | Add environment variables to user's bashrc |


##### xml format \{#xml-format}

```xml title="xml format"
<configuration>

<property>
<name>yarn.app.mapreduce.am.job.client.port-range</name>
<value>41000-43000</value>
</property>

</configuration>
env format
env format
...
export HADOOP_HEAPSIZE="3001"
...
properties format
properties format
spark.driver.memory              4000M
spark.network.timeout 800
user-env format
user-env format
{
"configurations": [
{
"classification": "user-env:profile",
"properties": {
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties": {
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}

Sample example

Sample example
{
"configurations":
[
{
"classification": "mapred-site", -- xml format
"properties":
{
"yarn.app.mapreduce.am.job.client.port-range": "41000-43000"
}
},
{
"classification": "hadoop-env", -- env format
"properties":
{
"hadoop_env_hadoop_heapsize": 3001,
"hadoop_env_hadoop_namenode_heapsize": "-Xmx3002m",
}
},
{
"classification": "spark-defaults", -- properties format
"properties":
{
"spark.driver.memory": "4000M",
"spark.network.timeout": "800s"
}
},
{
"classification": "user-env:profile", -- user-env format
"properties":
{
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties":
{
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}

Sample example

Sample example
{
"configurations":
[
{
"classification": "mapred-site", -- xml format
"properties":
{
"yarn.app.mapreduce.am.job.client.port-range": "41000-43000"
}
},
{
"classification": "hadoop-env", -- env format
"properties":
{
"hadoop_env_hadoop_heapsize": 3001,
"hadoop_env_hadoop_namenode_heapsize": "-Xmx3002m",
}
},
{
"classification": "spark-defaults", -- properties format
"properties":
{
"spark.driver.memory": "4000M",
"spark.network.timeout": "800s"
}
},
{
"classification": "user-env:profile", -- user-env format
"properties":
{
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties":
{
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}

Log storage settings

When using log storage settings, you need to configure the log storage path.

  1. Set the log storage settings to use.

  2. Choose the Object Storage bucket for log storage.

  3. After checking the path, if you wish to modify it, enter the new path.

caution
  • Deleting the stored logs may cause the Spark History Server to malfunction.

3. Configure job scheduling (optional)

If you select Core Hadoop as the cluster type in Step 1: Configure cluster, proceed to configure job scheduling.
When configuring job scheduling, choose from Hive job, Spark job, or None. If you select Hive job or Spark job, you can configure the job scheduling. If you choose None, no job scheduling will be configured.

  • When configuring job scheduling, click the [refresh] icon to fetch Object Storage bucket information.
  • Click Management Page to navigate to the Object Storage service.

Configure Hive job scheduling

Configure the scheduling for Hive jobs.

info

When selecting a bucket in Hive options, Storage Object Managers and Storage Object Creators can upload objects, but they do not have access to view objects in the Object Storage bucket in the console. However, objects can be accessed through the Object Storage API.

  1. In Step 3: Job Scheduling Configuration, select Hive job as the job type.

  2. Enter the scheduling information for the Hive job.

    CategoryDescription
    Job typeHive job: Execute Hive job after cluster creation
    Execution fileExecution file type
    - File: Select a file to execute from an Object Storage bucket, only .hql files are allowed
    - Text: Write Hive queries to execute the job
    Hive optionsProvide options for the job (Hive Options reference)
    - File: Select an option file from Object Storage
    - Text: Write Hive option values for the job
    Job completion actionSelect the action when the job finishes
    - Wait on failure: Only shut down the cluster if the job succeeds
    - Always wait: Do not shut down the cluster regardless of job success or failure
    - Always shutdown: Shut down the cluster regardless of job success or failure
    Scheduling log file storageChoose whether to save scheduling log files
    - Do not save: Do not save scheduling logs
    - Save to Object Storage: Save logs to a selected bucket
    * Logs will be stored in bucket-name/log/ with a yyyy-mm-dd.log format

Hive options

Hive options refer to Hive configuration properties used when executing a Hive job.
These can be written as follows, and further details about Hive configuration properties can be found in the official documentation.

--hiveconf hive.tez.container.size=2048 --hiveconf hive.tez.java.opts=-Xmx1600m

Configure Spark job scheduling

Configure the scheduling for Spark jobs.

  1. In Step 3: Job Scheduling Configuration, select Spark job as the job type.

  2. Enter the scheduling information for the Spark job.

    CategoryDescription
    Job typeSpark job: Execute Spark job after cluster creation
    Execution fileSelect a file to execute from an Object Storage bucket
    - Only .jar files are allowed
    Spark options (optional)Provide options for the job
    Arguments (optional)Provide arguments to be passed to the .jar file
    - File: Select an argument file from Object Storage
    - Text: Write arguments for the job
    Deployment modeChoose the mode to run Spark
    - Choose between {client / cluster}
    Job completion actionSelect the action when the job finishes
    - Wait on failure: Only shut down the cluster if the job succeeds
    - Always wait: Do not shut down the cluster regardless of job success or failure
    - Always shutdown: Shut down the cluster regardless of job success or failure
    Scheduling log file storageChoose whether to save scheduling log files
    - Do not save: Do not save scheduling logs
    - Save to Object Storage: Save logs to a selected bucket
    * Logs will be stored in bucket-name/log/ with a yyyy-mm-dd.log format

Spark options

Spark options refer to configuration settings to be passed when executing Spark job files using spark-submit.
They can be written as follows, and further details can be found in the official documentation.

info

Note that if you include the --deploy-mode argument, errors may occur. Since the deploy mode can be selected in the UI, please use the UI's functionality for selecting it.

--class org.apache.spark.examples.SparkPi --master yarn

4. Configure security details (optional)

Apply cluster security features using Kerberos and Ranger.

CategoryDescription
KerberosIf you want to install it, select Install and enter the following information:

Kerberos Realm Name
- Only uppercase English letters, numbers, and dots (.) are allowed. (1-50 characters)

KDC (Key Distribution Center) Password
- Automatically set with the admin password set in the cluster setup step.
RangerIf you want to install it, select Install.

Ranger Password
- Automatically set with the admin password set in the cluster setup step.