Skip to main content

Create cluster

A cluster is a group of nodes provisioned using virtual machines.
The steps to create a cluster in the Hadoop Eco service are as follows:

info

Cluster creation takes approximately 20–25 minutes.

Step 1. Configure cluster

Set basic information for creating a Hadoop cluster.

  1. Go to the KakaoCloud console > Analytics > Hadoop Eco.

  2. Click the [Create cluster] button located at the top right of the Cluster menu.

  3. In Step 1: Configure cluster, enter the required information and click the [Next] button.

    ItemCategoryDescription
    Cluster nameExample: my-cluster-01
    - Duplicate cluster names cannot be used within the same project
    - VMs are created based on the cluster name
    - Master node: HadoopMST-{cluster name}-{number} format
    - Worker node: HadoopWRK-{cluster name}-{number} format
    Cluster setupCluster versionSelect the cluster version
    Cluster type- Select the cluster type based on the version
    - For details, refer to Cluster version and type
    Cluster availabilityProvides standard and high-availability types for operational stability
    - Standard (Single, 1 master node instance): Runs one resource manager and one name node
      ㄴ Suitable for small workloads with a single master node
    - High Availability (HA, 3 master node instances): Runs resource manager and name node in HA mode
       ㄴ Ensures uninterrupted operations even during reboot or failure
    Admin setupAdmin IDEnter the admin ID
    Admin passwordEnter the admin password
    - For password reset instructions, refer to Reset Hue password
    - When Ranger is applied, follow specific password creation rules. For details, refer to Ranger
    Confirm passwordRe-enter the admin password
    VPC setupSelect VPC and Subnet
    - Click the [Refresh] icon to retrieve network information
    - Click Management page to navigate to VPC > Network
    - Public IPs accessible from outside can be assigned after creating the instance via the [Public IP connection] menu
    Security groupCreate new security group: Enter a security group name to create a new one
      - Automatically configures inbound/outbound rules for Hadoop Eco
    Select existing security group: Verify inbound/outbound rules
      - Click the [Refresh] icon to update network information
      - For details, refer to Security group

Configure security group

To install a cluster in the Hadoop Eco service, the ports used by components must be configured in the security group. Certain ports must be open for component configuration. You can preview the policies applied to the instances in the applied policy list.

  • When a security group is automatically created, the default security group name is generated based on the creation date, cluster version, and type.
  • Example security group name: {cluster_name}-HDE-{version}-{type}
  • Only one security group can be set when creating an instance. Additional security group configurations can be made after creation.

When creating a cluster, the security group is automatically generated. The port information configured in the automatically generated security group is as follows:

ProtocolSourcePort numberPolicy description
ALLVPC subnet CIDRALLhadoop eco internal

Cluster version and type

Cluster versionCluster typeOption
Hadoop Eco 1.0.1
(Trino, Dataflow not supported)
Core HadoopCore Hadoop (Hadoop 2.10.1 HDFS, YARN, Spark 2.4.6)
- Installs Apache Spark and Hive analytics engines integrated with Hadoop
HBaseHBase (Hadoop 2.10.1, HBase 1.4.13)
- Installs Apache HBase, a distributed database based on Hadoop
Hadoop Eco 1.1.2Core HadoopCore Hadoop (Hadoop 2.10.2 HDFS, YARN, Spark 2.4.8)
HBaseHBase (Hadoop 2.10.2, HDFS, YARN, HBase 1.7.1)
TrinoTrino (Hadoop 2.10.2, HDFS, YARN, Trino 377)
DataflowDataflow (Hadoop 2.10.2, HDFS, YARN, Kafka 3.4.0, Druid 25.0.0, Superset 2.1.1)
Hadoop Eco 2.0.1Core HadoopCore Hadoop (Hadoop 3.3.4 HDFS, YARN, Spark 3.2.2)
HBaseHBase (Hadoop 3.3.4 HDFS, YARN, HBase 2.4.13)
TrinoTrino (Hadoop 3.3.4, HDFS, YARN, Trino 393)
DataflowDataflow (Hadoop 3.3.4, HDFS, YARN, Kafka 3.4.0, Druid 25.0.0, Superset 2.1.1)
Hadoop Eco 2.1.0Core HadoopCore Hadoop (Hadoop 3.3.6 HDFS, YARN, Spark 3.5.2)
HBaseHBase (Hadoop 3.3.6 HDFS, YARN, HBase 2.6.0)
TrinoTrino (Hadoop 3.3.6, HDFS, YARN, Trino 393)
DataflowDataflow (Hadoop 3.3.6, HDFS, YARN, Kafka 3.8.0, Druid 25.0.0, Superset 2.1.1)
info

Hadoop Eco versions 1.0.0, 1.1.0, 1.1.1, and 2.0.0 are not supported.

Step 2. Configure instance

Configure master and worker instances, storage, and network.

Enter the required information in Step 2: Configure instance and click the [Next] button.

  • After cluster creation, instance and disk volume settings cannot be changed.
  • Adding master/worker instances and disk volumes will be supported in the future.
ItemCategoryDescription
Master node setupMaster node instance countFixed based on cluster availability type
- Standard (Single) type: 1 instance
- HA type: 3 instances
Master node instance typeSelect from supported instance types
- Hardware configuration depends on the selected instance type
Disk volume type/size- Volume type: Currently, only SSD type is supported (other types will be supported later)
- Volume size: 50–5,120GB
Worker node setupWorker node instance countChoose the number of worker nodes based on purpose; total count depends on project quota
Worker node instance typeSelect from supported instance types
- Hardware configuration depends on the selected instance type
Disk volume type/size- Volume type: Currently, only SSD type is supported (other types will be supported later)
- Volume size: 50–5,120GB
Total YARN usageYARN CoreCalculated as 'Number of worker nodes × vCPUs per node'
YARN MemoryCalculated as 'Number of worker nodes × Memory per node × YARN allocation ratio (0.8)'
Key pairSelect a key pair to apply to instances
- Choose from existing or newly created key pairs in KakaoCloud
- For new key pair creation, refer to Create new key pair
- Click the [Refresh] icon to update key pair information
* Click Management page to navigate to Virtual Machine > 키 페어
User script (optional)A script that automatically configures the initial environment by running user data during instance startup

Create new key pair

Follow these steps to create a key pair in Create cluster:

  1. Select Create new key pair and enter a key pair name.

  2. Click the [Create and download key pair] button.

  3. A private key file with the .pem extension will be downloaded using the entered key pair name.

info

Keep the downloaded private key file secure.

Step 3. Detailed settings (optional)

1. Configure service integration (optional)

Apply settings for cluster service integration. You can configure integration with the Data Catalog service provided by KakaoCloud.

info

If service integration is not configured, Standard (Single) type will install MySQL on Master Node 1, and HA type will install MySQL on all three master nodes for use as a metastore.

In Service integration settings, select whether to install the Monitoring agent and configure the integration settings.

ItemDescription
Monitoring agent installationChoose whether to Install monitoring agent
Service integrationChoose between No integration, Data Catalog integration, or External Hive metastore integration

Install monitoring agent

When the monitoring agent is installed, you can view additional monitoring data under Hadoop Eco > Cluster details > Monitoring tab.

  • CPU usage (%) per node
  • Memory usage (%) per node

Data Catalog integration

  1. Prepare a pre-created Data Catalog for integration. For detailed instructions on creating a catalog, refer to Create catalog.

  2. In Service integration settings (optional), select Data Catalog integration.

    • In the Data Catalog integration section, verify the Hadoop network/Subnet information and select the desired catalog.

External Hive metastore integration

  1. Create a MySQL instance for external Hive metastore integration. For detailed instructions on creating MySQL, refer to Create MySQL instance group.

  2. In Service integration settings (optional), select External Hive metastore integration.

    1. In the service integration section, select the Instance where MySQL is installed.
    2. After selecting the instance, enter the Database name, MySQL ID, and Password for the MySQL database.

2. Configure cluster details (optional)

You can apply HDFS block size, replication settings, and cluster configuration settings.
HDFS configuration values take precedence over cluster configuration settings.

ItemDescription
HDFS settingsHDFS block size
- Set the dfs.blocksize value in hdfs-site.xml
- Create volumes with a size of 1–1,024MB (Default: 128MB)

HDFS replication count
- Set the dfs.blockreplication value in hdfs-site.xml
- Set replication count between 1 and 500
- The replication count must not exceed the number of worker node instances
Cluster configuration settings (optional)Enter settings for cluster components
- Upload a JSON file or directly input configuration values
- For Object Storage integration, refer to Integrate with Object Storage
Cluster configuration settings

Cluster configuration settings are provided in key-value pair JSON format. The configurations field is a list, with classification specifying the file name and properties containing the configuration name.
The default input method is as follows:

JSON file configuration
-- Input method
{
"configurations":
[
{
"classification": "Configuration file name",
"properties": {
"Setting name": "Setting value"
}
}
]
}

-- Example
{
"configurations":
[
{
"classification": "core-site",
"properties": {
"dfs.blocksize": "67108864"
}
}
]
}

Settings can be categorized into env, xml, and properties formats based on the configuration file name.

FormatDescription
envThe value of the entered key is converted into the configuration file.
The key remains unchanged, and only the predefined value for the key is updated.
xmlThe entered key-value pairs are converted into XML elements, where the key is represented as <name> and the value as <value>.
propertiesThe entered key-value pairs are converted into a predefined format based on the key name.
user-envAdds user environment variables, creating them with <username>.

User settings are applied to the appropriate configuration file location based on the specified classification.

env

ClassificationLocationConfigurationSample value
hadoop-env/etc/hadoop/confhadoop_env_hadoop_heapsize2048
hadoop_env_hadoop_namenode_heapsize"-Xmx2048m"
hadoop_env_hadoop_jobtracker_heapsize"-Xmx1024m"
hadoop_env_hadoop_tasktracker_heapsize"-Xmx1024m"
hadoop_env_hadoop_shared_hadoop_namenode_heapsize"-Xmx1024m"
hadoop_env_hadoop_datanode_heapsize"-Xmx1024m"
hadoop_env_hadoop_zkfc_opts"-Xmx1024m"
hadoop_env_hadoop_log_levelINFO, DRFA, console
hadoop_env_hadoop_security_log_levelINFO, DRFAS
hadoop_env_hadoop_audit_log_levelINFO, RFAAUDIT
mapred-env/etc/hadoop/confmapred_env_hadoop_job_historyserver_heapsize2000
hive-env/etc/hive/confhive_env_hive_metastore_heapsize2048
hive_env_hiveserver2_heapsize2048
hive_env_hadoop_client_opts"-Xmx2048m"
hbase-env /etc/hbase/confhbase_env_hbase_master_heapsize"-Xmx2048m"
hbase_env_hbase_regionserver_heapsize"-Xmx2048m"
spark-defaults/etc/spark/confspark_defaults_spark_driver_memory2g
trino-config/etc/trino/conftrino_jvm_config_heap-Xmx10G

xml

classificationLocationReference locationRemark
core-site/etc/hadoop/confcore-default.xml
hdfs-site/etc/hadoop/confhdfs-default.xml
httpfs-site/etc/hadoop/confServerSetup.html
mapred-site/etc/hadoop/confmapred-default.xml
yarn-site/etc/hadoop/confyarn-default.xml
capacity-scheduler/etc/hadoop/confCapacityScheduler.htmlyarn 스케줄러 설정
tez-site/etc/tez/confTezConfiguration.html
hive-site/etc/hive/confHive configuration properties
hiveserver2-site/etc/hive/confSetting up hiveserver2Only for hiveserver2

properties

classificationLocationDescription
spark-defaults/etc/spark/confSpark configuration values are converted into key[tab]value format during input.

user-env

classificationLocationDescription
user-env:profile/etc/profileAdd global environment variables
user-env:[username]~/.bashrcAdd environment variables to the user's bashrc
xml format
xml format
<configuration>
<property>
<name>yarn.app.mapreduce.am.job.client.port-range</name>
<value>41000-43000</value>
</property>

</configuration>
env format
env format
...
export HADOOP_HEAPSIZE="3001"
...
properties format
properties format
spark.driver.memory              4000M
spark.network.timeout 800
user-env format
user-env format
{
"configurations": [
{
"classification": "user-env:profile",
"properties": {
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties": {
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}

Sample examples

Sample examples
{
"configurations":
[
{
"classification": "mapred-site", -- xml format
"properties":
{
"yarn.app.mapreduce.am.job.client.port-range": "41000-43000"
}
},
{
"classification": "hadoop-env", -- env format
"properties":
{
"hadoop_env_hadoop_heapsize": 3001,
"hadoop_env_hadoop_namenode_heapsize": "-Xmx3002m",
}
},
{
"classification": "spark-defaults", -- properties format
"properties":
{
"spark.driver.memory": "4000M",
"spark.network.timeout": "800s"
}
},
{
"classification": "user-env:profile", -- user-env format
"properties":
{
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties":
{
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}

3. Configure job scheduling (optional)

If Core Hadoop was selected as the Cluster type in Step 1: Configure cluster, you can proceed with job scheduling settings.

During job scheduling configuration, select one of the following job types: Hive job, Spark job, or No configuration.
If Hive job or Spark job is selected, you can configure job scheduling. If No configuration is selected, job scheduling will not be set.

  • When configuring job scheduling, click the [Refresh] icon to fetch Object Storage bucket information.
  • Click Management page to navigate to the Object Storage service.

Configure Hive job scheduling

Configure scheduling for Hive jobs.

info

When selecting a bucket in Hive options, Storage object manager and Storage object creator can upload objects but cannot view objects in the bucket through the console. However, they can access objects using the Object Storage API.

  1. In Step 3: Configure job scheduling, select Hive job as the Job type.

  2. Enter the scheduling information for the Hive job.

    CategoryDescription
    Job typeHive job: Execute Hive tasks after cluster creation.
    Execution fileExecution file type
    - File: Select an Object Storage bucket and register the execution file (only .hql files are supported).
    - Text: Write Hive queries directly for execution.
    Hive optionsEnter option values to be passed to the job (Hive options for reference).
    - File: Select an Object Storage bucket and register a Hive option file.
    - Text: Write Hive option values directly.
    Job termination actionSelect the action upon job termination.
    - Wait on failure: Only terminate the cluster if the job succeeds.
    - Always wait: Do not terminate the cluster regardless of success or failure.
    - Always terminate: Terminate the cluster on both success and failure.
    Save scheduling log fileChoose whether to save scheduling log files.
    - Do not save: Do not save scheduling logs.
    - Save to Object Storage: Save scheduling log files to the desired bucket.
    * Logs are stored in the selected bucket_name/log/ path in yyyy-mm-dd.log format.

Hive options

Hive options refer to Hive configuration properties used during Hive job execution.
They can be written as follows. For detailed information about Hive configuration properties, refer to the official documentation.

--hiveconf hive.tez.container.size=2048 --hiveconf hive.tez.java.opts=-Xmx1600m

Configure Spark job scheduling

Configure scheduling for Spark jobs.

  1. In Step 3: Configure job scheduling, select Spark job as the Job type.

  2. Enter the scheduling information for the Spark job.

    CategoryDescription
    Job typeSpark job: Execute Spark tasks after cluster creation.
    Execution fileSelect an Object Storage bucket and register the execution file.
    - Only files with the .jar extension can be registered.
    Spark options (optional)Write options to pass to the job.
    - Refer to Spark options.
    Arguments (optional)Write arguments to pass to the .jar file being executed.
    - File: Select an Object Storage bucket and register the argument file.
    - Text: Write arguments directly.
    Deployment modeSelect the mode to run Spark.
    - Choose between client or cluster.
    Job termination actionSelect the action upon job termination.
    - Wait on failure: Only terminate the cluster if the job succeeds.
    - Always wait: Do not terminate the cluster regardless of success or failure.
    - Always terminate: Terminate the cluster on both success and failure.
    Save scheduling log fileChoose whether to save scheduling log files.
    - Do not save: Do not save scheduling logs.
    - Save to Object Storage: Save scheduling log files to the desired bucket.
    * Logs are stored in the selected bucket_name/log/ path in yyyy-mm-dd.log format.

Spark options

Spark options refer to configuration values passed when executing Spark job files using spark-submit.
They can be written as shown below, and for more details on configuration values, refer to the official documentation.

info

However, an error may occur if the --deploy-mode syntax is included. Since deploy-mode can be selected from the interface, please use the available interface functionality.

--class org.apache.spark.examples.SparkPi --master yarn