Create cluster
A cluster is a group of nodes provisioned using virtual machines.
The steps to create a cluster in the Hadoop Eco service are as follows:
Cluster creation takes approximately 20–25 minutes.
Step 1. Configure cluster
Set basic information for creating a Hadoop cluster.
-
Go to the KakaoCloud console > Analytics > Hadoop Eco.
-
Click the [Create cluster] button located at the top right of the Cluster menu.
-
In Step 1: Configure cluster, enter the required information and click the [Next] button.
Item Category Description Cluster name Example: my-cluster-01
- Duplicate cluster names cannot be used within the same project
- VMs are created based on the cluster name
- Master node:HadoopMST-{cluster name}-{number}
format
- Worker node:HadoopWRK-{cluster name}-{number}
formatCluster setup Cluster version Select the cluster version Cluster type - Select the cluster type based on the version
- For details, refer to Cluster version and typeCluster availability Provides standard and high-availability types for operational stability
- Standard (Single, 1 master node instance): Runs one resource manager and one name node
ㄴ Suitable for small workloads with a single master node
- High Availability (HA, 3 master node instances): Runs resource manager and name node in HA mode
ㄴ Ensures uninterrupted operations even during reboot or failureAdmin setup Admin ID Enter the admin ID Admin password Enter the admin password
- For password reset instructions, refer to Reset Hue password
- When Ranger is applied, follow specific password creation rules. For details, refer to RangerConfirm password Re-enter the admin password VPC setup Select VPC and Subnet
- Click the [Refresh] icon to retrieve network information
- Click Management page to navigate to VPC > Network
- Public IPs accessible from outside can be assigned after creating the instance via the [Public IP connection] menuSecurity group Create new security group: Enter a security group name to create a new one
- Automatically configures inbound/outbound rules for Hadoop Eco
Select existing security group: Verify inbound/outbound rules
- Click the [Refresh] icon to update network information
- For details, refer to Security group
Configure security group
To install a cluster in the Hadoop Eco service, the ports used by components must be configured in the security group. Certain ports must be open for component configuration. You can preview the policies applied to the instances in the applied policy list.
- When a security group is automatically created, the default security group name is generated based on the creation date, cluster version, and type.
- Example security group name:
{cluster_name}-HDE-{version}-{type}
- Only one security group can be set when creating an instance. Additional security group configurations can be made after creation.
- Inbound policy
- Outbound policy
When creating a cluster, the security group is automatically generated. The port information configured in the automatically generated security group is as follows:
Protocol | Source | Port number | Policy description |
---|---|---|---|
ALL | VPC subnet CIDR | ALL | hadoop eco internal |
Outbound is set to 'Allow all.'
Cluster version and type
Cluster version | Cluster type | Option |
---|---|---|
Hadoop Eco 1.0.1 (Trino, Dataflow not supported) | Core Hadoop | Core Hadoop (Hadoop 2.10.1 HDFS, YARN, Spark 2.4.6) - Installs Apache Spark and Hive analytics engines integrated with Hadoop |
HBase | HBase (Hadoop 2.10.1, HBase 1.4.13) - Installs Apache HBase, a distributed database based on Hadoop | |
Hadoop Eco 1.1.2 | Core Hadoop | Core Hadoop (Hadoop 2.10.2 HDFS, YARN, Spark 2.4.8) |
HBase | HBase (Hadoop 2.10.2, HDFS, YARN, HBase 1.7.1) | |
Trino | Trino (Hadoop 2.10.2, HDFS, YARN, Trino 377) | |
Dataflow | Dataflow (Hadoop 2.10.2, HDFS, YARN, Kafka 3.4.0, Druid 25.0.0, Superset 2.1.1) | |
Hadoop Eco 2.0.1 | Core Hadoop | Core Hadoop (Hadoop 3.3.4 HDFS, YARN, Spark 3.2.2) |
HBase | HBase (Hadoop 3.3.4 HDFS, YARN, HBase 2.4.13) | |
Trino | Trino (Hadoop 3.3.4, HDFS, YARN, Trino 393) | |
Dataflow | Dataflow (Hadoop 3.3.4, HDFS, YARN, Kafka 3.4.0, Druid 25.0.0, Superset 2.1.1) | |
Hadoop Eco 2.1.0 | Core Hadoop | Core Hadoop (Hadoop 3.3.6 HDFS, YARN, Spark 3.5.2) |
HBase | HBase (Hadoop 3.3.6 HDFS, YARN, HBase 2.6.0) | |
Trino | Trino (Hadoop 3.3.6, HDFS, YARN, Trino 393) | |
Dataflow | Dataflow (Hadoop 3.3.6, HDFS, YARN, Kafka 3.8.0, Druid 25.0.0, Superset 2.1.1) |
Hadoop Eco versions 1.0.0, 1.1.0, 1.1.1, and 2.0.0 are not supported.
Step 2. Configure instance
Configure master and worker instances, storage, and network.
Enter the required information in Step 2: Configure instance and click the [Next] button.
- After cluster creation, instance and disk volume settings cannot be changed.
- Adding master/worker instances and disk volumes will be supported in the future.
Item | Category | Description |
---|---|---|
Master node setup | Master node instance count | Fixed based on cluster availability type - Standard (Single) type: 1 instance - HA type: 3 instances |
Master node instance type | Select from supported instance types - Hardware configuration depends on the selected instance type | |
Disk volume type/size | - Volume type: Currently, only SSD type is supported (other types will be supported later) - Volume size: 50–5,120GB | |
Worker node setup | Worker node instance count | Choose the number of worker nodes based on purpose; total count depends on project quota |
Worker node instance type | Select from supported instance types - Hardware configuration depends on the selected instance type | |
Disk volume type/size | - Volume type: Currently, only SSD type is supported (other types will be supported later) - Volume size: 50–5,120GB | |
Total YARN usage | YARN Core | Calculated as 'Number of worker nodes × vCPUs per node' |
YARN Memory | Calculated as 'Number of worker nodes × Memory per node × YARN allocation ratio (0.8)' | |
Key pair | Select a key pair to apply to instances - Choose from existing or newly created key pairs in KakaoCloud - For new key pair creation, refer to Create new key pair - Click the [Refresh] icon to update key pair information * Click Management page to navigate to Virtual Machine > 키 페어 | |
User script (optional) | A script that automatically configures the initial environment by running user data during instance startup |
Create new key pair
Follow these steps to create a key pair in Create cluster:
-
Select Create new key pair and enter a key pair name.
-
Click the [Create and download key pair] button.
-
A private key file with the
.pem
extension will be downloaded using the entered key pair name.
Keep the downloaded private key file secure.
Step 3. Detailed settings (optional)
1. Configure service integration (optional)
Apply settings for cluster service integration. You can configure integration with the Data Catalog service provided by KakaoCloud.
If service integration is not configured, Standard (Single) type will install MySQL on Master Node 1, and HA type will install MySQL on all three master nodes for use as a metastore.
In Service integration settings, select whether to install the Monitoring agent and configure the integration settings.
Item | Description |
---|---|
Monitoring agent installation | Choose whether to Install monitoring agent |
Service integration | Choose between No integration, Data Catalog integration, or External Hive metastore integration |
Install monitoring agent
When the monitoring agent is installed, you can view additional monitoring data under Hadoop Eco > Cluster details > Monitoring tab.
- CPU usage (%) per node
- Memory usage (%) per node
Data Catalog integration
-
Prepare a pre-created Data Catalog for integration. For detailed instructions on creating a catalog, refer to Create catalog.
-
In Service integration settings (optional), select Data Catalog integration.
- In the Data Catalog integration section, verify the Hadoop network/Subnet information and select the desired catalog.
External Hive metastore integration
-
Create a MySQL instance for external Hive metastore integration. For detailed instructions on creating MySQL, refer to Create MySQL instance group.
-
In Service integration settings (optional), select External Hive metastore integration.
- In the service integration section, select the Instance where MySQL is installed.
- After selecting the instance, enter the Database name, MySQL ID, and Password for the MySQL database.
2. Configure cluster details (optional)
You can apply HDFS block size, replication settings, and cluster configuration settings.
HDFS configuration values take precedence over cluster configuration settings.
Item | Description |
---|---|
HDFS settings | HDFS block size - Set the dfs.blocksize value in hdfs-site.xml - Create volumes with a size of 1–1,024MB (Default: 128MB) HDFS replication count - Set the dfs.blockreplication value in hdfs-site.xml - Set replication count between 1 and 500 - The replication count must not exceed the number of worker node instances |
Cluster configuration settings (optional) | Enter settings for cluster components - Upload a JSON file or directly input configuration values - For Object Storage integration, refer to Integrate with Object Storage |
Cluster configuration settings are provided in key-value pair JSON format. The configurations
field is a list, with classification
specifying the file name and properties
containing the configuration name.
The default input method is as follows:
-- Input method
{
"configurations":
[
{
"classification": "Configuration file name",
"properties": {
"Setting name": "Setting value"
}
}
]
}
-- Example
{
"configurations":
[
{
"classification": "core-site",
"properties": {
"dfs.blocksize": "67108864"
}
}
]
}
Settings can be categorized into env
, xml
, and properties
formats based on the configuration file name.
Format | Description |
---|---|
env | The value of the entered key is converted into the configuration file. The key remains unchanged, and only the predefined value for the key is updated. |
xml | The entered key-value pairs are converted into XML elements, where the key is represented as <name> and the value as <value> . |
properties | The entered key-value pairs are converted into a predefined format based on the key name. |
user-env | Adds user environment variables, creating them with <username> . |
User settings are applied to the appropriate configuration file location based on the specified classification
.
env
Classification | Location | Configuration | Sample value |
---|---|---|---|
hadoop-env | /etc/hadoop/conf | hadoop_env_hadoop_heapsize | 2048 |
hadoop_env_hadoop_namenode_heapsize | "-Xmx2048m" | ||
hadoop_env_hadoop_jobtracker_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_tasktracker_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_shared_hadoop_namenode_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_datanode_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_zkfc_opts | "-Xmx1024m" | ||
hadoop_env_hadoop_log_level | INFO, DRFA, console | ||
hadoop_env_hadoop_security_log_level | INFO, DRFAS | ||
hadoop_env_hadoop_audit_log_level | INFO, RFAAUDIT | ||
mapred-env | /etc/hadoop/conf | mapred_env_hadoop_job_historyserver_heapsize | 2000 |
hive-env | /etc/hive/conf | hive_env_hive_metastore_heapsize | 2048 |
hive_env_hiveserver2_heapsize | 2048 | ||
hive_env_hadoop_client_opts | "-Xmx2048m" | ||
hbase-env | /etc/hbase/conf | hbase_env_hbase_master_heapsize | "-Xmx2048m" |
hbase_env_hbase_regionserver_heapsize | "-Xmx2048m" | ||
spark-defaults | /etc/spark/conf | spark_defaults_spark_driver_memory | 2g |
trino-config | /etc/trino/conf | trino_jvm_config_heap | -Xmx10G |
xml
classification | Location | Reference location | Remark |
---|---|---|---|
core-site | /etc/hadoop/conf | core-default.xml | |
hdfs-site | /etc/hadoop/conf | hdfs-default.xml | |
httpfs-site | /etc/hadoop/conf | ServerSetup.html | |
mapred-site | /etc/hadoop/conf | mapred-default.xml | |
yarn-site | /etc/hadoop/conf | yarn-default.xml | |
capacity-scheduler | /etc/hadoop/conf | CapacityScheduler.html | yarn 스케줄러 설정 |
tez-site | /etc/tez/conf | TezConfiguration.html | |
hive-site | /etc/hive/conf | Hive configuration properties | |
hiveserver2-site | /etc/hive/conf | Setting up hiveserver2 | Only for hiveserver2 |
properties
classification | Location | Description |
---|---|---|
spark-defaults | /etc/spark/conf | Spark configuration values are converted into key[tab]value format during input. |
user-env
classification | Location | Description |
---|---|---|
user-env:profile | /etc/profile | Add global environment variables |
user-env:[username] | ~/.bashrc | Add environment variables to the user's bashrc |
xml format
<configuration>
<property>
<name>yarn.app.mapreduce.am.job.client.port-range</name>
<value>41000-43000</value>
</property>
</configuration>
env format
...
export HADOOP_HEAPSIZE="3001"
...
properties format
spark.driver.memory 4000M
spark.network.timeout 800
user-env format
{
"configurations": [
{
"classification": "user-env:profile",
"properties": {
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties": {
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}
Sample examples
{
"configurations":
[
{
"classification": "mapred-site", -- xml format
"properties":
{
"yarn.app.mapreduce.am.job.client.port-range": "41000-43000"
}
},
{
"classification": "hadoop-env", -- env format
"properties":
{
"hadoop_env_hadoop_heapsize": 3001,
"hadoop_env_hadoop_namenode_heapsize": "-Xmx3002m",
}
},
{
"classification": "spark-defaults", -- properties format
"properties":
{
"spark.driver.memory": "4000M",
"spark.network.timeout": "800s"
}
},
{
"classification": "user-env:profile", -- user-env format
"properties":
{
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties":
{
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}
3. Configure job scheduling (optional)
If Core Hadoop was selected as the Cluster type in Step 1: Configure cluster, you can proceed with job scheduling settings.
During job scheduling configuration, select one of the following job types: Hive job, Spark job, or No configuration.
If Hive job or Spark job is selected, you can configure job scheduling. If No configuration is selected, job scheduling will not be set.
- When configuring job scheduling, click the [Refresh] icon to fetch Object Storage bucket information.
- Click Management page to navigate to the Object Storage service.
Configure Hive job scheduling
Configure scheduling for Hive jobs.
When selecting a bucket in Hive options, Storage object manager and Storage object creator can upload objects but cannot view objects in the bucket through the console. However, they can access objects using the Object Storage API.
-
In Step 3: Configure job scheduling, select Hive job as the Job type.
-
Enter the scheduling information for the Hive job.
Category Description Job type Hive job: Execute Hive tasks after cluster creation. Execution file Execution file type
- File: Select an Object Storage bucket and register the execution file (only.hql
files are supported).
- Text: Write Hive queries directly for execution.Hive options Enter option values to be passed to the job (Hive options for reference).
- File: Select an Object Storage bucket and register a Hive option file.
- Text: Write Hive option values directly.Job termination action Select the action upon job termination.
- Wait on failure: Only terminate the cluster if the job succeeds.
- Always wait: Do not terminate the cluster regardless of success or failure.
- Always terminate: Terminate the cluster on both success and failure.Save scheduling log file Choose whether to save scheduling log files.
- Do not save: Do not save scheduling logs.
- Save to Object Storage: Save scheduling log files to the desired bucket.
* Logs are stored in the selected bucket_name/log/ path inyyyy-mm-dd.log
format.
Hive options
Hive options refer to Hive configuration properties used during Hive job execution.
They can be written as follows. For detailed information about Hive configuration properties, refer to the official documentation.
--hiveconf hive.tez.container.size=2048 --hiveconf hive.tez.java.opts=-Xmx1600m
Configure Spark job scheduling
Configure scheduling for Spark jobs.
-
In Step 3: Configure job scheduling, select Spark job as the Job type.
-
Enter the scheduling information for the Spark job.
Category Description Job type Spark job: Execute Spark tasks after cluster creation. Execution file Select an Object Storage bucket and register the execution file.
- Only files with the.jar
extension can be registered.Spark options (optional) Write options to pass to the job.
- Refer to Spark options.Arguments (optional) Write arguments to pass to the .jar
file being executed.
- File: Select an Object Storage bucket and register the argument file.
- Text: Write arguments directly.Deployment mode Select the mode to run Spark.
- Choose between client or cluster.Job termination action Select the action upon job termination.
- Wait on failure: Only terminate the cluster if the job succeeds.
- Always wait: Do not terminate the cluster regardless of success or failure.
- Always terminate: Terminate the cluster on both success and failure.Save scheduling log file Choose whether to save scheduling log files.
- Do not save: Do not save scheduling logs.
- Save to Object Storage: Save scheduling log files to the desired bucket.
* Logs are stored in the selected bucket_name/log/ path inyyyy-mm-dd.log
format.
Spark options
Spark options refer to configuration values passed when executing Spark job files using spark-submit.
They can be written as shown below, and for more details on configuration values, refer to the official documentation.
However, an error may occur if the --deploy-mode
syntax is included. Since deploy-mode
can be selected from the interface, please use the available interface functionality.
--class org.apache.spark.examples.SparkPi --master yarn