Create cluster
A cluster is a set of nodes provisioned using Virtual Machines.
The process to create a cluster in the Hadoop Eco service is as follows.
It takes approximately 20-25 minutes to create a cluster.
Step 1. Configure cluster
Set the basic information for creating a Hadoop cluster.
-
Go to KakaoCloud console > Analytics > Hadoop Eco menu.
-
In the Cluster menu, click the [Create cluster] button located at the top right.
-
In Step 1: Configure cluster, enter the information and click the [Next] button.
Item Category Description Cluster name Example: my-cluster-01
- Cluster names must be unique within the same project
- VM is created based on the cluster name
- Master node: Created in the formatHadoopMST-{ClusterName}-{Number}
- Worker node: Created in the formatHadoopWRK-{ClusterName}-{Number}
Cluster configuration Cluster version Select the cluster version Cluster type - Select the cluster type based on the cluster version
- For detailed explanation, refer to Cluster version and typeCluster availability Provides standard and high availability types for operational stability
- Standard (Single, 1 master node instance): Resource manager, Name node run in 1 instance
- Creates a single master node, suitable for small-scale tasks
- High availability (HA) (3 master node instances): Resource manager, Name node run in HA mode
- Creates 3 master nodes, allowing for uninterrupted tasks even during rebootsAdministrator settings Admin ID Enter the admin ID Admin password Enter the admin password
- For details on resetting the password, refer to Hue password reset
- When Ranger is applied, a specific password creation rule must be followed. For details, refer to Ranger applicationConfirm admin password Enter the same admin password VPC settings Select a VPC and subnet
- Click Management page to go to VPC
- Public IP accessible from external sources can be assigned after instance creation in the Assign public IP menuSecurity group Create new security group: Enter a name to create a new security group
- Automatically set inbound/outbound rules for Hadoop Eco
Select an existing security group: Check inbound/outbound rules
- Click the [Refresh] icon to fetch network information
- For detailed explanation, refer to Security group
Configure security group
In order to install a cluster in the Hadoop Eco service, you must configure the ports used by the components in the security group. Certain ports must be open for component configuration. You can check the policies applied to instances in the list of applied policies.
- When a security group is automatically created, the default name of the security group is automatically set based on the creation date, cluster version, and type.
- Example of a security group name:
{cluster_name}-HDE-{version}-{type}
- Only one security group can be set when creating an instance, and additional security group configurations can be made after the instance is created.
- Inbound rules
- Outbound rules
A security group is automatically created during cluster creation. The port information set in the automatically created security group is as follows.
Protocol | Packet Source | Port Number | Policy Description |
---|---|---|---|
ALL | VPC subnet CIDR | ALL | Hadoop Eco internal |
Outbound is set to 'Allow all'.
Cluster version and type
Cluster version | Cluster type | Options |
---|---|---|
Hadoop Eco 1.0.1 (Trino, Dataflow type not supported) | Core Hadoop | Core Hadoop (Hadoop 2.10.1 HDFS, YARN, Spark 2.4.6) - Apache Spark and Hive, which can be used with Hadoop, are installed together |
HBase | HBase (Hadoop 2.10.1, HBase 1.4.13) - Apache HBase, a distributed database based on Hadoop, is installed together | |
Hadoop Eco 1.1.2 | Core Hadoop | Core Hadoop (Hadoop 2.10.2 HDFS, YARN, Spark 2.4.8) |
HBase | HBase (Hadoop 2.10.2, HDFS, YARN, HBase 1.7.1) | |
Trino | Trino (Hadoop 2.10.2, HDFS, YARN, Trino 377) | |
Dataflow | Dataflow (Hadoop 2.10.2, HDFS, YARN, Kafka 3.4.0, Druid 25.0.0, Superset 2.1.1) | |
Hadoop Eco 2.0.1 | Core Hadoop | Core Hadoop (Hadoop 3.3.4 HDFS, YARN, Spark 3.2.2) |
HBase | HBase (Hadoop 3.3.4 HDFS, YARN, HBase 2.4.13) | |
Trino | Trino (Hadoop 3.3.4, HDFS, YARN, Trino 393) | |
Dataflow | Dataflow (Hadoop 3.3.4, HDFS, YARN, Kafka 3.4.0, Druid 25.0.0, Superset 2.1.1) | |
Hadoop Eco 2.1.0 | Core Hadoop | Core Hadoop (Hadoop 3.3.6 HDFS, YARN, Spark 3.5.2) |
HBase | HBase (Hadoop 3.3.6 HDFS, YARN, HBase 2.6.0) | |
Trino | Trino (Hadoop 3.3.6, HDFS, YARN, Trino 393) | |
Dataflow | Dataflow (Hadoop 3.3.6, HDFS, YARN, Kafka 3.8.0, Druid 25.0.0, Superset 2.1.1) |
Versions Hadoop Eco 1.0.0, 1.1.0, 1.1.1, and 2.0.0 are not supported.
Step 2. Configure instance
Configure the master and worker instances, storage, and network.
In Step 2: Configure instance, enter the information and click the [Next] button.
- After creating the cluster, instance and disk volume configurations cannot be changed.
- Adding master/worker instances and disk volumes will be supported in the future.
Category | Details | Description |
---|---|---|
Master node settings | Master node instance count | Fixed based on cluster availability - Standard (Single) type: 1 instance - HA type: 3 instances |
Master node instance type | Choose from supported instance types - Hardware configuration depends on the selected instance type | |
Disk volume type/size | - Volume type: Currently only SSD type is supported (other types will be supported in the future) - Volume size: 50 ~ 5,120GB | |
Worker node settings | Worker node instance count | Choose the number of instances based on purpose; the total number is determined by the project's quota |
Worker node instance type | Choose from supported instance types - Hardware configuration depends on the selected instance type | |
Disk volume type/size | Volume type: Currently only SSD type is supported (other types will be supported in the future) - Volume size: 50 ~ 5,120GB | |
Total YARN usage | YARN Core | Result of 'number of worker nodes x vCPU count per node' |
YARN Memory | Result of 'number of worker nodes x memory size per node x YARN allocated ratio (0.8)' | |
Key pair | Choose a key pair to apply to the instance - Select an existing key pair from KakaoCloud or create a new one - For creating a new key pair, refer to Create new key pair - Click Management Page to navigate to Virtual Machine > Key pairs | |
User script (optional) | A script that runs user data to automatically configure the environment when starting the instance |
Create new key pair
To create a key pair during cluster creation, follow these steps:
- Select Create new key pair and enter the key pair name.
- Click the [Create and download key pair] button.
- A private key file with a
.pem
extension will be downloaded with the entered key pair name.
Please keep the downloaded private key file in a safe place.
Step 3. Configure detailed settings (optional)
1. Configure service integration (optional)
Apply settings for cluster service integration. You can configure integration with the Data Catalog service provided by KakaoCloud.
If service integration is not performed, the Standard (Single) type will install MySQL on the master node 1, and the HA type will install MySQL on the master node 3 to be used as the metastore.
In Service integration settings, choose whether to install the monitoring agent and configure service integration:
Category | Description |
---|---|
Monitoring agent installation | Select whether to install the monitoring agent |
Service integration | Select between "Do not integrate", Data Catalog integration, External Hive Metastore integration, or MemStore integration |
Install Monitoring Agent
When the monitoring agent is installed, additional node monitoring can be viewed under the Hadoop Eco > Cluster Details page > Monitoring tab.
- CPU usage per node (%)
- Memory usage per node (%)
Integrate Data Catalog
-
To integrate with Data Catalog, prepare a Data Catalog that was created in advance. For more details on creating a catalog, refer to Create catalog.
-
To integrate with Data Catalog, select Data Catalog integration in Service Integration Settings (optional).
- In the Data Catalog integration section, check the Hadoop network/subnet information and select the desired catalog.
Integrate external Hive metastore
-
To integrate with an external Hive metastore, create MySQL. For more details on creating MySQL, refer to Create MySQL instance group.
-
To integrate with MySQL, select "External Hive Metastore Integration" in Service Integration Settings (optional).
- In the service integration section, select the instance where MySQL is installed.
- After selecting the instance, enter the MySQL database name, MySQL ID, and password.
Integrate MemStore
MemStore integration is only available for Hadoop Eco - Dataflow types. The MemStore integration button will not be displayed for other cluster types.
-
To integrate with MemStore, create MemStore. For more details on creating MemStore, refer to Create MemStore cluster.
-
To integrate with MemStore, select MemStore integration in Service Integration Settings (optional).
- In the MemStore name field, select the MemStore to integrate.
- Depending on whether MemStore cluster mode is used, fields for Superset Cache DB ID and Superset Query Cache DB ID will appear.
- When cluster mode is used: no additional input fields
- When cluster mode is not used: you can set Superset Cache DB ID and Superset Query Cache DB ID, and if not set, they will be automatically configured to 0,1.
2. Configure cluster details (optional)
You can configure the HDFS block size, replication factor, and other cluster settings. The HDFS settings take precedence over the cluster configuration settings.
Category | Description |
---|---|
HDFS settings | HDFS Block Size - Set the dfs.blocksize value in hdfs-site.xml - Create volumes with sizes between 1 and 1,024MB (default: 128MB) HDFS Replication Factor - Set the dfs.blockreplication value in hdfs-site.xml - Set replication factor between 1 and 500 - Replication factor must not exceed the number of worker node instances |
Cluster configuration settings (optional) | Enter the settings for the components that configure the cluster - Upload a JSON file or enter the settings manually - For details on Object Storage integration, refer to Integrate with Object Storage |
Log storage settings | Select whether to use Log Storage Settings |
Configure cluster - Component settings
Cluster configuration settings are made with a JSON file in key-value pair format. The configurations
list contains classification
for the filename and properties
for the setting names. The basic input format is as follows:
-- Input method
{
"configurations":
[
{
"classification": "filename",
"properties": {
"setting_name": "setting_value"
}
}
]
}
-- Example
{
"configurations":
[
{
"classification": "core-site",
"properties": {
"dfs.blocksize": "67108864"
}
}
]
}
You can categorize the setting file name into env
, xml
, and properties
formats.
Format | Description |
---|---|
env | The value of the entered key is transformed into the configuration file, and the key must be modified to change the fixed name value. |
xml | The key-value pair is transformed into XML elements, with the key as <name> and the value as <value> . |
properties | The key-value pair is transformed into the defined format based on the name of the key. |
user-env | Creates a user environment variable with <username> . |
The user's settings are added to the location where the setting file is generated based on the classification entered by the user.
env
Classification | Location | Setting | Sample Value |
---|---|---|---|
hadoop-env | /etc/hadoop/conf | hadoop_env_hadoop_heapsize | 2048 |
hadoop_env_hadoop_namenode_heapsize | "-Xmx2048m" | ||
hadoop_env_hadoop_jobtracker_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_tasktracker_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_shared_hadoop_namenode_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_datanode_heapsize | "-Xmx1024m" | ||
hadoop_env_hadoop_zkfc_opts | "-Xmx1024m" | ||
hadoop_env_hadoop_log_level | INFO, DRFA, console | ||
hadoop_env_hadoop_security_log_level | INFO, DRFAS | ||
hadoop_env_hadoop_audit_log_level | INFO, RFAAUDIT | ||
mapred-env | /etc/hadoop/conf | mapred_env_hadoop_job_historyserver_heapsize | 2000 |
hive-env | /etc/hive/conf | hive_env_hive_metastore_heapsize | 2048 |
hive_env_hiveserver2_heapsize | 2048 | ||
hive_env_hadoop_client_opts | "-Xmx2048m" | ||
hbase-env | /etc/hbase/conf | hbase_env_hbase_master_heapsize | "-Xmx2048m" |
hbase_env_hbase_regionserver_heapsize | "-Xmx2048m" | ||
spark-defaults | /etc/spark/conf | spark_defaults_spark_driver_memory | 2g |
trino-config | /etc/trino/conf | trino_jvm_config_heap | -Xmx10G |
#### xml \{#xml}
| Classification | Location | Reference Location | Notes |
| -------------------- | ----------------- | ------------------------------------------------------------------------------- | ------ |
| core-site | /etc/hadoop/conf | [core-default.xml](https://hadoop.apache.org/docs/r2.10.1/hadoop-project-dist/hadoop-common/core-default.xml) | |
| hdfs-site | /etc/hadoop/conf | [hdfs-default.xml](https://hadoop.apache.org/docs/r2.10.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml) | |
| httpfs-site | /etc/hadoop/conf | [ServerSetup.html](https://hadoop.apache.org/docs/r2.10.1/hadoop-hdfs-httpfs/ServerSetup.html) | |
| mapred-site | /etc/hadoop/conf | [mapred-default.xml](https://hadoop.apache.org/docs/r2.10.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) | |
| yarn-site | /etc/hadoop/conf | [yarn-default.xml](https://hadoop.apache.org/docs/r2.10.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml) | |
| capacity-scheduler | /etc/hadoop/conf | [CapacityScheduler.html](https://hadoop.apache.org/docs/r2.10.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html) | yarn scheduler setup |
| tez-site | /etc/tez/conf | [TezConfiguration.html](https://tez.apache.org/releases/0.9.2/tez-api-javadocs/configs/TezConfiguration.html) |
| hive-site | /etc/hive/conf | [Hive configuration properties](https://cwiki.apache.org/confluence/display/hive/configuration+properties) | |
| hiveserver2-site | /etc/hive/conf | [Setting up hiveserver2](https://cwiki.apache.org/confluence/display/hive/setting+up+hiveserver2) | hiveserver2-specific settings |
#### properties \{#properties}
| Classification | Location | Description |
| -------------------- | --------------- | ----------------------------------------------------------------------- |
| spark-defaults | /etc/spark/conf | Spark configuration values are entered as key[tab]value pairs. |
#### user-env \{#user-env}
| Classification | Location | Description |
| -------------------- | --------------- | ----------------------------------------------------------------------- |
| user-env:profile | /etc/profile | Add global variables |
| user-env:[username] | ~/.bashrc | Add environment variables to user's bashrc |
##### xml format \{#xml-format}
```xml title="xml format"
<configuration>
<property>
<name>yarn.app.mapreduce.am.job.client.port-range</name>
<value>41000-43000</value>
</property>
</configuration>
env format
...
export HADOOP_HEAPSIZE="3001"
...
properties format
spark.driver.memory 4000M
spark.network.timeout 800
user-env format
{
"configurations": [
{
"classification": "user-env:profile",
"properties": {
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties": {
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}
Sample example
{
"configurations":
[
{
"classification": "mapred-site", -- xml format
"properties":
{
"yarn.app.mapreduce.am.job.client.port-range": "41000-43000"
}
},
{
"classification": "hadoop-env", -- env format
"properties":
{
"hadoop_env_hadoop_heapsize": 3001,
"hadoop_env_hadoop_namenode_heapsize": "-Xmx3002m",
}
},
{
"classification": "spark-defaults", -- properties format
"properties":
{
"spark.driver.memory": "4000M",
"spark.network.timeout": "800s"
}
},
{
"classification": "user-env:profile", -- user-env format
"properties":
{
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties":
{
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}
Sample example
{
"configurations":
[
{
"classification": "mapred-site", -- xml format
"properties":
{
"yarn.app.mapreduce.am.job.client.port-range": "41000-43000"
}
},
{
"classification": "hadoop-env", -- env format
"properties":
{
"hadoop_env_hadoop_heapsize": 3001,
"hadoop_env_hadoop_namenode_heapsize": "-Xmx3002m",
}
},
{
"classification": "spark-defaults", -- properties format
"properties":
{
"spark.driver.memory": "4000M",
"spark.network.timeout": "800s"
}
},
{
"classification": "user-env:profile", -- user-env format
"properties":
{
"env": "FOO=profile\nVAR=profile\nexport U=profile"
}
},
{
"classification": "user-env:ubuntu",
"properties":
{
"env": "FOO=foo\nVAR=var\nexport U=N"
}
}
]
}
Log storage settings
When using log storage settings, you need to configure the log storage path.
-
Set the log storage settings to use.
-
Choose the Object Storage bucket for log storage.
-
After checking the path, if you wish to modify it, enter the new path.
- Deleting the stored logs may cause the Spark History Server to malfunction.
3. Configure job scheduling (optional)
If you select Core Hadoop as the cluster type in Step 1: Configure cluster, proceed to configure job scheduling.
When configuring job scheduling, choose from Hive job, Spark job, or None. If you select Hive job or Spark job, you can configure the job scheduling. If you choose None, no job scheduling will be configured.
- When configuring job scheduling, click the [refresh] icon to fetch Object Storage bucket information.
- Click Management Page to navigate to the Object Storage service.
Configure Hive job scheduling
Configure the scheduling for Hive jobs.
When selecting a bucket in Hive options, Storage Object Managers and Storage Object Creators can upload objects, but they do not have access to view objects in the Object Storage bucket in the console. However, objects can be accessed through the Object Storage API.
-
In Step 3: Job Scheduling Configuration, select Hive job as the job type.
-
Enter the scheduling information for the Hive job.
Category Description Job type Hive job: Execute Hive job after cluster creation Execution file Execution file type
- File: Select a file to execute from an Object Storage bucket, only.hql
files are allowed
- Text: Write Hive queries to execute the jobHive options Provide options for the job (Hive Options reference)
- File: Select an option file from Object Storage
- Text: Write Hive option values for the jobJob completion action Select the action when the job finishes
- Wait on failure: Only shut down the cluster if the job succeeds
- Always wait: Do not shut down the cluster regardless of job success or failure
- Always shutdown: Shut down the cluster regardless of job success or failureScheduling log file storage Choose whether to save scheduling log files
- Do not save: Do not save scheduling logs
- Save to Object Storage: Save logs to a selected bucket
* Logs will be stored in bucket-name/log/ with ayyyy-mm-dd.log
format
Hive options
Hive options refer to Hive configuration properties used when executing a Hive job.
These can be written as follows, and further details about Hive configuration properties can be found in the official documentation.
--hiveconf hive.tez.container.size=2048 --hiveconf hive.tez.java.opts=-Xmx1600m
Configure Spark job scheduling
Configure the scheduling for Spark jobs.
-
In Step 3: Job Scheduling Configuration, select Spark job as the job type.
-
Enter the scheduling information for the Spark job.
Category Description Job type Spark job: Execute Spark job after cluster creation Execution file Select a file to execute from an Object Storage bucket
- Only.jar
files are allowedSpark options (optional) Provide options for the job Arguments (optional) Provide arguments to be passed to the .jar
file
- File: Select an argument file from Object Storage
- Text: Write arguments for the jobDeployment mode Choose the mode to run Spark
- Choose between{client / cluster}
Job completion action Select the action when the job finishes
- Wait on failure: Only shut down the cluster if the job succeeds
- Always wait: Do not shut down the cluster regardless of job success or failure
- Always shutdown: Shut down the cluster regardless of job success or failureScheduling log file storage Choose whether to save scheduling log files
- Do not save: Do not save scheduling logs
- Save to Object Storage: Save logs to a selected bucket
* Logs will be stored in bucket-name/log/ with ayyyy-mm-dd.log
format
Spark options
Spark options refer to configuration settings to be passed when executing Spark job files using spark-submit
.
They can be written as follows, and further details can be found in the official documentation.
Note that if you include the --deploy-mode
argument, errors may occur. Since the deploy mode can be selected in the UI, please use the UI's functionality for selecting it.
--class org.apache.spark.examples.SparkPi --master yarn
4. Configure security details (optional)
Apply cluster security features using Kerberos and Ranger.
Category | Description |
---|---|
Kerberos | If you want to install it, select Install and enter the following information: Kerberos Realm Name - Only uppercase English letters, numbers, and dots (.) are allowed. (1-50 characters) KDC (Key Distribution Center) Password - Automatically set with the admin password set in the cluster setup step. |
Ranger | If you want to install it, select Install. Ranger Password - Automatically set with the admin password set in the cluster setup step. |