Component list
The following explains how to use the components installed in a Hadoop Eco cluster.
Hive
Run the Hive CLI and enter queries.
hive
hive (default)> CREATE TABLE tbl (
> col1 STRING
> ) STORED AS ORC
> TBLPROPERTIES ("orc.compress"="SNAPPY");
OK
Time taken: 1.886 seconds
Hive - Zeppelin integration
-
Attach a public IP to the master node where Zeppelin is installed, then access the Zeppelin UI.
Refer to Zeppelin for access details. -
Run Hive in Zeppelin:
- Click Notebook > Create new note from the top menu and select the
hive
interpreter in the popup. - Enter and execute the following commands in the Zeppelin notebook.
View Trino-Zeppelin query results
- Click Notebook > Create new note from the top menu and select the
Beeline
-
Run the Beeline CLI.
-
Choose between Direct access to HiveServer2 or Access via Zookeeper to connect to HiveServer2.
- Direct access to HiveServer2
- Access via Zookeeper
!connect jdbc:hive2://[server-name]:10000/default;
!connect jdbc:hive2://[master-1]:2181,[master-2]:2181,[master3]:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=kepbp_hiveserver2
-
Enter the query.
- Ensure that you enter the correct username before running a query. The default user for Hadoop Eco is Ubuntu.
- If executed with a user that cannot access HDFS, errors may occur during query execution.
Run query using Beeline CLI######################################
# Direct access to HiveServer2
######################################
beeline> !connect jdbc:hive2://10.182.50.137:10000/default;
Connecting to jdbc:hive2://10.182.50.137:10000/default;
Enter username for jdbc:hive2://10.182.50.137:10000/default:
Enter password for jdbc:hive2://10.182.50.137:10000/default:
Connected to: Apache Hive (version 2.3.2)
Driver: Hive JDBC (version 2.3.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://10.182.50.137:10000/default> show databases;
+----------------+
| database_name |
+----------------+
| db_1 |
| default |
+----------------+
2 rows selected (0.198 seconds)
#####################################
# Access via Zookeeper
#####################################
beeline> !connect jdbc:hive2://hadoopmst-hadoop-ha-1:2181,hadoopmst-hadoop-ha-2:2181,hadoopmst-hadoop-ha-3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=kepbp_hiveserver2
Connecting to jdbc:hive2://hadoopmst-hadoop-ha-1:2181,hadoopmst-hadoop-ha-2:2181,hadoopmst-hadoop-ha-3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=kepbp_hiveserver2
Enter username for jdbc:hive2://hadoopmst-hadoop-ha-1:2181,hadoopmst-hadoop-ha-2:2181,hadoopmst-hadoop-ha-3:2181/:
Enter password for jdbc:hive2://hadoopmst-hadoop-ha-1:2181,hadoopmst-hadoop-ha-2:2181,hadoopmst-hadoop-ha-3:2181/:
22/09/06 05:40:52 [main]: INFO jdbc.HiveConnection: Connected to hadoopmst-hadoop-ha-1:10000
Connected to: Apache Hive (version 2.3.9)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hadoopmst-hadoop-ha-1:2181,ha> show databases;
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (2.164 seconds)
0: jdbc:hive2://hadoopmst-hadoop-ha-1:2181,ha>
#####################################
# User input (Enter username: )
#####################################
beeline> !connect jdbc:hive2://bigdata-hadoop-master-1.kep.k9d.in:10000/default;
Connecting to jdbc:hive2://bigdata-hadoop-master-1.kep.k9d.in:10000/default;
Enter username for jdbc:hive2://bigdata-hadoop-master-1.kep.k9d.in:10000/default: ubuntu
Enter password for jdbc:hive2://bigdata-hadoop-master-1.kep.k9d.in:10000/default:
Connected to: Apache Hive (version 2.3.9)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://bigdata-hadoop-master-1.kep.k> create table t2 (`value` string,`product_num` int) partitioned by (manufacture_date string) STORED AS ORC;
No rows affected (0.527 seconds)
0: jdbc:hive2://bigdata-hadoop-master-1.kep.k> insert into t2 partition(manufacture_date='2019-01-01') values('asdf', '123');
No rows affected (26.56 seconds)
0: jdbc:hive2://bigdata-hadoop-master-1.kep.k>
Hue
Hue is a user interface provided for Hadoop Eco clusters.
Access it by clicking Quick Link on the cluster details page. Log in using the admin credentials set during cluster creation.
Hue provides both a Hive editor and browser.
- Browser types: File browser, Table browser, Job browser
Hue access ports by cluster availability
Cluster availability | Access port |
---|---|
Standard (Single) | Port 8888 on Master Node 1 |
HA | Port 8888 on Master Node 3 |
Hue user login
Oozie
Oozie is a workflow management tool provided when the Hadoop Eco cluster type is Core Hadoop.
Oozie access ports by cluster availability
Cluster availability | Access port |
---|---|
Standard (Single) | Port 11000 on Master Node 1 |
HA | Port 11000 on Master Node 3 |
Oozie workflow list
Oozie workflow job information
Oozie workflow job details
Zeppelin
Zeppelin is a user interface provided when the Hadoop Eco cluster type is Core Hadoop
or Trino
.
Access Zeppelin by clicking Quick Link on the cluster details page.
Zeppelin access ports by cluster availability
Cluster availability | Access port |
---|---|
Standard (Single) | Port 8180 on Master Node 1 |
HA | Port 8180 on Master Node 3 |
Zeppelin user interface
Interpreter
An interpreter is an environment that directly executes programming language source code.
Zeppelin supports interpreters for Spark, Hive, and Trino.
Using Zeppelin
Spark
-
Run the
spark-shell
CLI on the master node and enter the test code:Run query using spark-shell CLIWelcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.2
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_262)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val count = sc.parallelize(1 to 90000000).filter { _ =>
| val x = math.random
| val y = math.random
| x*x + y*y < 1
| }.count()
count: Long = 70691442 -
Run a Spark example file using the
spark-submit
command on the master node.$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn /opt/spark/examples/jars/spark-examples_*.jar 100
Access Spark History Server
-
Attach a public IP to the master node based on the cluster's high availability and access the Spark History Server.
Cluster availability Access port Standard (Single) Port 18082 on Master Node 1 HA Port 18082 on Master Node 3 -
Verify information in the History Server.
Spark History Server interface
Spark - Zeppelin integration
-
Attach a public IP to the master node where Zeppelin is installed, then access the Zeppelin UI.
Refer to Zeppelin for access details. -
Run Spark-shell in Zeppelin:
- Click Notebook > Create new note from the top menu and select the
spark
interpreter in the popup. - Enter and execute the following commands in the Zeppelin notebook.
Spark query results in Zeppelin - Click Notebook > Create new note from the top menu and select the
-
Run PySpark in Zeppelin:
- Click Notebook > Create new note from the top menu and select the
spark
interpreter in the popup. - Enter and execute the following commands in the Zeppelin notebook.
PySpark query results in Zeppelin - Click Notebook > Create new note from the top menu and select the
-
Run Spark-submit in Zeppelin:
- Click Notebook > Create new note from the top menu and select the
spark
interpreter in the popup. - Enter and execute the following commands in the Zeppelin notebook.
Spark-submit results in Zeppelin - Click Notebook > Create new note from the top menu and select the
Tez
Tez is a component that performs distributed data processing in Hive by replacing the MR engine on the Yarn framework.
You can verify Tez application information via the Tez Web UI.
-
To access the Tez Web UI, attach a public IP to either Master Node 1 or Master Node 3 based on the cluster's high availability.
-
Add the hostname and public IP of Master Node 1 or 3 to your
/etc/hosts
file.
Once all configurations are complete, access it as follows:Cluster availability Access port Standard (Single) Port 9999 on Master Node 1 HA Port 9999 on Master Node 3
Tez Web UI interface
Trino
-
Run Trino CLI. Enter the server information where Trino Coordinator is installed.
Cluster Availability Connection Port Standard (Single) Port 8780 of Master Node 1 HA Port 8780 of Master Node 3 -
Enter the query.
Running a Query Using Trino CLI$ trino --server http://hadoopmst-trino-ha-3:8780
trino> SHOW CATALOGS;
Catalog
---------
hive
system
(2 rows)
Query 20220701_064104_00014_9rp8f, FINISHED, 2 nodes
Splits: 12 total, 12 done (100.00%)
0.23 [0 rows, 0B] [0 rows/s, 0B/s]
trino> SHOW SCHEMAS FROM hive;
Schema
--------------------
default
information_schema
(2 rows)
Query 20220701_064108_00015_9rp8f, FINISHED, 3 nodes
Splits: 12 total, 12 done (100.00%)
0.23 [2 rows, 35B] [8 rows/s, 155B/s]
trino> select * from hive.default.t1;
col1
------
a
b
c
(3 rows)
Query 20220701_064113_00016_9rp8f, FINISHED, 1 node
Splits: 5 total, 5 done (100.00%)
0.23 [3 rows, 16B] [13 rows/s, 71B/s]
Access Trino Web UI
-
Access the Trino Web UI via port 8780 of the server where Trino Coordinator is installed.
Cluster Availability Connection Port Standard (Single) Port 8780 of Master Node 1 HA Port 8780 of Master Node 3 -
Check the Trino query history and statistics information.
Trino Web Screen
Integrate Trino with Zeppelin
-
Select the interpreter as Trino in Zeppelin.
Trino-Zeppelin Integration
-
Enter the query you want to run in Trino, and you will be able to see the results.
Trino-Zeppelin Query Results
Use Kafka
Apache Kafka is a large-scale real-time data streaming platform provided when the Hadoop Eco cluster type is Dataflow.
Kafka installed on the Hadoop Eco cluster can be executed via commands in the terminal. For more detailed information beyond the example below, please refer to the official Kafka documentation.
-
Create a Topic
Create a Topic$ /opt/kafka/bin/kafka-topics.sh --create --topic my-topic --bootstrap-server $(hostname):9092
-
Write events to the topic
Write events to the topic$ echo '{"time":'$(date +%s)', "id": 1, "msg": "1st event"}' | /opt/kafka/bin/kafka-console-producer.sh --topic my-topic --bootstrap-server $(hostname):9092
$ echo '{"time":'$(date +%s)', "id": 2, "msg": "2nd event"}' | /opt/kafka/bin/kafka-console-producer.sh --topic my-topic --bootstrap-server $(hostname):9092
$ echo '{"time":'$(date +%s)', "id": 3, "msg": "3rd event"}' | /opt/kafka/bin/kafka-console-producer.sh --topic my-topic --bootstrap-server $(hostname):9092 -
Check events
Check events$ /opt/kafka/bin/kafka-console-consumer.sh --topic my-topic --from-beginning --bootstrap-server $(hostname):9092
{"time":1692604787, "id": 1, "msg": "1st event"}
{"time":1692604792, "id": 2, "msg": "2nd event"}
{"time":1692604796, "id": 3, "msg": "3rd event"}
Use Druid
Apache Druid is a real-time analytics database designed for fast, segmented analysis of large datasets when the Hadoop Eco cluster type is Dataflow. Druid supports various types of data sources, making it suitable for building data pipelines.
Access the UI provided by the router via the Druid quick link on the cluster detail page.
The Druid UI connection ports for each Hadoop Eco cluster type are as follows:
Druid UI Connection Ports by Cluster Type
Cluster Availability | Connection Port |
---|---|
Standard (Single) | Port 3008 of Master Node 1 |
HA | Port 3008 of Master Node 3 |
Explore Druid UI
-
Ingestion Task List
-
Data Source List
-
Query
Load data from kafka
-
In the Druid UI, select Streaming under the Load data tab. Choose Apache Kafka from the available data sources.
-
In Bootstrap servers, enter the host and port information of the Kafka you want to connect to in the format
<hostname>:<port>
. Enter the name of the topic you want to load in the Topic field. -
After selecting the topic, choose the appropriate data format and review the specs on the last page before submitting the ingestion.
-
On the Ingestion page, check the status of the supervision you created. Click the magnifying glass icon in the Actions field to view detailed information such as logs and payload.
Superset
Apache Superset is a data visualization and exploration platform provided when the Hadoop Eco cluster type is Dataflow. Superset integrates with various data sources (MySQL, SQLite, Hive, Druid, ...) and provides a variety of analytical tools such as query execution, graph creation, and dashboards.
Access the UI provided by the router via the Superset link on the cluster detail page.
The Superset UI connection ports for each Hadoop Eco cluster type are as follows:
Superset UI Connection Ports by Cluster Type
Cluster Availability | Connection Port |
---|---|
Standard (Single) | Port 4000 of Master Node 1 |
HA | Port 4000 of Master Node 3 |
Explore Superset UI
-
Dashboard List
-
Dashboard: You can create and save a dashboard, and get a link to share or save it as an image.
-
Chart: You can create and save charts using physical datasets registered in the connected databases or virtual datasets saved through queries.
-
Query Lab: You can run queries on the data in the connected databases and view the results.
-
Data: You can manage connected databases, created datasets, saved queries, and the list of queries that have been executed.
Run superset commands
The commands provided by Superset can be executed in the terminal in the following order:
# Set environment variables
$ export $(cat /opt/superset/.env)
# Superset 커맨드 실행 예시 $ /opt/superset/superset-venv/bin/superset --help
Loaded your LOCAL configuration at [/opt/superset/superset_config.py]
Usage: superset [OPTIONS] COMMAND [ARGS]...
This is a management script for the Superset application.
Options:
--version Show the flask version
--help Show this message and exit.
Commands:
compute-thumbnails Compute thumbnails
db Perform database migrations.
export-dashboards Export dashboards to ZIP file
export-datasources Export datasources to ZIP file
fab FAB flask group commands
import-dashboards Import dashboards from ZIP file
import-datasources Import datasources from ZIP file
import-directory Imports configs from a given directory
init Inits the Superset application
load-examples Loads a set of Slices and Dashboards and a...
load-test-users Loads admin, alpha, and gamma user for...
re-encrypt-secrets
routes Show the routes for the app.
run Run a development server.
set-database-uri Updates a database connection URI
shell Run a shell in the app context.
superset This is a management script for the Superset...
sync-tags Rebuilds special tags (owner, type, favorited...
update-api-docs Regenerate the openapi.json file in docs
update-datasources-cache Refresh sqllab datasources cache
version Prints the current version number
Flink on yarn session
Apache Flink is an open-source, unified stream processing and batch processing framework developed by the Apache Software Foundation. Here’s how to run Flink in Yarn session mode.
Run flink
When running Flink in session mode, you can adjust the resources Flink can use by setting options.
Run Option | Description |
---|---|
-jm | Job manager memory size |
-tm | Task manager memory size |
-s | Number of CPU cores |
-n | Number of task managers |
-nm | Application name |
-d | Background mode |
yarn-session.sh \
-jm 2048 \
-tm 2048 \
-s 2 \
-n 3 \
-nm yarn-session-jobs
Flink interface
When Flink is running, it executes on Yarn nodes, and the WebUI will start, allowing access to the specified location.
2022-07-07 23:15:33,775 INFO org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager [] - State change: CONNECTED
2022-07-07 23:15:33,800 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with ZookeeperLeaderRetrievalDriver{connectionInformationPath='/leader/rest_server/connection_info'}.
JobManager Web Interface: http://hadoopwrk-logan-표준(Single)-2:8082
Flink Web
Run flink job
After running Flink on Yarn Session, you can execute user jobs and check the results on the website.
Flink Job Execution
$ flink run /opt/flink/examples/batch/WordCount.jar
Executing WordCount example with default input data set.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID 43aee6d7d947b1ce848e01c094801ab4
Program execution finished
Job with JobID 43aee6d7d947b1ce848e01c094801ab4 has finished.
Job Runtime: 7983 ms
Accumulator Results:
- 4e3b3f0ae1b13d861a25c3798bc15c04 (java.util.ArrayList) [170 elements]
(a,5)
(action,1)
(after,1)
(against,1)
Kakao HBase Tools
HBase Tools is an open-source project developed by Kakao. From the KakaoCloud Console, select the Hadoop Eco menu. In the cluster list, click the [More] icon next to the cluster you want to delete > Select Delete Cluster. In the cluster deletion popup, confirm the cluster to be deleted, enter "permanent delete," and click the [Delete] button.
For more details, please refer to the Kakao Tech Blog.
Category | Description |
---|---|
hbase-manager Module | Provides region batch management, split, merge, and major compaction - Region Assignment Management - Advanced Split - Advanced Merge - Advanced Major Compaction |
hbase-table-stat Module | Performance monitoring - Table Metrics Monitoring |
hbase-snapshot Module | Backup and restore data stored in HBase - Table Snapshot Management |
Run kakao hbase tools
Provide Zookeeper host information to fetch HBase data.
# hbase-manager
java -jar /opt/hbase/lib/tools/hbase-manager-1.2-1.5.7.jar <command> <zookeeper host name> <table name>
# hbase-snapshot
java -jar /opt/hbase/lib/tools/hbase-snapshot-1.2-1.5.7.jar <zookeeper host name> <table name>
# hbase-table-stat
java -jar /opt/hbase/lib/tools/hbase-table-stat-1.2-1.5.7.jar <zookeeper host name> <table name>
hbase-manager
An admin tool for HBase management.
hbase-manager
hbase-table-stat
A tool for checking the current usage status of a table.
hbase-table-stat
hbase-snapshot
A tool for creating table data snapshots.
hbase-snapshot