Skip to main content

Component list

The following explains how to use the components installed in a Hadoop Eco cluster.

Hive

Run the Hive CLI and enter queries.

Run query using Hive CLI
hive
hive (default)> CREATE TABLE tbl (
> col1 STRING
> ) STORED AS ORC
> TBLPROPERTIES ("orc.compress"="SNAPPY");
OK
Time taken: 1.886 seconds

Hive - integrate with Zeppelin

  1. Attach a public IP to the master node where Zeppelin is installed, then access the Zeppelin UI.
    For how to access Zeppelin, refer to Zeppelin.

  2. How to run Hive in Zeppelin

    • Click Notebook > Create new note from the top menu and select the Hive interpreter in the popup.
    • You can enter the following in the Zeppelin notebook and run it.

    Image Check Trino-Zeppelin query result

Beeline

  1. Run the Beeline CLI.

  2. Choose between Direct access to HiveServer2 or Access via Zookeeper to connect to HiveServer2.

    !connect jdbc:hive2://[server name]:10000/default;
  3. Enter the query.

    • Before executing the query, be sure to enter the correct username. The default user in Hadoop Eco is Ubuntu.

    • If the query is executed as a user who cannot access HDFS, an error may occur.

      Run query using Beeline CLI
      ######################################
      # Direct access to HiveServer2
      ######################################
      beeline> !connect jdbc:hive2://10.182.50.137:10000/default;
      Connecting to jdbc:hive2://10.182.50.137:10000/default;
      Enter username for jdbc:hive2://10.182.50.137:10000/default:
      Enter password for jdbc:hive2://10.182.50.137:10000/default:
      Connected to: Apache Hive (version 2.3.2)
      Driver: Hive JDBC (version 2.3.2)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      0: jdbc:hive2://10.182.50.137:10000/default> show databases;
      +----------------+
      | database_name |
      +----------------+
      | db_1 |
      | default |
      +----------------+
      2 rows selected (0.198 seconds)

      #####################################
      # Access via Zookeeper
      #####################################
      beeline> !connect jdbc:hive2://hadoopmst-hadoop-ha-1:2181,hadoopmst-hadoop-ha-2:2181,hadoopmst-hadoop-ha-3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=kepbp_hiveserver2
      Connecting to jdbc:hive2://hadoopmst-hadoop-ha-1:2181,hadoopmst-hadoop-ha-2:2181,hadoopmst-hadoop-ha-3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=kepbp_hiveserver2
      Enter username for jdbc:hive2://hadoopmst-hadoop-ha-1:2181,hadoopmst-hadoop-ha-2:2181,hadoopmst-hadoop-ha-3:2181/:
      Enter password for jdbc:hive2://hadoopmst-hadoop-ha-1:2181,hadoopmst-hadoop-ha-2:2181,hadoopmst-hadoop-ha-3:2181/:
      22/09/06 05:40:52 [main]: INFO jdbc.HiveConnection: Connected to hadoopmst-hadoop-ha-1:10000
      Connected to: Apache Hive (version 2.3.9)
      Driver: Hive JDBC (version 2.3.9)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      0: jdbc:hive2://hadoopmst-hadoop-ha-1:2181,ha> show databases;
      +----------------+
      | database_name |
      +----------------+
      | default |
      +----------------+
      1 row selected (2.164 seconds)
      0: jdbc:hive2://hadoopmst-hadoop-ha-1:2181,ha>

      #####################################
      # User input (Enter username: )
      #####################################
      beeline> !connect jdbc:hive2://bigdata-hadoop-master-1.kep.k9d.in:10000/default;
      Connecting to jdbc:hive2://bigdata-hadoop-master-1.kep.k9d.in:10000/default;
      Enter username for jdbc:hive2://bigdata-hadoop-master-1.kep.k9d.in:10000/default: ubuntu
      Enter password for jdbc:hive2://bigdata-hadoop-master-1.kep.k9d.in:10000/default:
      Connected to: Apache Hive (version 2.3.9)
      Driver: Hive JDBC (version 2.3.9)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      0: jdbc:hive2://bigdata-hadoop-master-1.kep.k> create table t2 (`value` string,`product_num` int) partitioned by (manufacture_date string) STORED AS ORC;
      No rows affected (0.527 seconds)
      0: jdbc:hive2://bigdata-hadoop-master-1.kep.k> insert into t2 partition(manufacture_date='2019-01-01') values('asdf', '123');
      No rows affected (26.56 seconds)
      0: jdbc:hive2://bigdata-hadoop-master-1.kep.k>

Access Hue

Hue is a user interface provided for Hadoop Eco clusters.
Click the quick link on the cluster details page to access it. Log in using the administrator credentials set at cluster creation.

info

Hue supports both Hive editor and browsers.

  • Browser types: file browser, table browser, job browser

Hue access ports by cluster type

Cluster availabilityAccess port
Standard (Single)Port 8888 of master node 1
HAPort 8888 of master node 3

Image
Hue user login

Access Oozie

Oozie is a workflow management tool provided for clusters of type Core Hadoop in Hadoop Eco.

Oozie access ports by cluster type

Cluster availabilityAccess port
Standard (Single)Port 11000 of master node 1
HAPort 11000 of master node 3

Image
Oozie workflow list

Image
Oozie workflow job info

Image
Oozie workflow job details

Zeppelin

Zeppelin is a user interface available in Hadoop Eco clusters of type Core Hadoop or Trino.
Access it via the quick link on the cluster details page.

Zeppelin access ports by high availability setting

Cluster availabilityAccess port
Standard (Single)Port 8180 of master node 1
HAPort 8180 of master node 3

Image
Zeppelin user interface

Interpreter

An interpreter is an environment that executes programming language source code directly.
Zeppelin provides interpreters for Spark, Hive, and Trino.

Image
Using Zeppelin

Spark

  1. Run the spark-shell CLI on the master node and enter the test code.

    Run query using spark-shell CLI
    Welcome to
    ____ __
    / __/__ ___ _____/ /__
    _\ \/ _ \/ _ `/ __/ '_/
    /___/ .__/\_,_/_/ /_/\_\ version 3.2.2
    /_/

    Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_262)
    Type in expressions to have them evaluated.
    Type :help for more information.


    scala> val count = sc.parallelize(1 to 90000000).filter { _ =>
    | val x = math.random
    | val y = math.random
    | x*x + y*y < 1
    | }.count()
    count: Long = 70691442
  2. Execute a Spark example file using the spark-submit command on the master node.

    $ spark-submit --class org.apache.spark.examples.SparkPi --master yarn /opt/spark/examples/jars/spark-examples_*.jar 100

Access spark history server

  1. Depending on the cluster’s high availability configuration, attach a public IP to the master node and access the spark history server.

    Cluster availabilityAccess port
    Standard (Single)Port 18082 on master node 1
    HAPort 18082 on master node 3
  2. Check job information via the history server.

    Image

Integrate spark with Zeppelin

  1. Attach a public IP to the master node where Zeppelin is installed, then access the Zeppelin UI.
    For access instructions, refer to Zeppelin.

  2. Run spark-shell in Zeppelin

    • From the top menu, click Notebook > Create new note, and select the spark interpreter in the popup.
    • Enter the code in the Zeppelin notebook as shown below.

    Image
    Check Trino-Zeppelin query result

  3. Run pyspark in Zeppelin

    • From the top menu, click Notebook > Create new note, and select the spark interpreter in the popup.
    • Enter the code in the Zeppelin notebook as shown below.

    Image
    Check Trino-Zeppelin query result

  4. Run spark-submit in Zeppelin

    • From the top menu, click Notebook > Create new note, and select the spark interpreter in the popup.
    • Enter the code in the Zeppelin notebook as shown below.

    Image
    Check Trino-Zeppelin query result

Tez

Tez is a component that replaces the MapReduce engine in Hive to perform distributed data processing based on Yarn.
You can check information about Tez applications through the Tez Web UI.

  1. To access the Tez Web UI, attach a public IP to either master node 1 or master node 3 depending on the cluster’s high availability configuration.

  2. Then, add the hostname and public IP of master node 1 or 3 to the user's /etc/hosts file.
    Once all settings are complete, you can access the UI as shown below.

    Cluster availabilityAccess port
    Standard (Single)Port 9999 on master node 1
    HAPort 9999 on master node 3

    Image
    Check Trino-Zeppelin query result

Trino

  1. Launch the Trino CLI. Enter the server information based on the location where the Trino coordinator is installed.

    Cluster availabilityAccess port
    Standard (Single)Port 8780 on master node 1
    HAPort 8780 on master node 3
  2. Enter the following queries.

    Run query using Trino CLI
    $ trino --server http://hadoopmst-trino-ha-3:8780
    trino> SHOW CATALOGS;
    Catalog
    ---------
    hive
    system
    (2 rows)

    Query 20220701_064104_00014_9rp8f, FINISHED, 2 nodes
    Splits: 12 total, 12 done (100.00%)
    0.23 [0 rows, 0B] [0 rows/s, 0B/s]

    trino> SHOW SCHEMAS FROM hive;
    Schema
    --------------------
    default
    information_schema
    (2 rows)

    Query 20220701_064108_00015_9rp8f, FINISHED, 3 nodes
    Splits: 12 total, 12 done (100.00%)
    0.23 [2 rows, 35B] [8 rows/s, 155B/s]

    trino> select * from hive.default.t1;
    col1
    ------
    a
    b
    c
    (3 rows)

    Query 20220701_064113_00016_9rp8f, FINISHED, 1 node
    Splits: 5 total, 5 done (100.00%)
    0.23 [3 rows, 16B] [13 rows/s, 71B/s]

Access Trino web UI

  1. Access the Trino web UI via port 8780 on the server where the Trino coordinator is installed.

    Cluster availabilityAccess port
    Standard (Single)Port 8780 on master node 1
    HAPort 8780 on master node 3
  2. View Trino query history and statistics.

    Image
    Trino web interface

Integrate Trino with Zeppelin

  1. In Zeppelin, select Trino as the interpreter.

    Image
    Trino-Zeppelin integration

  2. Enter the desired query for Trino and view the results.

    Image
    Trino-Zeppelin query result

Kafka

Apache Kafka is a large-scale real-time data streaming platform provided when the Hadoop Eco cluster type is set to Dataflow.
Kafka installed on a Hadoop Eco cluster can be operated using terminal commands. For further details beyond the examples below, refer to the official Kafka documentation.

  1. Create topic

    Create topic
    $ /opt/kafka/bin/kafka-topics.sh --create --topic my-topic --bootstrap-server $(hostname):9092
  2. Write events to topic

    Write events to topic
    $ echo '{"time":'$(date +%s)', "id": 1, "msg": "first event"}' | /opt/kafka/bin/kafka-console-producer.sh --topic my-topic --bootstrap-server $(hostname):9092
    $ echo '{"time":'$(date +%s)', "id": 2, "msg": "second event"}' | /opt/kafka/bin/kafka-console-producer.sh --topic my-topic --bootstrap-server $(hostname):9092
    $ echo '{"time":'$(date +%s)', "id": 3, "msg": "third event"}' | /opt/kafka/bin/kafka-console-producer.sh --topic my-topic --bootstrap-server $(hostname):9092
  3. Check events

    Check events
    $ /opt/kafka/bin/kafka-console-consumer.sh --topic my-topic --from-beginning --bootstrap-server $(hostname):9092  
    {"time":1692604787, "id": 1, "msg": "first event"}
    {"time":1692604792, "id": 2, "msg": "second event"}
    {"time":1692604796, "id": 3, "msg": "third event"}

Druid

Apache Druid is a real-time analytics database designed for fast, partitioned analysis of large datasets, and is available when the Hadoop Eco cluster type is set to Dataflow.
Druid supports various types of data sources and is well suited for building data pipelines.
Access the Druid UI from the cluster detail page via the quick link provided by the router.
UI access ports differ based on the Hadoop Eco cluster type.

Druid UI access ports by cluster type

Cluster availabilityAccess port
Standard (Single)Port 3008 on master node 1
HAPort 3008 on master node 3

Explore Druid UI

  • Ingestion jobs list
    Image

  • Datasource list
    Image

  • Query
    Image

Load data from Kafka

  1. In the Druid UI, go to the Load data tab and select Streaming. From the list of available data sources, choose Apache Kafka.
    Image

  2. In the Bootstrap servers field, enter the Kafka host and port in the format <hostname>:<port>. In the Topic field, enter the name of the topic you want to ingest.
    Image

  3. After selecting the topic, choose the desired data shape and schema on the next pages. On the final page, review the ingestion spec and submit the ingestion.
    Image

  4. In the ingestion page, check the status of the created supervisor. Click the magnifying glass icon in the Actions column to view detailed logs and payload information.
    Image

Superset

Apache Superset is a data visualization and exploration platform available when the Hadoop Eco cluster type is set to Dataflow.
Superset supports various data sources (MySQL, SQLite, Hive, Druid, ...) and provides tools for query building, charting, dashboards, and more.

Access the Superset UI through the Superset link provided on the cluster detail page.
Superset UI access ports vary by Hadoop Eco cluster type.

Superset UI access ports by cluster type

Cluster availabilityAccess port
Standard (Single)Port 4000 on master node 1
HAPort 4000 on master node 3

Explore Superset UI

  • Dashboard list
    Image

  • Dashboard: You can create and save dashboards, export them as images, or generate shareable links.
    Image

  • Charts: Create and save charts using registered physical datasets or virtual datasets based on queries from connected databases.
    Image

  • SQL Lab: Run queries on data from connected databases and check the results.
    Image

  • Data: Manage connected databases, created datasets, saved queries, and query history.
    Image

Run superset commands

Superset commands can be executed in the terminal as shown below.

Superset commands
# Set environment variables
$ export $(cat /opt/superset/.env)

# Example: run superset command
$ /opt/superset/superset-venv/bin/superset --help
Loaded your LOCAL configuration at [/opt/superset/superset_config.py]
Usage: superset [OPTIONS] COMMAND [ARGS]...

This is a management script for the Superset application.

Options:
--version Show the flask version
--help Show this message and exit.

Commands:
compute-thumbnails Compute thumbnails
db Perform database migrations.
export-dashboards Export dashboards to ZIP file
export-datasources Export datasources to ZIP file
fab FAB flask group commands
import-dashboards Import dashboards from ZIP file
import-datasources Import datasources from ZIP file
import-directory Imports configs from a given directory
init Inits the Superset application
load-examples Loads a set of Slices and Dashboards and a...
load-test-users Loads admin, alpha, and gamma user for...
re-encrypt-secrets
routes Show the routes for the app.
run Run a development server.
set-database-uri Updates a database connection URI
shell Run a shell in the app context.
superset This is a management script for the Superset...
sync-tags Rebuilds special tags (owner, type, favorited...
update-api-docs Regenerate the openapi.json file in docs
update-datasources-cache Refresh sqllab datasources cache
version Prints the current version number

Apache Flink is an open-source, unified stream and batch processing framework developed by the Apache Software Foundation.
Below is how to run Flink in Yarn session mode.

When running Flink in session mode, you can control the resources available to Flink using the options below.

OptionDescription
-jmJobManager memory size
-tmTaskManager memory size
-sNumber of CPU cores
-nNumber of TaskManagers
-nmApplication name
-dRun in background
Run flink
yarn-session.sh \
-jm 2048 \
-tm 2048 \
-s 2 \
-n 3 \
-nm yarn-session-jobs

Once Flink is running, it launches on a Yarn node and its Web UI can be accessed at the following address:

Flink interface
2022-07-07 23:15:33,775 INFO  org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager [] - State change: CONNECTED
2022-07-07 23:15:33,800 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with ZookeeperLeaderRetrievalDriver{connectionInformationPath='/leader/rest_server/connection_info'}.
JobManager Web Interface: http://hadoopwrk-logan-Standard(Single)-2:8082

Image
Flink web interface

After starting Flink on Yarn session mode, user jobs can be submitted and monitored via the web interface.

Image
Flink job execution

Run flink job
$ flink run /opt/flink/examples/batch/WordCount.jar
Executing WordCount example with default input data set.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID 43aee6d7d947b1ce848e01c094801ab4
Program execution finished
Job with JobID 43aee6d7d947b1ce848e01c094801ab4 has finished.
Job Runtime: 7983 ms
Accumulator Results:
- 4e3b3f0ae1b13d861a25c3798bc15c04 (java.util.ArrayList) [170 elements]

(a,5)
(action,1)
(after,1)
(against,1)

Kakao HBase Tools

HBase Tools is an open-source project developed by Kakao.
In the KakaoCloud Console, go to the Hadoop Eco menu.
In the cluster list, click the [More options] icon next to the cluster you want to delete > select Delete cluster.
In the confirmation popup, verify the target cluster and enter permanent delete, then click [Delete].

info

For more details, refer to the Kakao Tech Blog.

ModuleDescription
hbase-manager ModuleRegion assignment management, splits, merges, and major compactions
- Region Assignment Management
- Advanced Split
- Advanced Merge
- Advanced Major Compaction
hbase-table-stat ModulePerformance monitoring
- Table Metrics Monitoring
hbase-snapshot ModuleBackup and restore HBase data
- Table Snapshot Management

Run Kakao HBase Tools

Pass the Zookeeper host information to retrieve HBase data.

Run Kakao HBase Tools
# hbase-manager
java -jar /opt/hbase/lib/tools/hbase-manager-1.2-1.5.7.jar <command> <zookeeper host name> <table name>

# hbase-snapshot
java -jar /opt/hbase/lib/tools/hbase-snapshot-1.2-1.5.7.jar <zookeeper host name> <table name>

# hbase-table-stat
java -jar /opt/hbase/lib/tools/hbase-table-stat-1.2-1.5.7.jar <zookeeper host name> <table name>

hbase-manager

Admin tool for managing HBase.

Image
hbase-manager

hbase-table-stat

Tool for checking current table status.

Image
hbase-table-stat

hbase-snapshot

Tool for generating table data snapshots.

Image
hbase-snapshot