Skip to main content

Integrate with Hadoop Eco

Explain how to integrate Hadoop Eco with Data Catalog service.

info

For detailed instructions on creating Hadoop Eco clusters, refer to Create cluster.

Step 1. Integrate Data Catalog with Hadoop Eco

Follow these steps to integrate Data Catalog after creating a cluster and connecting to Object Storage in Hadoop Eco.

  1. To connect to Object Storage, add core-site.xml information at Create Hadoop Cluster > Step 3: Detailed Settings (Optional).

    Object Storage Connection Information
    {
    "configurations":
    [
    {
    "classification": "core-site",
    "properties":
    {
    "fs.swifta.service.kic.credential.id": "credential_id",
    "fs.swifta.service.kic.credential.secret": "credential_secret",
    "fs.s3a.access.key": "access_key",
    "fs.s3a.secret.key": "secret_key",
    "fs.s3a.buckets.create.region": "kr-central-2",
    "fs.s3a.endpoint.region": "kr-central-2",
    "fs.s3a.endpoint": "objectstorage.kr-central-2.kakaocloud.com",
    "s3service.s3-endpoint": "objectstorage.kr-central-2.kakaocloud.com"
    }
    }
    ]
    }
  2. To integrate with Data Catalog, go to Create Cluster > Step 5: Service Integration Settings (Optional).

    • Select Data Catalog integration under the service integration options.
    • Check Hadoop network/subnet information and choose the desired catalog.

Step 2. Use components

Explain how to use components after integrating Hadoop Eco with Data Catalog service.

info

For detailed explanations on using Hadoop Eco components, refer to Using Components document.

Step 3. Create tables and insert data using queries

After integrating Hadoop Eco with Data Catalog, use Hive, Spark, and Trino to create tables and access data.

Create tables and insert data using Hive queries

Explain how to create and insert data in various formats using Hive queries.

text format
$ hive (data_table)> CREATE EXTERNAL TABLE text_table (
> col1 string
> )
> LOCATION 'swifta://kbc-test.kc/data_table/text_table';

OK
Time taken: 5.351 seconds
$ hive (data_table)>
> INSERT INTO TABLE text_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.text_table stats: [numFiles=1, totalSize=16]
OK
col1
Time taken: 31.864 second

Check data using Spark queries

Use spark-shell to check the content of data tables created with Hive.

caution

The following test queries should all work, but there are compatibility issues depending on versions or table formats. Currently, the following versions and formats are known to have issues:

  • spark2: orc, json
  • spark3: json
Check table
$ spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.2
/_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_262)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("use data_table").show()
....
scala> spark.sql("show tables").show()
+----------+-------------+-----------+
| namespace| tableName|isTemporary|
+----------+-------------+-----------+
|data_table| avro_table| false|
|data_table| csv_table| false|
|data_table| json_table| false|
|data_table| orc_table| false|
|data_table|parquet_table| false|
|data_table| text_table| false|
+----------+-------------+-----------+

Check data using Trino queries

Use the Trino CLI to check the contents of the data tables created with Hive.

  $ trino --server http://$(hostname):8780

trino> show catalogs;
Catalog
---------
hive
system
(2 rows)

trino> show schemas in hive;
Schema
--------------------
...
default
information_schema
kbc_hive_test
....
(8 rows)

trino> show tables in hive.kbc_hive_test;
Table
------------------
datatype_avro
datatype_csv
datatype_json
datatype_orc
datatype_parquet
datatype_text
(6 rows)

Step 4. Add table partitions

You can add partition information to tables integrated with Hadoop Eco using Hive queries.

Insert partition data

Create partitioned tables in Hive and insert partition data to add partition information. Use the INSERT statement to insert partition data into the table and verify the information.

 # Create partitioned table
CREATE EXTERNAL TABLE text_table (
col1 string
) PARTITIONED BY (yymmdd STRING)
LOCATION 'swifta://kbc-test.kc/data_table/text_table';

MSCK REPAIR

After creating directories in Object Storage that match the partition information of the table and creating an EXTERNAL table, you can use Hive's MSCK command to add partition information.

 # Check Object Storage data
$ hadoop fs -ls swifta://kbc-test.kc/tables/orders/
Found 7 items
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 swifta://kbc-test.kc/tables/orders/year=1992
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 swifta://kbc-test.kc/tables/orders/year=1993
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 swifta://kbc-test.kc/tables/orders/year=1994