Skip to main content

Integrate With Hadoop Eco

This page explains how to integrate Hadoop Eco with the Data Catalog service.

info

For detailed instructions on creating a Hadoop Eco cluster, see Create a cluster.

Step 1. Integrate Data Catalog in Hadoop Eco

When creating a cluster in Hadoop Eco, first integrate with Object Storage, then set up the Data Catalog.
Connection details differ by catalog type (Standard, Iceberg). Select the appropriate type below.

  1. To integrate with Object Storage, add core-site.xml properties in Hadoop Eco > Create Cluster > Step 3: Advanced settings (optional).
  • For issuing S3 credentials, see Authentication.

    Object Storage connection settings
    {
    "configurations":
    [
    {
    "classification": "core-site",
    "properties":
    {
    "fs.s3a.service.kic.credential.id": "credential_id",
    "fs.s3a.service.kic.credential.secret": "credential_secret",
    "fs.s3a.access.key": "access_key",
    "fs.s3a.secret.key": "secret_key",
    "fs.s3a.buckets.create.region": "kr-central-2",
    "fs.s3a.endpoint.region": "kr-central-2",
    "fs.s3a.endpoint": "objectstorage.kr-central-2.kakaocloud.com",
    "s3service.s3-endpoint": "objectstorage.kr-central-2.kakaocloud.com"
    }
    }
    ]
    }
  1. To integrate with the Data Catalog, configure Create Cluster > Step 3: Service integration settings (optional).
    • In Service Integration, select Integrate with Data Catalog.
    • In Data Catalog integration, verify the Hadoop network/subnet, then select the desired catalog.

Step 2. Use components

After integrating Hadoop Eco with the Data Catalog service, this section describes how to use components.

info

For details on using Hadoop Eco components, see Use components.

Step 3. Create tables and insert data using queries

After integrating Hadoop Eco and Data Catalog, create tables and access data using Hive, Spark, and Trino.

Create tables and insert data with Hive queries

This section shows how to create data in multiple formats and insert data using Hive queries.

TEXT format
$ hive (data_table)> CREATE EXTERNAL TABLE text_table (
> col1 string
> )
> LOCATION 's3a://kbc-test.kc/data_table/text_table';

OK
Time taken: 5.351 seconds
$ hive (data_table)>
> INSERT INTO TABLE text_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.text_table stats: [numFiles=1, totalSize=16]
OK
col1
Time taken: 31.864 second

Check data with Spark queries

Use spark-shell to check the contents of the tables created with Hive.

caution

All of the test queries below are expected to work; however, there are compatibility issues depending on versions and table formats. We are currently working on fixes. Problematic combinations are:

  • spark2: orc, json
  • spark3: json
Check tables
$ spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.2
/_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_262)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("use data_table").show()
....
scala> spark.sql("show tables").show()
+----------+-------------+-----------+
| namespace| tableName|isTemporary|
+----------+-------------+-----------+
|data_table| avro_table| false|
|data_table| csv_table| false|
|data_table| json_table| false|
|data_table| orc_table| false|
|data_table|parquet_table| false|
|data_table| text_table| false|
+----------+-------------+-----------+

Check data with Trino queries

Use the Trino CLI to check the contents of tables created with Hive.

  $ trino --server http://$(hostname):8780

trino> show catalogs;
Catalog
---------
hive
system
(2 rows)

trino> show schemas in hive;
Schema
--------------------
...
default
information_schema
kbc_hive_test
....
(8 rows)

trino> show tables in hive.kbc_hive_test;
Table
------------------
datatype_avro
datatype_csv
datatype_json
datatype_orc
datatype_parquet
datatype_text
(6 rows)

Step 4. Add table partitions

You can add partition information to tables integrated with Hadoop Eco using Hive queries.

Insert partition data

Create a partitioned table in Hive and generate partitioned data to add partition information. Use an INSERT statement to insert partition data and verify the information.

 # Create a partitioned table
CREATE EXTERNAL TABLE text_table (
col1 string
) PARTITIONED BY (yymmdd STRING)
LOCATION 's3a://kbc-test.kc/data_table/text_table';

MSCK REPAIR

If you create directories in Object Storage to match the table’s partition scheme and then create an EXTERNAL table, you can add partition information using Hive’s MSCK command.

 # Check data in Object Storage
$ hadoop fs -ls s3a://kbc-test.kc/tables/orders/
Found 7 items
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 s3a://kbc-test.kc/tables/orders/year=1992
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 s3a://kbc-test.kc/tables/orders/year=1993
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 s3a://kbc-test.kc/tables/orders/year=1994