Integrate With Hadoop Eco
This page explains how to integrate Hadoop Eco with the Data Catalog service.
For detailed instructions on creating a Hadoop Eco cluster, see Create a cluster.
Step 1. Integrate Data Catalog in Hadoop Eco
When creating a cluster in Hadoop Eco, first integrate with Object Storage, then set up the Data Catalog.
Connection details differ by catalog type (Standard, Iceberg). Select the appropriate type below.
- Integrate with a Standard catalog
- Integrate with an Iceberg catalog
- To integrate with Object Storage, add
core-site.xmlproperties in Hadoop Eco > Create Cluster > Step 3: Advanced settings (optional).
-
For issuing S3 credentials, see Authentication.
Object Storage connection settings{
"configurations":
[
{
"classification": "core-site",
"properties":
{
"fs.s3a.service.kic.credential.id": "credential_id",
"fs.s3a.service.kic.credential.secret": "credential_secret",
"fs.s3a.access.key": "access_key",
"fs.s3a.secret.key": "secret_key",
"fs.s3a.buckets.create.region": "kr-central-2",
"fs.s3a.endpoint.region": "kr-central-2",
"fs.s3a.endpoint": "objectstorage.kr-central-2.kakaocloud.com",
"s3service.s3-endpoint": "objectstorage.kr-central-2.kakaocloud.com"
}
}
]
}
- To integrate with the Data Catalog, configure Create Cluster > Step 3: Service integration settings (optional).
- In Service Integration, select Integrate with Data Catalog.
- In Data Catalog integration, verify the Hadoop network/subnet, then select the desired catalog.
- To integrate with Object Storage, add settings for your engine in Hadoop Eco > Create Cluster > Step 3: Advanced settings (optional).
-
For issuing S3 credentials, see Authentication.
- If using Trino: add
iceberg-trino.properties - If using Spark: add
iceberg-spark-defaults.conf
Object Storage connection settings{
"configurations": [
{
"classification": "iceberg-setting",
"properties": {
"enabled": "true"
}
},
{
"classification": "iceberg-trino.properties",
"properties": {
"iceberg.rest.catalog.uri": "https://data-catalog-iceberg.kr-central-2.kakaocloud.com",
"iceberg.rest.catalog.warehouse": "s3://bucketName/prefix",
"iceberg.rest.catalog.credential.id": "credential_id",
"iceberg.rest.catalog.credential.secret": "credential_secret",
"iceberg.rest.catalog.s3.endpoint": "https://objectstorage.kr-central-2.kakaocloud.com",
"iceberg.rest.catalog.s3.region": "kr-central-2",
"iceberg.rest.catalog.s3.access": "access",
"iceberg.rest.catalog.s3.secret.key": "key"
}
},
{
"classification": "iceberg-spark-defaults.conf",
"properties": {
"spark.sql.catalog.datacatalog.uri": "https://data-catalog-iceberg.kr-central-2.kakaocloud.com",
"spark.sql.catalog.datacatalog.warehouse": "s3a://bucketName/prefix",
"spark.sql.catalog.datacatalog.s3.endpoint": "https://objectstorage.kr-central-2.kakaocloud.com",
"spark.sql.catalog.datacatalog.s3.region": "kr-central-2",
"spark.sql.catalog.datacatalog.s3.access.key.id": "id",
"spark.sql.catalog.datacatalog.s3.secret.access.key": "key",
"spark.sql.catalog.datacatalog.header.credential_id": "credential_id",
"spark.sql.catalog.datacatalog.header.credential_secret": "credential_secret",
"spark.hadoop.fs.s3a.access.key": "access_key",
"spark.hadoop.fs.s3a.secret.key": "secret_key"
}
}
]
} - If using Trino: add
- To integrate with the Data Catalog, configure Create Cluster > Step 3: Service integration settings (optional).
- In Service Integration, select Integrate with Data Catalog.
- In Data Catalog integration, verify the Hadoop network/subnet, then select the desired catalog.
Step 2. Use components
After integrating Hadoop Eco with the Data Catalog service, this section describes how to use components.
For details on using Hadoop Eco components, see Use components.
Step 3. Create tables and insert data using queries
After integrating Hadoop Eco and Data Catalog, create tables and access data using Hive, Spark, and Trino.
Create tables and insert data with Hive queries
This section shows how to create data in multiple formats and insert data using Hive queries.
- TEXT
- Avro
- Parquet
- CSV
- ORC
- JSON
$ hive (data_table)> CREATE EXTERNAL TABLE text_table (
> col1 string
> )
> LOCATION 's3a://kbc-test.kc/data_table/text_table';
OK
Time taken: 5.351 seconds
$ hive (data_table)>
> INSERT INTO TABLE text_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.text_table stats: [numFiles=1, totalSize=16]
OK
col1
Time taken: 31.864 second
$ hive (data_table)> CREATE EXTERNAL TABLE avro_table (
> col1 string
> )
> STORED AS PARQUET
> LOCATION 's3a://kbc-test.kc/data_table/avro_table';
OK
Time taken: 3.032 seconds
$ hive (data_table)>
> INSERT INTO TABLE avro_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.avro_table stats: [numFiles=1, totalSize=266]
OK
col1
Time taken: 29.613 seconds
$ hive (data_table)> CREATE EXTERNAL TABLE parquet_table (
> col1 string
> )
> STORED AS PARQUET
> LOCATION 's3a://kbc-test.kc/data_table/parquet_table';
OK
Time taken: 1.041 seconds
$ hive (data_table)>
> INSERT INTO TABLE parquet_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.parquet_table stats: [numFiles=1, totalSize=266]
OK
col1
Time taken: 26.808 seconds
$ hive (data_table)> CREATE EXTERNAL TABLE csv_table (
> col1 string
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS textfile
> LOCATION 's3a://kbc-test.kc/data_table/csv_table';
OK
Time taken: 5.459 seconds
$ hive (data_table)>
> INSERT INTO TABLE csv_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.csv_table stats: [numFiles=1, totalSize=22]
OK
col1
Time taken: 27.47 seconds
$ hive (data_table)> CREATE EXTERNAL TABLE orc_table (
> col1 string
> )
> STORED AS ORC
> LOCATION 's3a://kbc-test.kc/data_table/orc_table';
OK
Time taken: 1.423 seconds
$ hive (data_table)>
> INSERT INTO TABLE orc_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.orc_table stats: [numFiles=1, totalSize=216]
OK
col1
Time taken: 22.871 seconds
$ hive (data_table)> CREATE EXTERNAL TABLE json_table (
> col1 string
> )
> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> STORED AS textfile
> LOCATION 's3a://kbc-test.kc/data_table/json_table';
OK
Time taken: 3.917 seconds
$ hive (data_table)>
> INSERT INTO TABLE json_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.json_table stats: [numFiles=1, totalSize=35]
OK
col1
Time taken: 22.638 seconds
Check data with Spark queries
Use spark-shell to check the contents of the tables created with Hive.
All of the test queries below are expected to work; however, there are compatibility issues depending on versions and table formats. We are currently working on fixes. Problematic combinations are:
- spark2: orc, json
- spark3: json
- Check tables
- Query data
$ spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.2
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_262)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.sql("use data_table").show()
....
scala> spark.sql("show tables").show()
+----------+-------------+-----------+
| namespace| tableName|isTemporary|
+----------+-------------+-----------+
|data_table| avro_table| false|
|data_table| csv_table| false|
|data_table| json_table| false|
|data_table| orc_table| false|
|data_table|parquet_table| false|
|data_table| text_table| false|
+----------+-------------+-----------+
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.text_table group by col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.avro_table group by col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.parquet_table group by col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.csv_table group by col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.orc_table group by col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.json_table group by col1").show()
.....
+----+--------+
|col1|count(1)|
+----+--------+
| c| 1|
| b| 1|
| a| 1|
+----+--------+
Check data with Trino queries
Use the Trino CLI to check the contents of tables created with Hive.
- Check tables
- Query data
$ trino --server http://$(hostname):8780
trino> show catalogs;
Catalog
---------
hive
system
(2 rows)
trino> show schemas in hive;
Schema
--------------------
...
default
information_schema
kbc_hive_test
....
(8 rows)
trino> show tables in hive.kbc_hive_test;
Table
------------------
datatype_avro
datatype_csv
datatype_json
datatype_orc
datatype_parquet
datatype_text
(6 rows)
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_avro group by col_bigint;
or
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_csv group by col_bigint;
or
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_json group by col_bigint;
or
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_orc group by col_bigint;
or
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_parquet group by col_bigint;
or
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_text group by col_bigint;
.....
col_bigint | _col1
------------+-------
1 | 6
(1 row)
Step 4. Add table partitions
You can add partition information to tables integrated with Hadoop Eco using Hive queries.
Insert partition data
Create a partitioned table in Hive and generate partitioned data to add partition information. Use an INSERT statement to insert partition data and verify the information.
- Create a partitioned table
- Insert temporary data
- Verify partitioned data
# Create a partitioned table
CREATE EXTERNAL TABLE text_table (
col1 string
) PARTITIONED BY (yymmdd STRING)
LOCATION 's3a://kbc-test.kc/data_table/text_table';
# Insert temporary data
INSERT INTO TABLE text_table PARTITION (yymmdd)
SELECT 'a', '240101';
# Verify partitioned data
hive (default)> show partitions text_table10;
OK
partition
yymmdd=240101
MSCK REPAIR
If you create directories in Object Storage to match the table’s partition scheme and then create an EXTERNAL table, you can add partition information using Hive’s MSCK command.
- Check data in Object Storage
- Create a Hive table
- Run the MSCK command
- Verify partitioned data
# Check data in Object Storage
$ hadoop fs -ls s3a://kbc-test.kc/tables/orders/
Found 7 items
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 s3a://kbc-test.kc/tables/orders/year=1992
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 s3a://kbc-test.kc/tables/orders/year=1993
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 s3a://kbc-test.kc/tables/orders/year=1994
# Create a Hive table
CREATE EXTERNAL TABLE orders (
col1 string
) PARTITIONED BY (year STRING)
LOCATION 's3a://kbc-test.kc/tables';
# Run the MSCK command
MSCK REPAIR TABLE orders;
# Verify partitioned data
hive (default)> show partitions orders;
OK
partition
year=1992