Integrate with Hadoop Eco
Explain how to integrate Hadoop Eco with Data Catalog service.
For detailed instructions on creating Hadoop Eco clusters, refer to Create cluster.
Step 1. Integrate Data Catalog with Hadoop Eco
Follow these steps to integrate Data Catalog after creating a cluster and connecting to Object Storage in Hadoop Eco.
-
To connect to Object Storage, add core-site.xml information at Create Hadoop Cluster > Step 3: Detailed Settings (Optional).
- For issuing credentials for S3 integration, refer to Issue credentials document.
Connect to Object Storage{
"configurations":
[
{
"classification": "core-site",
"properties":
{
"fs.swifta.service.kic.credential.id": "credential_id",
"fs.swifta.service.kic.credential.secret": "credential_secret",
"fs.s3a.access.key": "access_key",
"fs.s3a.secret.key": "secret_key",
"fs.s3a.buckets.create.region": "kr-central-2",
"fs.s3a.endpoint.region": "kr-central-2",
"fs.s3a.endpoint": "objectstorage.kr-central-2.kakaocloud.com",
"s3service.s3-endpoint": "objectstorage.kr-central-2.kakaocloud.com"
}
}
]
} -
To integrate with Data Catalog, go to Create Cluster > Step 5: Service Integration Settings (Optional).
- Select Data Catalog integration under the service integration options.
- Check Hadoop network/subnet information and choose the desired catalog.
Step 2. Use components
Explain how to use components after integrating Hadoop Eco with Data Catalog service.
For detailed explanations on using Hadoop Eco components, refer to Using Components document.
Step 3. Create tables and insert data using queries
After integrating Hadoop Eco with Data Catalog, use Hive, Spark, and Trino to create tables and access data.
Create tables and insert data using Hive queries
Explain how to create and insert data in various formats using Hive queries.
- text
- avro
- parquet
- csv
- orc
- json
$ hive (data_table)> CREATE EXTERNAL TABLE text_table (
> col1 string
> )
> LOCATION 'swifta://kbc-test.kc/data_table/text_table';
OK
Time taken: 5.351 seconds
$ hive (data_table)>
> INSERT INTO TABLE text_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.text_table stats: [numFiles=1, totalSize=16]
OK
col1
Time taken: 31.864 second
$ hive (data_table)> CREATE EXTERNAL TABLE avro_table (
> col1 string
> )
> STORED AS PARQUET
> LOCATION 'swifta://kbc-test.kc/data_table/avro_table';
OK
Time taken: 3.032 seconds
$ hive (data_table)>
> INSERT INTO TABLE avro_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.avro_table stats: [numFiles=1, totalSize=266]
OK
col1
Time taken: 29.613 seconds
$ hive (data_table)> CREATE EXTERNAL TABLE parquet_table (
> col1 string
> )
> STORED AS PARQUET
> LOCATION 'swifta://kbc-test.kc/data_table/parquet_table';
OK
Time taken: 1.041 seconds
$ hive (data_table)>
> INSERT INTO TABLE parquet_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.parquet_table stats: [numFiles=1, totalSize=266]
OK
col1
Time taken: 26.808 seconds
$ hive (data_table)> CREATE EXTERNAL TABLE csv_table (
> col1 string
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS textfile
> LOCATION 'swifta://kbc-test.kc/data_table/csv_table';
OK
Time taken: 5.459 seconds
$ hive (data_table)>
> INSERT INTO TABLE csv_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.csv_table stats: [numFiles=1, totalSize=22]
OK
col1
Time taken: 27.47 seconds
$ hive (data_table)> CREATE EXTERNAL TABLE orc_table (
> col1 string
> )
> STORED AS ORC
> LOCATION 'swifta://kbc-test.kc/data_table/orc_table';
OK
Time taken: 1.423 seconds
$ hive (data_table)>
> INSERT INTO TABLE orc_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.orc_table stats: [numFiles=1, totalSize=216]
OK
col1
Time taken: 22.871 seconds
$ hive (data_table)> CREATE EXTERNAL TABLE json_table (
> col1 string
> )
> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> STORED AS textfile
> LOCATION 'swifta://kbc-test.kc/data_table/json_table';
OK
Time taken: 3.917 seconds
$ hive (data_table)>
> INSERT INTO TABLE json_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.json_table stats: [numFiles=1, totalSize=35]
OK
col1
Time taken: 22.638 seconds
Check data using Spark queries
Use spark-shell to check the content of data tables created with Hive.
The following test queries should all work, but there are compatibility issues depending on versions or table formats. Currently, the following versions and formats are known to have issues:
- spark2: orc, json
- spark3: json
- Check table
- Check data
$ spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.2
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_262)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.sql("use data_table").show()
....
scala> spark.sql("show tables").show()
+----------+-------------+-----------+
| namespace| tableName|isTemporary|
+----------+-------------+-----------+
|data_table| avro_table| false|
|data_table| csv_table| false|
|data_table| json_table| false|
|data_table| orc_table| false|
|data_table|parquet_table| false|
|data_table| text_table| false|
+----------+-------------+-----------+
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.text_table group by col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.avro_table group by col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.parquet_table group by col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.csv_table group by col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.orc_table group by col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.json_table group by col1").show()
.....
+----+--------+
|col1|count(1)|
+----+--------+
| c| 1|
| b| 1|
| a| 1|
+----+--------+
Check data using Trino queries
Use the Trino CLI to check the contents of the data tables created with Hive.
- Check table
- Check data
$ trino --server http://$(hostname):8780
trino> show catalogs;
Catalog
---------
hive
system
(2 rows)
trino> show schemas in hive;
Schema
--------------------
...
default
information_schema
kbc_hive_test
....
(8 rows)
trino> show tables in hive.kbc_hive_test;
Table
------------------
datatype_avro
datatype_csv
datatype_json
datatype_orc
datatype_parquet
datatype_text
(6 rows)
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_avro group by col_bigint;
or
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_csv group by col_bigint;
or
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_json group by col_bigint;
or
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_orc group by col_bigint;
or
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_parquet group by col_bigint;
or
trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_text group by col_bigint;
.....
col_bigint | _col1
------------+-------
1 | 6
(1 row)
Step 4. Add table partitions
You can add partition information to tables integrated with Hadoop Eco using Hive queries.
Insert partition data
Create partitioned tables in Hive and insert partition data to add partition information. Use the INSERT statement to insert partition data into the table and verify the information.
- Create partitioned table
- 임시 데이터 입력
- Check partitioned data
# Create partitioned table
CREATE EXTERNAL TABLE text_table (
col1 string
) PARTITIONED BY (yymmdd STRING)
LOCATION 'swifta://kbc-test.kc/data_table/text_table';
# Insert temporary data
INSERT INTO TABLE text_table PARTITION (yymmdd)
SELECT 'a', '240101';
# Check partitioned data
hive (default)> show partitions text_table10;
OK
partition
yymmdd=240101
MSCK REPAIR
After creating directories in Object Storage that match the partition information of the table and creating an EXTERNAL table, you can use Hive's MSCK command to add partition information.
- Check Object Storage data
- Create Hive table
- Run MSCK command
- Check partitioned data
# Check Object Storage data
$ hadoop fs -ls swifta://kbc-test.kc/tables/orders/
Found 7 items
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 swifta://kbc-test.kc/tables/orders/year=1992
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 swifta://kbc-test.kc/tables/orders/year=1993
drwxrwxrwx - ubuntu ubuntu 0 1970-01-01 00:00 swifta://kbc-test.kc/tables/orders/year=1994
# Create Hive table
CREATE EXTERNAL TABLE orders (
col1 string
) PARTITIONED BY (year STRING)
LOCATION 'swifta://kbc-test.kc/tables';
# Run MSCK command
MSCK REPAIR TABLE orders;
# Check partitioned data
hive (default)> show partitions orders;
OK
partition
year=1992