Integrate with Hadoop Eco

Explain how to integrate Hadoop Eco with Data Catalog service.

info

For detailed instructions on creating Hadoop Eco clusters, refer to Create cluster.

Step 1. Integrate Data Catalog with Hadoop Eco

Follow these steps to integrate Data Catalog after creating a cluster and connecting to Object Storage in Hadoop Eco.

To connect to Object Storage, add core-site.xml information at Create Hadoop Cluster > Step 3: Detailed Settings (Optional).

For issuing credentials for S3 integration, refer to Issue credentials document.

Connect to Object Storage
{
    "configurations":
    [
        {
            "classification": "core-site",
            "properties":
            {
                "fs.swifta.service.kic.credential.id": "credential_id",
                "fs.swifta.service.kic.credential.secret": "credential_secret",
                "fs.s3a.access.key": "access_key",
                "fs.s3a.secret.key": "secret_key",
                "fs.s3a.buckets.create.region": "kr-central-2",
                "fs.s3a.endpoint.region": "kr-central-2",
                "fs.s3a.endpoint": "objectstorage.kr-central-2.kakaocloud.com",
                "s3service.s3-endpoint": "objectstorage.kr-central-2.kakaocloud.com"
            }
        }
    ]
}

To integrate with Data Catalog, go to Create Cluster > Step 5: Service Integration Settings (Optional).
- Select Data Catalog integration under the service integration options.
- Check Hadoop network/subnet information and choose the desired catalog.

Step 2. Use components

Explain how to use components after integrating Hadoop Eco with Data Catalog service.

info

For detailed explanations on using Hadoop Eco components, refer to Using Components document.

Step 3. Create tables and insert data using queries

After integrating Hadoop Eco with Data Catalog, use Hive, Spark, and Trino to create tables and access data.

Create tables and insert data using Hive queries

Explain how to create and insert data in various formats using Hive queries.

text format
$ hive (data_table)> CREATE EXTERNAL TABLE text_table (
            >     col1 string
            > )
            > LOCATION 'swifta://kbc-test.kc/data_table/text_table';

OK
Time taken: 5.351 seconds
$ hive (data_table)>
                > INSERT INTO TABLE text_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.text_table stats: [numFiles=1, totalSize=16]
OK
col1
Time taken: 31.864 second

avro 포맷
$ hive (data_table)> CREATE EXTERNAL TABLE avro_table (
                >     col1 string
                > )
                > STORED AS PARQUET
                > LOCATION 'swifta://kbc-test.kc/data_table/avro_table';
OK
Time taken: 3.032 seconds
$ hive (data_table)>
                > INSERT INTO TABLE avro_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.avro_table stats: [numFiles=1, totalSize=266]
OK
col1
Time taken: 29.613 seconds

parquet 포맷
$ hive (data_table)> CREATE EXTERNAL TABLE parquet_table (
                >     col1 string
                > )
                > STORED AS PARQUET
                > LOCATION 'swifta://kbc-test.kc/data_table/parquet_table';
OK
Time taken: 1.041 seconds
$ hive (data_table)>
                > INSERT INTO TABLE parquet_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.parquet_table stats: [numFiles=1, totalSize=266]
OK
col1
Time taken: 26.808 seconds

csv 포맷
$ hive (data_table)> CREATE EXTERNAL TABLE csv_table (
                >     col1 string
                > )
                > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
                > STORED AS textfile
                > LOCATION 'swifta://kbc-test.kc/data_table/csv_table';
OK
Time taken: 5.459 seconds
$ hive (data_table)>
                > INSERT INTO TABLE csv_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.csv_table stats: [numFiles=1, totalSize=22]
OK
col1
Time taken: 27.47 seconds

orc 포맷
$ hive (data_table)> CREATE EXTERNAL TABLE orc_table (
                >     col1 string
                > )
                > STORED AS ORC
                > LOCATION 'swifta://kbc-test.kc/data_table/orc_table';
OK
Time taken: 1.423 seconds
$ hive (data_table)>
                > INSERT INTO TABLE orc_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.orc_table stats: [numFiles=1, totalSize=216]
OK
col1
Time taken: 22.871 seconds

json 포맷
$ hive (data_table)> CREATE EXTERNAL TABLE json_table (
                >     col1 string
                > )
                > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
                > STORED AS textfile
                > LOCATION 'swifta://kbc-test.kc/data_table/json_table';
OK
Time taken: 3.917 seconds
$ hive (data_table)>
                > INSERT INTO TABLE json_table VALUES ('a'), ('b'), ('c');
.....
Table data_table.json_table stats: [numFiles=1, totalSize=35]
OK
col1
Time taken: 22.638 seconds

Check data using Spark queries

Use spark-shell to check the content of data tables created with Hive.

caution

The following test queries should all work, but there are compatibility issues depending on versions or table formats. Currently, the following versions and formats are known to have issues:

spark2: orc, json
spark3: json

Check table
Check data

Check table
$ spark-shell
Welcome to
    ____              __
    / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 3.2.2
    /_/
    
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_262)
Type in expressions to have them evaluated.
Type :help for more information.
    
scala> spark.sql("use data_table").show()
....
scala> spark.sql("show tables").show()
+----------+-------------+-----------+
| namespace|    tableName|isTemporary|
+----------+-------------+-----------+
|data_table|   avro_table|      false|
|data_table|    csv_table|      false|
|data_table|   json_table|      false|
|data_table|    orc_table|      false|
|data_table|parquet_table|      false|
|data_table|   text_table|      false|
+----------+-------------+-----------+

Check data
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.text_table group by    col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.avro_table group by    col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.parquet_table group  by   col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.csv_table group by    col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.orc_table group by    col1").show()
or
$ scala> spark.sql("SELECT col1, count(*) FROM data_table.json_table group by    col1").show()
    
.....
+----+--------+
|col1|count(1)|
+----+--------+
|   c|       1|
|   b|       1|
|   a|       1|
+----+--------+

Check data using Trino queries

Use the Trino CLI to check the contents of the data tables created with Hive.

Check table
Check data

  $ trino --server http://$(hostname):8780
  
  trino> show catalogs;
  Catalog
  ---------
  hive
  system
  (2 rows)
      
  trino>  show schemas in hive;
      Schema
  --------------------
  ...
  default
  information_schema
  kbc_hive_test
  ....
  (8 rows)
      
  trino> show tables in hive.kbc_hive_test;
      Table
  ------------------
  datatype_avro
  datatype_csv
  datatype_json
  datatype_orc
  datatype_parquet
  datatype_text
  (6 rows)

  trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_avro group    by col_bigint;
  or
  trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_csv group  by   col_bigint;
  or
  trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_json group    by col_bigint;
  or
  trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_orc group  by   col_bigint;
  or
  trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_parquet    group by col_bigint;
  or
  trino> SELECT col_bigint, count(*) FROM hive.kbc_hive_test.datatype_text group    by col_bigint;
      
  ..... 
  col_bigint | _col1
  ------------+-------
          1 |     6
  (1 row)

Step 4. Add table partitions

You can add partition information to tables integrated with Hadoop Eco using Hive queries.

Insert partition data

Create partitioned tables in Hive and insert partition data to add partition information. Use the INSERT statement to insert partition data into the table and verify the information.

Create partitioned table
임시 데이터 입력
Check partitioned data

 # Create partitioned table
   CREATE EXTERNAL TABLE text_table (
       col1 string
   ) PARTITIONED BY (yymmdd STRING)
   LOCATION 'swifta://kbc-test.kc/data_table/text_table';
   

 # Insert temporary data
   INSERT INTO TABLE text_table PARTITION (yymmdd)
   SELECT 'a', '240101';
 

   # Check partitioned data
   hive (default)> show partitions text_table10;
   OK
   partition
   yymmdd=240101

MSCK REPAIR

After creating directories in Object Storage that match the partition information of the table and creating an EXTERNAL table, you can use Hive's MSCK command to add partition information.

Check Object Storage data
Create Hive table
Run MSCK command
Check partitioned data

 # Check Object Storage data
   $ hadoop fs -ls swifta://kbc-test.kc/tables/orders/
   Found 7 items
   drwxrwxrwx   - ubuntu ubuntu          0 1970-01-01 00:00 swifta://kbc-test.kc/tables/orders/year=1992
   drwxrwxrwx   - ubuntu ubuntu          0 1970-01-01 00:00 swifta://kbc-test.kc/tables/orders/year=1993
   drwxrwxrwx   - ubuntu ubuntu          0 1970-01-01 00:00 swifta://kbc-test.kc/tables/orders/year=1994

 # Create Hive table
   CREATE EXTERNAL TABLE orders (
       col1 string
   ) PARTITIONED BY (year STRING)
   LOCATION 'swifta://kbc-test.kc/tables';
 

   # Run MSCK command
   MSCK REPAIR TABLE orders;

   # Check partitioned data
   hive (default)> show partitions orders;
   OK
   partition
   year=1992

Step 1. Integrate Data Catalog with Hadoop Eco​

Step 2. Use components​

Step 3. Create tables and insert data using queries​

Create tables and insert data using Hive queries​

Check data using Spark queries​

Check data using Trino queries​

Step 4. Add table partitions​

Insert partition data​

MSCK REPAIR​