Analysis of loaded web server logs using Hadoop Eco scheduling
Automates the task of periodically analyzing logs loaded into Object Storage using the Hadoop Eco scheduling function. You can easily set up Hadoop cluster execution in the desired situation for efficient resource utilization of Hadoop.
- Estimated time: 60 minutes
- User environment
- Operating system: MacOS, Ubuntu
- Region: kr-central-2
- Reference documents
Before you start
To proceed with the practice, you need to check the access key and VM access key pair.
Step 1. Check the log file uploaded to Object Storage
Create the log file example required for the hands-on according to the document. First, proceed with the corresponding hands-on tutorial, and then check whether the example log file was successfully uploaded to Object Storage.
Step 2. Create Data Catalog Resource
Data Catalog is a fully managed service that helps you identify and efficiently manage organization and user data assets within KakaoCloud. In this step, you must create the Catalog, Database, and Table that make up the Data Catalog.
-
Catalog is a fully managed central repository within the VPC. To use the Data Catalog service, first create a Catalog.
Item Setting Name hands_on VPC ${any}
Subnet ${public}
-
When the status of the created Catalog becomes Running, create a Database. The database of the Data Catalog is a container that stores tables.
Item Setting Catalog hands_on Name hands_on_db Path:Bucket hands-on Path:Directory nginx -
Create a table, which is the metadata of the Data Catalog.
- Origin data table
- Schema
Item Setting Database hands_on_db Table name handson_log_original Data storage path: Bucket name hands-on Data storage path: Directory log/nginx Data type CSV Partition key Column number Field name Data type off 1 log string on - date_id string on - hour_id string
Step 3. Create Hadoop Eco resource
Hadoop Eco is a KakaoCloud service for executing distributed processing tasks using open source frameworks. Here's how to create a Hadoop Eco resource:
-
Select the Hadoop Eco in the KakaoCloud Console. Click the [Create cluster] button and create a cluster corresponding to the information below.
Item Setting Cluster Name hands-on Cluster Version Hadoop Eco 2.0.0 Cluster Type Core Hadoop Cluster Availability Standard Administrator ID ${ADMIN_ID}
Administrator Password ${ADMIN_PASSWORD}
-
Set the master node and worker node instances.
Classification Master Node Worker Node Instance Count 1 2 Instance Type m2a.xlarge m2a.xlarge Volume Size 50GB 100GB -
Set the key pair and network configuration (VPC, Subnet) to suit the environment in which users can access ssh. Next, click the [Create security group] button.
-
Open the scheduling settings, select hive, and then enter a query. The query cleans the nginx default format log, computes user requests, and stores the results.
-- Set the database
USE hands_on_db;
-- Repair the table
MSCK REPAIR TABLE handson_table_original;
-- Create empty refined_json table with JSON format using JsonSerDe
CREATE EXTERNAL TABLE IF NOT EXISTS refined_json (
remote_addr STRING,
request_method STRING,
request_url STRING,
status STRING,
request_time STRING,
day_time STRING)
PARTITIONED BY (date_id STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE;
-- Populate refined_json table
INSERT INTO refined_json PARTITION (date_id)
SELECT
split(log, ' ')[0] AS remote_addr,
split(log, ' ')[5] AS request_method,
split(log, ' ')[6] AS request_url,
split(log, ' ')[8] AS status,
split(log, ' ')[9] AS request_time,
regexp_extract(split(log, ' ')[3], '\\d{2}:\\d{2}:\\d{2}', 0) AS day_time,
date_id
FROM
handson_table_original;
-- Create empty urlcount table with JSON format using JsonSerDe
CREATE EXTERNAL TABLE IF NOT EXISTS urlcount (
request_url STRING,
status STRING,
count BIGINT)
PARTITIONED BY (date_id STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE;
-- Populate urlcount table
INSERT INTO urlcount PARTITION (date_id)
SELECT
request_url,
status,
count(*) AS count,
date_id
FROM
refined_json
GROUP BY
request_url,
status,
date_id; -
Set up the cluster according to the conditions below.
Item Setting HDFS block size 128 HDFS replica count 2 Cluster configuration settings Click [Enter directly] and enter the code below Cluster configuration settings - Object Storage linkage{
"configurations": [
{
"classification": "core-site",
"properties": {
"fs.swifta.service.kic.credential.id": "${ACCESS_KEY}",
"fs.swifta.service.kic.credential.secret": "${ACCESS_SECRET_KEY}"
}
}
]
} -
Set up monitoring agent, Data Catalog service integration, etc.
Item Description Install monitoring agent Do not install Service integration Data Catalog integration Data catalog name Select [hands_on] created in [hands_on] -
Check the entered information and create a cluster.
Step 4. Hadoop Eco scheduling cron job setup
-
Create a virtual machine to run the cron job.
Type Virtual Machine Quantity 1 Name cron-vm Image Ubuntu 20.04 Flavor m2a.large Volume 20 infoThe instance that will run the cron job sends a request to the external network. Therefore, you need to set up a Security Group and network environment that can send and receive requests to the external network.
-
Connect to the instance that will perform the cron job via ssh. You can try to connect to ssh by adding a public IP or using a Bastion host, etc.
ssh -i ${PRIVATE_KEY_FILE} ubuntu@${CRON_VM_ENDPOINT}
-
Install the jq package to easily handle JSON formatted data in the
shell
.sudo apt-get update -y
sudo apt-get install -y jq -
Connect to the instance via
ssh
, then create an environment variable file to load the logs by referring to the table below. The cluster ID can be checked in the cluster details.cat << \EOF | sudo tee /tmp/env.sh
#!/bin/bash
export CLUSTER_ID="${CLUSTER_ID}"
export HADOOP_API_KEY="${HADOOP_API_KEY}"
export ACCESS_KEY="${ACCESS_KEY}"
export ACCESS_SECRET_KEY="${ACCESS_SECRET_KEY}"
EOFEnvironment Variable Key Environment Variable Value ${CLUSTER_ID}
Cluster ID ${HADOOP_API_KEY}
Hadoop API Key ${ACCESS_KEY}
Access key ${ACCESS_SECRET_KEY}
Secret access key -
Write a script that requests cluster creation using the information in the environment variable file. For more information about the Hadoop Cluster API, see Hadoop Eco API.
cat << \EOF | sudo tee /tmp/exec_hadoop.sh
#!/bin/bash
. /tmp/env.sh
curl -X POST "https://hadoop-eco.kr-central-1.kakaoi.io/hadoop-eco/v1/cluster/${CLUSTER_ID}" \
-H "Hadoop-Eco-Api-Key:${HADOOP_API_KEY}" \
-H "Credential-ID:${ACCESS_KEY}" \
-H "Credential-Secret:${ACCESS_SECRET_KEY}" \
-H "Content-Type: application/json"
EOF -
Use the cron package to automatically run the script at regular intervals.
sudo apt update -y
sudo apt install -y cron -
Enter a command to run the script that creates the Hadoop cluster written in crontab every midnight.
cat << EOF > tmp_crontab
0 0 * * * /bin/bash /tmp/exec_hadoop.sh
EOF
sudo crontab tmp_crontab
rm tmp_crontab -
Check if the cron job is registered.
sudo crontab -l
-
Run the command below and check if the cluster is created.
bash /tmp/exec_hadoop.sh
-
Check the cluster's task results. Check the result log in the bucket of Object Storage set in the Hadoop cluster creation step.
-
Check the task results saved in the bucket and log directory selected in Object Storage.