Web server log analysis using Hadoop Eco
Using Hadoop Eco, you can easily build a web server log analysis environment on KakaoCloud. This document provides a hands-on tutorial linking Hadoop Eco, Object Storage, and Data Catalog.
- Estimated time: 60 minutes
- User Environment
- Recommended OS: any
- Region: kr-central-2
Prerequisite
To proceed with this tutorial, you need to get access key and key pairs.
Step 1. Upload log file to Object Storage
Upload the web server log to be used for analysis to Object Storage.
-
The sample log file uses the default format of the web server Nginx. Open the user local terminal and execute the command below to save the sample log file to the local download folder.
cat << EOF > ~/Downloads/access.log
172.16.0.174 - - [02/Mar/2023:03:04:05 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
172.16.0.174 - - [02/Mar/2023:03:04:07 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
172.16.0.174 - - [02/Mar/2023:03:04:30 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
172.16.0.174 - - [02/Mar/2023:03:48:54 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
172.16.0.174 - - [02/Mar/2023:03:48:57 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
172.16.0.174 - - [02/Mar/2023:03:48:59 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
172.16.0.174 - - [02/Mar/2023:03:49:34 +0000] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.3 Safari/605.1.15"
EOF -
Upload the log file to Object Storage. Create a bucket and directory using the table below, and upload the example log file stored in the user's local download folder.
Item Value Bucket Name hands-on Directory /nginx/ori/date_id=2023-02-01/hour_id=00 File Name access.log -
Verify that the example log file has been uploaded to Object Storage.
Step 2. Create Data Catalog resource
Data Catalog is a fully managed service that helps you identify and efficiently manage organization and user data assets within KakaoCloud. It consists of Catalog, Database, and Table.
-
Catalog is a fully managed central repository within VPC, and must be created before using the Data Catalog.
Item Value Name hands_on VPC ${any}
Subnet ${public}
-
When the status of the created catalog becomes
Running
, create a database. The database of the Data Catalog is a container that stores tables.Item Value Catalog hands_on Name hands_on_db Path:Bucket hands-on Path:Directory nginx -
Create
Origin Data Table
,Refined Data Table
, andResult Data Table
, which are metadata of the Data Catalog.- Original data table
- Refined Data Table
- Result data table
Item Value Database hands_on_db Table name handson_table_original Data storage path: Bucket name hands-on Data storage path: Directory nginx/ori Data type CSV 스키마
Partition key Column number Field name Data type off 1 log string on - date_id string on - hour_id string Item Value Database hands_on_db Table name handson_table_orc Data storage path: Bucket name hands-on Data storage path: Directory nginx/orc Data type ORC Schema
Partition key Column number Field name Data type off 1 remote_addr string off 2 date_time string off 3 request_method string off 4 request_url string off 5 status string off 6 request_time string on - date_id string on - hour_id string Item Value Database hands_on_db Table Name handson_table_request_url_count Data Storage Path: Bucket Name hands-on Data Storage Path: Directory nginx/request_url_count Data Type JSON Partition Key Column Number Field Name Data Type off 1 request_url string off 2 status string off 3 count int on - date_id string on - hour_id string
Step 3. Create Hadoop Eco resource
Hadoop Eco executes distributed processing tasks using open source frameworks.
-
Select KakaoCloud Console > Analytics > Hadoop Eco. Click the [Create cluster] button and create a cluster corresponding to the information below.
Item Value Cluster Name hands-on Cluster Version Hadoop Eco 2.0.0 Cluster Type Core Hadoop Cluster Availability Standard Administrator ID ${ADMIN_ID}
Administrator Password ${ADMIN_PASSWORD}
-
Set up the master node and worker node instances.
- Set the key pair and network configuration (VPC, Subnet) to suit the environment in which the user can access ssh. Next, create a new security group.
Category Master Node Worker Node Number of instances 1 2 Instance type m2a.2xlarge m2a.2xlarge Volume size 50GB 100GB -
Set up the cluster.
Item Setting Task Scheduling Settings Not selected HDFS Block Size 128 HDFS Replication Count 2 Cluster Configuration Settings Select [Direct Input] and enter the code below Cluster Configuration Settings - Object Storage Integration{
"configurations": [
{
"classification": "core-site",
"properties": {
"fs.swifta.service.kic.credential.id": "${ACCESS_KEY}",
"fs.swifta.service.kic.credential.secret": "${ACCESS_SECRET_KEY}"
}
}
]
} -
Set up service linkage.
Item Setting value Install monitoring agent Do not install Service linkage Data Catalog linkage Data catalog name Select [hands_on] created in [hands_on] -
Check the entered information and create a cluster.
Step 4. Extract original data and write to refined table
-
Connect to the master node of the created Hadoop cluster using ssh.
Connect to master nodessh -i ${PRIVATE_KEY_FILE} ubuntu@${HADOOP_MST_NODE_ENDPOINT}
cautionThe created master node is configured with a private IP and cannot be accessed from a public network environment. Therefore, you can connect by connecting a public IP or using a Bastion host.
-
Use Apache Hive for data extraction. Apache Hive facilitates reading, writing, and managing large data sets residing in distributed storage using SQL.
Apache Hivehive
-
Set the database to be worked on to the database created in the Data Catalog creation step.
Database setupuse hands_on_db;
-
Add the original log table partition. Confirm the addition of the original data table partition information on the console Data Catalog page.
Add original log table partitionmsck repair table handson_table_original;
-
Execute the SQL command below in HiveCLI. Extract the original log data and write it to the refined table.
SQL 명령어 실행INSERT INTO TABLE handson_table_orc PARTITION(date_id, hour_id)
SELECT remote_addr,
from_unixtime(unix_timestamp(time_local, '[DD/MMM/yyyy:HH:mm:ss')) as date_time,
request_method,
request_url,
status,
request_time,
date_id,
hour_id
FROM (
SELECT split(log, " ")[0] AS remote_addr,
split(log, " ")[3] AS time_local,
split(log, " ")[5] AS request_method,
split(log, " ")[6] AS request_url,
split(log, " ")[8] AS status,
split(log, " ")[9] AS request_time,
date_id,
hour_id
FROM handson_table_original
) R; -
End the Hivecli job.
End the jobexit;
Step 5. Extract required data and create table
-
Run Spark shell.
Executespark-shell
-
Use Spark to compute the data written to the refined table. Write the counts for
request_url
andstatus
to the result table injson
format.Calculate the written dataspark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
spark.sql("use hands_on_db").show()
spark.sql("SELECT request_url, status, count(*) as count FROM handson_table_orc GROUP BY request_url, status").write.format("json").option("compression", "gzip").save("swifta://hands-on.kic/nginx/request_url_count/date_id=2023-02-01/hour_id=01")
Step 6. Check results using Hue
Hue (Hadoop User Experience) is a web-based user interface used with Apache Hadoop clusters. Hue allows easy access to Hadoop data and easy integration with various Hadoop ecosystems.
-
Access the Hue page. You can access the master node
8888
port of the Hadoop cluster via a browser. If the connection is successful, log in by entering the administrator ID and password set in the Hadoop cluster you created.Hue Accessopen http://{HADOOP_MST_NODE_ENDPOINT}:8888
cautionThe created node is configured with a private IP and cannot be accessed from a public network environment. Therefore, methods such as public IP connection and Bastion host are used.
-
After accessing the page, you can run a Hive query. Set the database to be worked on to the database created in the Data Catalog creation step.
Database setupuse hands_on_db;
-
Add a partition to the result table by executing the following command.
Execute Hive querymsck repair table handson_table_request_url_count;
-
Access the Data Catalog console and check the addition of the partition to the result table.
-
Search the data saved as the result table. Add a
jar
file to processjson
data.Add library for JSON data processingadd jar /opt/hive/lib/hive-hcatalog-core-3.1.3.jar; ```
```bash title="View saved data"
select * from handson_table_request_url_count order by count limit 10; -
You can check the results of the search on the Hue page as a graph.