Skip to main content

Web server log analysis using Hadoop Eco

Using Hadoop Eco, you can easily build a web server log analysis environment on KakaoCloud. This document provides a hands-on tutorial linking Hadoop Eco, Object Storage, and Data Catalog.

Basic information
  • Estimated time: 60 minutes
  • User Environment
    • Recommended OS: any
    • Region: kr-central-2

Prerequisite

To proceed with this tutorial, you need to get access key and key pairs.

Step 1. Upload log file to Object Storage

Upload the web server log to be used for analysis to Object Storage.

  1. The sample log file uses the default format of the web server Nginx. Open the user local terminal and execute the command below to save the sample log file to the local download folder.

    cat << EOF > ~/Downloads/access.log
    172.16.0.174 - - [02/Mar/2023:03:04:05 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:04:07 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:04:30 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:48:54 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:48:57 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:48:59 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:49:34 +0000] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.3 Safari/605.1.15"
    EOF
  2. Upload the log file to Object Storage. Create a bucket and directory using the table below, and upload the example log file stored in the user's local download folder.

    ItemValue
    Bucket Namehands-on
    Directory/nginx/ori/date_id=2023-02-01/hour_id=00
    File Name    access.log
  3. Verify that the example log file has been uploaded to Object Storage.

    Upload log file

Step 2. Create Data Catalog resource

Data Catalog is a fully managed service that helps you identify and efficiently manage organization and user data assets within KakaoCloud. It consists of Catalog, Database, and Table.

  1. Catalog is a fully managed central repository within VPC, and must be created before using the Data Catalog.

    ItemValue
    Namehands_on
    VPC${any}
    Subnet   ${public}
  2. When the status of the created catalog becomes Running, create a database. The database of the Data Catalog is a container that stores tables.

    ItemValue
    Cataloghands_on
    Namehands_on_db
    Path:Buckethands-on
    Path:Directory  nginx
  3. Create Origin Data Table, Refined Data Table, and Result Data Table, which are metadata of the Data Catalog.

    ItemValue
    Databasehands_on_db
    Table namehandson_table_original
    Data storage path: Bucket name   hands-on
    Data storage path: Directorynginx/ori
    Data typeCSV
    스키마
    Partition keyColumn numberField nameData type
    off    1     log    string
    on-date_idstring
    on-hour_idstring

Step 3. Create Hadoop Eco resource

Hadoop Eco executes distributed processing tasks using open source frameworks.

  1. Select KakaoCloud Console > Analytics > Hadoop Eco. Click the [Create cluster] button and create a cluster corresponding to the information below.

    ItemValue
    Cluster Namehands-on
    Cluster VersionHadoop Eco 2.0.0
    Cluster TypeCore Hadoop
    Cluster AvailabilityStandard
    Administrator ID${ADMIN_ID}
    Administrator Password  ${ADMIN_PASSWORD}
  2. Set up the master node and worker node instances.

    • Set the key pair and network configuration (VPC, Subnet) to suit the environment in which the user can access ssh. Next, create a new security group.
    CategoryMaster NodeWorker Node
    Number of instances1     2
    Instance typem2a.2xlargem2a.2xlarge
    Volume size50GB100GB
  3. Set up the cluster.

    ItemSetting
    Task Scheduling SettingsNot selected
    HDFS Block Size128
    HDFS Replication Count2
    Cluster Configuration Settings   Select [Direct Input] and enter the code below
    Cluster Configuration Settings - Object Storage Integration
       {
    "configurations": [
    {
    "classification": "core-site",
    "properties": {
    "fs.swifta.service.kic.credential.id": "${ACCESS_KEY}",
    "fs.swifta.service.kic.credential.secret": "${ACCESS_SECRET_KEY}"
    }
    }
    ]
    }
  4. Set up service linkage.

    ItemSetting value
    Install monitoring agent  Do not install
    Service linkageData Catalog linkage
    Data catalog nameSelect [hands_on] created in [hands_on]
  5. Check the entered information and create a cluster.

Step 4. Extract original data and write to refined table

  1. Connect to the master node of the created Hadoop cluster using ssh.

    Connect to master node
    ssh -i ${PRIVATE_KEY_FILE} ubuntu@${HADOOP_MST_NODE_ENDPOINT}
    caution

    The created master node is configured with a private IP and cannot be accessed from a public network environment. Therefore, you can connect by connecting a public IP or using a Bastion host.

  2. Use Apache Hive for data extraction. Apache Hive facilitates reading, writing, and managing large data sets residing in distributed storage using SQL.

    Apache Hive
    hive
  3. Set the database to be worked on to the database created in the Data Catalog creation step.

    Database setup
    use hands_on_db;
  4. Add the original log table partition. Confirm the addition of the original data table partition information on the console Data Catalog page.

    Add original log table partition
    msck repair table handson_table_original; 

    Add original log table partition

  5. Execute the SQL command below in HiveCLI. Extract the original log data and write it to the refined table.

    SQL 명령어 실행
    INSERT INTO TABLE handson_table_orc PARTITION(date_id, hour_id)
    SELECT remote_addr,
    from_unixtime(unix_timestamp(time_local, '[DD/MMM/yyyy:HH:mm:ss')) as date_time,
    request_method,
    request_url,
    status,
    request_time,
    date_id,
    hour_id
    FROM (
    SELECT split(log, " ")[0] AS remote_addr,
    split(log, " ")[3] AS time_local,
    split(log, " ")[5] AS request_method,
    split(log, " ")[6] AS request_url,
    split(log, " ")[8] AS status,
    split(log, " ")[9] AS request_time,
    date_id,
    hour_id
    FROM handson_table_original
    ) R;
  6. End the Hivecli job.

    End the job
    exit;

Step 5. Extract required data and create table

  1. Run Spark shell.

    Execute
    spark-shell
  2. Use Spark to compute the data written to the refined table. Write the counts for request_url and status to the result table in json format.

    Calculate the written data
    spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
    spark.sql("use hands_on_db").show()
    spark.sql("SELECT request_url, status, count(*) as count FROM handson_table_orc GROUP BY request_url, status").write.format("json").option("compression", "gzip").save("swifta://hands-on.kic/nginx/request_url_count/date_id=2023-02-01/hour_id=01")

Step 6. Check results using Hue

Hue (Hadoop User Experience) is a web-based user interface used with Apache Hadoop clusters. Hue allows easy access to Hadoop data and easy integration with various Hadoop ecosystems.

  1. Access the Hue page. You can access the master node 8888 port of the Hadoop cluster via a browser. If the connection is successful, log in by entering the administrator ID and password set in the Hadoop cluster you created.

    Hue Access
    open http://{HADOOP_MST_NODE_ENDPOINT}:8888
    caution

    The created node is configured with a private IP and cannot be accessed from a public network environment. Therefore, methods such as public IP connection and Bastion host are used.

  2. After accessing the page, you can run a Hive query. Set the database to be worked on to the database created in the Data Catalog creation step.

    Database setup
    use hands_on_db;
  3. Add a partition to the result table by executing the following command.

    Execute Hive query
    msck repair table handson_table_request_url_count;
  4. Access the Data Catalog console and check the addition of the partition to the result table.

  5. Search the data saved as the result table. Add a jar file to process json data.

    Add library for JSON data processing
    add jar /opt/hive/lib/hive-hcatalog-core-3.1.3.jar; ```

    ```bash title="View saved data"
    select * from handson_table_request_url_count order by count limit 10;
  6. You can check the results of the search on the Hue page as a graph.

    결과 확인