Skip to main content

Automate web server log analysis using Hadoop Eco scheduling

This document introduces how to automate periodic log analysis tasks uploaded to Object Storage using Hadoop Eco's scheduling feature.

info

About this scenario

In this scenario, you will learn how to use Hadoop Eco's distributed processing and scheduling capabilities to analyze web server logs periodically stored in Object Storage. This automated process will help efficiently manage and process log data, allowing users to handle recurring analysis tasks seamlessly.

Getting started

This tutorial covers the entire process, including checking uploaded log files in Object Storage, processing data with Hadoop Eco, and automating tasks using scheduling. By the end, you'll understand how to store, manage, process, and automate periodic log analysis tasks.

Step 1. Check the log file uploaded to Object Storage

Follow the hands-on document to create sample log files needed for this tutorial. After completing the previous tutorial, verify that the example log files are successfully uploaded to Object Storage.

Example log file upload

Step 2. Create Data Catalog resource

Data Catalog is a fully managed service in KakaoCloud that helps you understand and efficiently manage your data assets. In this step, you'll create Catalog, Database, and Table.

  1. Create a Catalog, which acts as a fully managed central repository within your VPC.

    ItemValue
    Namehands_on
    VPC${any}
    Subnet${public}
  2. Once the Catalog status changes to Running, create a Database. A database acts as a container for tables.

    ItemValue
    Cataloghands_on
    Namehands_on_db
    Path: Buckethands-on
    Path: Directorynginx
  3. Create metadata tables.

    ItemValue
    Databasehands_on_db
    Table namehandson_log_original
    Path: Buckethands-on
    Path: Directorylog/nginx
    Data typeCSV

Step 3. Create Hadoop Eco resource

Hadoop Eco is a KakaoCloud service for executing distributed processing tasks using an open-source framework.

  1. Go to KakaoCloud Console > Analytics > Hadoop Eco, select [Create cluster], and configure the cluster with the following details:

    ItemValue
    Cluster namehands-on
    Cluster versionHadoop Eco 2.0.0
    Cluster typeCore Hadoop
    Cluster availabilityStandard
    Admin ID${ADMIN_ID}
    Admin password${ADMIN_PASSWORD}
  2. Configure Master node and Worker nodes.

    ConfigurationMaster nodeWorker node
    Number of instances12
    Instance typem2a.xlargem2a.xlarge
    Volume size50GB100GB
  3. Set up key pairs and network configurations for SSH access. Select Create new security group.

    Network configuration and security group setup

  4. Open the scheduling settings, select Hive, and enter the following query:

    Refine and aggregate log data
    -- Set the database
    USE hands_on_db;

    -- Repair the table
    MSCK REPAIR TABLE handson_table_original;

    -- Create refined JSON table
    CREATE EXTERNAL TABLE IF NOT EXISTS refined_json (
    remote_addr STRING,
    request_method STRING,
    request_url STRING,
    status STRING,
    request_time STRING,
    day_time STRING)
    PARTITIONED BY (date_id STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
    STORED AS TEXTFILE;

    -- Populate refined_json table
    INSERT INTO refined_json PARTITION (date_id)
    SELECT
    split(log, ' ')[0] AS remote_addr,
    split(log, ' ')[5] AS request_method,
    split(log, ' ')[6] AS request_url,
    split(log, ' ')[8] AS status,
    split(log, ' ')[9] AS request_time,
    regexp_extract(split(log, ' ')[3], '\\d{2}:\\d{2}:\\d{2}', 0) AS day_time,
    date_id
    FROM handson_table_original;

    -- Create urlcount table
    CREATE EXTERNAL TABLE IF NOT EXISTS urlcount (
    request_url STRING,
    status STRING,
    count BIGINT)
    PARTITIONED BY (date_id STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
    STORED AS TEXTFILE;

    -- Populate urlcount table
    INSERT INTO urlcount PARTITION (date_id)
    SELECT request_url, status, count(*) AS count, date_id
    FROM refined_json
    GROUP BY request_url, status, date_id;
  5. Configure Cluster settings:

    ItemValue
    HDFS block size128
    HDFS replication2
  6. Verify all settings and create the cluster.

Step 4. Configure Hadoop Eco scheduling cron job

  1. Create a virtual machine for running the cron job.

    TypeVirtual Machine
    Quantity1
    Namecron-vm
    ImageUbuntu 20.04
    Flavorm2a.large
    Volume20GB
    info

    The virtual machine executing the cron job must communicate with external networks. Ensure the security group and network configuration allow outbound requests.

  2. SSH into the virtual machine. Use a public IP or Bastion host to establish a connection.

    Connect via SSH
    ssh -i ${PRIVATE_KEY_FILE} ubuntu@${CRON_VM_ENDPOINT}
  3. Install jq for easy JSON data processing.

    Install jq
    sudo apt-get update -y
    sudo apt-get install -y jq
  4. Create an environment variable file to store cluster-related configurations. Replace placeholder values with your specific settings.

    Create environment variable file
    cat << \EOF | sudo tee /tmp/env.sh
    #!/bin/bash
    export CLUSTER_ID="${CLUSTER_ID}"
    export HADOOP_API_KEY="${HADOOP_API_KEY}"
    export ACCESS_KEY="${ACCESS_KEY}"
    export ACCESS_SECRET_KEY="${ACCESS_SECRET_KEY}"
    EOF
    Environment variable keyValue
    ${CLUSTER_ID}Cluster ID
    ${HADOOP_API_KEY}Hadoop API key
    ${ACCESS_KEY}Access key
    ${ACCESS_SECRET_KEY}Access secret key
  5. Create a script to request cluster creation using the environment variables.

    Create Hadoop API script
    cat << \EOF | sudo tee /tmp/exec_hadoop.sh
    #!/bin/bash

    . /tmp/env.sh

    curl -X POST "https://hadoop-eco.kr-central-1.kakaoi.io/hadoop-eco/v1/cluster/${CLUSTER_ID}" \
    -H "Hadoop-Eco-Api-Key:${HADOOP_API_KEY}" \
    -H "Credential-ID:${ACCESS_KEY}" \
    -H "Credential-Secret:${ACCESS_SECRET_KEY}" \
    -H "Content-Type: application/json"
    EOF
  6. Install cron to enable periodic script execution.

    Install cron
    sudo apt update -y
    sudo apt install -y cron
  7. Configure a cron job to run the Hadoop script daily at midnight.

    Set cron job
    cat << EOF > tmp_crontab
    0 0 * * * /bin/bash /tmp/exec_hadoop.sh
    EOF
    sudo crontab tmp_crontab
    rm tmp_crontab
  8. Verify the cron job registration.

    List cron jobs
    sudo crontab -l
  9. Manually execute the Hadoop script to ensure it works as expected.

    Test Hadoop script
    bash /tmp/exec_hadoop.sh

    Verify cluster creation

  10. Check the results of the cluster tasks. Logs are stored in the Object Storage bucket specified during cluster creation.

    Verify cluster operation result

  11. Verify stored results in the selected bucket and log directory in Object Storage.

    Verify bucket and log directory operation result