Skip to main content

Analysis of loaded web server logs using Hadoop Eco scheduling

Automates the task of periodically analyzing logs loaded into Object Storage using the Hadoop Eco scheduling function. You can easily set up Hadoop cluster execution in the desired situation for efficient resource utilization of Hadoop.

Basic information

Before you start

To proceed with the practice, you need to check the access key and VM access key pair.

Step 1. Check the log file uploaded to Object Storage

Create the log file example required for the hands-on according to the document. First, proceed with the corresponding hands-on tutorial, and then check whether the example log file was successfully uploaded to Object Storage.

Upload sample log file

Step 2. Create Data Catalog Resource

Data Catalog is a fully managed service that helps you identify and efficiently manage organization and user data assets within KakaoCloud. In this step, you must create the Catalog, Database, and Table that make up the Data Catalog.

  1. Catalog is a fully managed central repository within the VPC. To use the Data Catalog service, first create a Catalog.

    ItemSetting
    Namehands_on
    VPC${any}
    Subnet${public}
  2. When the status of the created Catalog becomes Running, create a Database. The database of the Data Catalog is a container that stores tables.

    ItemSetting
    Cataloghands_on
    Namehands_on_db
    Path:Buckethands-on
    Path:Directorynginx
  3. Create a table, which is the metadata of the Data Catalog.

    ItemSetting
    Databasehands_on_db
    Table namehandson_log_original
    Data storage path: Bucket namehands-on
    Data storage path: Directorylog/nginx
    Data typeCSV

Step 3. Create Hadoop Eco resource

Hadoop Eco is a KakaoCloud service for executing distributed processing tasks using open source frameworks. Here's how to create a Hadoop Eco resource:

  1. Select the Hadoop Eco in the KakaoCloud Console. Click the [Create cluster] button and create a cluster corresponding to the information below.

    ItemSetting
    Cluster Namehands-on
    Cluster VersionHadoop Eco 2.0.0
    Cluster TypeCore Hadoop
    Cluster AvailabilityStandard
    Administrator ID${ADMIN_ID}
    Administrator Password${ADMIN_PASSWORD}
  2. Set the master node and worker node instances.

    ClassificationMaster NodeWorker Node
    Instance Count1       2
    Instance Typem2a.xlargem2a.xlarge
    Volume Size50GB100GB
  3. Set the key pair and network configuration (VPC, Subnet) to suit the environment in which users can access ssh. Next, click the [Create security group] button.

    Network Configuration and Security Group Settings

  4. Open the scheduling settings, select hive, and then enter a query. The query cleans the nginx default format log, computes user requests, and stores the results.

    -- Set the database
    USE hands_on_db;

    -- Repair the table
    MSCK REPAIR TABLE handson_table_original;

    -- Create empty refined_json table with JSON format using JsonSerDe
    CREATE EXTERNAL TABLE IF NOT EXISTS refined_json (
    remote_addr STRING,
    request_method STRING,
    request_url STRING,
    status STRING,
    request_time STRING,
    day_time STRING)
    PARTITIONED BY (date_id STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
    STORED AS TEXTFILE;

    -- Populate refined_json table
    INSERT INTO refined_json PARTITION (date_id)
    SELECT
    split(log, ' ')[0] AS remote_addr,
    split(log, ' ')[5] AS request_method,
    split(log, ' ')[6] AS request_url,
    split(log, ' ')[8] AS status,
    split(log, ' ')[9] AS request_time,
    regexp_extract(split(log, ' ')[3], '\\d{2}:\\d{2}:\\d{2}', 0) AS day_time,
    date_id
    FROM
    handson_table_original;

    -- Create empty urlcount table with JSON format using JsonSerDe
    CREATE EXTERNAL TABLE IF NOT EXISTS urlcount (
    request_url STRING,
    status STRING,
    count BIGINT)
    PARTITIONED BY (date_id STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
    STORED AS TEXTFILE;

    -- Populate urlcount table
    INSERT INTO urlcount PARTITION (date_id)
    SELECT
    request_url,
    status,
    count(*) AS count,
    date_id
    FROM
    refined_json
    GROUP BY
    request_url,
    status,
    date_id;
  5. Set up the cluster according to the conditions below.

    ItemSetting
    HDFS block size128
    HDFS replica count2
    Cluster configuration settingsClick [Enter directly] and enter the code below
    Cluster configuration settings - Object Storage linkage
    {
    "configurations": [
    {
    "classification": "core-site",
    "properties": {
    "fs.swifta.service.kic.credential.id": "${ACCESS_KEY}",
    "fs.swifta.service.kic.credential.secret": "${ACCESS_SECRET_KEY}"
    }
    }
    ]
    }
  6. Set up monitoring agent, Data Catalog service integration, etc.

    ItemDescription
    Install monitoring agentDo not install
    Service integrationData Catalog integration
    Data catalog nameSelect [hands_on] created in [hands_on]
  7. Check the entered information and create a cluster.

Step 4. Hadoop Eco scheduling cron job setup

  1. Create a virtual machine to run the cron job.

    TypeVirtual Machine
    Quantity1
    Namecron-vm
    ImageUbuntu 20.04
    Flavorm2a.large
    Volume20
    info

    The instance that will run the cron job sends a request to the external network. Therefore, you need to set up a Security Group and network environment that can send and receive requests to the external network.

  2. Connect to the instance that will perform the cron job via ssh. You can try to connect to ssh by adding a public IP or using a Bastion host, etc.

    ssh -i ${PRIVATE_KEY_FILE} ubuntu@${CRON_VM_ENDPOINT}
  3. Install the jq package to easily handle JSON formatted data in the shell.

    sudo apt-get update -y
    sudo apt-get install -y jq
  4. Connect to the instance via ssh, then create an environment variable file to load the logs by referring to the table below. The cluster ID can be checked in the cluster details.

    cat << \EOF | sudo tee /tmp/env.sh
    #!/bin/bash
    export CLUSTER_ID="${CLUSTER_ID}"
    export HADOOP_API_KEY="${HADOOP_API_KEY}"
    export ACCESS_KEY="${ACCESS_KEY}"
    export ACCESS_SECRET_KEY="${ACCESS_SECRET_KEY}"
    EOF
    Environment Variable KeyEnvironment Variable Value
    ${CLUSTER_ID}Cluster ID
    ${HADOOP_API_KEY}Hadoop API Key
    ${ACCESS_KEY}Access key
    ${ACCESS_SECRET_KEY}Secret access key
  5. Write a script that requests cluster creation using the information in the environment variable file. For more information about the Hadoop Cluster API, see Hadoop Eco API.

    cat << \EOF | sudo tee /tmp/exec_hadoop.sh
    #!/bin/bash

    . /tmp/env.sh

    curl -X POST "https://hadoop-eco.kr-central-1.kakaoi.io/hadoop-eco/v1/cluster/${CLUSTER_ID}" \
    -H "Hadoop-Eco-Api-Key:${HADOOP_API_KEY}" \
    -H "Credential-ID:${ACCESS_KEY}" \
    -H "Credential-Secret:${ACCESS_SECRET_KEY}" \
    -H "Content-Type: application/json"
    EOF
  6. Use the cron package to automatically run the script at regular intervals.

    sudo apt update -y
    sudo apt install -y cron
  7. Enter a command to run the script that creates the Hadoop cluster written in crontab every midnight.

    cat << EOF > tmp_crontab
    0 0 * * * /bin/bash /tmp/exec_hadoop.sh
    EOF
    sudo crontab tmp_crontab
    rm tmp_crontab
  8. Check if the cron job is registered.

    sudo crontab -l
  9. Run the command below and check if the cluster is created.

    bash /tmp/exec_hadoop.sh

    Check cluster creation

  10. Check the cluster's task results. Check the result log in the bucket of Object Storage set in the Hadoop cluster creation step.

    Check cluster task results

  11. Check the task results saved in the bucket and log directory selected in Object Storage.

    버킷 및 로그 디렉터리 작업 결과 확인