Skip to main content

Analyze web server logs using Hadoop Eco

This document introduces how to easily set up an environment for analyzing server logs by integrating Hadoop Eco, Object Storage, and Data Catalog.

info

About this scenario

This scenario guides you through the process of building an environment for analyzing web server logs by integrating Hadoop Eco, Object Storage, and Data Catalog. Users will learn how to efficiently analyze large log datasets to extract meaningful insights.

Key topics include:

  • Uploading and managing log files using Object Storage
  • Creating data catalogs and tables with Data Catalog
  • Configuring a data analysis environment using a Hadoop Eco cluster

Getting started

Step 1. Upload log files to Object Storage

Upload web server logs to Object Storage for analysis.

  1. Example log files use the default format of the Nginx web server. Open your local terminal and execute the following commands to save the example log files in the local downloads folder.

    cat << EOF > ~/Downloads/access.log
    172.16.0.174 - - [02/Mar/2023:03:04:05 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:04:07 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:04:30 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:48:54 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:48:57 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:48:59 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
    172.16.0.174 - - [02/Mar/2023:03:49:34 +0000] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.3 Safari/605.1.15"
    EOF
  2. Upload the log file to Object Storage. Refer to the table below to create a bucket and directory and upload the example log file saved in your local downloads folder.

    ItemValue
    Bucket namehands-on
    Directory/nginx/ori/date_id=2023-02-01/hour_id=00
    File nameaccess.log
  3. Verify that the example log file has been uploaded to Object Storage.

    Log file upload

Step 2. Create Data Catalog resource

Data Catalog is a fully managed service in KakaoCloud that helps you understand and efficiently manage your organizational and user data assets. It consists of Catalog, Database, and Table components.

  1. Create a Catalog as a fully managed central repository within the VPC before using the Data Catalog service.

    ItemValue
    Namehands_on
    VPC${any}
    Subnet${public}
  2. Once the created catalog is in the Running state, create a Database. A database in Data Catalog serves as a container for storing tables.

    ItemValue
    Cataloghands_on
    Namehands_on_db
    Path: Buckethands-on
    Path: Directorynginx
  3. Create metadata tables: Original Data Table, Refined Data Table, and Result Data Table.

ItemValue
Databasehands_on_db
Table namehandson_table_original
Path: Buckethands-on
Path: Directorynginx/ori
Data typeCSV

Schema

Partition keyColumn numberField nameData type
off1logstring
on-date_idstring
on-hour_idstring

Step 3. Create Hadoop Eco resource

Hadoop Eco is a KakaoCloud service designed to execute distributed processing tasks using an open-source framework.

  1. Go to KakaoCloud Console > Analytics > Hadoop Eco, select the [Create cluster] button, and configure the cluster with the following details:

    ItemValue
    Cluster namehands-on
    Cluster versionHadoop Eco 2.0.0
    Cluster typeCore Hadoop
    Cluster availabilityStandard
    Admin ID${ADMIN_ID}
    Admin password${ADMIN_PASSWORD}
  2. Configure Master node and Worker nodes instances.

    • Set up key pairs and network configuration (VPC, subnet) to ensure SSH access to the nodes. Select Create new security group for proper access control.
    ConfigurationMaster nodeWorker node
    Number of instances12
    Instance typem2a.2xlargem2a.2xlarge
    Volume size50GB100GB
  3. Configure the Cluster settings.

    ItemValue
    Task schedulingNone
    HDFS block size128
    HDFS replication2
    Cluster configurationSelect [Manual input] and enter the following code:
    Cluster configuration for Object Storage integration
    {
    "configurations": [
    {
    "classification": "core-site",
    "properties": {
    "fs.swifta.service.kic.credential.id": "${ACCESS_KEY}",
    "fs.swifta.service.kic.credential.secret": "${ACCESS_SECRET_KEY}"
    }
    }
    ]
    }
  4. Configure Service integration.

    ItemValue
    Monitoring agent installDo not install
    Service integrationEnable Data Catalog integration
    Data Catalog nameSelect the previously created [hands_on]
  5. Review the entered information and create the cluster.

Step 4. Extract original data and write to refined table

  1. Connect to the master node of the Hadoop cluster using SSH.

    Connect to master node
    ssh -i ${PRIVATE_KEY_FILE} ubuntu@${HADOOP_MST_NODE_ENDPOINT}
    caution

    The created master node uses a private IP and cannot be accessed directly from a public network. You may need to use a bastion host or configure a public IP for access.

  2. Use Apache Hive to extract data. Apache Hive simplifies reading, writing, and managing large datasets stored in distributed storage using SQL.

    Start Apache Hive
    hive
  3. Set the working database to the one created in the Data Catalog step.

    Set database
    use hands_on_db;
  4. Add partitions to the original log table. Verify the added partition details on the Data Catalog Console.

    Add original log table partitions
    msck repair table handson_table_original;

    Add original log table partition

  5. Run the following SQL query in HiveCLI to extract the original log data and write it to the refined table.

    Run SQL query
    INSERT INTO TABLE handson_table_orc PARTITION(date_id, hour_id)
    SELECT remote_addr,
    from_unixtime(unix_timestamp(time_local, '[DD/MMM/yyyy:HH:mm:ss')) as date_time,
    request_method,
    request_url,
    status,
    request_time,
    date_id,
    hour_id
    FROM (
    SELECT split(log, " ")[0] AS remote_addr,
    split(log, " ")[3] AS time_local,
    split(log, " ")[5] AS request_method,
    split(log, " ")[6] AS request_url,
    split(log, " ")[8] AS status,
    split(log, " ")[9] AS request_time,
    date_id,
    hour_id
    FROM handson_table_original
    ) R;
  6. Exit HiveCLI after completing the tasks.

    Exit HiveCLI
    exit;

Step 5. Extract required data and create table

  1. Start the Spark shell.

    Start Spark shell
    spark-shell
  2. Use Spark to process the refined table data. Calculate the count of request_url and status and save the results in JSON format to the result table.

    Process and write data
    spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
    spark.sql("use hands_on_db").show()
    spark.sql("SELECT request_url, status, count(*) as count FROM handson_table_orc GROUP BY request_url, status").write.format("json").option("compression", "gzip").save("swifta://hands-on.kic/nginx/request_url_count/date_id=2023-02-01/hour_id=01")

Step 6. Check results using Hue

Hue (Hadoop User Experience) is a web-based user interface designed for use with Apache Hadoop clusters. It allows easy access to Hadoop data and seamless integration with various Hadoop ecosystem components.

  1. Access the Hue page. Open your browser and connect to port 8888 on the Hadoop cluster's master node. Log in with the admin ID and password configured during cluster creation.

    Access Hue
    open http://{HADOOP_MST_NODE_ENDPOINT}:8888
    caution

    The created nodes use private IPs and cannot be accessed directly from a public network. Use a bastion host or configure a public IP for access.

  2. In the Hue interface, execute Hive queries. Set the working database to the one created in the Data Catalog step.

    Set database
    use hands_on_db;
  3. Add partitions to the result table using the following command.

    Add partitions to result table
    msck repair table handson_table_request_url_count;
  4. Verify the partition details in the Data Catalog console.

  5. Query the result table. To process JSON data, add the necessary library.

    Add library for JSON data processing
    add jar /opt/hive/lib/hive-hcatalog-core-3.1.3.jar;
  6. Retrieve and display the stored data from the result table.

    Query result data
    select * from handson_table_request_url_count order by count limit 10;
  7. View the query results as a graph on the Hue interface.

    View results