Skip to main content

Real-time web server log analysis and monitoring using Hadoop Eco Dataflow type

Using KakaoCloud Hadoop Eco, you can easily build an environment to analyze and monitor web server logs in real time. This document provides a hands-on tutorial using Hadoop Eco Dataflow type.

info
  • Estimated time required: 60 minutes
  • User environment
    • Recommended OS: Mac OS, Ubuntu
    • Region: kr-central-2

About this scenario

This tutorial guides you how to implement real-time web server log analysis and monitoring using Hadoop Eco Dataflow type. By practicing the steps of data collection, preprocessing, analysis, and visualization, you can understand the basic principles of building a real-time data pipeline and gain experience in real-time analysis and monitoring using Dataflow of Hadoop Eco.

The main contents of this scenario are as follows:

  • Preprocessing and analysis of log data using Filebeat and Kafka
  • Building a monitoring dashboard through real-time data visualization using Druid and Superset

Getting started

The practical steps for setting up Hadoop Eco's Dataflow and building a monitoring environment for real-time web server log analysis are as follows:

Step 1. Create Hadoop Eco

  1. Select the Hadoop Eco menu in the KakaoCloud Console.

  2. Click the [Create cluster] button and create a Hadoop Eco cluster as follows.

    ItemSetting
    Cluster Namehands-on-dataflow
    Cluster VersionHadoop Eco 2.0.1
    Cluster TypeDataflow
    Cluster AvailabilityStandard
    Administrator ID${ADMIN_ID}
    Administrator Password  ${ADMIN_PASSWORD}
    caution

    The administrator ID and password must be stored safely as they are required to access Superset, a data exploration and visualization platform.

  3. Set up the master node and worker node instances.

    • Set the key pair and network configuration (VPC, Subnet) to suit the environment in which the user can connect via ssh.

      CategoryMaster NodeWorker Node
      Number of Instances12
      Instance Type  m2a.xlarge   m2a.xlarge
      Volume size50GB100GB
    • Next, select Create security group.

  4. Set the following steps as follows.

    • Task scheduling settings
    ItemSetting value
    Task scheduling settings   Not selected
    • Cluster detailed settings
    ItemSetting value
    HDFS block size128
    HDFS replica count2
    Cluster configuration settings   Not set
    • Service linkage settings
    ItemSetting value
    Monitoring agent installation   Not installed
    Service linkageNot linked
  5. After checking the entered information, click the [Create] button to create a cluster.

Step 2. Configure security group

When creating a Hadoop Eco cluster, the newly created Security Group does not have an inbound policy set for security.

Set the inbound policy of the Security Group to access the cluster.

Click Hadoop Eco Cluster List > Created Cluster > Cluster Information > Security Group Link. Click the [Manage inbound rules] button and set the inbound policy as shown below.

Check my public IP

Click the button below to check your current public IP.

tip

If a ‘bad permissions’ error occurs due to a key file permission issue, you can solve the problem by adding the sudo command.

ProtocolPacket SourcePort NumberPolicy Description
TCP{your public IP address}/3222ssh connection
TCP  {user public IP address}/3280NGINX
TCP{user public IP address}/324000Superset
TCP{user public IP address}/323008      Druid

Step 3. Configure web server and log pipeline

Configure the log pipeline using the web server Nginx and Filebeat on the created Hadoop Eco cluster master node. Filebeat periodically scans log files and forwards the logs loaded into files to Kafka.

  1. Connect to the master node of the created Hadoop cluster using ssh.
Connect to the master node
chmod 400 ${PRIVATE_KEY_FILE}
ssh -i ${PRIVATE_KEY_FILE} ubuntu@${HADOOP_MST_NODE_ENDPOINT}
caution

The created master node is configured with a private IP and cannot be accessed from a public network environment. Therefore, you can connect by connecting a public IP or using a Bastion host.

  1. Install the web server Nginx and JQ to output logs in JSON format.

    sudo apt update -y
    sudo apt install nginx -y
    sudo apt install jq -y
    info

    If you see a purple background and Pending kernel upgrade, Daemons using outdated libraries window during installation, don't panic and just press Enter!

  2. Install and configure GeoIP to collect logs about API request client region information.

    sudo apt install libnginx-mod-http-geoip geoip-database gzip
    cd /usr/share/GeoIP
    sudo wget https://centminmod.com/centminmodparts/geoip-legacy/GeoLiteCity.gz
    sudo gunzip GeoLiteCity.gz
  3. Set the Nginx access log format.

    Edit nginx settings
    cat << 'EOF' | sudo tee /etc/nginx/nginx.conf
    user www-data;
    worker_processes auto;
    pid /run/nginx.pid;
    include /etc/nginx/modules-enabled/*.conf;

    events {
    worker_connections 768;
    }

    http {

    # Basic Settings
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    # SSL Settings
    ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
    ssl_prefer_server_ciphers on;

    # Logging Settings
    geoip_country /usr/share/GeoIP/GeoIP.dat;

    log_format nginxlog_json escape=json
    '{'
    '"remote_addr":"$remote_addr",'
    '"remote_user":"$remote_user",'
    '"http_user_agent":"$http_user_agent",'
    '"host":"$host",'
    '"hostname":"$hostname",'
    '"request":"$request",'
    '"request_method":"$request_method",'
    '"request_uri":"$request_uri",'
    '"status":"$status",'
    '"time_iso8601":"$time_iso8601",'
    '"time_local":"$time_local",'
    '"uri":"$uri",'
    '"http_referer":"$http_referer",'
    '"body_bytes_sent":"$body_bytes_sent",'
    '"geoip_country_code": "$geoip_country_code",'
    '"geoip_latitude": "$geoip_latitude",'
    '"geoip_longitude": "$geoip_longitude"'
    '}';

    access_log /var/log/nginx/access.log nginxlog_json;
    error_log /var/log/nginx/error.log;

    # Virtual Host Configs
    include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/sites-enabled/*;
    }
    EOF
    Restart Nginx and check status
    sudo systemctl restart nginx
    sudo systemctl status nginx
  4. Access the webpage and check the access log.

    • Access the webpage. If the web server is normal, the following screen will be displayed.
    http://{MASTER_NODE_PUBLIC_IP)

    Nginx 접속 확인 Nginx 접속 확인

    • Verify that logs are being recorded normally on the master node instance.

      tail /var/log/nginx/access.log | jq
    • Example of connection log

      {
      "remote_addr": "220.12x.8x.xx",
      "remote_user": "",
      "http_user_agent": "",
      "host": "10.xx.xx.1x",
      "hostname": "host-172-30-4-5",
      "request": "GET http://210.109.8.104:80/php/scripts/setup.php HTTP/1.0",
      "request_method": "GET",
      "request_uri": "/php/scripts/setup.php",
      "status": "404",
      "time_iso8601": "2023-11-15T06:24:49+00:00",
      "time_local": "15/Nov/2023:06:24:49 +0000",
      "uri": "/php/scripts/setup.php",
      "http_referer": "",
      "body_bytes_sent": "162",
      "geoip_country_code": "KR",
      "geoip_latitude": "37.3925",
      "geoip_longitude": "126.9269"
      }
      caution

      Nginx's default timezone is UTC. Therefore, the time_iso8601, time_local fields are displayed in UTC, which may be different from KST(+9:00).

  5. Install and configure Filebeat.

    Install Filebeat
    cd ~
    sudo curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.9.1-linux-x86_64.tar.gz
    tar xzvf filebeat-8.9.1-linux-x86_64.tar.gz
    ln -s filebeat-8.9.1-linux-x86_64 filebeat
    Filebeat configuration (integration with Kafka on Hadoop Eco cluster worker node)
    cat << EOF | sudo tee ~/filebeat/filebeat.yml

    ########################### Filebeat Configuration ##############################
    filebeat.config.modules:
    # Glob pattern for configuration loading
    path: \${path.config}/modules.d/*.yml
    # Set to true to enable config reloading
    reload.enabled: false
    # ================================== Outputs ===================================
    output.kafka:
    hosts: ["${WORKER-NODE1-HOSTNAME}:9092","${WORKER-NODE2-HOSTNAME}:9092"]
    topic: 'nginx-from-filebeat'
    partition.round_robin:
    reachable_only: false
    required_acks: 1
    compression: gzip
    max_message_bytes: 1000000
    # ================================= Processors =================================
    processors:
    - add_host_metadata:
    when.not.contains.tags: forwarded
    - decode_json_fields:
    fields: ["message"]
    process_array: true
    max_depth: 2
    target: log
    overwrite_keys: true
    add_error_key: false

    EOF
caution

${path.config} should not be modified. You should enter the Hostname instead of the IP address in output.kafka -> hosts.

  • Example: host-172-16-0-0:9092
Filebeat Configuration (Filebeat Nginx Module Configuration)
cat << EOF | sudo tee ~/filebeat/modules.d/nginx.yml
- module: nginx
access:
enabled: true

error:
enabled: false

ingress_controller:
enabled: false
EOF
Filebeat Configuration (Creating a Filebeat Service)
cat << 'EOF' | sudo tee /etc/systemd/system/filebeat.service
[Unit]
Description=Filebeat sends log files to Kafka.
Documentation=https://www.elastic.co/products/beats/filebeat
Wants=network-online.target
After=network-online.target

[Service]
User=ubuntu
Group=ubuntu
ExecStart=/home/ubuntu/filebeat/filebeat -c /home/ubuntu/filebeat/filebeat.yml -path.data /home/ubuntu/filebeat/data
Restart=always

[Install]
WantedBy=multi-user.target

EOF
Filebeat 실행
sudo systemctl daemon-reload
sudo systemctl enable filebeat
sudo systemctl start filebeat
sudo systemctl status filebeat

Step 4. Connect to Druid and set up

  1. Connect to Druid through Hadoop Eco Cluster > Cluster Information > [Druid URL].

    http://{MASTER_NODE_PUBLIC_IP):3008
  2. Click the Load Data > Streaming button at the top of the main screen. Click the [Edit spec] button at the top right.

    Druid Settings Button Druid Settings Button

  3. Modify the host name of the Hadoop Eco worker node in bootstarp.server of the JSON below, paste the content, and click the [Submit] button.

caution

You must enter the host name instead of the IP address of the worker node in bootstraps.servers.
You can check the host name in VM Instance > Details. Example) host-172-16-0-2

JSON
{
"type": "kafka",
"spec": {
"ioConfig": {
"type": "kafka",
"consumerProperties": {
"bootstrap.servers": "{WORKER-NODE1-HOSTNAME}:9092,{WORKDER-NODE2-HOSTNAME}:9092"
},
"topic": "nginx-from-filebeat",
"inputFormat": {
"type": "json",
"flattenSpec": {
"fields": [
{
"name": "agent.ephemeral_id",
"type": "path",
"expr": "$.agent.ephemeral_id"
},
{
"name": "agent.id",
"type": "path",
"expr": "$.agent.id"
},
{
"name": "agent.name",
"type": "path",
"expr": "$.agent.name"
},
{
"name": "agent.type",
"type": "path",
"expr": "$.agent.type"
},
{
"name": "agent.version",
"type": "path",
"expr": "$.agent.version"
},
{
"name": "ecs.version",
"type": "path",
"expr": "$.ecs.version"
},
{
"name": "event.dataset",
"type": "path",
"expr": "$.event.dataset"
},
{
"name": "event.module",
"type": "path",
"expr": "$.event.module"
},
{
"name": "event.timezone",
"type": "path",
"expr": "$.event.timezone"
},
{
"name": "fileset.name",
"type": "path",
"expr": "$.fileset.name"
},
{
"name": "host.architecture",
"type": "path",
"expr": "$.host.architecture"
},
{
"name": "host.containerized",
"type": "path",
"expr": "$.host.containerized"
},
{
"name": "host.hostname",
"type": "path",
"expr": "$.host.hostname"
},
{
"name": "host.id",
"type": "path",
"expr": "$.host.id"
},
{
"name": "host.ip",
"type": "path",
"expr": "$.host.ip"
},
{
"name": "host.mac",
"type": "path",
"expr": "$.host.mac"
},
{
"name": "host.name",
"type": "path",
"expr": "$.host.name"
},
{
"name": "host.os.codename",
"type": "path",
"expr": "$.host.os.codename"
},
{
"name": "host.os.family",
"type": "path",
"expr": "$.host.os.family"
},
{
"name": "host.os.kernel",
"type": "path",
"expr": "$.host.os.kernel"
},
{
"name": "host.os.name",
"type": "path",
"expr": "$.host.os.name"
},
{
"name": "host.os.platform",
"type": "path",
"expr": "$.host.os.platform"
},
{
"name": "host.os.type",
"type": "path",
"expr": "$.host.os.type"
},
{
"name": "host.os.version",
"type": "path",
"expr": "$.host.os.version"
},
{
"name": "input.type",
"type": "path",
"expr": "$.input.type"
},
{
"name": "log.body_bytes_sent",
"type": "path",
"expr": "$.log.body_bytes_sent"
},
{
"name": "log.file.path",
"type": "path",
"expr": "$.log.file.path"
},
{
"name": "log.geoip_country_code",
"type": "path",
"expr": "$.log.geoip_country_code"
},
{
"name": "log.geoip_latitude",
"type": "path",
"expr": "$.log.geoip_latitude"
},
{
"name": "log.geoip_longitude",
"type": "path",
"expr": "$.log.geoip_longitude"
},
{
"name": "log.host",
"type": "path",
"expr": "$.log.host"
},
{
"name": "log.hostname",
"type": "path",
"expr": "$.log.hostname"
},
{
"name": "log.http_referer",
"type": "path",
"expr": "$.log.http_referer"
},
{
"name": "log.http_user_agent",
"type": "path",
"expr": "$.log.http_user_agent"
},
{
"name": "log.offset",
"type": "path",
"expr": "$.log.offset"
},
{
"name": "log.remote_addr",
"type": "path",
"expr": "$.log.remote_addr"
},
{
"name": "log.remote_user",
"type": "path",
"expr": "$.log.remote_user"
},
{
"name": "log.request",
"type": "path",
"expr": "$.log.request"
},
{
"name": "log.request_method",
"type": "path",
"expr": "$.log.request_method"
},
{
"name": "log.request_uri",
"type": "path",
"expr": "$.log.request_uri"
},
{
"name": "log.status",
"type": "path",
"expr": "$.log.status"
},
{
"name": "log.time_iso8601",
"type": "path",
"expr": "$.log.time_iso8601"
},
{
"name": "log.time_local",
"type": "path",
"expr": "$.log.time_local"
},
{
"name": "log.uri",
"type": "path",
"expr": "$.log.uri"
},
{
"name": "service.type",
"type": "path",
"expr": "$.service.type"
},
{
"name": "$.@metadata.beat",
"type": "path",
"expr": "$['@metadata'].beat"
},
{
"name": "$.@metadata.pipeline",
"type": "path",
"expr": "$['@metadata'].pipeline"
},
{
"name": "$.@metadata.type",
"type": "path",
"expr": "$['@metadata'].type"
},
{
"name": "$.@metadata.version",
"type": "path",
"expr": "$['@metadata'].version"
}
]
}
},
"useEarliestOffset": true
},
"tuningConfig": {
"type": "kafka"
},
"dataSchema": {
"dataSource": "nginx-from-filebeat",
"timestampSpec": {
"column": "@timestamp",
"format": "iso"
},
"dimensionsSpec": {
"dimensions": [
"host.name",
{
"name": "log.body_bytes_sent",
"type": "float"
},
"log.file.path",
"log.geoip_country_code",
"log.geoip_latitude",
"log.geoip_longitude",
"log.host",
"log.hostname",
"log.http_referer",
"log.http_user_agent",
"log.offset",
"log.remote_addr",
"log.remote_user",
"log.request",
"log.request_method",
"log.request_uri",
{
"name": "log.status",
"type": "long"
},
"log.time_iso8601",
"log.time_local",
"log.uri"
]
},
"granularitySpec": {
"queryGranularity": "none",
"rollup": false
},
"transformSpec": {
"filter": {
"type": "not",
"field": {
"type": "selector",
"dimension": "log.status",
"value": null
}
}
}
}
}
}
  1. You can check the connection status with Kafka in the Ingestion tab. If the Status is RUNNING as shown in the picture below, it is normally connected.

    Check Druid status Check Druid status

Step 5. Superset connection and settings

You can monitor data in real time through Superset.

  1. Access Superset through Hadoop Eco Cluster > Cluster Information > [Superset URL]. Log in using the administrator ID and password you entered when creating the cluster.

    http://{MASTER_NODE_PUBLIC_IP):4000
  2. Click the [Datasets] button on the top menu. Then click the [+ DATASET] button on the top right to import the dataset from Druid.

  3. Set the database and schema as shown below. Then click the [CREATE DATASET AND CREATE CHART] button.

    ItemSetting value
    DATABASEdruid
    SCHEMAdruid
    TABLEnginx-from-filebeat
  4. Select the desired charts and click the [CREATE NEW CHART] button.

  5. Enter the data and setting values ​​you want to check, click the [CREATE CHART] button to create a chart, and click the [SAVE] button on the top right to save the chart.

  6. You can add the created chart to the dashboard and monitor it as shown below.

Check the dashboard Dashboard