Skip to main content

Monitor Kafka consumer lag using Prometheus

This tutorial demonstrates how to collect consumer lag metrics from Kafka using Prometheus Exporter, and trigger alerts through Alert Center when the lag exceeds a defined threshold.

Basic information
  • Estimated time: 60 minutes
  • Recommended OS: MacOS, Ubuntu
  • Prerequisites:

About this scenario

In this tutorial, you'll learn how to monitor Kafka consumer lag using Prometheus and Exporter, and how to set up alerts using Alert Center when lag exceeds a certain threshold.

This scenario includes:

  • Installing Kafka Exporter and Prometheus Agent
  • Collecting and viewing Kafka lag metrics
  • Setting threshold-based alerts with Alert Center
Terminology
  • Kafka Lag: Kafka lag represents the number of messages not yet processed by a consumer. It quantifies how far behind a consumer group is for a given topic, which is helpful for identifying system bottlenecks or failures. (Lag = Latest offset in Kafka partition - Offset committed by the consumer group)
  • Kafka Exporter: Kafka Exporter collects metrics (including lag) from Kafka and exposes them in Prometheus-compatible format.
Note

KakaoCloud's Advanced Managed Prometheus cannot access Kafka clusters directly. Therefore, you need to install a Prometheus Agent on the VM running Kafka (or in the same network) to collect metrics and forward them to the Managed Prometheus workspace.

Before you start

Before starting this tutorial, please follow the steps in Message processing through Kafka to set up a working Kafka producer-consumer environment.

Getting started

Step 1. Create consumer group

The kafka_consumergroup_lag metric collected by Kafka Exporter is measured per consumer group, based on how far each group has consumed messages from each partition.

Run the following command to create a consumer group in Kafka:

# Move to Kafka directory
cd ~/kafka

# Create consumer group
bin/kafka-console-consumer.sh \
--bootstrap-server ${BOOTSTRAP_SERVER} \
--topic ${TOPIC_NAME} \
--group ${GROUP_NAME} \
--from-beginning
환경변수설명
BOOTSTRAP_SERVER🖌Kafka cluster bootstrap server from KakaoCloud Console
TOPIC_NAME🖌Pre-created Kafka topic name
GROUP_NAME🖌Specify consumer group name / e.g. lag-group

Step 2. Install Kafka Exporter

  1. Install and run kafka_exporter to expose Kafka metrics for Prometheus to scrape.
# Move to install directory
cd ~/Downloads

# Download Kafka Exporter
wget https://github.com/danielqsj/kafka_exporter/releases/download/v1.9.0/kafka_exporter-1.9.0.linux-amd64.tar.gz
tar -xvf kafka_exporter-1.9.0.linux-amd64.tar.gz
cd kafka_exporter-1.9.0.linux-amd64

# Start Kafka Exporter
./kafka_exporter \
--kafka.server=${BROKER1} \
--kafka.server=${BROKER2} \
--log.level=info
환경변수설명
BROKER1🖌Kafka broker IP and port from KakaoCloud Console / e.g. 10.0.x.x:9092
BROKER2🖌Another broker IP and port
  1. Check if metrics are exposed correctly:

    curl http://localhost:9308/metrics | grep kafka_consumergroup_lag

Step 3. Install and configure local Prometheus Agent

To scrape metrics from Kafka Exporter, install Prometheus locally and configure it as an agent.

Note

Before installation, make sure to create a Prometheus workspace in the KakaoCloud Console.

  1. Download and extract Prometheus:

    cd ~/kafka
    wget https://github.com/prometheus/prometheus/releases/download/v2.33.1/prometheus-2.33.1.linux-amd64.tar.gz
    tar xvfz prometheus-2.33.1.linux-amd64.tar.gz
    cd prometheus-2.33.1.linux-amd64
  2. Create and open a Prometheus Agent config file:

    mkdir -p /etc/prometheus
    sudo vi /etc/prometheus/prometheus-agent.yaml
  3. Add the following configuration to the YAML file:

    global:
    scrape_interval: 15s

    scrape_configs:
    - job_name: 'kafka_exporter'
    static_configs:
    - targets: ['localhost:9308']

    remote_write:
    - url: "${WRITE_ENDPOINT}"
    headers:
    Credential-ID: '${CREDENTIAL_ID}'
    Credential-Secret: '${CREDENTIAL_SECRET}'
    환경변수설명
    WRITE_ENDPOINT🖌Write endpoint of the workspace from KakaoCloud Console
    CREDENTIAL_ID🖌Access Key ID
    CREDENTIAL_SECRET🖌Secret Access Key
  4. Start Prometheus with the config file:

    cd ~/kafka/prometheus-2.33.1.linux-amd64
    ./prometheus --config.file=/etc/prometheus/prometheus-agent.yaml > prom.log 2>&1 &
  5. Verify Prometheus is running:

    curl http://localhost:9090

Step 4. Set alerts in Alert Center

Configure alert rules based on Kafka Lag metrics.

Note

At least one notification channel must be registered before creating an alert policy. See Create and manage notification channels for details.

  1. Go to KakaoCloud Console > Management > Alert Center.

  2. Click the Alert policy (project) tab, then click Create alert policy.

  3. Choose Advanced Managed Prometheus for the condition type.

  4. Select the previously created workspace.

  5. Enter the following alert rule script:

    groups:
    - name: kafkaConsumergroupAlert
    rules:
    - alert: HighConsumergroupLag
    expr: sum(kafka_consumergroup_lag) by (consumergroup, topic) >= 10
    for: 1m
    annotations:
    summary: "Kafka Consumergroup Lag >= 10"
    description: "consumer group: {{ $labels.consumergroup }} / topic: {{ $labels.topic }} / sum of lag: {{ $value }}"
  6. Click [Next], select the notification channel.

  7. Click [Next] again and enter a name for the alert policy.

  8. Review and click [Create] to complete the alert setup.

Step 5. Trigger alert for testing

To verify the alert, simulate a lag by stopping the consumer while the producer continues sending messages.

  1. Run the consumer and ensure lag is 0:

    cd ~/kafka
    bin/kafka-console-consumer.sh \
    --bootstrap-server ${BOOTSTARP_SERVER} \
    --topic ${TOPIC_NAME} \
    --group ${GROUP_NAME} \
    --from-beginning
    환경변수설명
    BOOTSTARP_SERVER🖌Kafka cluster bootstrap server
    TOPIC_NAME🖌Kafka topic name
    GROUP_NAME🖌Consumer group name / e.g. lag-group
  2. Press Ctrl+C to stop the consumer, then send new messages:

    cd ~/kafka
    bin/kafka-console-producer.sh \
    --bootstrap-server ${BOOTSTARP_SERVER} \
    --topic ${TOPIC_NAME}
    환경변수설명
    BOOTSTARP_SERVER🖌Kafka cluster bootstrap server
    TOPIC_NAME🖌Kafka topic name
    > test-1
    > test-2
    > test-3
    ...
    > test-12
  3. Alerts are triggered only if the lag remains above the threshold for at least 1 minute, as defined in for: 1m. Wait a minute to confirm alert reception.