Skip to main content
Tutorial series | Kubeflow traffic prediction model

1. Explore data and develop model

📈 Analyze log data and develop a machine learning–based prediction model.

Basic information
  • Estimated time: 60 minutes
  • Recommended OS: MacOS, Ubuntu

About this scenario

Using Notebook, one of the core components of Kubeflow, you will develop a machine learning model to predict hourly traffic based on load balancer log data. This tutorial walks you through data preprocessing, visualization, feature engineering, model training, and performance evaluation.

Key topics include:

  • Organizing and visualizing log data on an hourly basis
  • Performing feature engineering that reflects periodicity
  • Exploring and evaluating ML models using Scikit-learn
  • Saving the trained model for use in the serving phase

Before you start

1. Prepare Kubeflow environment

You must have a Kubeflow environment set up in advance. Refer to the prerequisites to make sure that a CPU-based node pool and PVC volumes are already created.

2. Launch Notebook instance

  1. Go to the Kubeflow dashboard and select the Notebook menu on the left.

  2. Click the [+ New Notebook] button and configure it as follows:

    ItemSetting
    Namekc-lb-pred-handson
    NotebookJupyterLab
    Imagekc-kubeflow/jupyter-scipy:v1.8.0.py311.1a
    CPU / RAM0.5 / 1
    Workspace Volume(User-defined)
    Data Volumes(See table below)
    Affinity / Tolerations(CPU node pool) / None
  3. For data volumes, use the PVCs created during the prerequisites stage and set their Mount Path as shown below:

    ItemSetting 1Setting 2Setting 3
    TypeKubernetes VolumeKubernetes VolumeKubernetes Volume
    Namedataset-pvcmodel-pvcartifact-pvc
    Mount path/home/jovyan/dataset/home/jovyan/models/home/jovyan/artifacts
  4. Once the Notebook is created, click the [Connect] button to access the JupyterLab environment.
    JupyterLab

Getting started

Step 1. Prepare log data

In this tutorial, you will build a model that predicts the number of log occurrences over time using load balancer logs. Instead of using raw log fields, the target variable is the aggregated log count per time unit.

NLB logs are collected every 30 minutes, with each log record written in a single line of JSON format.

Synthetic data will be used instead of real-world logs. Download the file from the link below and upload it to the /home/jovyan path in JupyterLab:

Log data field structure
FieldDescription
project_idProject ID
timeTimestamp of log generation
lb_idLoad balancer resource ID
listener_idListener ID
client_portClient IP and port
destination_portDestination IP and port
tls_cipherOpenSSL-style cipher group
tls_protocol_versionTLS protocol version

JupyterLab

Step 2. Preprocess NLB data

  1. Load the log file and aggregate the data into a DataFrame with counts per 30-minute interval.

    Load and aggregate logs
    import json
    import pandas as pd # Import module for data preprocessing

    file_path = '/home/jovyan/nlb-raw.txt' # NLB file path
    raw_data = []
    # Load NLB dataset
    with open(file_path) as f:
    for line in f: # Read each line of JSON-formatted text
    data = json.loads(line) # Convert JSON string to Python dict
    raw_data.append(data)

    raw_df = pd.json_normalize(raw_data) # Convert key-value pairs into tabular format using pandas DataFrame
  2. Aggregate log counts by 30-minute intervals and structure as a time-based DataFrame.

    Aggregate logs
    # Round timestamps to 30-minute intervals
    log_time_sr = pd.to_datetime(raw_df['time'], format='%Y/%m/%d %H:%M:%S:%f').dt.floor('30min')

    # Count logs per 30-minute interval and store as a dictionary
    log_count_dict = log_time_sr.dt.floor('30min').value_counts(dropna=False).to_dict()

    # Generate time range (30-minute intervals)
    time_range = pd.date_range(start='2024-04-01', end='2024-05-01', freq='30min')

    # Create DataFrame with timestamps and corresponding log counts
    df = pd.DataFrame({'datetime': time_range})
    df['count'] = df['datetime'].apply(lambda x: log_count_dict.get(x, 0))
  3. Preview the result of log aggregation.

    Preview result
    df.head(5)

Sample output

datetimecount
02024-04-01 00:00:0026
12024-04-01 00:30:0029
22024-04-01 01:00:0051
32024-04-01 01:30:0032
42024-04-01 02:00:0069

Step 3. Analyze NLB data

Visualize the log count data to identify basic traffic distribution patterns.

  1. Import necessary modules and add time-related columns to the dataset.

    Add time-related columns
    import matplotlib.pyplot as plt

    df['week'] = df['datetime'].dt.isocalendar().week
    df['dow'] = df['datetime'].dt.day_of_week
    df['hour'] = df['datetime'].dt.hour
  2. Plot the log count changes over the entire period.

    Time series of total log counts
    plt.figure(figsize=(15, 5))
    plt.title('NLB Log Count')
    plt.xlabel('Date')
    plt.ylabel('Count')
    plt.plot(df['datetime'], df['count'])
    plt.show()

    graph1

    The chart reveals a periodic pattern in overall traffic volume.

  3. Create a comparison plot to observe weekly trends in log volume.

    Compare log counts by week over a 4-week span:

    Weekly log trends
    fig, axs = plt.subplots(2, 2, figsize=(20, 10))
    fig.suptitle('Log Count by Week - 4 weeks')

    for i in range(4):
    axs[i//2][i%2].set_ylim(0, max(df['count']))
    axs[i//2][i%2].plot(df[df['week'] == 14+i]['datetime'], df[df['week'] == 14+i]['count'])
    axs[i//2][i%2].set_title(f'Week {1+i}')

    plt.show()

    graph2

    The log count graph shows repeating traffic patterns depending on the day of the week.

  4. Visualize the distribution of log counts by hour and day of the week using heatmaps.

    Hour-Day heatmap
    fig, axs = plt.subplots(2, 2, figsize=(15, 8))
    fig.suptitle('Log Count Heatmap by Week')

    for i in range(4):
    week = i + 14
    df_grouped = df[df['week'] == week].groupby(["hour", "dow"])["count"].sum().reset_index()
    df_heatmap = df_grouped.pivot(index="dow", columns="hour", values="count")
    axs[i//2][i%2].set_title(f'Week {i+1}')
    axs[i//2][i%2].imshow(df_heatmap, cmap='hot', interpolation='nearest')
    axs[i//2][i%2].set_xticks(range(24))
    axs[i//2][i%2].set_xticklabels(range(24))
    axs[i//2][i%2].set_yticks(range(7))
    axs[i//2][i%2].set_yticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
    axs[i//2][i%2].set_xlabel('Hour')
    axs[i//2][i%2].set_ylabel('DoW')

    plt.show()

    graph3

    The heatmaps reveal concentrated traffic patterns on specific weekdays and times of day.

Step 4. Feature engineering

Log counts show a periodic pattern depending on the time of day and day of the week. Since 11 PM is followed by midnight, and Sunday is followed by Monday, these cyclical characteristics must be encoded in a way that the model can learn. To do so, we apply cyclic encoding to transform the time-related variables before training the machine learning model.

The hour and day-of-week values are cyclic by nature, so we use sine and cosine functions to encode their periodicity. Since the data is collected every 30 minutes, a full day is divided into 48 time intervals.

Create features with cyclic encoding
import numpy as np

time_sr = df['datetime'].apply(lambda x: x.hour * 2 + x.minute // 30)
dow_sr = df['datetime'].dt.dayofweek

dataset = pd.DataFrame()
dataset['datetime'] = df['datetime'] # for convenience in later steps

# Time-related features x1, x2
dataset['x1'] = np.sin(2*np.pi*time_sr/48)
dataset['x2'] = np.cos(2*np.pi*time_sr/48)
# Day-of-week features x3, x4
dataset['x3'] = np.sin(2*np.pi*dow_sr/7)
dataset['x4'] = np.cos(2*np.pi*dow_sr/7)

# Target variable (label)
dataset['y'] = df['count']

# Save dataset for later steps
dataset.to_csv('/home/jovyan/dataset/nlb-sample.csv', index=False)

dataset.head(5)
Sample of generated training dataset
datetimex1x2x3x4y
02024-04-01 00:00:000.0000001.0000000.01.026
12024-04-01 00:30:000.1305260.9914450.01.029
22024-04-01 01:00:000.2588190.9659260.01.051
32024-04-01 01:30:000.3826830.9238800.01.032
42024-04-01 02:00:000.5000000.8660250.01.069

As shown above, the dataset consists of 4 features:

  • x1, x2: time (hour + minute) encoded using sine/cosine
  • x3, x4: day of week encoded using sine/cosine
  • y: number of logs at each timestamp

Step 5. Train, evaluate, and save ML model

  1. Before training the model, the dataset is split into training and validation sets for evaluation purposes. In this tutorial, data up to week 16 as of April 21, 2024 is used for training, and data after April 22, 2024 is used for validation.

    Splitting training and validation data
    train_df = dataset[dataset['datetime'].dt.isocalendar().week < 17]
    test_df = dataset[dataset['datetime'].dt.isocalendar().week >= 17]
  2. To streamline performance analysis, a function is defined to both output evaluation scores and visualize the prediction results.

    Model evaluation and prediction result visualization
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

    def print_metrics(y, y_pred):
    mse = mean_squared_error(y, y_pred)
    mae = mean_absolute_error(y, y_pred)
    r2 = r2_score(y, y_pred)
    print(f"MSE: {mse:.4f}\nMAE: {mae:.4f}\nR2: {r2:.4f}")

    def draw_predictions(test_df, predictions, model_name=None):
    plt.figure(figsize=(20, 5))
    if model_name:
    plt.title(model_name)
    plt.xlabel('Date')
    plt.ylabel('Count')
    plt.plot(test_df['datetime'], test_df['y'], label='Actual')
    plt.plot(test_df['datetime'], predictions, label='Prediction')
    plt.legend()
    plt.show()
  3. We apply various machine learning models from Scikit-learn and compare their prediction accuracy. All models are trained using the same four input features (x1, x2, x3, x4). The resulting prediction graphs visually illustrate how closely each model’s predictions align with actual traffic.

    Linear Regression
    from sklearn.linear_model import LinearRegression

    params = {
    # Refer to the model documentation when setting parameters
    }
    model = LinearRegression(**params)
    model.fit(train_df[['x1', 'x2', 'x3', 'x4']], train_df['y'])
    predictions = model.predict(test_df[['x1', 'x2', 'x3', 'x4']])
    print_metrics(test_df['y'], predictions)
    draw_predictions(test_df=test_df,
    predictions=predictions,
    model_name='Linear Regression')

    graph4

    Gaussian Naive Bayes
    from sklearn.naive_bayes import GaussianNB
    params = {
    # Set parameters with reference to the model documentation
    }
    model = GaussianNB(**params)
    model.fit(train_df[['x1', 'x2', 'x3', 'x4']], train_df['y'])
    predictions = model.predict(test_df[['x1', 'x2', 'x3', 'x4']])
    print_metrics(test_df['y'], predictions)
    draw_predictions(test_df=test_df,
    predictions=predictions,
    model_name='Gaussian Naive Bayes')

    graph5

    Random Forest Regressor
    from sklearn.ensemble import RandomForestRegressor

    params = {
    # Set parameters with reference to the model documentation
    }
    model = RandomForestRegressor(**params)
    model.fit(train_df[['x1', 'x2', 'x3', 'x4']], train_df['y'])
    predictions = model.predict(test_df[['x1', 'x2', 'x3', 'x4']])
    print_metrics(test_df['y'], predictions)
    draw_predictions(test_df=test_df,
    predictions=predictions,
    model_name='Random Forest')

    graph6

    Gradient Boosting Regressor
    from sklearn.ensemble import GradientBoostingRegressor

    params = {
    # Set parameters with reference to the model documentation
    }
    model = GradientBoostingRegressor(**params)
    model.fit(train_df[['x1', 'x2', 'x3', 'x4']], train_df['y'])
    predictions = model.predict(test_df[['x1', 'x2', 'x3', 'x4']])
    print_metrics(test_df['y'], predictions)
    draw_predictions(test_df=test_df,
    predictions=predictions,
    model_name='Gradient Boosting')

    graph7

    tip

    For details on the models used in this tutorial, refer to the following link.

  4. The trained model is saved as a file using the joblib library.

  5. Since the saved model file will be used in the upcoming model deployment tutorial, it is stored in the mounted model-pvc volume at the path /home/jovyan/models.

    Save model
    import os
    import joblib

    model_dir = '/home/jovyan/models/lb-predictor'
    os.makedirs(model_dir,exist_ok=True)

    model_path = os.path.join(model_dir, 'model.joblib')
    joblib.dump(model, model_path) # Save model
    # model = joblib.load(model_path) # Load model