Skip to main content

Implementing a Predictive Model with Kubeflow Notebook

This guide explains how to implement a taxi fare prediction model using TLC Trip Record Data in the KakaoCloud Kubeflow environment.

Basic information

Before starting

This tutorial provides instructions on how to implement a taxi fare prediction model using Jupyter Notebook in the KakaoCloud Kubeflow environment. By practicing steps from data preprocessing to model training and evaluation, you will understand the fundamentals of building a machine learning model with real data and gain experience in training models and building pipelines in Kubeflow.

About this scenario

In this scenario, you will implement a taxi fare prediction model using TLC Trip Record Data within Kubeflow. The key topics covered in this scenario are:

  • Creating and utilizing Jupyter Notebook instances in Kubeflow
  • Performing data preprocessing and exploratory data analysis (EDA)
  • Implementing and training a simple machine learning model in the notebook
  • Automating the model training process by building a pipeline in Kubeflow

Prework

1. Prepare training dataset

This tutorial uses the publicly available TLC Trip Record Data from New York City and provides a pipeline manifest file for the exercise, allowing you to practice data preprocessing and training.

ItemDescription
GoalImplement a taxi fare prediction model
Data informationYellow taxi fare data from the New York City Taxi and Limousine Commission (2009–2015)
- Includes pickup and drop-off times and locations, trip distances, fares, payment types, passenger counts, and more

Original dataset information

2. Set up the Kubeflow environment

This tutorial uses a notebook in a CPU node pool environment.

If the Kubeflow service or the appropriate environment is not ready, refer to the Create Jupyter Notebook document to create a CPU-based notebook.

Notebook practice

This tutorial provides two practice scenarios: training a prediction model in the notebook and building a pipeline to train the prediction model.

Practice 1. Train a predictive model in the notebook

  1. Download the nyc_taxi_pytorch_run_in_notebook.ipynb file needed for the exercise.

  2. Access the Kubeflow notebook instance created in the previous step. Click the [Upload File] button in the top left to upload the example file.

    Image. Upload file to Jupyter notebook console Upload file to Jupyter notebook console

  3. Once the upload is complete, you can check the example file in the left tab. Select the uploaded example file and review the exercise content in the right-hand display area.

    Image. (Practice 1) Example file upload complete

  4. Follow the exercise steps and run the model training.

Practice 2. Create a pipeline and train a predictive model in the notebook

  1. Download the nyc_taxi_pytorch_build_pipeline_cpu.ipynb file needed for the exercise.

  2. Access the Kubeflow notebook instance created in the previous step. Click the [Upload File] button in the top left to upload the example file.

    Image. Upload file to Jupyter notebook console

  3. Once the upload is complete, you can view the example file in the left tab. Select the uploaded example file and review the exercise content in the right-hand display area.

  4. For the exercise, enter the environment variable information, including the private IP of the load balancer connected to Kubeflow and the email and password used in Kubeflow.

    Image. (Practice 2) Example file upload complete

  5. Follow the exercise steps and run the model training.

Delete resources (Optional)

When the exercise is complete or the service is no longer in use, it is recommended to delete the resources as follows:

  1. In the Runs tab of the Kubeflow dashboard, you can check the status of the runs. Once the task is complete, move the run to Archived and click the Delete button. Image. Delete run Delete run

  2. After deleting the run, confirm that the pod has been deleted.

    Image. Confirm pod deletion Confirm pod deletion