Implementing a Predictive Model with Kubeflow Notebook
This guide explains how to implement a taxi fare prediction model using TLC Trip Record Data in the KakaoCloud Kubeflow environment.
- Estimated time: 10 minutes
- Recommended OS: MacOS, Ubuntu
- Region: kr-central-2
- Reference document:
Before starting
This tutorial provides instructions on how to implement a taxi fare prediction model using Jupyter Notebook in the KakaoCloud Kubeflow environment. By practicing steps from data preprocessing to model training and evaluation, you will understand the fundamentals of building a machine learning model with real data and gain experience in training models and building pipelines in Kubeflow.
About this scenario
In this scenario, you will implement a taxi fare prediction model using TLC Trip Record Data within Kubeflow. The key topics covered in this scenario are:
- Creating and utilizing Jupyter Notebook instances in Kubeflow
- Performing data preprocessing and exploratory data analysis (EDA)
- Implementing and training a simple machine learning model in the notebook
- Automating the model training process by building a pipeline in Kubeflow
Prework
1. Prepare training dataset
This tutorial uses the publicly available TLC Trip Record Data from New York City and provides a pipeline manifest file for the exercise, allowing you to practice data preprocessing and training.
Item | Description |
---|---|
Goal | Implement a taxi fare prediction model |
Data information | Yellow taxi fare data from the New York City Taxi and Limousine Commission (2009–2015) - Includes pickup and drop-off times and locations, trip distances, fares, payment types, passenger counts, and more |
Original dataset information
2. Set up the Kubeflow environment
This tutorial uses a notebook in a CPU node pool environment.
If the Kubeflow service or the appropriate environment is not ready, refer to the Create Jupyter Notebook document to create a CPU-based notebook.
Notebook practice
This tutorial provides two practice scenarios: training a prediction model in the notebook and building a pipeline to train the prediction model.
Practice 1. Train a predictive model in the notebook
-
Download the nyc_taxi_pytorch_run_in_notebook.ipynb file needed for the exercise.
-
Access the Kubeflow notebook instance created in the previous step. Click the [Upload File] button in the top left to upload the example file.
Upload file to Jupyter notebook console
-
Once the upload is complete, you can check the example file in the left tab. Select the uploaded example file and review the exercise content in the right-hand display area.
-
Follow the exercise steps and run the model training.
Practice 2. Create a pipeline and train a predictive model in the notebook
-
Download the nyc_taxi_pytorch_build_pipeline_cpu.ipynb file needed for the exercise.
-
Access the Kubeflow notebook instance created in the previous step. Click the [Upload File] button in the top left to upload the example file.
-
Once the upload is complete, you can view the example file in the left tab. Select the uploaded example file and review the exercise content in the right-hand display area.
-
For the exercise, enter the environment variable information, including the private IP of the load balancer connected to Kubeflow and the email and password used in Kubeflow.
-
Follow the exercise steps and run the model training.
Delete resources (Optional)
When the exercise is complete or the service is no longer in use, it is recommended to delete the resources as follows:
-
In the
Runs
tab of the Kubeflow dashboard, you can check the status of the runs. Once the task is complete, move the run toArchived
and click the Delete button. Delete run -
After deleting the run, confirm that the pod has been deleted.
Confirm pod deletion