Parallel training with Kubeflow MIG
This tutorial introduces how to implement parallel training models using MIG (Multi-Instance GPU) instances and the Training Operator in the Kubeflow environment provided by KakaoCloud.
- Estimated time: 10 minutes
- Recommended OS: macOS, Ubuntu
- Reference
- Note
- In private network environments, training file downloads may not work properly.
About this scenario
This tutorial explains how to implement parallel training models using the MIG (Multi-Instance GPU) configuration in KakaoCloud, leveraging Kubeflow Notebooks and Pipelines. You will learn how to manage GPU resources efficiently and reduce training time by applying parallel processing.
The scenario uses the Fashion MNIST dataset to walk through the implementation of a parallel training model using Kubeflow’s MIG feature and Training Operator.
Main topics include:
- Optimizing GPU usage through MIG configuration
- Building a distributed training environment with the Training Operator in Kubeflow
- Training a prediction model using the Fashion MNIST dataset
- Improving training efficiency and managing GPU resources
Supported tools
Tool | Version | Description | Supported frameworks |
---|---|---|---|
Training Operator | v1-e1434f6 | - Supports distributed training for various deep learning frameworks - Enables fast model training using multiple GPUs | - TensorFlow - PyTorch - Apache MXNet - XGBoost - Message Passing Interface (MPI) |
For more details on Training Operators, refer to the Kubeflow > Training Operators official documentation.
Before you start
1. Prepare training data
This hands-on lab uses the Fashion MNIST dataset, a popular benchmark dataset in computer vision. Unlike the original MNIST, Fashion MNIST consists of grayscale images representing clothing categories such as sneakers, shirts, sandals, etc. It contains 70,000 28x28 pixel images across 10 categories. The dataset will be automatically downloaded during training.
Fashion MNIST dataset
2. Set up Kubeflow environment
Before using the Training Operator, you must prepare a GPU node pool with MIG configuration. If you haven’t set up Kubeflow yet, refer to the Deploy Jupyter Notebooks on Kubeflow guide to create a Kubeflow environment with a GPU node pool and launch a CPU-based notebook.
Minimum requirements
- Node pool: vCPUs ≥ 4, Memory ≥ 8GB
- MIG instances: At least 3 instances with 1g.10gb
- GPU pipeline node pool configuration required
Getting started
The step-by-step process for implementing a parallel training model in Kubeflow is as follows:
Step 1. Create TrainingJob for Fashion MNIST classification model
-
Download the example file fashionmnist_pytorch_parallel_train_with_tj.ipynb.
-
Open your Jupyter Notebook instance and upload the downloaded file.
Upload file to Jupyter Notebook -
Once uploaded, open the file in the notebook interface and review the content.
Open uploaded example file
Step 2. Check TrainingJob status on dashboard
-
Go to the TrainingJob tab on the Kubeflow dashboard.
-
Select
parallel-train-pytorch
from the list. -
Check training status, logs, and event details on the detail page.
TrainingJob status
Step 3. Monitor MIG instance usage
-
Log into the KakaoCloud console and go to the Kubeflow menu.
-
Select the Kubeflow instance you created.
-
In the GPU Status tab, check the MIG instance usage during the training task.
GPU usage status