Skip to main content

Implementing a Parallel Training Model with Kubeflow MIG Instance

This guide introduces the method of implementing parallel training models using KakaoCloud's Kubeflow MIG (Multi-Instance GPU) instances and the Training Operator.

Basic information

Before starting

This tutorial provides a guide on how to implement a parallel training model in the Kubeflow notebook and pipeline environment by utilizing multiple GPU resources through MIG (Multi-Instance GPU) settings. This process allows users to experience efficient resource management and faster training times, as well as learn about model training through parallel processing.

About this scenario

This scenario explains step-by-step how to implement a parallel training model using the Fashion MNIST dataset in the Kubeflow environment, utilizing MIG features and the Training Operator. Key topics covered in this scenario include:

  • Optimizing GPU resources through MIG settings
  • Configuring a distributed training environment using the Training Operator in Kubeflow
  • Training a prediction model using the Fashion MNIST dataset
  • Enhancing training efficiency and resource management

Supported tools

ToolVersionDescriptionSupported Frameworks
Training Operatorv1-e1434f6- Supports distributed training for various deep learning frameworks
- Provides fast model training across multiple GPU resources
- TensorFlow
- PyTorch
- Apache MXNet
- XGBoost
- Message passing interface (MPI)
info

For detailed information on Training Operators, please refer to the Kubeflow > Training Operators official documentation.

Prework

1. Prepare training data

This hands-on exercise uses the Fashion MNIST dataset, a benchmark dataset commonly used in computer vision to test new algorithms. Similar to MNIST, Fashion MNIST consists of small images that fall into clothing categories (sneakers, shirts, sandals, etc.). The dataset contains 10 categories and 70,000 grayscale images of 28x28 pixels, which will be automatically downloaded during the exercise.

Image. Fashion MNIST dataset Fashion MNIST dataset

2. Set up the Kubeflow environment

Before using the Training Operator in Kubeflow, verify the MIG settings and node pool specifications suitable for the hands-on exercise. If the Kubeflow environment is not set up, refer to the Setting up a Jupyter Notebook environment using Kubeflow document to create a Kubeflow environment with a GPU node pool and launch a notebook based on a CPU image.

Minimum required specifications

  • Node pool minimum specs: At least 4 vCPUs, 8GB of memory
  • MIG minimum specs: At least three 1g.10gb instances
  • GPU pipeline node pool setup

Step-by-step process

The detailed hands-on steps for implementing a parallel training model using Kubeflow are as follows.

Step 1. Create TrainingJob for the Fashion MNIST classification model

Utilize Kubeflow's Training Operator to create a TrainingJob for training a classification model based on the Fashion MNIST dataset. The TrainingJob will execute the model training using the Fashion MNIST dataset.

  1. Download the fashionmnist_pytorch_parallel_train_with_tj.ipynb data for the hands-on exercise.

  2. After downloading, access the created notebook instance and upload the file to the browser.

    Image. Upload file to Jupyter notebook console Upload file to Jupyter notebook console

  3. Once the upload is complete, you can check the exercise contents on the right-hand side of the screen.

    Image. Example file upload complete Example file upload complete

  4. Review the example contents and proceed with the model training exercise.

Step 2. Check the TrainingJob status on the dashboard

The Kubeflow dashboard provides a screen to check the specifications, logs, and events of the TrainingJob.

  1. Access the Kubeflow dashboard, then select the TrainingJob tab. Image. GPU status tab GPU status tab

  2. Click parallel-train-pytorch from the list. Image. GPU status tab GPU status tab

  3. Check the detailed status of the TrainingJob created in Step 1. Image. GPU status tab GPU status tab

Step 3. Check the MIG instance status

Use the KakaoCloud console to monitor the usage status of MIG instances. Review the GPU resource distribution and usage due to parallel training tasks and assess the efficiency of resource utilization.

  1. In the KakaoCloud Console, select the Kubeflow menu.

  2. In the Kubeflow menu, select the Kubeflow environment to review the details.

  3. In the GPU status tab, check the usage status of the MIG instance used for the exercise.

    Image. GPU status tab GPU status tab