One post tagged with "ai" | KakaoCloud Docs

Building MLOps workflows with Kubeflow

January 31, 2024 · 7 min read

Developer

Hello. In this post, we introduce Kubeflow, a core platform for machine learning operations.

Kubeflow is an open-source project designed to reduce the complexity of machine learning and help data scientists and developers develop and deploy machine learning models more easily and quickly. In the first sentence introducing Kubeflow on the official Kubeflow site, it is described as a project that helps comprehensively manage and operate various open-source tools for machine learning on Kubernetes.

Starting from TensorFlow Extended (TFX), which Google used internally in the past, Kubeflow has now expanded into one of the most widely known end-to-end solutions for running machine learning workflows in various Kubernetes-based environments.

One of Kubeflow's most innovative approaches is the integration of AutoML and Kubeflow Pipelines. This allows users to automate and optimize the training, evaluation, and deployment stages of models, reducing repetitive work in machine learning projects. In addition, multi-tenant support has been strengthened so that multiple teams can effectively share the same Kubeflow instance while isolating resources. The Kubeflow service provided by KakaoCloud is also designed to maximize the efficiency of machine learning work and make it easy for users to access.

In this post, we introduce Kubeflow's major components, latest features, and various tutorial scenarios for using Kubeflow on KakaoCloud.

Kubeflow features

Kubeflow supports the following tasks in Kubernetes environments with the goal of flexible scaling and easy, convenient production deployment of machine learning models.

Easy, repeatable, and portable deployment: Pipelines created through Kubeflow make deployment easier across multiple environments, including cloud and on-premises environments.
Independent microservice deployment and management system: Based on a microservices architecture, Kubeflow enables independent management of each component.
Responsive scaling based on user requirements: Resources are automatically scaled according to user requirements to ensure optimal performance.

Key Kubeflow components

Kubeflow consists of multiple open-source components such as Central Dashboard, Jupyter Notebooks, Tensorboard, and Pipelines, each supporting a specific stage of the machine learning workflow. These components are designed to help users manage machine learning projects more efficiently.

Source: Kubeflow Ecosystem

Using these key components on Kubernetes, Kubeflow efficiently supports the entire process from machine learning model development and deployment to resource management.

Key Kubeflow component	Description
Central Dashboard	Provides a dashboard web console for accessing and monitoring multiple components.
Notebooks	Provides a Jupyter Notebook environment where data scientists can code directly within a cluster.
Tensorboard	Creates and manages Tensorboard Server, a tool for visualizing model training processes and training data provided by frameworks such as Tensorflow and PyTorch.
Pipelines	Simplifies complex machine learning workflows through scalable Docker-based pipelines.
Katib	Automates hyperparameter tuning for model training through AutoML components such as Katib.
Training Operator	Supports various machine learning frameworks and enables flexible training jobs.
KServe	Enables efficient model deployment and serving through model-serving add-ons such as KServe, and provides them as real-time APIs internally and externally.

KakaoCloud Kubeflow

KakaoCloud supports the latest features, including Kubeflow 1.6, and provides an optimized cloud environment that enables users to perform machine learning tasks easily and quickly. In particular, KakaoCloud Kubeflow has the following features.

Support for all Kubeflow 1.6 features

KakaoCloud Kubeflow lets you use all major Kubeflow components and add-ons introduced above. You can also install and use frameworks and libraries such as Tensorflow, PyTorch, Apache MXNet, MPI, XGBoost, Chainer, HuggingFace, and OpenAI SDK.

Granular access management

By providing RBAC, users can be assigned namespaces according to their tasks and roles, and permissions can be managed efficiently by user or group. Administrators can also assign quota features by namespace and allocate CPU, memory, GPU memory, and storage resources according to configured usage.

Flexible storage options

In addition to the independent MinIO type, KakaoCloud supports storage repositories of the Object Storage type, enabling more flexible serving of model result files.

Optimized for Nvidia MIG instances

KakaoCloud Kubeflow provides optimized MIG (Multi Instance GPU) instances based on Nvidia A100. MIG instance settings allow GPU resources to be partitioned, enabling users to run multiple workloads efficiently on the same GPU.

Multi File Storage support

Users can dynamically use as much independent File Storage as needed by user or group, making it easier to share files between work pipelines and notebooks.

Usage examples with Kubeflow

KakaoCloud technical documentation provides rich Kubeflow tutorials that cover various stages of machine learning projects, from Jupyter Notebook setup to building parallel training models and creating model-serving APIs. By referring to these tutorials, you can learn about efficient model development, training, optimization, and deployment using KakaoCloud Kubeflow.

The Kubeflow-related tutorials currently available in KakaoCloud technical documentation are as follows.

Configure a Jupyter Notebook environment using Kubeflow
Introduces the process of configuring Jupyter Notebook using the Kubeflow service in a Kubernetes environment.
Implement a predictive model with Kubeflow Notebook
A hands-on example that implements a taxi fare prediction model using TLC Trip Record Data.
Train a predictive model using Kubeflow Pipelines
Introduces how to automate the training process of a machine learning model using Kubeflow Pipelines.
Manage machine learning experiments using Kubeflow Tensorboard
A hands-on example that uses the TensorBoard component to manage and visualize log data generated during machine learning experiments.
Tune hyperparameters with Kubeflow
A scenario that performs hyperparameter tuning for the MNIST dataset using Kubeflow and Katib.
Implement a parallel training model with a Kubeflow MIG instance
A scenario that implements a parallel training model using Kubeflow MIG (Multi-Instance GPU) instances and Training Operator.
Create a Kubeflow model serving API
A scenario that builds a machine learning pipeline using a dataset and provides the generated model as a web API.

Closing

Kubeflow is currently one of the most widely used open-source MLOps platforms in Korea and abroad. As a result, educational content, experience cases, and example source code are relatively abundant, helping data scientists and working analysts who are using it for the first time adapt quickly.

KakaoCloud Kubeflow provides GPU optimization and powerful resource management features through easy provisioning that takes advantage of the cloud environment. We will continue improving the Kubeflow service so KakaoCloud users can fully benefit from an MLOps platform with machine learning efficiency and enhanced security. If you are considering using a Kubeflow service for machine learning, be sure to try KakaoCloud's service.

Thank you.

Kubeflow features​

Key Kubeflow components​

KakaoCloud Kubeflow​

Support for all Kubeflow 1.6 features​

Granular access management​

Flexible storage options​

Optimized for Nvidia MIG instances​

Multi File Storage support​

Usage examples with Kubeflow​

Closing​