Tutorial series | Kubeflow LLM workflows

3. Build RAG with LLM model

📘 Implement a document-based Q&A system (RAG) using the updated LLM model endpoint.

Basic information

Estimated duration: 60 minutes
Recommended operating system: Ubuntu

About this scenario

In this scenario, you will use the latest LLM models—Kakao’s Kanana and Meta Llama 3.2—within a Kubeflow Jupyter Notebook environment to split and vectorize documents and build a Retrieval-Augmented Generation (RAG) pipeline. Through this exercise, you will implement a local document-based Q&A system using LangChain.

Key tasks include:

Load, split, embed, and store documents as vectors
Build a LangChain-based Retriever & Generator
Invoke the LLM and test the RAG query pipeline

RAG implementation process

The full RAG flow consists of preprocessing (steps 1–4) and query handling (steps 5–8):

Step	Task Name	Description
1	Load	Load documents from PDFs, text files, webpages, etc.
2	Split	Split documents into chunks based on `chunk_size` and `overlap` suitable for LLM input.
3	Embedding	Convert split text into semantic vectors.
4	VectorStore	Store embedded vectors in a vector DB (e.g., FAISS).
5	Retrieval	Use a `retriever` object to fetch documents relevant to the query.
6	Prompting	Construct prompts with retrieved documents and the query.
7	LLM	Generate responses using the LLM.
8	Output	Return final prompt → model call → response.

Supported tools

Tool	Version	Description
Jupyter Notebook	4.2.1	Web-based development environment supporting ML frameworks and Kubeflow SDK.
KServe	0.15.0	Model serving tool offering fast deployment, updates, high availability, and automatic handling of common serving issues.

Before you start

1. Prepare Kubeflow environment

To reliably build the RAG pipeline in Kubeflow, you need a node pool environment with the following specifications. Refer to the prerequisites and prepare an environment with a CPU or GPU-enabled node pool in advance.

2. Prepare training dataset

Use KakaoCloud’s technical documentation, including the Kubeflow service guide and tutorial texts, as the training dataset for this exercise.
Download the sample dataset below to get started:

Download sample data: sample_rag_docs_dataset.csv

Getting started

In this exercise, you will build a document-based Q&A system (RAG: Retrieval-Augmented Generation) using the latest LLM models—Kakao’s Kanana-Nano-2.1B and Meta’s Llama 3.2.

Step 1. Create Jupyter Notebook instance

Create a GPU-based Jupyter Notebook instance for hands-on practice in the Kubeflow dashboard.

In the Kubeflow dashboard, go to the Notebooks tab.
Click the [New Notebook] button in the upper-right corner to create a new instance.
In the New notebook configuration screen, enter the following settings:
- Notebook Image: Select kc-kubeflow/jupyter-pytorch-cuda-full:v1.8.0.py311.1a
- Minimum specs: At least 3 vCPUs, 6GB memory
- GPU settings: Select 1 GPU, GPU Vendor NVIDIA MIG - 1g.10gb
After entering the settings, click the [LAUNCH] button to start the notebook instance.

Step 2. Install packages

Install the required Python packages to implement the RAG pipeline.

Install packages
! pip install langchain langchain-community langchain-openai transformers sentence-transformers faiss-cpu langgraph datasets accelerate langchain_huggingface

Step 3. Load and convert documents

Load the document provided in CSV format and convert it into LangChain Document objects.

Load CSV file
from datasets import Dataset

dataset = Dataset.from_csv('sample_rag_docs_dataset.csv')
dataset

Example output
Generating train split: 
 17/0 [00:00<00:00, 1236.38 examples/s]
Dataset({
    features: ['source', 'page_content'],
    num_rows: 17
})

Convert to LangChain Document objects
from langchain_core.documents import Document

docs = []

for _ea_doc_data in dataset:
    docs.append(Document(
        metadata={'source': _ea_doc_data['source']}, 
        page_content=_ea_doc_data['page_content']
    ))

len(docs)

Example output
17

Step 4. Split documents

Split the converted documents into chunks suitable for LLM input.
These chunks will be used in the next steps for embedding and vector store construction.

Split documents
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

Example output
108

Step 5. Create embeddings and vector store

Convert the documents into high-dimensional vectors and store them in a searchable structure.

Generate embeddings. Document embeddings represent each document chunk as a vector. You can use the BAAI/bge-m3 model from HuggingFace with the HuggingFaceEmbeddings class.

Kanana model
Llma 3.2 model

The Kanana-Nano-2.1B model is not natively supported by LangChain’s embedding interface, so you need to implement a custom wrapper class as shown below.

Kanana embedding model setup
    embeddings = AutoModel.from_pretrained(
        "kakaocorp/kanana-nano-2.1b-embedding",
        trust_remote_code=True,
    )

Move to device and define embedding wrapper
def _move_to_device(maybe_tensor, device: torch.device):
    if torch.is_tensor(maybe_tensor):
        return maybe_tensor.to(device, non_blocking=device.type == "cuda")
    elif isinstance(maybe_tensor, dict):
        return {key: _move_to_device(value, device) for key, value in maybe_tensor.items()}
    elif isinstance(maybe_tensor, list):
        return [_move_to_device(x, device) for x in maybe_tensor]
    elif isinstance(maybe_tensor, tuple):
        return tuple([_move_to_device(x, device) for x in maybe_tensor])
    elif isinstance(maybe_tensor, Mapping):
        return type(maybe_tensor)({k: _move_to_device(v, device) for k, v in maybe_tensor.items()})
    else:
        return maybe_tensor

def move_to_device(sample, device: torch.device):
    if device.type == "cpu":
        return sample
    
    if len(sample) == 0:
        return {}
    return _move_to_device(sample, device)

class LangChainEmbeddingWrapper:
    def __init__(self, model):
        self.model = model
    

    def embed_documents(self, texts: List[str]) -> List[np.ndarray]:
        # Convert input text into the model's input format
        batch_dict = self.model.tokenizer(
            texts,
            max_length=512,
            padding=True,
            return_tensors="pt",
            truncation=True
        )
        # Create pool_mask (same as attention_mask)
        pool_mask = batch_dict['attention_mask'].clone()
        # Move data to the model's device
        batch_dict = move_to_device(batch_dict, self.model.device)
        pool_mask = pool_mask.to(self.model.device)
        # Run the model
        with torch.no_grad():
            embeddings = self.model(
                input_ids=batch_dict['input_ids'],
                attention_mask=batch_dict['attention_mask'],
                pool_mask=pool_mask
            ).embedding
        # Convert result to NumPy array and return
        return embeddings.cpu().numpy().tolist()

    def embed_query(self, text: str) -> np.ndarray:
        return self.embed_documents([text])[0]

# Set wrapper
embeddings = LangChainEmbeddingWrapper(embeddings)

If you're using the BAAI bge-m3 model, you can directly use LangChain's HuggingFaceEmbeddings class:

Create embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3"
)

Store the embedded vectors in a vector database. This database will serve as the foundation for similarity-based document retrieval during queries.
Save vectors
```
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)
vectorstore.save_local('./db/kcdocs_faiss')
```

Step 6. Define LLM model

Access the LLM endpoint served via KServe using a DNS-based URL and use it in LangChain.

Configure LLM endpoint
isvc_name = "<YOUR_KSERVE_INFERENCE_SERVICE_NAME>"  # Inference service name
namespace = "<YOUR_NAMESPACE>"                      # Your namespace
model_name = "<YOUR_MODEL_NAME>"                    # Model name

llm_svc_url = f"http://{isvc_name}.{namespace}.svc.cluster.local/"
 
# Set up OpenAI-compatible chat model
llm = ChatOpenAI(
    model_name=model_name,
    base_url=os.path.join(llm_svc_url, 'openai', 'v1'),
    openai_api_key="empty",
)

Step 7. Build RAG graph

Construct a single chain (graph) that flows from document retrieval to response generation.

Define the data structure exchanged between each node (retrieve, generate) in the graph.
Define state
```
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str
```

Create the Retriever function to search documents and the Generate function to generate answers.

Retriever and generator functions
# Load the prompt used in RAG; this will be used in the generate function.
prompt = hub.pull("rlm/rag-prompt")

# `retrieve` function: responsible for retrieving the most relevant documents from the vector store based on the question
def retrieve(state: State):
    retrieved_docs = vectorstore.similarity_search(state["question"])
    return {"context": retrieved_docs}
        
# `generate` function: responsible for generating an answer based on the retrieved documents
def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}

Visually connect the defined state and functions to build the Graph.

Build graph
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Step 8. Test RAG

Use the previously constructed Graph (retrieve → generate) to search for relevant documents based on a user's question and generate a response using the LLM.

Test RAG
response = graph.invoke({"question": "Tell me about the supported framework versions in Kubeflow versions 1.6 and 1.8."})
print(response)

Example output
Generating train split: 
 17/0 [00:00<00:00, 1236.38 examples/s]
Dataset({
    features: ['source', 'page_content'],
    num_rows: 17
})

About this scenario​

RAG implementation process​

Supported tools​

Before you start​

1. Prepare Kubeflow environment​

2. Prepare training dataset​

Getting started​

Step 1. Create Jupyter Notebook instance​

Step 2. Install packages​

Step 3. Load and convert documents​

Step 4. Split documents​

Step 5. Create embeddings and vector store​

Step 6. Define LLM model​

Step 7. Build RAG graph​

Step 8. Test RAG​