Skip to main content
Tutorial series | Kubeflow LLM workflows

3. Build RAG with LLM model

📘 Implement a document-based Q&A system (RAG) using the updated LLM model endpoint.

Basic information
  • Estimated duration: 60 minutes
  • Recommended operating system: Ubuntu

About this scenario

In this scenario, you will use the latest LLM models—Kakao’s Kanana and Meta Llama 3.2—within a Kubeflow Jupyter Notebook environment to split and vectorize documents and build a Retrieval-Augmented Generation (RAG) pipeline. Through this exercise, you will implement a local document-based Q&A system using LangChain.

Key tasks include:

  • Load, split, embed, and store documents as vectors
  • Build a LangChain-based Retriever & Generator
  • Invoke the LLM and test the RAG query pipeline

RAG implementation process

The full RAG flow consists of preprocessing (steps 1–4) and query handling (steps 5–8):

StepTask NameDescription
1LoadLoad documents from PDFs, text files, webpages, etc.
2SplitSplit documents into chunks based on chunk_size and overlap suitable for LLM input.
3EmbeddingConvert split text into semantic vectors.
4VectorStoreStore embedded vectors in a vector DB (e.g., FAISS).
5RetrievalUse a retriever object to fetch documents relevant to the query.
6PromptingConstruct prompts with retrieved documents and the query.
7LLMGenerate responses using the LLM.
8OutputReturn final prompt → model call → response.

Supported tools

ToolVersionDescription
Jupyter Notebook4.2.1Web-based development environment supporting ML frameworks and Kubeflow SDK.
KServe0.15.0Model serving tool offering fast deployment, updates, high availability, and automatic handling of common serving issues.

Before you start

1. Prepare Kubeflow environment

To reliably build the RAG pipeline in Kubeflow, you need a node pool environment with the following specifications. Refer to the prerequisites and prepare an environment with a CPU or GPU-enabled node pool in advance.

2. Prepare training dataset

Use KakaoCloud’s technical documentation, including the Kubeflow service guide and tutorial texts, as the training dataset for this exercise.
Download the sample dataset below to get started:

Getting started

In this exercise, you will build a document-based Q&A system (RAG: Retrieval-Augmented Generation) using the latest LLM models—Kakao’s Kanana-Nano-2.1B and Meta’s Llama 3.2.

Step 1. Create Jupyter Notebook instance

Create a GPU-based Jupyter Notebook instance for hands-on practice in the Kubeflow dashboard.

  1. In the Kubeflow dashboard, go to the Notebooks tab.

  2. Click the [New Notebook] button in the upper-right corner to create a new instance.

  3. In the New notebook configuration screen, enter the following settings:

    • Notebook Image: Select kc-kubeflow/jupyter-pytorch-cuda-full:v1.8.0.py311.1a
    • Minimum specs: At least 3 vCPUs, 6GB memory
    • GPU settings: Select 1 GPU, GPU Vendor NVIDIA MIG - 1g.10gb
  4. After entering the settings, click the [LAUNCH] button to start the notebook instance.

Step 2. Install packages

Install the required Python packages to implement the RAG pipeline.

Install packages
! pip install langchain langchain-community langchain-openai transformers sentence-transformers faiss-cpu langgraph datasets accelerate langchain_huggingface

Step 3. Load and convert documents

Load the document provided in CSV format and convert it into LangChain Document objects.

Load CSV file
from datasets import Dataset

dataset = Dataset.from_csv('sample_rag_docs_dataset.csv')
dataset
Example output
Generating train split: 
 17/0 [00:00<00:00, 1236.38 examples/s]
Dataset({
features: ['source', 'page_content'],
num_rows: 17
})
Convert to LangChain Document objects
from langchain_core.documents import Document

docs = []

for _ea_doc_data in dataset:
docs.append(Document(
metadata={'source': _ea_doc_data['source']},
page_content=_ea_doc_data['page_content']
))

len(docs)
Example output
17

Step 4. Split documents

Split the converted documents into chunks suitable for LLM input.
These chunks will be used in the next steps for embedding and vector store construction.

Split documents
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
split_documents = text_splitter.split_documents(docs)
len(split_documents)
Example output
108

Step 5. Create embeddings and vector store

Convert the documents into high-dimensional vectors and store them in a searchable structure.

  1. Generate embeddings. Document embeddings represent each document chunk as a vector. You can use the BAAI/bge-m3 model from HuggingFace with the HuggingFaceEmbeddings class.

    The Kanana-Nano-2.1B model is not natively supported by LangChain’s embedding interface, so you need to implement a custom wrapper class as shown below.

    Kanana embedding model setup
        embeddings = AutoModel.from_pretrained(
    "kakaocorp/kanana-nano-2.1b-embedding",
    trust_remote_code=True,
    )
    Move to device and define embedding wrapper
    def _move_to_device(maybe_tensor, device: torch.device):
    if torch.is_tensor(maybe_tensor):
    return maybe_tensor.to(device, non_blocking=device.type == "cuda")
    elif isinstance(maybe_tensor, dict):
    return {key: _move_to_device(value, device) for key, value in maybe_tensor.items()}
    elif isinstance(maybe_tensor, list):
    return [_move_to_device(x, device) for x in maybe_tensor]
    elif isinstance(maybe_tensor, tuple):
    return tuple([_move_to_device(x, device) for x in maybe_tensor])
    elif isinstance(maybe_tensor, Mapping):
    return type(maybe_tensor)({k: _move_to_device(v, device) for k, v in maybe_tensor.items()})
    else:
    return maybe_tensor

    def move_to_device(sample, device: torch.device):
    if device.type == "cpu":
    return sample

    if len(sample) == 0:
    return {}
    return _move_to_device(sample, device)

    class LangChainEmbeddingWrapper:
    def __init__(self, model):
    self.model = model


    def embed_documents(self, texts: List[str]) -> List[np.ndarray]:
    # Convert input text into the model's input format
    batch_dict = self.model.tokenizer(
    texts,
    max_length=512,
    padding=True,
    return_tensors="pt",
    truncation=True
    )
    # Create pool_mask (same as attention_mask)
    pool_mask = batch_dict['attention_mask'].clone()
    # Move data to the model's device
    batch_dict = move_to_device(batch_dict, self.model.device)
    pool_mask = pool_mask.to(self.model.device)
    # Run the model
    with torch.no_grad():
    embeddings = self.model(
    input_ids=batch_dict['input_ids'],
    attention_mask=batch_dict['attention_mask'],
    pool_mask=pool_mask
    ).embedding
    # Convert result to NumPy array and return
    return embeddings.cpu().numpy().tolist()

    def embed_query(self, text: str) -> np.ndarray:
    return self.embed_documents([text])[0]

    # Set wrapper
    embeddings = LangChainEmbeddingWrapper(embeddings)
  2. Store the embedded vectors in a vector database. This database will serve as the foundation for similarity-based document retrieval during queries.

    Save vectors
    vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)
    vectorstore.save_local('./db/kcdocs_faiss')

Step 6. Define LLM model

Access the LLM endpoint served via KServe using a DNS-based URL and use it in LangChain.

Configure LLM endpoint
isvc_name = "<YOUR_KSERVE_INFERENCE_SERVICE_NAME>"  # Inference service name
namespace = "<YOUR_NAMESPACE>" # Your namespace
model_name = "<YOUR_MODEL_NAME>" # Model name

llm_svc_url = f"http://{isvc_name}.{namespace}.svc.cluster.local/"

# Set up OpenAI-compatible chat model
llm = ChatOpenAI(
model_name=model_name,
base_url=os.path.join(llm_svc_url, 'openai', 'v1'),
openai_api_key="empty",
)

Step 7. Build RAG graph

Construct a single chain (graph) that flows from document retrieval to response generation.

  1. Define the data structure exchanged between each node (retrieve, generate) in the graph.

    Define state
    class State(TypedDict):
    question: str
    context: List[Document]
    answer: str
  2. Create the Retriever function to search documents and the Generate function to generate answers.

    Retriever and generator functions
    # Load the prompt used in RAG; this will be used in the generate function.
    prompt = hub.pull("rlm/rag-prompt")

    # `retrieve` function: responsible for retrieving the most relevant documents from the vector store based on the question
    def retrieve(state: State):
    retrieved_docs = vectorstore.similarity_search(state["question"])
    return {"context": retrieved_docs}

    # `generate` function: responsible for generating an answer based on the retrieved documents
    def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}
  3. Visually connect the defined state and functions to build the Graph.

    Build graph
    graph_builder = StateGraph(State).add_sequence([retrieve, generate])
    graph_builder.add_edge(START, "retrieve")
    graph = graph_builder.compile()

Step 8. Test RAG

Use the previously constructed Graph (retrieve → generate) to search for relevant documents based on a user's question and generate a response using the LLM.

Test RAG
response = graph.invoke({"question": "Tell me about the supported framework versions in Kubeflow versions 1.6 and 1.8."})
print(response)
Example output
Generating train split: 
 17/0 [00:00<00:00, 1236.38 examples/s]
Dataset({
features: ['source', 'page_content'],
num_rows: 17
})