3. Build RAG with LLM model
📘 Implement a document-based Q&A system (RAG) using the updated LLM model endpoint.
- Estimated duration: 60 minutes
- Recommended operating system: Ubuntu
About this scenario
In this scenario, you will use the latest LLM models—Kakao’s Kanana and Meta Llama 3.2—within a Kubeflow Jupyter Notebook environment to split and vectorize documents and build a Retrieval-Augmented Generation (RAG) pipeline. Through this exercise, you will implement a local document-based Q&A system using LangChain.
Key tasks include:
- Load, split, embed, and store documents as vectors
- Build a LangChain-based Retriever & Generator
- Invoke the LLM and test the RAG query pipeline
RAG implementation process
The full RAG flow consists of preprocessing (steps 1–4) and query handling (steps 5–8):
Step | Task Name | Description |
---|---|---|
1 | Load | Load documents from PDFs, text files, webpages, etc. |
2 | Split | Split documents into chunks based on chunk_size and overlap suitable for LLM input. |
3 | Embedding | Convert split text into semantic vectors. |
4 | VectorStore | Store embedded vectors in a vector DB (e.g., FAISS). |
5 | Retrieval | Use a retriever object to fetch documents relevant to the query. |
6 | Prompting | Construct prompts with retrieved documents and the query. |
7 | LLM | Generate responses using the LLM. |
8 | Output | Return final prompt → model call → response. |
Supported tools
Tool | Version | Description |
---|---|---|
Jupyter Notebook | 4.2.1 | Web-based development environment supporting ML frameworks and Kubeflow SDK. |
KServe | 0.15.0 | Model serving tool offering fast deployment, updates, high availability, and automatic handling of common serving issues. |
Before you start
1. Prepare Kubeflow environment
To reliably build the RAG pipeline in Kubeflow, you need a node pool environment with the following specifications. Refer to the prerequisites and prepare an environment with a CPU or GPU-enabled node pool in advance.
2. Prepare training dataset
Use KakaoCloud’s technical documentation, including the Kubeflow service guide and tutorial texts, as the training dataset for this exercise.
Download the sample dataset below to get started:
- Download sample data: sample_rag_docs_dataset.csv
Getting started
In this exercise, you will build a document-based Q&A system (RAG: Retrieval-Augmented Generation) using the latest LLM models—Kakao’s Kanana-Nano-2.1B and Meta’s Llama 3.2.
Step 1. Create Jupyter Notebook instance
Create a GPU-based Jupyter Notebook instance for hands-on practice in the Kubeflow dashboard.
-
In the Kubeflow dashboard, go to the Notebooks tab.
-
Click the [New Notebook] button in the upper-right corner to create a new instance.
-
In the New notebook configuration screen, enter the following settings:
- Notebook Image: Select
kc-kubeflow/jupyter-pytorch-cuda-full:v1.8.0.py311.1a
- Minimum specs: At least 3 vCPUs, 6GB memory
- GPU settings: Select 1 GPU, GPU Vendor
NVIDIA MIG - 1g.10gb
- Notebook Image: Select
-
After entering the settings, click the [LAUNCH] button to start the notebook instance.
Step 2. Install packages
Install the required Python packages to implement the RAG pipeline.
! pip install langchain langchain-community langchain-openai transformers sentence-transformers faiss-cpu langgraph datasets accelerate langchain_huggingface
Step 3. Load and convert documents
Load the document provided in CSV format and convert it into LangChain Document
objects.
from datasets import Dataset
dataset = Dataset.from_csv('sample_rag_docs_dataset.csv')
dataset
Generating train split:
17/0 [00:00<00:00, 1236.38 examples/s]
Dataset({
features: ['source', 'page_content'],
num_rows: 17
})
from langchain_core.documents import Document
docs = []
for _ea_doc_data in dataset:
docs.append(Document(
metadata={'source': _ea_doc_data['source']},
page_content=_ea_doc_data['page_content']
))
len(docs)
17
Step 4. Split documents
Split the converted documents into chunks suitable for LLM input.
These chunks will be used in the next steps for embedding and vector store construction.
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
split_documents = text_splitter.split_documents(docs)
len(split_documents)
108
Step 5. Create embeddings and vector store
Convert the documents into high-dimensional vectors and store them in a searchable structure.
-
Generate embeddings. Document embeddings represent each document chunk as a vector. You can use the BAAI/bge-m3 model from HuggingFace with the
HuggingFaceEmbeddings
class.- Kanana model
- Llma 3.2 model
The Kanana-Nano-2.1B model is not natively supported by LangChain’s embedding interface, so you need to implement a custom wrapper class as shown below.
Kanana embedding model setupembeddings = AutoModel.from_pretrained(
"kakaocorp/kanana-nano-2.1b-embedding",
trust_remote_code=True,
)Move to device and define embedding wrapperdef _move_to_device(maybe_tensor, device: torch.device):
if torch.is_tensor(maybe_tensor):
return maybe_tensor.to(device, non_blocking=device.type == "cuda")
elif isinstance(maybe_tensor, dict):
return {key: _move_to_device(value, device) for key, value in maybe_tensor.items()}
elif isinstance(maybe_tensor, list):
return [_move_to_device(x, device) for x in maybe_tensor]
elif isinstance(maybe_tensor, tuple):
return tuple([_move_to_device(x, device) for x in maybe_tensor])
elif isinstance(maybe_tensor, Mapping):
return type(maybe_tensor)({k: _move_to_device(v, device) for k, v in maybe_tensor.items()})
else:
return maybe_tensor
def move_to_device(sample, device: torch.device):
if device.type == "cpu":
return sample
if len(sample) == 0:
return {}
return _move_to_device(sample, device)
class LangChainEmbeddingWrapper:
def __init__(self, model):
self.model = model
def embed_documents(self, texts: List[str]) -> List[np.ndarray]:
# Convert input text into the model's input format
batch_dict = self.model.tokenizer(
texts,
max_length=512,
padding=True,
return_tensors="pt",
truncation=True
)
# Create pool_mask (same as attention_mask)
pool_mask = batch_dict['attention_mask'].clone()
# Move data to the model's device
batch_dict = move_to_device(batch_dict, self.model.device)
pool_mask = pool_mask.to(self.model.device)
# Run the model
with torch.no_grad():
embeddings = self.model(
input_ids=batch_dict['input_ids'],
attention_mask=batch_dict['attention_mask'],
pool_mask=pool_mask
).embedding
# Convert result to NumPy array and return
return embeddings.cpu().numpy().tolist()
def embed_query(self, text: str) -> np.ndarray:
return self.embed_documents([text])[0]
# Set wrapper
embeddings = LangChainEmbeddingWrapper(embeddings)If you're using the BAAI bge-m3 model, you can directly use LangChain's
HuggingFaceEmbeddings
class:Create embeddingsembeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-m3"
) -
Store the embedded vectors in a vector database. This database will serve as the foundation for similarity-based document retrieval during queries.
Save vectorsvectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)
vectorstore.save_local('./db/kcdocs_faiss')
Step 6. Define LLM model
Access the LLM endpoint served via KServe using a DNS-based URL and use it in LangChain.
isvc_name = "<YOUR_KSERVE_INFERENCE_SERVICE_NAME>" # Inference service name
namespace = "<YOUR_NAMESPACE>" # Your namespace
model_name = "<YOUR_MODEL_NAME>" # Model name
llm_svc_url = f"http://{isvc_name}.{namespace}.svc.cluster.local/"
# Set up OpenAI-compatible chat model
llm = ChatOpenAI(
model_name=model_name,
base_url=os.path.join(llm_svc_url, 'openai', 'v1'),
openai_api_key="empty",
)
Step 7. Build RAG graph
Construct a single chain (graph) that flows from document retrieval to response generation.
-
Define the data structure exchanged between each node (retrieve, generate) in the graph.
Define stateclass State(TypedDict):
question: str
context: List[Document]
answer: str -
Create the Retriever function to search documents and the Generate function to generate answers.
Retriever and generator functions# Load the prompt used in RAG; this will be used in the generate function.
prompt = hub.pull("rlm/rag-prompt")
# `retrieve` function: responsible for retrieving the most relevant documents from the vector store based on the question
def retrieve(state: State):
retrieved_docs = vectorstore.similarity_search(state["question"])
return {"context": retrieved_docs}
# `generate` function: responsible for generating an answer based on the retrieved documents
def generate(state: State):
docs_content = "\n\n".join(doc.page_content for doc in state["context"])
messages = prompt.invoke({"question": state["question"], "context": docs_content})
response = llm.invoke(messages)
return {"answer": response.content} -
Visually connect the defined state and functions to build the Graph.
Build graphgraph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()
Step 8. Test RAG
Use the previously constructed Graph (retrieve → generate
) to search for relevant documents based on a user's question and generate a response using the LLM.
response = graph.invoke({"question": "Tell me about the supported framework versions in Kubeflow versions 1.6 and 1.8."})
print(response)
Generating train split:
17/0 [00:00<00:00, 1236.38 examples/s]
Dataset({
features: ['source', 'page_content'],
num_rows: 17
})