dev_to 2026年3月7日

Pythonで実装可能なRAGパイプラインを作成するガイド

Build a RAG Pipeline in Python That Actually Works

Translated: 2026/3/7 8:38:48

Japanese Translation

多数のRAGのチュートリアルが、文書をベクトルストアに詰め込んであきらめるものが多い。その後質問に対する正しくない答えが返されるので問題だ。

Original Content

Most RAG tutorials teach you to stuff documents into a vector store and call it a day. Then your users ask a question and get back completely wrong answers because the retriever pulled the wrong chunks. Retrieval Augmented Generation is the most common pattern in production AI systems. It lets an LLM answer questions using your own data — internal docs, codebases, knowledge bases — without fine-tuning. The concept is straightforward: retrieve relevant documents, feed them to the model, get grounded answers. The implementation is where teams struggle. Bad chunking produces fragments that lose context. Naive retrieval returns semantically similar but factually irrelevant results. And most tutorials stop before showing you how to evaluate whether your pipeline actually works. This guide walks through 4 patterns that make RAG pipelines reliable. Every code example uses LangChain (as of v0.3+, March 2026), runs on Python 3.10+, and is verified against the official documentation. Install the dependencies: pip install langchain-openai langchain-chroma langchain-community \ langchain-text-splitters chromadb beautifulsoup4 Set your OpenAI API key: export OPENAI_API_KEY="your-key-here" All examples below use OpenAI embeddings and models. You can swap in any LangChain-compatible provider (Anthropic, Ollama, Cohere) by changing the import and model name. The first failure point in most RAG pipelines is chunking. Split too small and you lose context. Split too large and you dilute relevance. The key is overlap: every chunk shares some text with its neighbors, so the retriever can find relevant passages even when the answer spans a chunk boundary. from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter import bs4 # Load a web page, extracting only the content you need loader = WebBaseLoader( web_paths=["https://lilianweng.github.io/posts/2023-06-23-agent/"], bs_kwargs={ "parse_only": bs4.SoupStrainer( class_=("post-title", "post-header", "post-content") ) }, ) docs = loader.load() # RecursiveCharacterTextSplitter tries paragraph breaks first, # then sentences, then words. This preserves natural boundaries. text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, add_start_index=True, # tracks where each chunk came from ) splits = text_splitter.split_documents(docs) print(f"Loaded {len(docs)} documents, split into {len(splits)} chunks") Three things matter here: chunk_size=1000 keeps chunks large enough to contain complete thoughts. A 200-token chunk rarely contains enough context to answer a question on its own. chunk_overlap=200 means adjacent chunks share 200 characters. When an answer spans two chunks, both show up in retrieval results. add_start_index=True records the character offset where each chunk starts in the original document. This lets you trace any retrieved chunk back to its source position — critical for debugging retrieval quality. RecursiveCharacterTextSplitter is the default choice for most use cases. It splits on paragraph breaks (\n\n) first, then sentence breaks (\n, .), then words. This hierarchy preserves the most natural reading boundaries. Once your documents are chunked, you need to convert them to vectors and store them for retrieval. ChromaDB is the simplest vector store for local development — no external services, no Docker containers, just pip install. from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings # OpenAI's text-embedding-3-small is fast and cheap # For higher accuracy, use text-embedding-3-large embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Create the vector store from your document chunks vectorstore = Chroma.from_documents( documents=splits, embedding=embeddings, persist_directory="./chroma_db", # saves to disk ) # Turn it into a retriever retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 4}, # return top 4 matches ) # Test it results = retriever.invoke("What is task decomposition?") for doc in results: print(f"[Score chunk from index {doc.metadata.get('start_index', '?')}]") print(doc.page_content[:200]) print("---") The persist_directory parameter saves your vectors to disk. Without it, ChromaDB stores everything in memory and you re-embed on every restart. For a knowledge base with thousands of documents, re-embedding costs real money. Choosing k: Start with k=4. Too few results and you miss relevant context. Too many and you flood the LLM's context window with noise. Measure retrieval precision (are the returned chunks actually relevant?) and adjust. When to use a different vector store: ChromaDB works for local development and small datasets (under 1 million chunks). For production with larger datasets, consider Pinecone, Weaviate, or PostgreSQL with pgvector. The LangChain API is the same — swap the import, change the constructor, keep your retrieval code. Here is where retrieval meets generation. You build a chain that takes a question, retrieves relevant chunks, formats them into a prompt, and passes everything to the LLM. from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough # Initialize the LLM llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # The prompt template grounds the LLM in your retrieved context prompt = ChatPromptTemplate.from_template( """Answer the question based only on the following context. If the context doesn't contain the answer, say "I don't have enough information to answer that." Context: {context} Question: {question} Answer:""" ) def format_docs(docs): """Join retrieved documents into a single string.""" return "\n\n".join(doc.page_content for doc in docs) # Build the RAG chain using LCEL (LangChain Expression Language) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) # Run it answer = rag_chain.invoke("What is task decomposition?") print(answer) Two design decisions in this prompt matter: "Based only on the following context" prevents the LLM from using its training data. Without this constraint, the model mixes retrieved facts with memorized (potentially outdated) information. The fallback instruction ("say I don't have enough information") stops the model from hallucinating when the retriever returns irrelevant chunks. Most RAG failures happen here: the retriever returns something vaguely related, and the model confidently generates a wrong answer from it. The chain itself uses LangChain Expression Language (LCEL). The | pipe operator connects components: retriever feeds into format_docs, which feeds into the prompt template, which feeds into the LLM, which feeds into the output parser. RunnablePassthrough() passes the user's question through unchanged. The retriever receives the same question string to perform the similarity search. This is the pattern most tutorials skip. You built a RAG pipeline. How do you know it returns correct answers? You need a test set of questions with known answers, and a systematic way to check retrieval quality. # Simple evaluation: does the retriever find relevant chunks? test_questions = [ { "question": "What is task decomposition?", "expected_keywords": ["subgoal", "decompose", "smaller"], }, { "question": "What are the types of agent memory?", "expected_keywords": ["short-term", "long-term", "sensory"], }, ] def evaluate_retrieval(retriever, test_cases): """Check if retrieved chunks contain expected keywords.""" results = [] for case in test_cases: docs = retriever.invoke(case["question"]) retrieved_text = " ".join(d.page_content for d in docs).lower() found = [ kw for kw in case["expected_keywords"] if kw.lower() in retrieved_text ] missing = [ kw for kw in case["expected_keywords"] if kw.lower() not in retrieved_text ] score = len(found) / len(case["expected_keywords"]) results.append({ "question": case["question"], "score": score, "found": found, "missing": missing, }) status = "PASS" if score >= 0.5 else "FAIL" print(f"[{status}] {case['question']} — {score:.0%}") if missing: print(f" Missing: {missing}") avg = sum(r["score"] for r in results) / len(results) print(f"\nAverage retrieval score: {avg:.0%}") return results evaluate_retrieval(retriever, test_questions) This is a minimal evaluation. It checks whether the retriever pulls back chunks that contain the right concepts. A score below 50% means your chunking strategy is wrong — go back to Pattern 1 and adjust chunk_size and chunk_overlap. For production evaluation, add these layers: Answer correctness: Compare generated answers against ground truth using an LLM-as-judge (ask a model to score the answer's factual accuracy against a reference answer). Faithfulness: Check whether the answer is grounded in the retrieved context. If the answer contains claims not present in any retrieved chunk, the model is hallucinating. Retrieval relevance: For each retrieved chunk, score whether it is actually relevant to the question. Low relevance scores mean your embeddings or chunking need work. Frameworks like DeepEval and RAGAS automate these checks. But start with the keyword-based evaluation above. It catches the obvious failures — wrong chunks, empty retrievals, missing concepts — before you invest in a full evaluation pipeline. Here is the complete pipeline in one script: """Complete RAG pipeline — load, chunk, embed, retrieve, generate, evaluate.""" import bs4 from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough # 1. Load loader = WebBaseLoader( web_paths=["https://lilianweng.github.io/posts/2023-06-23-agent/"], bs_kwargs={ "parse_only": bs4.SoupStrainer( class_=("post-title", "post-header", "post-content") ) }, ) docs = loader.load() # 2. Chunk splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, add_start_index=True ) splits = splitter.split_documents(docs) # 3. Embed + Store embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents(splits, embeddings) retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # 4. Generate llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) prompt = ChatPromptTemplate.from_template( """Answer based only on this context. If unsure, say so. Context: {context} Question: {question} Answer:""" ) rag_chain = ( {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)), "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) # 5. Run question = "What is task decomposition?" print(rag_chain.invoke(question)) 38 lines from raw documents to grounded answers. Three improvements that matter most after your first pipeline works: Add metadata filtering. Tag your documents with source, date, and category. Use search_kwargs={"filter": {"source": "docs"}} to restrict retrieval to specific document sets. Try hybrid search. Vector similarity misses exact keyword matches. ChromaDB and most vector stores support combining vector search with keyword (BM25) search. This catches queries where the user uses exact terminology from the documents. Monitor retrieval quality. Log every query, the chunks retrieved, and the generated answer. Review the logs weekly. The queries your pipeline answers badly tell you exactly which documents to add or how to adjust your chunking. RAG is not a one-time setup. It is a system that improves as you add documents, adjust chunking, and measure what works. Follow @klement_gunndu for more AI engineering content. We're building in public.