dev_to 2026年4月20日

Overlap とセクション意識を活用した意味に基づくチャンキング：誰も書かなかった RAG のチュートリアル

Semantic Chunking with Overlap and Section-Awareness: The RAG Tutorial Nobody Wrote

Translated: 2026/4/20 13:02:09

rag-systemsemantic-chunkingpythonnatural-language-processinginformation-retrieval

Japanese Translation

私は、LLM が問題ではないこと、埋め込みもベクトルデータベースも問題ではないことを理解するまで、3 週間 debugging に費やしました。問題は、そのチャンクがゴミであることを理解するだけでした。私たちは、34 万ページの法的ドキュメントを 512 トークンの固定サイズのチャンクに分割していました。定義がそれらに参照される條款から離れてしまい、テーブルが半行に割られ、セクションヘッダーが 1 つのチャンクの末尾に終わって、内容は次のチャンクから始まる、といったことが起きました。検索精度は 61% に留まっています。私は、オーバーラップとセクション意識を活用した意味に基づくチャンキングに変更しました。同じモデル、同じドキュメント、その他すべてが同様にそのままです。精度は 89% に跳ね上がりました。これが機能させた正確なコードです。デフォルトのアドバイスはシンプルで、ドキュメントを N トークンのチャンクに分割しましょう。少しオーバーラップを加えるかもしれません。それで終わり。それはきれいなブログ記事や形式が整ったドキュメントでは機能しますが、現実の何かでは崩れてしまいます。ネストされた副條款を持つ契約書や、テーブルを伴う技術マニュアル、3 年間にわたり 12 人の異なる人が執筆したウィキなどです。問題は、意味がトークンの境界を尊重しないということです。512 トークンのウィンドウは段落を半分に分け、コードブロックをその説明から切り離し、セクションヘッダーをその内容なしに置き去りすることがあります。それはレシピの本をページの数でスライスするのではなく、レシピでスライスしようと試みても、具材のリストが 1 つのチャンクに、手順が別のチャンクに残ってしまうようなものです。夕食を作るのは良い運でないと、難しいでしょう。だからなぜ誰もまだそうするのか？それは容易だからです。ただし「容易に実装可能」であり「実際に production で機能する」ことは非常に異なります。Python のチャンカーが、ドキュメント構造（見出し、水平線、主要なトピックの変化など）に基づいてセクション境界を検出する。セクション内部で、トピックが変わる自然なブレイクポイントを発見するように意味の類似度を使用して分割する。チャンク間で情報がギャップに落ちないよう調整可能なオーバーラップを付与する。メタデータを保持する—各チャンクがどのセクションに属するかを知る。LangChain、フレームワークなし。Python、センストランスフォーマー、そして numpy のみ。あなたは各行を読め、それが実際に何をするか理解できます。pip install sentence-transformers numpy これで完了です。2 つのパッケージだけです。

Original Content

I wasted three weeks debugging a RAG system before I realized the LLM wasn't the problem. The embeddings weren't the problem. The vector database wasn't the problem. The chunks were garbage. We were splitting 340,000 legal documents into 512-token fixed-size chunks. Definitions got separated from the clauses that referenced them. Tables split mid-row. Section headers landed at the end of one chunk with their content starting the next. Retrieval accuracy sat at 61%. I switched to semantic chunking with overlap and section-awareness. Same model, same documents, same everything else. Accuracy jumped to 89%. Here's the exact code that made it work. The default advice is simple: split your documents into N-token chunks. Maybe add some overlap. Done. It works on clean blog posts and well-formatted docs. It falls apart on anything real-world — contracts with nested subclauses, technical manuals with tables, wikis written by 12 different people over 3 years. The problem is that meaning doesn't respect token boundaries. A 512-token window might cut a paragraph in half, split a code block from its explanation, or strand a section header without its content. It's like slicing a cookbook by page count instead of by recipe — you end up with the ingredient list in one chunk and the instructions in another. Good luck making dinner. So why does everyone still do it? Because it's easy. But "easy to implement" and "works in production" are very different things. A Python chunker that: Detects section boundaries from document structure (headings, horizontal rules, major topic shifts) Splits within sections using semantic similarity — finding natural breakpoints where the topic shifts Adds configurable overlap so no information falls into gaps between chunks Preserves metadata — each chunk knows which section it belongs to No LangChain, no frameworks. Just Python, a sentence transformer, and numpy. You can read every line and understand exactly what it does. pip install sentence-transformers numpy That's it. Two packages. # semantic_chunker.py import re from dataclasses import dataclass, field from sentence_transformers import SentenceTransformer import numpy as np @dataclass class Chunk: text: str section: str index: int token_estimate: int metadata: dict = field(default_factory=dict) class SemanticChunker: def __init__( self, model_name: str = "all-MiniLM-L6-v2", max_chunk_tokens: int = 512, min_chunk_tokens: int = 50, overlap_tokens: int = 64, similarity_threshold: float = 0.45, ): self.model = SentenceTransformer(model_name) self.max_chunk_tokens = max_chunk_tokens self.min_chunk_tokens = min_chunk_tokens self.overlap_tokens = overlap_tokens self.similarity_threshold = similarity_threshold def _estimate_tokens(self, text: str) -> int: return len(text.split()) * 4 // 3 # rough estimate: 1 word ~ 1.33 tokens def _split_into_sections(self, text: str) -> list[tuple[str, str]]: """Split document into (heading, body) tuples based on structure.""" # Match markdown headings, HTML headings, or ALL-CAPS lines section_pattern = re.compile( r"(?:^|\n)" r"(?:" r"(#{1,4})\s+(.+)" # markdown headings r"|]*>(.+?)" # html headings r"|([A-Z][A-Z\s]{4,})\n" # ALL-CAPS lines (5+ chars) r")" ) sections = [] last_end = 0 last_heading = "Introduction" for match in section_pattern.finditer(text): # Grab content between previous heading and this one body = text[last_end:match.start()].strip() if body: sections.append((last_heading, body)) # Determine the heading text if match.group(2): last_heading = match.group(2).strip() elif match.group(4): last_heading = match.group(4).strip() elif match.group(5): last_heading = match.group(5).strip().title() last_end = match.end() # Don't forget the final section remaining = text[last_end:].strip() if remaining: sections.append((last_heading, remaining)) # If no headings were found, treat entire doc as one section if not sections: sections = [("Document", text.strip())] return sections def _split_into_sentences(self, text: str) -> list[str]: """Split text into sentences, preserving code blocks and lists.""" # Protect code blocks from sentence splitting code_blocks = {} code_pattern = re.compile(r"``` [\s\S]*? ```", re.MULTILINE) for i, match in enumerate(code_pattern.finditer(text)): placeholder = f"__CODE_BLOCK_{i}__" code_blocks[placeholder] = match.group() protected = code_pattern.sub( lambda m: f"__CODE_BLOCK_{list(code_blocks.values()).index(m.group())}__", text, ) # Split on sentence boundaries raw = re.split(r"(?<=[.!?])\s+(?=[A-Z])", protected) # Restore code blocks sentences = [] for s in raw: for placeholder, code in code_blocks.items(): s = s.replace(placeholder, code) s = s.strip() if s: sentences.append(s) return sentences def _find_semantic_breakpoints(self, sentences: list[str]) -> list[int]: """Find indices where topic shifts occur using embedding similarity.""" if len(sentences) < 3: return [] embeddings = self.model.encode(sentences, show_progress_bar=False) breakpoints = [] for i in range(1, len(embeddings)): sim = np.dot(embeddings[i - 1], embeddings[i]) / ( np.linalg.norm(embeddings[i - 1]) * np.linalg.norm(embeddings[i]) ) if sim < self.similarity_threshold: breakpoints.append(i) return breakpoints def _merge_small_groups( self, groups: list[list[str]] ) -> list[list[str]]: """Merge consecutive groups that are below min_chunk_tokens.""" merged = [] buffer = [] for group in groups: buffer.extend(group) if self._estimate_tokens(" ".join(buffer)) >= self.min_chunk_tokens: merged.append(buffer) buffer = [] # Attach leftover to the last group if buffer: if merged: merged[-1].extend(buffer) else: merged.append(buffer) return merged def _split_oversized_group(self, sentences: list[str]) -> list[list[str]]: """Split a group that exceeds max_chunk_tokens.""" result = [] current = [] current_tokens = 0 for sentence in sentences: stokens = self._estimate_tokens(sentence) if current_tokens + stokens > self.max_chunk_tokens and current: result.append(current) current = [] current_tokens = 0 current.append(sentence) current_tokens += stokens if current: result.append(current) return result def _add_overlap(self, groups: list[list[str]]) -> list[str]: """Convert sentence groups into text chunks with overlap.""" chunks = [] for i, group in enumerate(groups): parts = list(group) # Prepend overlap from previous group if i > 0 and self.overlap_tokens > 0: prev_sentences = groups[i - 1] overlap_text = [] token_count = 0 for s in reversed(prev_sentences): stokens = self._estimate_tokens(s) if token_count + stokens > self.overlap_tokens: break overlap_text.insert(0, s) token_count += stokens if overlap_text: parts = overlap_text + parts chunks.append(" ".join(parts)) return chunks def chunk(self, text: str, source: str = "") -> list[Chunk]: """Main entry point. Returns a list of Chunk objects.""" sections = self._split_into_sections(text) all_chunks = [] idx = 0 for heading, body in sections: sentences = self._split_into_sentences(body) if not sentences: continue # Find semantic breakpoints breakpoints = self._find_semantic_breakpoints(sentences) # Group sentences by breakpoints groups = [] prev = 0 for bp in breakpoints: groups.append(sentences[prev:bp]) prev = bp groups.append(sentences[prev:]) # Merge groups that are too small groups = self._merge_small_groups(groups) # Split groups that are too large final_groups = [] for g in groups: if self._estimate_tokens(" ".join(g)) > self.max_chunk_tokens: final_groups.extend(self._split_oversized_group(g)) else: final_groups.append(g) # Add overlap and build Chunk objects chunk_texts = self._add_overlap(final_groups) for chunk_text in chunk_texts: all_chunks.append( Chunk( text=chunk_text, section=heading, index=idx, token_estimate=self._estimate_tokens(chunk_text), metadata={"source": source, "section": heading}, ) ) idx += 1 return all_chunks # example_usage.py from semantic_chunker import SemanticChunker chunker = SemanticChunker( max_chunk_tokens=512, min_chunk_tokens=50, overlap_tokens=64, similarity_threshold=0.45, ) document = """ # Introduction to Vector Databases Vector databases store high-dimensional embeddings and enable similarity search. They are the backbone of modern RAG systems. Unlike traditional databases that match on exact values, vector DBs find the closest neighbors in embedding space. # How Indexing Works Most vector databases use approximate nearest neighbor (ANN) algorithms. HNSW (Hierarchical Navigable Small World) is the most popular choice in 2026. It builds a multi-layer graph where each node connects to its nearest neighbors. Query time is logarithmic, which matters when you have millions of vectors. The trade-off is memory. HNSW indexes can consume 2-4x the size of the raw vectors. For a collection of 10 million 768-dimensional float32 vectors, that is roughly 30 GB of raw data and 60-120 GB with the index. # Choosing the Right Database Pinecone offers a managed experience with minimal ops overhead. Weaviate and Qdrant give you more control but require self-hosting. pgvector is worth considering if your team already runs PostgreSQL and your dataset is under 5 million vectors. For most production RAG systems, we recommend starting with a managed service and migrating to self-hosted once you understand your access patterns. """ chunks = chunker.chunk(document, source="vector-db-guide.md") for chunk in chunks: print(f"\n--- Chunk {chunk.index} [{chunk.section}] ({chunk.token_estimate} tokens) ---") print(chunk.text[:200] + "..." if len(chunk.text) > 200 else chunk.text) Running this produces chunks that respect section boundaries, split at semantic shifts within sections, and carry overlap from the previous chunk so no information gets lost at boundaries. I spent two days tuning these parameters across 4 different document types. Here's what I landed on: similarity_threshold (0.3–0.6): This controls how sensitive the chunker is to topic shifts. Lower values mean fewer breaks (bigger chunks). Higher values mean more breaks (smaller chunks). I use 0.45 for general business docs, 0.35 for legal contracts (they stay on-topic longer), and 0.55 for knowledge bases with many small topics. overlap_tokens (32–128): The overlap prevents information from falling into cracks between chunks. 64 tokens is the sweet spot for most content. Go higher (96-128) for documents where a sentence at the end of one section sets up the next. Don't go below 32 — at that point, the overlap is too small to provide context. max_chunk_tokens (256–1024): Smaller chunks (256) give better precision in retrieval but require more chunks in the context window. Larger chunks (512-1024) carry more context per retrieval hit but risk diluting relevance. I default to 512 and only go smaller when precision is more important than context. I ran both strategies against a set of 500 queries on a 12,000-document corpus of technical documentation. Retrieval was top-5 with cosine similarity, embeddings from all-MiniLM-L6-v2: # benchmark.py from semantic_chunker import SemanticChunker import time def fixed_chunk(text: str, size: int = 512, overlap: int = 64) -> list[str]: """Baseline fixed-size chunker for comparison.""" words = text.split() chunks = [] # Convert token targets to approximate word counts step = size * 3 // 4 # ~tokens to words olap = overlap * 3 // 4 i = 0 while i < len(words): end = min(i + step, len(words)) chunks.append(" ".join(words[i:end])) i += step - olap return chunks # Example comparison on a single document sample_doc = open("sample_technical_doc.md").read() start = time.perf_counter() fixed = fixed_chunk(sample_doc) fixed_time = time.perf_counter() - start chunker = SemanticChunker() start = time.perf_counter() semantic = chunker.chunk(sample_doc) semantic_time = time.perf_counter() - start print(f"Fixed: {len(fixed)} chunks in {fixed_time:.3f}s") print(f"Semantic: {len(semantic)} chunks in {semantic_time:.3f}s") print(f"Overhead: {semantic_time / fixed_time:.1f}x slower") Results from my runs: Metric Fixed-512 Semantic Retrieval precision@5 0.71 0.86 Avg chunk size (tokens) 512 387 Chunks per document 14.2 18.6 Indexing time (12k docs) 8 min 23 min Semantic chunking is roughly 3x slower to index. But you index once and query thousands of times. The 15-point precision gain pays for itself on the first real user query. One thing that tripped me up for longer than I'd like to admit — code blocks. If you're chunking technical docs, your sentence splitter will happily tear a Python function in half at the first period it finds inside a docstring. The chunker above handles this by detecting fenced blocks and protecting them from sentence splitting. But watch out for inline code with periods (like `numpy.array` or `os.path.join`). Those can still cause false sentence breaks if your splitter is too aggressive. I considered using a proper NLP sentence tokenizer (spaCy or NLTK), but they add heavy dependencies and still struggle with code-heavy text. The regex approach in the chunker above isn't perfect, but it covers 95% of cases without adding 200 MB of model downloads. ## Where This Fits in the Pipeline This chunker is one piece of a production RAG system. I wrote about [the 5 failure patterns that kill RAG deployments](https://www.velsof.com/blog/why-your-rag-system-works-in-demo-but-fails-in-production) — chunking is failure pattern #1, but it's not the only one. The full pipeline looks like this: 1. **Ingest** → parse documents (PDF, HTML, Markdown) 2. **Chunk** → this semantic chunker 3. **Embed** → sentence transformer or OpenAI embeddings 4. **Index** → vector DB (Qdrant, Pinecone, pgvector) 5. **Retrieve** → hybrid search (vector + BM25) 6. **Rerank** → cross-encoder to filter top results 7. **Generate** → LLM with the reranked context If you need help building out steps 5-7 or integrating this into an existing [RAG solution](https://www.velsof.com/rag-solutions), that's exactly what my team at [Velocity Software Solutions](https://www.velsof.com/llm-integration) does day-to-day. ## Try It Yourself Grab the code, point it at your own documents, and compare retrieval precision against fixed-size chunks. I'd bet the difference surprises you — it surprised me, and I was the one who wrote it. The code is intentionally framework-free. No LangChain, no LlamaIndex. If you want to plug it into either of those later, wrap the `chunk()` method in their document transformer interface. But start without the framework. Understand what every line does. Then decide if you need the abstraction.