dev_to 2026年4月20日

埋め込みモデルベンチマークをやめろ。検索品質の 90% はモデル以前にある

Stop Benchmarking Embedding Models. 90% of Your Search Quality Lives Upstream.

Translated: 2026/4/20 11:21:04

pgvectorllm-pipelinesemantic-searchembedding-modelspostgresql-architecture

Japanese Translation

文脈の簡潔な導入：私は現在、Vaultt（旧 StudentVenture）の CTO である。Top 1% の非伝統的才を募るリクルート市場であり、1 万名を超える候補者のプロフィールを持つ。半導体産業で、実運用において 1 年以上にわたりセマンティックマッチングを実施している。我々は、主要な PostgreSQL データベース内にある pgvector を全システムに展開しており、LLM 生成サマリーを埋め込みの入力として活用し、クエリ時におけるハイブリッドフィルタリングを行っている。每隔数月、「2026 年最良埋め込みモデル」というベンチマークが発表され、創業者たちは常に「モデル X をモデル Y に変えるべきか」と私に同じ質問を投げかける。たいていは間違った質問である。なぜか。我々のパイプラインからの数字を基に解説しよう。同一の候補コーパス。同じ評価セットを実務におけるリクルーターのクエリ（合成データではない、実際のテキスト入力である）。同じ評価ルール：上位 10 件で返ってくる候補に、そのリクルーターが実際に関面した人物が含まれていたか。我々はコストと能力のスペクトルをカバーする 5 つの埋め込みモデルでこれを実行した：オープンソースモデル、Google の gemini-embedding-001、OpenAI text-embedding-3-large、Voyage voyage-3.5-large、そして小型のオンプレミスオプション。最良モデルから最悪モデルへと：7 点の差。同一モデルの 2 つのランダムシードで見るべき範囲内である。同一の評価。最悪モデルをそのまま置留し、唯一の变量は埋め込み元に送るテキスト内容のみを変えた。バージョン 1: 生のプロフィールデータ。シリアライズされたフィールドごとに名前、バイオ、スキル配列、経験配列、現在の役割。バージョン 2: LLM 生成構造化サマリー。各候補に対して、インゲスト時に PDF ポートフォリオのパーサ、画像ベースプロジェクトの OCR、リクルートの CV の読み込みを行い、自己記述したバイオと組み合わせ、この人物が実際に何に就き、得意なことを記述する单一の自然言語の段落を生成する。その段落を埋め込む。検索の品質差：40 ポイント。唯一の变量。モデル呼び出しの直前に上流で 40 ポイント。埋め込みモデルは、ベクトル空間でセマンティックに類似したテキストを近づくように訓練される。重要なキーワードは「テキスト」である。冗長なフィールドと一貫性のないフォーマット、そして物語的整合性が欠如している JSON の BloB を与えれば、「リクルートプラットフォームからの JSON BloB を記述するベクトル」が得られる。一方、人間のスキルと仕事の clean な説明を与えれば、「その人物を記述するベクトル」が得られる。 MTEB、BEIR、そして leaderboard に見るすべてのベンチマークは、clean で意図的なテキストを入力と仮定している。これは、多くの実運用パイプラインが最初の日から違反する潜在仮定である。コード部分は容易である。重要なアーキテクチャ上の決断は以下の通り：高価なモデルをインゲスト時に使い、クエリ時には使わない。1 つの候補につき強力な LLM を一度呼び出し、構造化サマリーを作成する。そのサマリーに対して安価な埋め込みモデルを呼び出す。クエリ時には、安価な埋め込みを用いてクエリ文字列を処理し、ベクトル距離検索を行う。高価な LLM 作業は、候補が未来の検索に登場するたびに均分される。ベクトルを Postgres に保存し、専用のベクトル DB に保存しない。すでに Postgres を所有していたため。pgvector における HNSW インデックスは、サブ 20ms のクエリ遅延で 1 万 vector 以上の処理を対応できる。私たちは、1 つのバックアップ戦略、1 つのパーミッションモデル、1 つの ORM、そして構造化カラム上でフィルタリングし、ベクトル距離でソートする单一の SQL 文を取得できる。この規模では、専用のベクトルストアは何も買われない。「pgvector がスケールしない」という崖は 5000 万 vector 以上の北で起き、それによりシャードやミグレーションが可能となる。構造化データをフィルタリングし、無構造な意味を埋め込み。場所、利用可能な日時、役割タイプ、タイムゾーン：Postgres カラムに B ツリーインデックスを持つ単純なもの。「思考者のタイプ」や「ポートフォリオの形状」：埋め込み。それらを組み合わせるのは、5 行の WHERE クラウーズと、ベクトル余弦距離に基づく ORDER BY である。 Google gemini-embedding-001：百万トークンあたり $0.006。Voyage voyage-3.5-large：百万トークンあたり $0.18。30 倍。我々のコーパスでは、これは月あたり約 $4 と $120 の違いである。我々の規模にとっては小銭。100 万のプロファイルを持つ週次更新では、千ドル対数万ドルの違いになる。単位数の品質差をベンチマーク

Original Content

Brief intro on context. I'm CTO at Vaultt (formerly StudentVenture), a recruitment marketplace for top 1% non-traditional talent. 10,000+ candidate profiles, semantic matching in production for over a year. We run everything on pgvector inside our main Postgres database, with LLM-generated summaries as the embedding input and hybrid filtering at query time. Every few months a new "best embedding model in 2026" benchmark lands, and founders ask me the same question: should we be using model X instead of model Y? Almost always, it's the wrong question. Here's why, with numbers from our own pipeline. Same candidate corpus. Same eval set of real recruiter queries (not synthetic ones, actual text recruiters type). Same scoring rule: did the top 10 retrieved candidates contain the people the recruiter ended up interviewing? I ran it across five embedding models spanning the cost and capability range: an open-source model, Google's gemini-embedding-001, OpenAI text-embedding-3-large, Voyage voyage-3.5-large, and a smaller on-prem option. Best model to worst model: a 7 point spread. Within the range you'd see between two random seeds on the same model. Same eval. Kept the worst model in place. Changed only one thing: what text I passed to the embedder. Version 1: raw profile data, serialized field by field. Name, bio, skills array, experience array, current role. Version 2: an LLM-generated structured summary. For each candidate, we run a one-time pipeline at ingestion that parses PDF portfolios, OCRs image-based projects, reads their CV, combines it with their self-written bio, and produces a single natural-language paragraph describing what this person actually does and is good at. That paragraph is what we embed. Quality delta on retrieval: 40 points. One variable. Upstream of the model call. Forty points. Embedding models are trained to place semantically similar text close in vector space. The operative word is text. Feed them a JSON blob with redundant fields, inconsistent formatting, and no narrative coherence, and you get a vector that describes "a JSON blob from a recruitment platform." Feed them a clean description of a human's skills and work, and you get a vector that describes the human. MTEB, BEIR, and every benchmark on the leaderboards you're staring at assume clean, purposeful text as input. That's an implicit assumption most production pipelines violate from the first day. The code is the easy part. The architectural decisions that mattered: Spend your expensive model at ingestion, not at query time. We call a strong LLM once per candidate to build the structured summary. We call a cheap embedding model on that summary. At query time we use cheap embeddings on the query string and a vector distance lookup. The expensive LLM work is amortized across every future search the candidate ever appears in. Store vectors in Postgres, not in a dedicated vector DB. We already had Postgres. pgvector with HNSW indexing handles 10k+ vectors at sub-20ms query latency. We get one backup strategy, one permission model, one ORM, and hybrid queries that filter on structured columns and sort on vector distance in a single SQL statement. A dedicated vector store would buy us nothing at this scale. The "pgvector doesn't scale" cliff kicks in north of 50M vectors, and by then you can shard or migrate. Filter structured data, embed unstructured meaning. Location, availability, role type, time zone: plain Postgres columns with B-tree indexes. "Kind of thinker this person is," "shape of their portfolio": embedding. Composing them is a five-line WHERE clause plus an ORDER BY on vector cosine distance. Google gemini-embedding-001: $0.006 per million tokens. Voyage voyage-3.5-large: $0.18 per million tokens. 30x. On our corpus that's the difference between roughly $4 per month and $120 per month. Small money at our scale. At 1M profiles with weekly refreshes, it's thousands versus tens of thousands. For single-digit quality deltas in benchmarks that don't fully represent your actual retrieval task. If you run semantic search in production and haven't done this, here's the order of operations. Build an eval set from real user queries. Not synthetic ones. 50 queries with known-good results is enough to start. If you can't measure a change, you'll optimize vibes. Run your current retrieval against it. Write down the number. Only after you've exhausted input improvements, consider a different model. This is where your 7 points live. Nine out of ten teams I've worked with do step 4 first. They spend weeks, see marginal improvement, and never loop back to the data prep that would have been 10x the ROI. AI systems degrade in predictable places, and almost always it's upstream of the flashy model call. Data cleanliness. Input construction. Query understanding. Eval rigor. These are the unsexy parts. They're also where the quality lives. The embedding leaderboard is a local maximum. Productive-looking work that moves the needle by single digits. The 40% improvement is a preprocessing pass away, and most teams never take it because the preprocessing doesn't come with a blog post. Stop comparing models. Fix what you're feeding them.