dev_to 2026年3月7日

データクエリ16.8M SIRENE事業所を66msで実行

How We Query 16.8M SIRENE Establishments in 66ms

Translated: 2026/3/7 13:03:13

postgresqlsirene-databasefuzzy-searchgeorefer-api

Japanese Translation

フランスのSIRENEデータベースには、フランスに登録されたすべての企業情報が含まれており、その数は約3,000万件です。このうち、アクティブな約1680万件を取り込みGEOREFERに導入しました。これにより名前で検索するまでを、小数点1秒以内に行うことが期待できます。

Original Content

The French SIRENE database contains information about every registered business in France — over 30 million establishments. We imported 16.8 million active ones into GEOREFER and needed to make them searchable by name in under 100ms. Here's how we did it with PostgreSQL 16 and pg_trgm. Our establishment table has 16.8 million rows. Users need to search by: SIREN (9 digits) — exact match, trivial with a B-tree index SIRET (14 digits) — exact match, same Company name — fuzzy match, this is where it gets interesting The name search needs to handle: Partial matches: "Total" should find "TotalEnergies SE" Typos: "Miclein" should find "Michelin" Accent insensitivity: "Societe Generale" should match "Societe Generale" First attempt: SELECT * FROM georefer.establishment WHERE company_name ILIKE '%total%' LIMIT 25; EXPLAIN ANALYZE result: Seq Scan on establishment Filter: (company_name ~~* '%total%') Rows Removed by Filter: 16799975 Planning Time: 0.1ms Execution Time: 12,847ms 12.8 seconds. Full sequential scan on 16.8M rows. Unusable. PostgreSQL's pg_trgm extension breaks strings into trigrams (3-character sequences) and uses GIN indexes to find similar strings efficiently. CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX idx_establishment_name_trgm ON georefer.establishment USING GIN (company_name gin_trgm_ops); Now with trigram similarity: SELECT *, similarity(company_name, 'total') as sim FROM georefer.establishment WHERE company_name % 'total' ORDER BY sim DESC LIMIT 25; EXPLAIN ANALYZE: Bitmap Heap Scan on establishment Recheck Cond: (company_name % 'total') -> Bitmap Index Scan on idx_establishment_name_trgm Index Cond: (company_name % 'total') Planning Time: 0.3ms Execution Time: 66ms 66ms. From 12.8 seconds to 66ms — a 194x improvement. Importing 16.8M rows isn't straightforward. We use a 3-phase approach: CREATE TABLE IF NOT EXISTS georefer.establishment ( id SERIAL PRIMARY KEY, siren VARCHAR(9) NOT NULL, siret VARCHAR(14) NOT NULL UNIQUE, company_name VARCHAR(255), commercial_name VARCHAR(255), legal_form VARCHAR(10), naf_code VARCHAR(6), employee_range VARCHAR(5), postal_code VARCHAR(5), city VARCHAR(100), department_code VARCHAR(3), is_headquarters BOOLEAN DEFAULT FALSE, is_active BOOLEAN DEFAULT TRUE, created_date DATE, last_update DATE ); We use PostgreSQL COPY for maximum throughput: COPY georefer.establishment (siren, siret, company_name, ...) FROM '/tmp/sirene_active.csv' WITH (FORMAT csv, HEADER true, DELIMITER ','); COPY is 10-50x faster than batch INSERT for bulk loading. The 16.8M rows load in about 8 minutes. We create indexes AFTER the bulk import (creating them before would slow the import): -- Exact lookups CREATE INDEX idx_establishment_siren ON georefer.establishment(siren); CREATE INDEX idx_establishment_siret ON georefer.establishment(siret); -- Geographic filtering CREATE INDEX idx_establishment_postal ON georefer.establishment(postal_code); CREATE INDEX idx_establishment_dept ON georefer.establishment(department_code); CREATE INDEX idx_establishment_city ON georefer.establishment(city); -- Fuzzy name search CREATE INDEX idx_establishment_naf ON georefer.establishment(naf_code); CREATE INDEX idx_establishment_name_trgm ON georefer.establishment USING GIN (company_name gin_trgm_ops); The real power comes from combining trigram search with geographic filters: SELECT *, similarity(company_name, 'boulangerie') as sim FROM georefer.establishment WHERE company_name % 'boulangerie' AND department_code = '75' AND is_active = true ORDER BY sim DESC LIMIT 25; This returns all bakeries in Paris in ~45ms, even across 16.8M rows. The Spring Boot service exposes this via REST: # Search by SIREN curl 'https://georefer.io/geographical_repository/v1/companies?siren=552120222' \ -H 'X-Georefer-API-Key: YOUR_API_KEY' # Search by name + department curl 'https://georefer.io/geographical_repository/v1/companies/search?name=michelin&department_code=63' \ -H 'X-Georefer-API-Key: YOUR_API_KEY' The response includes all SIRENE fields: { "success": true, "data": [ { "siren": "855200507", "siret": "85520050700046", "company_name": "MANUFACTURE FRANCAISE DES PNEUMATIQUES MICHELIN", "naf_code": "22.11Z", "employee_range": "5000+", "postal_code": "63000", "city": "CLERMONT-FERRAND", "is_headquarters": true } ] } Query Type Before (no index) After (pg_trgm) Improvement Name search 12,847ms 66ms 194x Name + dept filter 13,102ms 45ms 291x SIREN exact 8,200ms 0.3ms 27,333x SIRET exact 8,150ms 0.2ms 40,750x Always create indexes after bulk import — creating them before can make the import 10x slower pg_trgm GIN indexes use a lot of disk — our 16.8M row trigram index is ~2.3 GB Set maintenance_work_mem high during index creation — SET maintenance_work_mem = '1GB' cuts index creation time in half COPY beats INSERT every time for bulk loading — use COPY for anything over 10K rows GEOREFER exposes 16.8M SIRENE establishments through a simple REST API: Free tier: 100 req/day, no credit card Docs: https://georefer.io/docs Sign up: https://georefer.io/#signup AZMORIS Engineering — "Software that Endures"