Back to list
データクエリ16.8M SIRENE事業所を66msで実行
How We Query 16.8M SIRENE Establishments in 66ms
Translated: 2026/3/7 13:03:13
Japanese Translation
フランスのSIRENEデータベースには、フランスに登録されたすべての企業情報が含まれており、その数は約3,000万件です。このうち、アクティブな約1680万件を取り込みGEOREFERに導入しました。これにより名前で検索するまでを、小数点1秒以内に行うことが期待できます。
Original Content
The French SIRENE database contains information about every registered business in France — over 30 million establishments. We imported 16.8 million active ones into GEOREFER and needed to make them searchable by name in under 100ms.
Here's how we did it with PostgreSQL 16 and pg_trgm.
Our establishment table has 16.8 million rows. Users need to search by:
SIREN (9 digits) — exact match, trivial with a B-tree index
SIRET (14 digits) — exact match, same
Company name — fuzzy match, this is where it gets interesting
The name search needs to handle:
Partial matches: "Total" should find "TotalEnergies SE"
Typos: "Miclein" should find "Michelin"
Accent insensitivity: "Societe Generale" should match "Societe Generale"
First attempt:
SELECT * FROM georefer.establishment
WHERE company_name ILIKE '%total%'
LIMIT 25;
EXPLAIN ANALYZE result:
Seq Scan on establishment
Filter: (company_name ~~* '%total%')
Rows Removed by Filter: 16799975
Planning Time: 0.1ms
Execution Time: 12,847ms
12.8 seconds. Full sequential scan on 16.8M rows. Unusable.
PostgreSQL's pg_trgm extension breaks strings into trigrams (3-character sequences) and uses GIN indexes to find similar strings efficiently.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_establishment_name_trgm
ON georefer.establishment
USING GIN (company_name gin_trgm_ops);
Now with trigram similarity:
SELECT *, similarity(company_name, 'total') as sim
FROM georefer.establishment
WHERE company_name % 'total'
ORDER BY sim DESC
LIMIT 25;
EXPLAIN ANALYZE:
Bitmap Heap Scan on establishment
Recheck Cond: (company_name % 'total')
-> Bitmap Index Scan on idx_establishment_name_trgm
Index Cond: (company_name % 'total')
Planning Time: 0.3ms
Execution Time: 66ms
66ms. From 12.8 seconds to 66ms — a 194x improvement.
Importing 16.8M rows isn't straightforward. We use a 3-phase approach:
CREATE TABLE IF NOT EXISTS georefer.establishment (
id SERIAL PRIMARY KEY,
siren VARCHAR(9) NOT NULL,
siret VARCHAR(14) NOT NULL UNIQUE,
company_name VARCHAR(255),
commercial_name VARCHAR(255),
legal_form VARCHAR(10),
naf_code VARCHAR(6),
employee_range VARCHAR(5),
postal_code VARCHAR(5),
city VARCHAR(100),
department_code VARCHAR(3),
is_headquarters BOOLEAN DEFAULT FALSE,
is_active BOOLEAN DEFAULT TRUE,
created_date DATE,
last_update DATE
);
We use PostgreSQL COPY for maximum throughput:
COPY georefer.establishment
(siren, siret, company_name, ...)
FROM '/tmp/sirene_active.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',');
COPY is 10-50x faster than batch INSERT for bulk loading. The 16.8M rows load in about 8 minutes.
We create indexes AFTER the bulk import (creating them before would slow the import):
-- Exact lookups
CREATE INDEX idx_establishment_siren ON georefer.establishment(siren);
CREATE INDEX idx_establishment_siret ON georefer.establishment(siret);
-- Geographic filtering
CREATE INDEX idx_establishment_postal ON georefer.establishment(postal_code);
CREATE INDEX idx_establishment_dept ON georefer.establishment(department_code);
CREATE INDEX idx_establishment_city ON georefer.establishment(city);
-- Fuzzy name search
CREATE INDEX idx_establishment_naf ON georefer.establishment(naf_code);
CREATE INDEX idx_establishment_name_trgm ON georefer.establishment USING GIN (company_name gin_trgm_ops);
The real power comes from combining trigram search with geographic filters:
SELECT *, similarity(company_name, 'boulangerie') as sim
FROM georefer.establishment
WHERE company_name % 'boulangerie'
AND department_code = '75'
AND is_active = true
ORDER BY sim DESC
LIMIT 25;
This returns all bakeries in Paris in ~45ms, even across 16.8M rows.
The Spring Boot service exposes this via REST:
# Search by SIREN
curl 'https://georefer.io/geographical_repository/v1/companies?siren=552120222' \
-H 'X-Georefer-API-Key: YOUR_API_KEY'
# Search by name + department
curl 'https://georefer.io/geographical_repository/v1/companies/search?name=michelin&department_code=63' \
-H 'X-Georefer-API-Key: YOUR_API_KEY'
The response includes all SIRENE fields:
{
"success": true,
"data": [
{
"siren": "855200507",
"siret": "85520050700046",
"company_name": "MANUFACTURE FRANCAISE DES PNEUMATIQUES MICHELIN",
"naf_code": "22.11Z",
"employee_range": "5000+",
"postal_code": "63000",
"city": "CLERMONT-FERRAND",
"is_headquarters": true
}
]
}
Query Type
Before (no index)
After (pg_trgm)
Improvement
Name search
12,847ms
66ms
194x
Name + dept filter
13,102ms
45ms
291x
SIREN exact
8,200ms
0.3ms
27,333x
SIRET exact
8,150ms
0.2ms
40,750x
Always create indexes after bulk import — creating them before can make the import 10x slower
pg_trgm GIN indexes use a lot of disk — our 16.8M row trigram index is ~2.3 GB
Set maintenance_work_mem high during index creation — SET maintenance_work_mem = '1GB' cuts index creation time in half
COPY beats INSERT every time for bulk loading — use COPY for anything over 10K rows
GEOREFER exposes 16.8M SIRENE establishments through a simple REST API:
Free tier: 100 req/day, no credit card
Docs: https://georefer.io/docs
Sign up: https://georefer.io/#signup
AZMORIS Engineering — "Software that Endures"