arxiv_cs_cv 2026年2月10日

コロニアルバージニアの地権付与をジオロケーションするための大規模言語モデルのベンチマーク化

Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants

Translated: 2026/3/15 17:02:56

large-language-modelsgeolocationhistoric-land-grantsspatial-analysisbenchmarking

Japanese Translation

arXiv:2508.08266v2 Announce Type: replace-cross Abstract: ヴェرجニア州の17世紀および18世紀の土地特許は、主に文法的な境界記述としてのみ残り、空間解析を制限しています。本研究は、これらの説明的記述を、特定の評価文脈内で地理的に正確な緯度/経度座標に変換する際の、現在の世代の大規模言語モデル（LLM）を系統的に評価しています。5,471件のバージニア特許の抽象文書（1695-1732年）からなるデジタル化された文書群が公開され、43件の厳密に検証されたテストケースが、初期かつ地理的に焦点を当てたベンチマークとして機能しています。3つのアーキテクチャ（oシリーズ、GPT-4クラス、GPT-3.5）の6つのOpenAIモデルが、直接座標への変換パラダイムと、外部ジオコーディングAPIを呼び出すためのツール增強型チェーンオブスラウフの2つのパラダイムの下でテストされました。結果は、GIS解析者基準、Stanford NERジオパーサー、Mordecai-3ニューラルジオパーサー、および郡中心のヒューリスティックと比較されました。トップ単回モデル、o3-2025-04-16は、平均誤差23km（中央値14km）を達成し、中位数LLM（37.4km）を上回る37.5%、最も弱いLLM（50.3km）を上回る53.5%、および外部基準をそれぞれGIS解析者67%、Stanford NER 70%上回る結果となりました。5回呼び出しのエンスンブルは、追加コストが最小限（約USD 0.20/件）ながら誤差を19.2km（中央値12.2km）まで減少させ、中位数LLMより48.7%上回る結果となりました。特許所有者名の削除アブレーションにより誤差がわずかに増加し（約7%）、テキスト上のランドマークと隣接性の記述に依存していることを示唆しました。コスト効果の高いgpt-4o-2024-08-06モデルは、USD 1.09/1,000件の特許で28kmの平均誤差を維持しており、強力なコスト-精度基準を確立しました。外部ジオコーディングツールはこの評価では測定可能な利点を提供しませんでした。これらの発見は、LLMがスケーラブルで正確でコスト効果の高い歴史的ジオ参照化の可能性があることを示しています。

Original Content

arXiv:2508.08266v2 Announce Type: replace-cross Abstract: Virginia's seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures-o-series, GPT-4-class, and GPT-3.5-were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared against a GIS analyst baseline, Stanford NER geoparser, Mordecai-3 neural geoparser, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19.2 km (median 12.2 km) at minimal additional cost (~USD 0.20 per grant), outperforming the median LLM by 48.7%. A patentee-name redaction ablation slightly increased error (~7%), showing reliance on textual landmark and adjacency descriptions rather than memorization. The cost-effective gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark. External geocoding tools offer no measurable benefit in this evaluation. These findings demonstrate LLMs' potential for scalable, accurate, cost-effective historical georeferencing.