arxiv_cs_cv 2026年4月24日

どこで事前学習するか？事前学習データの多様性がジオ空間的な基礎モデルの性能に与える影響を調べる

Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

Translated: 2026/4/24 19:41:14

geospatial-foundation-modelspretrainingdata-diversitymachine-learninggeospatial-ai

Japanese Translation

arXiv:2604.21104v1 発表タイプ：新規要約：新しいジオ空間的基礎モデルは、異なるデータ多様性の概念を用いてサンプリングされた新しいモデルアーキテクチャと事前学習データセットを導入している。性能差は主にモデルアーキテクチャや入力モダリティに起因すると見なされるが、事前学習データセットの役割はあまり研究されていない。この研究ギャップに対応するため、事前学習データにおける地理的組成がモデルの下流性能にどのように影響するかについての系統的な調査を実施した。グローバルおよび大陸ごとの事前学習データセットを作成し、グローバルおよび大陸ごとの下流データセットで評価した。事前学習データセットは、どちらもグローバルおよびローカルの下流評価において、グローバルまたは大陸固有の事前学習データセットを上回る性能を示した。事前学習データセットの下流性能に影響を与える要因を調べるため、10 つの事前学習データセットを、大陸、生物圏、土地利用、スペクトル値を横断した多様性を用いて分析した。他の要因は弱相関であったのに対し、スペクトル多様性だけが性能と強く相関することが判明した。この発見は、高性能な事前学習データセットを作成する際に考慮すべき新しい多様性の次元を確立した。私たちは 7 つの新しい事前学習データセット、事前学習モデル、および実験フレームワークを https://github.com/kerner-lab/pretrain-where にオープンソース化した。

Original Content

arXiv:2604.21104v1 Announce Type: new Abstract: New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at https://github.com/kerner-lab/pretrain-where.