arxiv_cs_cv 2026年2月10日

電子顕微鏡画像分割において、ビジョンファウンデーションモデルは基礎的役割を果たしているのか？

Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Translated: 2026/3/16 14:05:34

vision-foundation-modelselectron-microscopyimage-segmentationmitochondrialow-rank-adaptation

Japanese Translation

arXiv:2602.08505v1 Announce Type: new 本文書は、ビジョンファウンデーションモデル（VFMs）が増加する頻度で生物医学画像解析に再使用されるにもかかわらず、それらが提供 latent representation が異質な顕微鏡画像データセット間で効果的な転送と再使用を支援するために十分普遍的であるかどうかという問題を、電子顕微鏡（EM）画像におけるミトコンドリア分割の課題を中心に検討する。本研究では、二つの人気のある公開 EM データセット（Lucchi++ と VNC）と三つの最近の代表的な VFMs（DINOv2、DINOv3、および OpenCLIP）を用いた。実用的なモデル適応の二つの režime を評価した：一つは軽量な分割ヘッダーのみを VFM の上にトレーニングすることからなる、バックボーンをフリーズした設定であり、もう一つは特定のデータセットに特化した方法で VFM をファインチューニングするためのパラメータ効率的なファインチューニング（PEFT）である、ロランク適応（LoRA）。すべてのバックボーンにおいて、単一の EM データセット上のトレーニングは良好な分割パフォーマンス（対象領域の Intersection-over-Union で計測）を生み出し、LoRA は常にドメイン内パフォーマンスを向上させた。一方、複数の EM データセット上のトレーニングは、考慮されたすべてのモデルに対して深刻なパフォーマンス劣化を引き起こし、PEFT のみによる利点はわずかであった。各種の手法（PCA、Fréchet Dinov2 distance、および線形プローブ）を介した latent representation space への探査は、視覚的な類似性にかかわらず、二つの考慮された EM データセット間で顕著で持続的なドメインの不整合が存在することを示した。これはペアトレーニングの失敗と整合的であり、これらの結果は、軽量的適応条件下では VFMs が単一のドメイン内で EM セグメンテーションに対して競合的な結果を提供できるものの、追加のドメイン対齐メカニズムなしに異質な EM データセット全体にわたる単一のrobust モデルを得るために、現在の PEFT 戦略が不十分であることを示唆している。

Original Content

arXiv:2602.08505v1 Announce Type: new Abstract: Although vision foundation models (VFMs) are increasingly reused for biomedical image analysis, it remains unclear whether the latent representations they provide are general enough to support effective transfer and reuse across heterogeneous microscopy image datasets. Here, we study this question for the problem of mitochondria segmentation in electron microscopy (EM) images, using two popular public EM datasets (Lucchi++ and VNC) and three recent representative VFMs (DINOv2, DINOv3, and OpenCLIP). We evaluate two practical model adaptation regimes: a frozen-backbone setting in which only a lightweight segmentation head is trained on top of the VFM, and parameter-efficient fine-tuning (PEFT) via Low-Rank Adaptation (LoRA) in which the VFM is fine-tuned in a targeted manner to a specific dataset. Across all backbones, we observe that training on a single EM dataset yields good segmentation performance (quantified as foreground Intersection-over-Union), and that LoRA consistently improves in-domain performance. In contrast, training on multiple EM datasets leads to severe performance degradation for all models considered, with only marginal gains from PEFT. Exploration of the latent representation space through various techniques (PCA, Fr\'echet Dinov2 distance, and linear probes) reveals a pronounced and persistent domain mismatch between the two considered EM datasets in spite of their visual similarity, which is consistent with the observed failure of paired training. These results suggest that, while VFMs can deliver competitive results for EM segmentation within a single domain under lightweight adaptation, current PEFT strategies are insufficient to obtain a single robust model across heterogeneous EM datasets without additional domain-alignment mechanisms.