arxiv_cs_ai 2026年4月24日

BioMiner: リテラチュアから自動的に抽出されたタンパク質 - 配位子の生物活性データを多モーダルに自動マイニングするシステム

BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

Translated: 2026/4/24 20:17:27

biominercheminformaticsbioactivityligand-structuremultimodal-ai

Japanese Translation

arXiv:2604.21508v1 Announce Type: new Abstract: リテラチュアに公表されたタンパク質 - 配位子の生物活性データは医薬品開発にとって不可欠です。しかし、急速に増え続けるリテラチュアに追いつくために、手動での編集は困難となっています。自動的な生物活性の抽出は、依然として挑戦的であり、それはテキスト、テーブル、図表に分散された化学バイオサイエンスのセマンティクスを解釈するだけでなく、化学的に正確な配位子構造（例：Markush構造）を再構築することを含むためです。このボトルネックに対処するために、BioMiner という多モーダル抽出フレームワークを導入しました。BioMiner は、生物活性のセマンティック解釈を配位子構造の構築から明確に分離しています。BioMiner 内では、生物活性のセマンティクスは直接推論を通じて推論され、化学構造は化学構造に基づいた視覚的なセマンティック推論パラダイムを通じて解決されます。このパラダイムにおいて、マルチモーダル大規模言語モデルは化学的に基づいた視覚表現に対して操作し、構造間の関係を推論し、正確な分子の構築はドメイン化学ツールに委ねられています。厳密な評価と手法の開発のためには、さらに BioMiner を評価するための BioVista という総合的なベンチマークを確立しました。これは、500 の出版物から編集された 16,457 件の生物活性エントリから構成されています。BioMiner はその抽出能力を検証し、定量的な基準を提供し、生物活性三元組に対して F1 スコア 0.32 を達成しました。BioMiner の実践的な有用性は、3 つのアプリケーションを通じて示されました：(1) 11,683 の論文から 82,262 のデータを抽出してプリトレーニングデータベースを構築し、これは下流モデルのパフォーマンスを 3.9% 改善します；(2) 高品質な NLRP3 生物活性データの数を 2 倍にする人間によるループワークを可能にし、28 つの QSAR モデルに対して 38.6% の改善と、新しいスキールドを持つ 16 個のヒット候補の特定を助けます；(3) ポスバスター（PoseBusters）データセットにおける手動ワークフローに対して、5.59 倍の速度向上と 5.75% の精度改善を達成し、タンパク質 - 配位子複合体の生物活性注釈を加速しました。

Original Content

arXiv:2604.21508v1 Announce Type: new Abstract: Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction. Within BioMiner, bioactivity semantics are inferred through direct reasoning, while chemical structures are resolved via a chemical-structure-grounded visual semantic reasoning paradigm, in which multi-modal large language models operate on chemically grounded visual representations to infer inter-structure relationships, and exact molecular construction is delegated to domain chemistry tools. For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications. BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets. BioMiner's practical utility is demonstrated via three applications: (1) extracting 82,262 data from 11,683 papers to build a pre-training database that improves downstream models performance by 3.9%; (2) enabling a human-in-the-loop workflow that doubles the number of high-quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating protein-ligand complex bioactivity annotation, achieving a 5.59-fold speed increase and 5.75% accuracy improvement over manual workflows in PoseBusters dataset.