arxiv_cs_lg 2026年4月24日

分数量子力学モデルを用いた分子測定値の予測

Graph-Theoretic Models for the Prediction of Molecular Measurements

Translated: 2026/4/24 19:54:34

graph-theorymolecule-predictionchemical-informaticsmachine-learningqsp

Japanese Translation

arXiv:2604.19840v1 発表タイプ：新要約：分数量子力学的手法は、分子物性の予測において簡潔性、解釈性、および計算コストの低さを特徴とする。これらの手法の中で、Mukwembi と Nyabadza が提案した外部活性 $D(G)$ と内部活性 $\\〇(G)\$ インデックスに基づくモデルは、フラボノイドの小さなデータセットにおいて強固な結果を示したが、より巨大で化学的に多様化されたデータセットへ拡張できるかどうかは検証されていない。本研究では、5 つのベンチマークデータセット（生物活性：BACE（1,513 分子）、脂溶性：LogP 合成（14,610 分子）、LogP 実測（753 分子）、水溶性：ESOL（1,128 分子）、および水化自由エネルギー：SAMPL（642 分子））から構成される $D(G)$-$\\u3007(G)$ 多項式ベースラインモデルの評価を行った。ベースラインモデルは平均 $R^2 = 0.24$ を達成し、限定的な転移性を確認した。これを解決するため、Ridge 正則化、追加のグラフ記述子、物性、勾配ブースティングを使用するアンサンブル学習、Lasso 特徴選択、そしてトポロジカル指標と Morgan フィンガープリントの組み合わせのハイブリッドアプローチを段階的に統合する系統的な強化フレームワークを提案した。強化されたモデルは、平均最適な $R^2$ を 0.79 に高上げ、個別の改善率は 165% から 274% の範囲にある。すべての改善は統計的に有意 ($p < 0.001$) である。同じ実験条件下で Graph Convolutional Network との直接比較では、強化された古典的モデルがすべての 5 つのデータセットにおいてディープラーニングに匹敵するかそれ以上の性能を発揮した。Djagba などの GNN+PGM ハイブリッドとの比較も、強化されたモデルが 2 つのデータセットで最良の結果、1 つのデータセットで同率の結果を示し、競争力をさらに確認した。このフレームワーク全体は GPU を必要とせず、5 分以内にトレーニングが完了し、オープンソースツールのみを使用するため、リソースに制限された研究環境の研究者にとってアクセス可能である。

Original Content

arXiv:2604.19840v1 Announce Type: new Abstract: Graph-theoretic approaches offer simplicity, interpretability, and low computational cost for molecular property prediction. Among these, the model proposed by Mukwembi and Nyabadza, based on the external activity $D(G)$ and internal activity $\zeta(G)$ indices, achieved strong results on a small flavonoid dataset. However, its ability to generalize to larger and chemically diverse datasets has not been tested. This study evaluates the baseline $D(G)$-$\zeta(G)$ polynomial model on five benchmark datasets from MoleculeNet, covering biological activity (BACE, 1,513 molecules), lipophilicity (LogP synthetic, 14,610 molecules; LogP experimental, 753 molecules), aqueous solubility (ESOL, 1,128 molecules), and hydration free energy (SAMPL, 642 molecules). The baseline model achieves an average $R^2 = 0.24$, confirming limited transferability. To address this, a systematic enhancement framework is proposed, progressively incorporating Ridge regularization, additional graph descriptors, physicochemical properties, ensemble learning with Gradient Boosting, Lasso feature selection, and a hybrid approach combining topological indices with Morgan fingerprints. The enhanced models raise the average best $R^2$ to 0.79, with individual improvements ranging from 165\% to 274\%. All improvements are statistically significant ($p < 0.001$). A direct comparison with a Graph Convolutional Network under identical experimental conditions shows that the enhanced classical models match or outperform deep learning on all five datasets. Comparison with the recent GNN+PGM hybrid of Djagba et al.\ further confirms competitiveness, with the enhanced models achieving the best results on two datasets and tying on one. The entire framework requires no GPU, trains in under five minutes, and uses only open-source tools, making it accessible for researchers in resource-limited settings.