arxiv_cs_lg 2026年2月10日

InfiCoEvalChain: バロックベースの協調型 LLM 評価ための分散枠組み

InfiCoEvalChain: A Blockchain-Based Decentralized Framework for Collaborative LLM Evaluation

Translated: 2026/3/15 8:02:43

llm-evaluationblockchaindecentralized-systemmachine-learningbenchmarking

Japanese Translation

arXiv:2602.08229v1 Announce Type: cross 要旨：大型言語モデル（LLM）の急速な進展に伴い、評価の信頼性がより高さが求められています。しかし、現在の中央集権的評価は不透明性、過学習、およびハードウェアによるバリエーションという課題に直面しています。我々の実証的分析は、既存の評価に警報級の一貫性の欠如を明らかにしました：HumanEval の単一のモデルの 10 回の実験における標準偏差（1.67）が、公式リーダーボードにおける上位 10 モデル間の性能差（0.91）を上回るため、現在のランク付けは統計的に不安定です。これらの不安定性を緩和するため、我々は異質な計算ノードを介した大規模ベンチマークによるハードウェアとパラメータ多様性を可能にする分散評価枠組みを提案しました。バロックベースのプロトコルを活用することで、この枠組みはグローバルな貢献者を独立した検証者として動機付け、堅牢な報酬システムにより評価の健全性を確保し、不正な参加を抑制します。この集合的な検証は、評価を「中央集権的なブラックボックス」から「マルチパーティコンセンサスと多様な推論環境が生み出すより安定した代表的な指標を有する『分散承認』」へと変容させます。実験的結果は、分散評価枠組みが同一モデルにおける 10 回の実験の標準偏差を 0.28 まで削減することを示しており、従来の枠組みに比べて大きな改善により、モデルランク付けにおける統計的な信頼性を向上させます。我々は既にこのプラットフォームを完全に実装しており、コミュニティにすぐに公開する予定である.

Original Content

arXiv:2602.08229v1 Announce Type: cross Abstract: The rapid advancement of large language models (LLMs) demands increasingly reliable evaluation, yet current centralized evaluation suffers from opacity, overfitting, and hardware-induced variance. Our empirical analysis reveals an alarming inconsistency in existing evaluations: the standard deviation across ten repeated runs of a single model on HumanEval (1.67) actually exceeds the performance gap among the top-10 models on the official leaderboard (0.91), rendering current rankings statistically precarious. To mitigate these instabilities, we propose a decentralized evaluation framework that enables hardware and parameter diversity through large-scale benchmarking across heterogeneous compute nodes. By leveraging the blockchain-based protocol, the framework incentivizes global contributors to act as independent validators, using a robust reward system to ensure evaluation integrity and discourage dishonest participation. This collective verification transforms evaluation from a "centralized black box" into a "decentralized endorsement" where multi-party consensus and diverse inference environments yield a more stable, representative metric. Experimental results demonstrate that the decentralized evaluation framework reduces the standard deviation across ten runs on the same model to 0.28. This significant improvement over conventional frameworks ensures higher statistical confidence in model rankings. We have completely implemented this platform and will soon release it to the community.