arxiv_cs_ai 2026年4月24日

シリアル化戦略が重要: FHIR データフォーマットが LLM 処方調和をどのように影響するか

Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation

Translated: 2026/4/24 20:23:13

fhirllmmedication-reconciliationhealthcare-aiclinical-data-serialization

Japanese Translation

arXiv:2604.21076v1 Announce Type: cross 要約: 臨床的な転換点における処方の調和は、高いリスクを伴うかつ誤りの多いプロセスです。大規模言語モデル（LLM）は、FHIR 構造化患者記録を使用するこのタスクを支援するための方法としてますます提案されていますが、モデルへ渡す前に FHIR データがどのようにシリアル化されるかが、基本的でありかつほとんど研究されていない変数です。私たちは、200 名の合成患者という管理されたベンチマーク（合計 4,000 回の推論実行）を、5 つのオープンウエイトモデル（Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, Llama-3.3-70B）に対して、4 つの FHIR シリアル化戦略（Raw JSON, Markdown Table, Clinical Narrative, Chronological Timeline）を初めて系統的に比較しました。私たちは、シリアル化戦略がパフォーマンスに大きな、統計的に有意な影響を与えていることを発見しました：パラメータ数が 8B 以下モデルでは、Clinical Narrative は Raw JSON を 19 の F1 スコア点上回り、Mistral-7B において r = 0.617 (p < 10^{-10}) です。この優位性は 70B において反転し、ここで Raw JSON が 0.9956 という平均 F1 スコアで最良になります。すべての 20 つのモデルと戦略の組み合わせにおいて、平均精度は平均再現率を超えており、欠落が支配的な失敗モードです：モデルは、作成するよりもアクティブな薬剤を欠損する傾向が強いです。これは、臨床的安全監査の優先順位を設定する方法を変えます。小型モデルは約 7-10 人の同時アクティブ薬剤で止まり、処方の調和エラーのリスクが最も高いポリファーマシー患者を系統的に不足させています。ドメイン事前学習済みで指示調整を伴わない BioMistral-7B は、すべての条件下で利用可能な出力を生成せず、ドメイン事前学習だけでは構造化抽出には不十分であることを示しています。これらの結果は、臨床的 LLM 実用に実践的で証拠に基づくフォーマット推奨を提供します：パラメータ数が 8B 以下は Clinical Narrative、70B 以上は Raw JSON です。完全なパイプラインは、AWS g6e.xlarge インスタンス（NVIDIA L40S、48GB VRAM）で実行されるオープンソースツール上で再現可能です。

Original Content

arXiv:2604.21076v1 Announce Type: cross Abstract: Medication reconciliation at clinical handoffs is a high-stakes, error-prone process. Large language models are increasingly proposed to assist with this task using FHIR-structured patient records, but a fundamental and largely unstudied variable is how the FHIR data is serialised before being passed to the model. We present the first systematic comparison of four FHIR serialisation strategies (Raw JSON, Markdown Table, Clinical Narrative, and Chronological Timeline) across five open-weight models (Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, Llama-3.3-70B) on a controlled benchmark of 200 synthetic patients, totalling 4,000 inference runs. We find that serialisation strategy has a large, statistically significant effect on performance for models up to 8B parameters: Clinical Narrative outperforms Raw JSON by up to 19 F1 points for Mistral-7B (r = 0.617, p < 10^{-10}). This advantage reverses at 70B, where Raw JSON achieves the best mean F1 of 0.9956. In all 20 model and strategy combinations, mean precision exceeds mean recall: omission is the dominant failure mode, with models more often missing an active medication than fabricating one, which changes how clinical safety auditing priorities should be set. Smaller models plateau at roughly 7-10 concurrent active medications, leaving polypharmacy patients, the patients most at risk from reconciliation errors, systematically underserved. BioMistral-7B, a domain-pretrained model without instruction tuning, produces zero usable output in all conditions, showing that domain pretraining alone is not sufficient for structured extraction. These results offer practical, evidence-based format recommendations for clinical LLM deployment: Clinical Narrative for models up to 8B, Raw JSON for 70B and above. The complete pipeline is reproducible on open-source tools running on an AWS g6e.xlarge instance (NVIDIA L40S, 48 GB VRAM).