arxiv_cs_ai 2026年4月24日

TraceScope: 分割されたチェックリスト判断を用いたインタラクティブな URL 調査

TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication

Translated: 2026/4/24 20:28:56

trace-scopephishing-detectionurl-classificationmachine-learningforensic-analysis

Japanese Translation

arXiv:2604.21840v1 Announce Type: cross 摘要：現代のフィッシングキャンペーンは、インタラクションゲート（例：チェックボックス/スライダー課題）、遅延コンテンツレンダリング、ロゴ付き不在の認証情報収集器などの方法により、ショットベースの URL クラシファイヤーを回避するようになりました。これは、URL 調査を静的分類からインタラクティブな法医調査作業へと変換しました：分析者は潜在的なランタイム利用に自分自身を隔離しながら、ページをアクティブにナビゲートしなければなりません。私たちは、TraceScope という分割された調査パイプラインを提示しました。このパイプラインは、大規模でこの作業フローを実行可能にします。観測者効果を防止し、安全性を保つために、サンドボックス化されたオペレータエージェントが、視覚的動機付けによってページ挙動を引き出す実際の GUI ブラウザーを導きます。このプロセスでセッションが凍結され、不可変の証拠バインドにまとめられます。独立して、判決者エージェントは MITRE ATT&CK チェックリストを裏付けるために、要求に応じて証拠をクエリし、抽出されたインジケーター・オブ・コンプライアンス（IOC）と最終的な判定を含む監査用レポートを生成します。既存のデータセットからの 708 つの到達可能な URL を評価したところ、TraceScope は 0.94 の精度と 0.78 の再現率を達成し、これは 3 つの先ほどの視覚的/参照ベースのクラシファイヤーと比較して再現率を大幅に改善すると同時に、レビューに適した再現性の高い分析グレードの証拠を生み出しました。さらに重要なのは、実用的な設定で我々のシステムを評価するために、我々が実際にフィッシングメールのデータセットを手動で作成・編集したことでした。我々の評価は、TraceScope が現実的なシナリオで優れているパフォーマンスを示すことを明らかにし、現在の最先端の防御が識別できない複雑なフィッシング試みを成功的に検出することを確認しました。

Original Content

arXiv:2604.21840v1 Announce Type: cross Abstract: Modern phishing campaigns increasingly evade snapshot-based URL classifiers using interaction gates (e.g., checkbox/slider challenges), delayed content rendering, and logo-less credential harvesters. This shifts URL triage from static classification toward an interactive forensics task: an analyst must actively navigate the page while isolating themselves from potential runtime exploits. We present TraceScope, a decoupled triage pipeline that operationalizes this workflow at scale. To prevent the observer effect and ensure safety, a sandboxed operator agent drives a real GUI browser guided by visual motivation to elicit page behavior, freezing the session into an immutable evidence bundle. Separately, an adjudicator agent circumvents LLM context limitations by querying evidence on demand to verify a MITRE ATT&CK checklist, and generates an audit-ready report with extracted indicators of compromise (IOCs) and a final verdict. Evaluated on 708 reachable URLs from existing dataset (241 verified phishing from PhishTank and 467 benign from Tranco-derived crawling), TraceScope achieves 0.94 precision and 0.78 recall, substantially improving recall over three prior visual/reference-based classifiers while producing reproducible, analyst-grade evidence suitable for review. More importantly, we manually curated a dataset of real-world phishing emails to evaluate our system in a practical setting. Our evaluation reveals that TraceScope demonstrates superior performance in a real-world scenario as well, successfully detecting sophisticated phishing attempts that current state-of-the-art defenses fail to identify.