arxiv_cs_ai 2026年2月10日

CIC-Trap4Phish: Multi-formatデータセットの統合化されたデタップおよびクシティングファイルの判定

CIC-Trap4Phish: A Unified Multi-Format Dataset for Phishing and Quishing Attachment Detection

Translated: 2026/3/7 14:21:50

phishingquishingmachine-learningcnn

Japanese Translation

悪意のあるメールと関連付けられた偽装は、サイバー攻撃者の最大の攻撃形態の一つであり、多くの場合、攻撃者は誘拐的な電子メールと共に有害な添付ファイルを使用してユーザーを間接的に_sensitive情報を提供しまたはマルウェアをインストールする手段を使いシステム全体を攻撃することができます。また、不正な電子メールに対して強化された防御があるにも関わらず、悪意のあるメールを悪用するための様々な形で攻撃者たちが引き続き危険な内容を防ぐツールとして利用してしまっています。さらに研究者がマルウェアや有害性のあるURLのようなデタップ内容を無視するモデルの訓練時に直面している別の課題は、統一された情報が不足しています。それは5つの最も一般的に使われる電子メール攻撃キャンペーンのファイル形式（ミシスジングワードドキュメント、エクセルスプレッドシート、PDFページ、HTMLページ、QRコード）を網羅することが要求されています。このギャップを解決するために、CIC-Trap4Phishが複数のフォーマットを持つ統合データセットを生成しました。これは両純と偽装のサンプルに包括的な5つの種類が含まれていています：メイソンワードドキュメント、エクセルスプレッドシート、PDFファイル、HTMLページ及びQRコード画像。これらの4つのフォーマットには、不開封または実行されずに不正な内容の捕虜を捕捉し、構造的な、文言上の情報に基づく指標が含まれる静的絵手パイプラインと呼ばれるセットが提案されました。また各ファイル形式向けにSHAP解析と特徴量の重要性を使用した選択処理はそれぞれの特徴量の過不足を明らかにします。評価された選択された特徴量は軽量な機械学習モデル、ランダムフォレスト、XGBoost、ディスクリートテーラーを使用して評価されます。

Original Content

arXiv:2602.09015v1 Announce Type: cross Abstract: Phishing attacks represents one of the primary attack methods which is used by cyber attackers. In many cases, attackers use deceptive emails along with malicious attachments to trick users into giving away sensitive information or installing malware while compromising entire systems. The flexibility of malicious email attachments makes them stand out as a preferred vector for attackers as they can embed harmful content such as malware or malicious URLs inside standard document formats. Although phishing email defenses have improved a lot, attackers continue to abuse attachments, enabling malicious content to bypass security measures. Moreover, another challenge that researches face in training advance models, is lack of an unified and comprehensive dataset that covers the most prevalent data types. To address this gap, we generated CIC-Trap4Phish, a multi-format dataset containing both malicious and benign samples across five categories commonly used in phishing campaigns: Microsoft Word documents, Excel spreadsheets, PDF files, HTML pages, and QR code images. For the first four file types, a set of execution-free static feature pipeline was proposed, designed to capture structural, lexical, and metadata-based indicators without the need to open or execute files. Feature selection was performed using a combination of SHAP analysis and feature importance, yielding compact, discriminative feature subsets for each file type. The selected features were evaluated by using lightweight machine learning models, including Random Forest, XGBoost, and Decision Tree. All models demonstrate high detection accuracy across formats. For QR code-based phishing (quishing), two complementary methods were implemented: image-based detection by employing Convolutional Neural Networks (CNNs) and lexical analysis of decoded URLs using recent lightweight language models.