Back to list
dev_to 2026年4月25日

# Production Ready AI Governance Stacks の構築(Part 3/3)

# Building a Production-Ready AI Governance Stack (Part 3/3)

Translated: 2026/4/25 3:39:09
ai-governancereference-architecturepre-execution-policyzero-trustgovernance-stack

Japanese Translation

このシリーズは、AI 統治アーキテクチャを 3 回にわたって解説します。第 1 回では、署名済みの受領書が不正な実行を行わなかったことを証明できえない「負の証明」の問題について調査しました。第 2 回では、実行前に行われるポリシー評価のゲートについて検討しました。今日、これらのコンポーネントが生产システムの中でどのように統合されるかを示す完全なリファレンスアーキテクチャを構築します。 注:このシリーズは、規制要件と暗号化のベストプラクティスに基づいた AI 統治のアーキテクチャパターンを探求します。提示されたレイヤードアーキテクチャおよびコード例は、教育目的ための概念フレームワークであり、異なる技術スタックとデプロイメント環境に適応可能です。 我々は実行前統治の概念的な基礎を確立しました:実行前にポリシーを評価し、実行後に評価しない。また、検出だけでなく予防を示す拒絶証明を作成し、リプレイ検証を可能にする決定論的なポリシー評価を維持します。しかし、概念上的なパターンの理解は、信頼性、パフォーマンス、保守性が全て重要な生産システムでの実装とは異なります。 「アーキテクチャ的にこれは意味がある」と「生产中ではこれが動く」の間のギャップこそが、ほとんどの統治イニシアチブが停滞する原因です。最初は良い意図を持ってコアのアイデアを検証するプロトタイプを構築し、既存システムとの統合、エッジケースの処理、ポリシーの進化管理、そして全体スタックのスケーラビリティなど、混沌とした現実に出遭います。必要なのは、何を構築すべきかだけでなく、コンポーネントがどのように相互作用するか、各層の責任、および要件の変化にシステムがどのように進化するかを示す明確なアーキテクチャ蓝图です。 このリファレンスアーキテクチャは、異なる技術スタックとデプロイメント環境に跨がるパターンを代表します。具体的な実装の詳細は、AWS、Azure、GCP、あるいはオンプレミスインフラを使用する場合によって変わりますが、レイヤード構造は同じです。各層には明確な責任と境界があり、隣接する層とのインターフェースは定義が明確で、これによりテストと進化が管理可能になります。 AI システムへの全ての要求は、バイパス経路なしに単一のエントリポイントを通じて移動します。これはマイクロサービスアーキテクチャでの APIゲートウェイの動作にアーキテクチャ的に似ています—you トラフィックを 1 つの場所で強制することでクロスカットの Concern を一貫性のある方法で適用できるようにします。この場合のクロスカットの Concern は、統治評価です。 実行ルーターの仕事は、デコイティブに単純に見えます:リクエストを受け取り、テナントおよびフォルダの文脈に基づいてどの統治パイプラインが適用されるかを見極め、適切な評価フローにルーティングします。しかし、その単純さは荷重を担っています。AI 実行レイヤーへの複数のエントリポイントがあれば、または開発者がモデル API を直接呼び出してルーターをバイパスできる場合、あなたの統治保証は崩れます。ルーターは必須でバイパス不可能である場合にのみ効果的です。 実際には、ルーターを必須にする意味は、インフラのアクセス制御システムを使ってこれを強制することです。AWS であれば、Lambda 関数が Bedrock を直接呼び出せないようにするための IAM ポリシー(ルーターを経由させること)を意味します。Azure であれば、AI サービスを呼ぶ権限のみをルーター関数に与えるマネージドアイデンティティを意味します。オンプレミスで直接モデルアクセスがある場合は、アプリケーションサーバーがガバナンス層を経由せずにモデル API にアクセスできないようにするネットワーク分割を意味します。 ルーターはまた認証と初期の文脈解決も処理します。どんな統治評価が起きる前にも、誰がリクエストを行っているか、そしてそれがどの組織の境界に属するかを知る必要があります。それは一般的に JWT トークンの検証、ユーザークレームからテナント ID の解決、およびどのポリシーが適用されるかが決めるフォルダ文脈の読み込みを意味します。この文脈は、その後のすべてのプロセスの基礎となります。

Original Content

This is Part 3 of a three-part series on AI governance architecture. In Part 1, we explored the negative proof problem why signed receipts can't prove that unauthorized actions didn't happen. In Part 2, we examined pre-execution gates that evaluate policy before execution occurs. Today, we'll build a complete reference architecture showing exactly how these components fit together in a production system. Note: This series explores architectural patterns for AI governance based on regulatory requirements and cryptographic best practices. The layered architecture and code examples presented are conceptual frameworks for educational purposes, adaptable across different tech stacks and deployment environments. We've established the conceptual foundation for pre-execution governance: evaluate policy before execution rather than after, create denial proofs that demonstrate prevention rather than just detection, maintain deterministic policy evaluation to enable replay verification. But understanding the pattern conceptually is different from implementing it in a production system where reliability, performance, and maintainability all matter. The gap between "this makes sense architecturally" and "this works in production" is where most governance initiatives stall out. You start with good intentions, build a proof of concept that validates the core ideas, then hit the messy reality of integrating with existing systems, handling edge cases, managing policy evolution, and operating the whole stack at scale. What you need is a clear architectural blueprint that shows not just what components to build, but how they interact, what each layer is responsible for, and how to evolve the system as requirements change. This reference architecture represents patterns that work across different tech stacks and deployment environments. The specific implementation details will vary depending on whether you're running on AWS, Azure, GCP, or on-premises infrastructure, but the layered structure remains the same. Each layer has a specific responsibility, clear boundaries with adjacent layers, and well-defined interfaces that make testing and evolution manageable. Every request into your AI system passes through a single entry point with no bypass paths. This is architecturally similar to how API gateways work in microservices architectures—you enforce that all traffic flows through one place so you can apply cross-cutting concerns consistently. In this case, the cross-cutting concern is governance evaluation. The execution router's job is deceptively simple: receive requests, determine which governance pipeline applies based on tenant and folder context, and route to the appropriate evaluation flow. But that simplicity is load-bearing. If there are multiple entry points into your AI execution layer, or if developers can bypass the router by calling model APIs directly, your governance guarantees collapse. The router is only effective if it's mandatory and non-bypassable. In practice, making the router mandatory means using your infrastructure's access control systems to enforce it. If you're running on AWS, that means IAM policies that prevent Lambda functions from calling Bedrock directly—they have to go through the router. If you're running on Azure, it means managed identities that only grant the router function permission to invoke AI services. If you're running on-premises with direct model access, it means network segmentation that prevents application servers from reaching model APIs without passing through the governance layer. The router also handles authentication and initial context resolution. Before any governance evaluation happens, you need to know who's making the request and what organizational boundaries it belongs to. That typically means validating JWT tokens, resolving tenant identifiers from user claims, and loading the folder context that determines which policies apply. This context becomes the foundation for all subsequent policy evaluation. Here's what that looks like structurally: class ExecutionRouter: """ Single entry point for all AI requests. No bypass paths allowed. Infrastructure access controls enforce that all model invocations must flow through this router. """ async def route_request(self, request): # Step 1: Authentication - who's making this request? caller = await self.authenticate(request) # Step 2: Context resolution - which tenant/folder? context = await self.resolve_context(caller, request) # Step 3: Route to appropriate governance pipeline # Different tenants or folders might have different policy engines pipeline = self.get_pipeline(context.tenant_id, context.folder_id) # Step 4: Execute governance evaluation # This is where we call Layer 2 (Policy Engine) decision = await pipeline.evaluate(request, context) # Step 5: Handle the decision if decision.verdict == 'DENY': return self.handle_denial(decision) else: return await self.execute_and_receipt(request, decision) The router is stateless and horizontally scalable. Each request is independent, and all the state needed for governance evaluation gets loaded from durable storage systems. This means you can run multiple router instances behind a load balancer without coordination between them, which is essential for handling production-scale traffic. The policy engine's responsibility is evaluating requests against governance rules and returning an enforcement decision. This is where the actual governance logic lives—all the rules about folder isolation, data classification restrictions, tool access controls, budget limits, and compliance requirements. The key architectural constraint for this layer is that policy evaluation must be deterministic and fast. As we discussed in Part 2, deterministic evaluation enables replay verification, which is how you prove to auditors that denial decisions were legitimate. Fast evaluation means you can run this synchronously on every request without adding unacceptable latency. To achieve both determinism and speed, the policy engine operates on a snapshot of the policy that's loaded once and cached in memory. When a request comes in for evaluation, the engine doesn't query a database to find out what rules apply—it already has the rules loaded. This eliminates network latency and ensures that the evaluation is deterministic because it's using a fixed policy version rather than potentially fetching different rules on subsequent evaluations. Policy snapshots are versioned immutably. When you update a policy, you create a new version with a new hash. The old version remains available indefinitely so that denial proofs can be replayed against the exact policy that was in effect when the original decision was made. This versioning is what enables the replay verification workflow that auditors rely on. The engine evaluates rules in a defined sequence. Some governance frameworks call this a policy decision point, but the concept is straightforward: you have an ordered list of rules, you evaluate them one by one, and the first rule that fires determines the outcome. This sequential evaluation is important because it makes policy behavior predictable and debuggable. You can trace through exactly which rule fired and why, which is essential for both policy development and compliance documentation. class PolicyEngine: """ Deterministic policy evaluation with immutable versioning. """ def __init__(self, policy_snapshot): # Load immutable policy snapshot into memory self.policy = policy_snapshot self.version_hash = policy_snapshot.hash def evaluate(self, request, context): # Evaluate rules sequentially until one fires for rule in self.policy.rules: if rule.condition_matches(request, context): # First matching rule determines the decision return Decision( verdict=rule.action, # ALLOW or DENY rule_id=rule.id, policy_version=self.version_hash, reason=rule.reason_template.format(**context), regulatory_basis=rule.citations ) # No explicit rule fired, use default policy return Decision( verdict=self.policy.default_action, policy_version=self.version_hash ) When you're designing policies for this engine, you need to think carefully about what belongs here versus what belongs in Layer 5 analytics. The policy engine should enforce simple, explicit rules that can be evaluated quickly: folder boundaries, data classification checks, budget gates, allowlists of permitted tools. It should not run machine learning models to detect anomalies, query external APIs that might be slow or unreliable, or implement complex heuristics that might produce different results on subsequent evaluations. Once the policy engine returns a decision, that decision needs to be captured in a tamper-evident format with cryptographic guarantees. This is where Layer 3 comes in. Its job is to take the decision from Layer 2, add cryptographic signing via a key management service, and store the signed artifact in immutable storage. The signing step is critical because it's what prevents someone from fabricating denial proofs after the fact. When you use AWS KMS, Azure Key Vault, or Google Cloud KMS for signing, you're leveraging a hardware security module that's designed to make forging signatures computationally infeasible. The governance system calls the signing API with the decision payload, gets back a signature, and bundles them together into the signed proof artifact. The immutability step is equally critical because it prevents tampering with the audit trail. If you store denial proofs in a regular database where administrators can delete records, an auditor can't trust that the absence of a denial proof means no denial occurred—it could mean the proof was deleted. But if you store denial proofs in S3 with Object Lock in compliance mode, or in Azure Blob Storage with immutable blob retention policies, those proofs become undeletable even by privileged administrators. The only way to "delete" them is to wait for the retention period to expire, which might be seven years for HIPAA data or even longer for other regulatory frameworks. Batching denial proofs into Merkle trees adds an additional layer of verification efficiency. Instead of requiring auditors to verify thousands of individual signatures, you can batch decisions into hourly or daily trees, compute a root hash, sign that root with KMS, and anchor it to immutable storage. Then auditors can verify the root signature once and use the Merkle proof structure to verify that individual decisions are included in the tree. This pattern scales much better than individual signature verification when you're dealing with high-volume AI systems. class ProofStorage: """ Cryptographically sign decisions and store immutably. """ async def store_denial(self, decision, request_hash): # Create denial proof payload proof = DenialProof( decision_id=generate_id(), request_hash=request_hash, verdict='DENY', rule_id=decision.rule_id, policy_version=decision.policy_version, timestamp=utcnow(), reason=decision.reason ) # Sign with KMS to prevent forgery signature = await kms_client.sign( key_id=GOVERNANCE_SIGNING_KEY, message=proof.canonical_bytes(), algorithm='RSASSA_PKCS1_V1_5_SHA_256' ) # Bundle into signed proof signed_proof = SignedDenialProof( proof=proof, signature=signature, key_id=GOVERNANCE_SIGNING_KEY ) # Store in immutable WORM storage await s3_client.put_object( Bucket=WORM_BUCKET, Key=f'denials/{proof.decision_id}.json', Body=signed_proof.to_json(), ObjectLockMode='COMPLIANCE', ObjectLockRetainUntilDate=utcnow() + timedelta(days=2555) # 7 years ) # Queue for Merkle batching await sqs_client.send_message( QueueUrl=MERKLE_BATCH_QUEUE, MessageBody=proof.decision_id ) return signed_proof The combination of cryptographic signing and immutable storage creates what compliance frameworks call non-repudiation. The organization that generated the denial proof cannot later claim that the proof was fabricated or tampered with, because the KMS signature proves authenticity and the WORM storage proves the proof hasn't been modified since creation. Having signed denial proofs in immutable storage is valuable, but only if auditors can independently verify them without needing privileged access to your production systems. That's what Layer 4 provides: a public verification endpoint that anyone with a denial proof identifier can use to validate authenticity. The verification endpoint accepts a denial proof ID, retrieves the corresponding proof from storage, and performs several checks. First, it verifies the KMS signature to confirm the proof hasn't been tampered with. Second, it checks that the proof is actually stored in the WORM bucket with retention policy intact. Third, if the proof is part of a Merkle batch, it verifies the Merkle inclusion proof showing that the decision is included in a sealed batch. Fourth, it offers a replay endpoint where someone can re-evaluate the decision using the archived policy snapshot to confirm the decision would still be DENY. This verification endpoint is intentionally designed to work without requiring authentication. Any auditor, regulator, or customer who has a denial proof ID can verify it independently. This is similar to how blockchain verification works—you don't need to trust the organization that created the record, you can verify it yourself using public cryptographic proofs. For compliance purposes, this independent verifiability is what makes denial proofs compelling evidence rather than just self-reported logs. class VerificationEndpoint: """ Public endpoint for independent verification of denial proofs. No authentication required - verification is based on cryptography. """ async def verify_denial(self, proof_id): # Retrieve proof from WORM storage proof = await self.get_proof(proof_id) # Check 1: Verify KMS signature signature_valid = await kms_client.verify( key_id=proof.key_id, message=proof.canonical_bytes(), signature=proof.signature, algorithm='RSASSA_PKCS1_V1_5_SHA_256' ) # Check 2: Verify WORM retention is intact retention_active = await self.verify_worm_retention(proof_id) # Check 3: Verify Merkle inclusion if batched merkle_valid = await self.verify_merkle_inclusion(proof) # Check 4: Offer replay verification replay_endpoint = f'/verify/{proof_id}/replay' return VerificationResult( proof_id=proof_id, signature_valid=signature_valid, worm_retention_active=retention_active, merkle_inclusion_valid=merkle_valid, replay_endpoint=replay_endpoint ) The replay endpoint deserves special attention because it's what makes deterministic policy evaluation valuable in practice. An auditor can call the replay endpoint with the original request hash and the policy version from the denial proof. The verification system retrieves the immutably stored policy snapshot, re-runs the policy evaluation, and confirms that the outcome is still DENY. If the replay produces a different result, that's a red flag that either the policy was mutated after the fact or the policy engine is non-deterministic, both of which undermine the integrity of your governance system. The first four layers focus on enforcement and proof generation. Layer 5 is where you add the observability and analytics that make the governance system operationally manageable. This is where you aggregate decisions to build dashboards showing denial patterns, detect anomalies that might indicate policy gaps or system attacks, surface frequently denied rules that might need policy adjustment, and track compliance metrics for internal reporting. Critically, Layer 5 is optional in the sense that the core governance enforcement works without it. You can have a fully functional pre-execution gate system with just Layers 1 through 4. Layer 5 adds operational visibility and helps you evolve policies over time, but it's not required for basic prevention and proof generation. This is an important architectural separation because it means you can start with enforcement-first and add analytics later as operational needs emerge. The analytics layer operates on the same denial proofs and receipts that Layer 3 generates, but it processes them asynchronously after the fact rather than inline during request handling. This separation keeps the enforcement path fast and simple while allowing the analytics path to be as complex and slow as necessary. You might run machine learning models to detect unusual denial patterns, query external threat intelligence feeds to identify potentially malicious request sources, or generate compliance reports that require aggregating data across thousands of decisions. One pattern that works well is using the analytics layer to detect when policies need updating. If you see a spike in denials for a particular rule, that might indicate a legitimate use case that your current policy doesn't account for. If you see a pattern of denials followed by successful requests with slightly modified parameters, that might indicate someone is probing your governance boundaries. The analytics layer surfaces these patterns so your security team can investigate and adjust policies as needed. Now that we've built out the full five-layer architecture, it's worth stepping back and honestly assessing when you don't need all this complexity. Not every AI system requires pre-execution gates. If your compliance requirements focus on auditability and transparency rather than prevention, if you're operating in environments where the cost of a governance failure is low, or if you're in early-stage development where shipping velocity matters more than production hardening, receipts alone may be sufficient. The decision tree is straightforward. If your regulatory framework uses prevention language—HIPAA's "prevent unauthorized access," PCI DSS's "prevent access beyond need-to-know," GDPR's "prevent processing beyond original purpose"—then you need pre-execution gates because receipts fundamentally cannot demonstrate prevention. But if your framework focuses on auditability and disclosure—demonstrating that you have policies, that you applied them consistently, that you can produce records on demand—then receipts provide the evidence you need without the architectural overhead of gates. Similarly, if you're operating in regulated verticals where negative proofs matter—healthcare, financial services, government systems—pre-execution gates become table stakes because auditors will ask questions that only gates can answer. But if you're running internal analytics tools used by trusted operators in controlled environments, the prevention requirement is less stringent and the detection that receipts provide may be adequate. The other consideration is operational maturity. Pre-execution gates require that your policies be well-defined, deterministic, and tested before you enable enforcement mode. If you're still figuring out what your governance policies should be, starting with receipt-based observability while you iterate on policy design makes more sense than trying to enforce policies that might change dramatically as you learn more about your system's actual behavior. If you've made it this far through the series, you understand the core architectural patterns for building prevention-first AI governance. You know why signed receipts alone can't solve the negative proof problem, how pre-execution gates create denial proofs that demonstrate prevention, what deterministic policy evaluation means and why it matters, and how to structure a complete governance stack across five architectural layers. The hard part isn't understanding these patterns—it's implementing them in your specific environment with your specific constraints and requirements. Every organization has legacy systems to integrate with, existing security controls that need to interoperate with the governance layer, and operational teams whose workflows change when you add mandatory governance gates. The approach that tends to work is starting with Layer 1 and Layer 2 in observer mode. Build the execution router and policy engine, but configure them to always return ALLOW while logging what the decision would have been if enforcement was enabled. This lets you validate that your policies are working correctly, that performance is acceptable, and that you're not about to break production workflows. Once you have confidence in observer mode, you can start enabling selective enforcement on high-risk surfaces where the security benefit justifies the risk of blocking something incorrectly. From there, you add Layers 3 and 4 to start generating verifiable denial proofs and providing independent verification endpoints. Finally, Layer 5 gives you the operational visibility to maintain and evolve the system over time. This incremental rollout reduces risk while letting you build the governance capabilities you need for compliance. The AI governance landscape is maturing rapidly. What started as optional nice-to-have tooling is becoming mandatory infrastructure as AI systems move into regulated production environments. Auditors are asking harder questions, regulators are writing more specific requirements, and the organizations that solve prevention-first governance early will have a significant advantage over those still relying on detection-only approaches. If you're building AI systems that handle sensitive data, operate in regulated industries, or face compliance requirements with prevention language, the time to start thinking about pre-execution governance architecture is now. The patterns are well-understood, the implementation approaches are proven, and the compliance benefits are clear. What's needed is the commitment to build governance as infrastructure rather than treating it as an afterthought. Read Part 1: The Negative Proof Problem in AI Governance Read Part 2: Pre-Execution Gates: How to Block Before You Execute