dev_to 2026年3月7日

AWS上に月々100ドルで生産用のAIプラットフォームを_Deploy_しましょう

Deploy a Production AI Platform on AWS for $100/month

Translated: 2026/3/7 9:43:43

awsai-platformlambda-orchestrationcdk

Japanese Translation

7つの壊れたLambada関数から、現在の生産性のための完全なプラットフォームへと進む旅です。これらの関数は通信が難しくてタイムアウトし、ユーザーにローリングウィナーシンボルを待たせました。現在は複雑なワークフローのオーケストレーション、リアルタイム更新の提供を行い、起業家には財政的に破壊的なものではありません。この話題では実際に私がお見せするのは動作するものです。そのアーキテクチャが日々1,500以上の要求を処理し、8ヶ月の運用経験から生み出され且つドキュメント分析から複数ステップの研究タスクまで可能なものだとしています。

Original Content

From seven broken Lambda functions to a production AI platform in 8 articles. That's the journey we've taken together. Functions that couldn't communicate, hit timeout walls, and left users staring at loading spinners. Now you get a complete platform that orchestrates complex workflows, streams real-time updates, and won't bankrupt your startup. This isn't a toy example. The architecture I'm about to show you serves 1,500+ requests daily, has survived 8 months in production, and handles everything from document analysis to multi-step research tasks. Time to deploy it. Before we dive into deployment, here's what we're building: API Gateway receives requests, handles auth, enforces rate limits Gateway Lambda validates requests, checks budgets, routes to appropriate service ECS Agents orchestrate multi-step workflows using Lambda tools Lambda Tools perform specific AI tasks (summarize, extract, classify) DynamoDB tracks usage, manages budgets, stores user data WebSocket streams real-time updates back to clients First, let's set up the deployment environment: # Install AWS CDK npm install -g aws-cdk # Clone the platform git clone https://github.com/tysoncung/ai-platform-aws.git cd ai-platform-aws # Install dependencies npm install npm run install:all # Installs in all packages # Bootstrap CDK (one time per account/region) npx cdk bootstrap # Create environment file cp .env.example .env Edit .env with your configuration: # AWS Configuration AWS_REGION=us-east-1 AWS_ACCOUNT_ID=123456789012 # AI Provider API Keys OPENAI_API_KEY=sk-your-openai-key ANTHROPIC_API_KEY=sk-ant-your-anthropic-key # Platform Configuration PLATFORM_ENVIRONMENT=production COST_TRACKING_ENABLED=true BUDGET_ALERTS_ENABLED=true # Monitoring SLACK_WEBHOOK_URL=https://hooks.slack.com/your-webhook ALERT_EMAIL=you@company.com # Security JWT_SECRET_KEY=your-super-secret-jwt-key ENCRYPTION_SALT=your-encryption-salt Before deploying to AWS, let's run everything locally with Docker Compose: # docker-compose.yml version: '3.8' services: api-gateway: build: context: ./packages/gateway dockerfile: Dockerfile.dev ports: - "3000:3000" environment: - NODE_ENV=development - DYNAMODB_ENDPOINT=http://dynamodb:8000 - AGENT_ENDPOINT=http://agent:3001 depends_on: - dynamodb - agent agent: build: context: ./packages/agents dockerfile: Dockerfile.dev ports: - "3001:3001" environment: - NODE_ENV=development - LAMBDA_ENDPOINT=http://lambda-tools:3002 depends_on: - lambda-tools lambda-tools: build: context: ./packages/tools dockerfile: Dockerfile.dev ports: - "3002:3002" environment: - NODE_ENV=development - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} dynamodb: image: amazon/dynamodb-local:latest ports: - "8000:8000" command: ["-jar", "DynamoDBLocal.jar", "-sharedDb", "-inMemory"] redis: image: redis:7-alpine ports: - "6379:6379" Start the local environment: # Start all services docker-compose up -d # Run database migrations npm run db:migrate:local # Seed with sample data npm run db:seed:local # Test the platform curl http://localhost:3000/health The platform is composed of multiple CDK stacks for better separation of concerns: // bin/deploy.ts import { AIGatewayStack } from '../lib/gateway-stack'; import { AIAgentsStack } from '../lib/agents-stack'; import { AIToolsStack } from '../lib/tools-stack'; import { AIMonitoringStack } from '../lib/monitoring-stack'; import { AISecurityStack } from '../lib/security-stack'; const app = new cdk.App(); const env = { account: process.env.CDK_DEFAULT_ACCOUNT, region: process.env.CDK_DEFAULT_REGION }; // Security layer (VPC, IAM, KMS) const securityStack = new AISecurityStack(app, 'AISecurityStack', { env }); // Lambda tools layer const toolsStack = new AIToolsStack(app, 'AIToolsStack', { env, vpc: securityStack.vpc, securityGroup: securityStack.lambdaSecurityGroup }); // ECS agents layer const agentsStack = new AIAgentsStack(app, 'AIAgentsStack', { env, vpc: securityStack.vpc, securityGroup: securityStack.ecsSecurityGroup, toolsArns: toolsStack.functionArns }); // API Gateway layer const gatewayStack = new AIGatewayStack(app, 'AIGatewayStack', { env, agentsCluster: agentsStack.cluster, agentsService: agentsStack.service, toolsArns: toolsStack.functionArns }); // Monitoring and alerting new AIMonitoringStack(app, 'AIMonitoringStack', { env, gatewayApi: gatewayStack.api, agentsService: agentsStack.service, toolsFunctions: toolsStack.functions }); Here's the gateway stack implementation: // lib/gateway-stack.ts export class AIGatewayStack extends cdk.Stack { public readonly api: apigateway.RestApi; constructor(scope: Construct, id: string, props: AIGatewayStackProps) { super(scope, id, props); // DynamoDB tables const usageTable = new dynamodb.Table(this, 'UsageTable', { tableName: 'ai-platform-usage', partitionKey: { name: 'userId', type: dynamodb.AttributeType.STRING }, sortKey: { name: 'timestamp', type: dynamodb.AttributeType.NUMBER }, billingMode: dynamodb.BillingMode.ON_DEMAND, timeToLiveAttribute: 'ttl' }); const budgetTable = new dynamodb.Table(this, 'BudgetTable', { tableName: 'ai-platform-budgets', partitionKey: { name: 'userId', type: dynamodb.AttributeType.STRING }, billingMode: dynamodb.BillingMode.ON_DEMAND }); // Gateway Lambda function const gatewayFunction = new lambda.Function(this, 'GatewayFunction', { runtime: lambda.Runtime.NODEJS_18_X, code: lambda.Code.fromAsset('packages/gateway/dist'), handler: 'index.handler', timeout: cdk.Duration.seconds(30), memorySize: 512, environment: { USAGE_TABLE_NAME: usageTable.tableName, BUDGET_TABLE_NAME: budgetTable.tableName, AGENTS_CLUSTER_ARN: props.agentsCluster.clusterArn, AGENTS_SERVICE_ARN: props.agentsService.serviceArn, TOOLS_ARNS: JSON.stringify(props.toolsArns) } }); // Grant permissions usageTable.grantReadWriteData(gatewayFunction); budgetTable.grantReadWriteData(gatewayFunction); // API Gateway this.api = new apigateway.RestApi(this, 'AIApi', { restApiName: 'AI Platform API', description: 'AI Platform REST API', defaultCorsPreflightOptions: { allowOrigins: apigateway.Cors.ALL_ORIGINS, allowMethods: apigateway.Cors.ALL_METHODS, allowHeaders: ['Content-Type', 'Authorization'] } }); // API Gateway integration const lambdaIntegration = new apigateway.LambdaIntegration(gatewayFunction); // Routes const v1 = this.api.root.addResource('v1'); v1.addResource('complete').addMethod('POST', lambdaIntegration); v1.addResource('embed').addMethod('POST', lambdaIntegration); v1.addResource('stream').addMethod('POST', lambdaIntegration); const agents = v1.addResource('agents'); agents.addResource('run').addMethod('POST', lambdaIntegration); agents.addResource('stream').addMethod('POST', lambdaIntegration); // Usage and budget endpoints const usage = v1.addResource('usage'); usage.addMethod('GET', lambdaIntegration); // Get usage stats usage.addResource('budget').addMethod('GET', lambdaIntegration); usage.addResource('budget').addMethod('PUT', lambdaIntegration); // WebSocket API for streaming const webSocketApi = new apigatewayv2.WebSocketApi(this, 'StreamingAPI', { apiName: 'AI Platform Streaming', connectRouteOptions: { integration: new apigatewayv2integrations.WebSocketLambdaIntegration( 'ConnectIntegration', gatewayFunction ) }, disconnectRouteOptions: { integration: new apigatewayv2integrations.WebSocketLambdaIntegration( 'DisconnectIntegration', gatewayFunction ) }, defaultRouteOptions: { integration: new apigatewayv2integrations.WebSocketLambdaIntegration( 'DefaultIntegration', gatewayFunction ) } }); new apigatewayv2.WebSocketStage(this, 'StreamingStage', { webSocketApi, stageName: 'prod', autoDeploy: true }); } } Now let's deploy everything: # 1. Validate CDK configuration npx cdk doctor # 2. Review what will be deployed npx cdk diff # 3. Deploy security stack first npx cdk deploy AISecurityStack # 4. Deploy Lambda tools npx cdk deploy AIToolsStack # 5. Deploy ECS agents npx cdk deploy AIAgentsStack # 6. Deploy API Gateway npx cdk deploy AIGatewayStack # 7. Deploy monitoring npx cdk deploy AIMonitoringStack # Or deploy everything at once npx cdk deploy --all The deployment takes about 15 minutes. You'll see output like: AIGatewayStack.APIEndpoint = https://abc123.execute-api.us-east-1.amazonaws.com/v1 AIGatewayStack.WebSocketEndpoint = wss://def456.execute-api.us-east-1.amazonaws.com/prod AIAgentsStack.ClusterName = ai-platform-agents AIToolsStack.SummarizeFunctionArn = arn:aws:lambda:us-east-1:123456789012:function:summarize Once deployed, configure your AI provider credentials: # Store API keys in AWS Systems Manager aws ssm put-parameter \ --name "/ai-platform/openai-api-key" \ --value "sk-your-openai-key" \ --type "SecureString" aws ssm put-parameter \ --name "/ai-platform/anthropic-api-key" \ --value "sk-ant-your-anthropic-key" \ --type "SecureString" # Update the deployed functions with the new parameter names npx cdk deploy AIToolsStack AIGatewayStack Let's test the complete platform: # 1. Health check curl https://your-api-endpoint.execute-api.us-east-1.amazonaws.com/v1/health # 2. Create an API key curl -X POST https://your-api-endpoint/v1/auth/keys \ -H "Content-Type: application/json" \ -d '{ "name": "Test Key", "scopes": ["ai:complete", "ai:embed", "agent:run"], "monthlyBudget": 50 }' # Returns: {"apiKey": "sk-proj-abc123...", "keyId": "sk-proj-abc"} # 3. Test completion curl -X POST https://your-api-endpoint/v1/complete \ -H "Authorization: Bearer sk-proj-abc123..." \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "Write a haiku about TypeScript"}], "model": "gpt-4", "temperature": 0.8 }' # 4. Test agent workflow curl -X POST https://your-api-endpoint/v1/agents/run \ -H "Authorization: Bearer sk-proj-abc123..." \ -H "Content-Type: application/json" \ -d '{ "type": "research", "input": {"topic": "renewable energy trends"}, "tools": ["search", "summarize", "extract"] }' The platform includes a built-in dashboard at /dashboard. Here's what you'll see: Usage Overview: Requests per day/hour Token consumption by model Cost breakdown by user Success/error rates Real-time Monitoring: Active agent sessions Queue depth for tools Response time percentiles Error alerts Budget Management: Per-user spend tracking Budget utilization alerts Cost projections BYOK vs platform credit usage System Health: Lambda cold start metrics ECS task utilization DynamoDB performance API Gateway latency You can access it at: https://your-api-endpoint/dashboard Here are the real metrics from 8 months running in production: Latency (P95): Simple completion: 1.2s Streaming completion: 180ms to first token Agent workflow (3 tools): 12s API Gateway overhead: 45ms Lambda cold start: 850ms (mitigated with provisioned concurrency) Throughput: Sustained: 50 requests/second Burst: 200 requests/second (before rate limiting) Agent concurrency: 15 parallel workflows Tool execution: 100 parallel Lambda invocations Reliability: Uptime: 99.8% Error rate: 0.4% P99 latency SLA: 5s (met 98.9% of the time) Budget enforcement accuracy: 99.99% Cost Optimization Wins: Response caching: 25% reduction in API calls Smart model selection: 40% cost reduction (Claude Haiku for summaries) BYOK adoption: 70% of users, eliminating platform AI costs Lambda right-sizing: 30% reduction in compute costs Fixed Infrastructure (Monthly): API Gateway: $3.50 (1M requests) Lambda (Gateway): $8.20 (compute + requests) ECS Fargate: $15.40 (2 tasks avg) DynamoDB: $6.80 (usage + budgets) Application Load Balancer: $16.20 NAT Gateway: $45.00 (data transfer) CloudWatch: $4.30 (logs + metrics) Route 53: $0.50 (hosted zone) ---- Total Fixed: $99.90/month Variable Costs: AI API costs: Pass-through with 2% platform markup Data transfer: $0.09/GB out of AWS Lambda executions: $0.20 per million requests DynamoDB reads/writes: $0.25 per million operations Real customer costs (excluding AI API): Light usage (500 req/month): $12/month Medium usage (5K req/month): $35/month Heavy usage (50K req/month): $120/month The platform is cost-effective for most use cases. The break-even point vs building your own infrastructure is around 2,000 requests per month. Lambda cold starts were killing our performance. Here's how we solved it: // Provisioned concurrency for critical functions new lambda.Function(this, 'GatewayFunction', { // ... other config reservedConcurrencyLimit: 10, provisionedConcurrencyConfig: { provisionedConcurrentExecutions: 5 } }); // Keep-warm function that pings Lambdas every 5 minutes new events.Rule(this, 'KeepWarmRule', { schedule: events.Schedule.rate(cdk.Duration.minutes(5)), targets: [ new targets.LambdaFunction(gatewayFunction, { event: events.RuleTargetInput.fromObject({ warmup: true }) }) ] }); // In Lambda handler - respond quickly to warmup export const handler = async (event: any) => { if (event.warmup) { return { statusCode: 200, body: 'warm' }; } // Normal processing... }; Result: Cold start rate dropped from 23% to 3% of requests. This platform is completely open source. Here's what's coming next: Q2 2026: [ ] Multi-region deployment support [ ] GraphQL API alongside REST [ ] Built-in vector database (Pinecone integration) [ ] Advanced agent memory management Q3 2026: [ ] Kubernetes support (alternative to ECS) [ ] Multi-tenant isolation improvements [ ] Advanced cost optimization (spot instances) [ ] Plugin system for custom tools Q4 2026: [ ] Edge deployment (CloudFlare Workers) [ ] Real-time collaboration features [ ] Advanced monitoring and observability [ ] Enterprise SSO integration Community Requests: Google Cloud and Azure support Terraform modules (alternative to CDK) Python SDK alongside TypeScript Zapier/Make.com integrations The entire platform is open source under MIT license. Everything I've built, you can use, modify, and improve. Repositories: Main platform: github.com/tysoncung/ai-platform-aws Working examples: github.com/tysoncung/ai-platform-aws-examples How to help: Star the repositories - helps others discover the project Try the full deployment - example 07-full-stack has everything Report deployment issues - especially AWS region differences Submit improvements - see CONTRIBUTING.md for guidelines Share your experience - what are you building with it? Connect: Email: tyson@hivo.co Twitter: @tysoncung What We Built Together Eight articles. One complete AI platform. We started with seven broken Lambda functions. We built: Agent orchestration that handles complex multi-step workflows without timeouts TypeScript SDK with perfect IntelliSense, streaming support, and smart error handling Cost control that prevents $2,847 surprises with budgets and rate limits Production security with authentication, encryption, and monitoring One-command deployment that gets you running in under an hour The platform serves 1,500+ requests daily. It's survived 8 months in production. It's processing everything from document analysis to research workflows. And it's completely open source. Building production AI infrastructure taught me things tutorials never mention: Technical truths: Cost control is life support, not a nice-to-have feature Lambda excels at tools, fails at orchestration Streaming looks simple, implementation is brutal Type safety prevents expensive mistakes at 3AM Business realities: Developers pay for great experience, abandon bad APIs Open source builds trust better than marketing Production numbers matter more than perfect demos Failure stories teach more than success posts Personal discoveries: Building in public creates accountability Documentation is your product's face Shipping beats perfecting every time Sharing mistakes helps everyone improve You have everything you need. Real code, real examples, real production lessons. The platform is MIT licensed - use it, improve it, make money with it. Next steps: Star the repos - ai-platform-aws and examples Deploy example 07 - full platform in under an hour Build something cool - then tell me about it Share your experience - help others learn from your journey Get stuck? Email me at tyson@hivo.co or find me on Twitter @tysoncung. The AI revolution needs better infrastructure. You can build it. Go. End of series: "Building an AI Platform on AWS from Scratch". Complete platform and examples at github.com/tysoncung/ai-platform-aws.