dev_to 2026年3月14日

ゼロから生産環境用の DevOps プロジェクトを構築した取り組み：実際の現場で運用されている方法

How I Built a Production-Grade DevOps Project From Scratch

Translated: 2026/3/14 13:00:26

devopsci-cdawsterraformcontainerization

Japanese Translation

本記事は、本格的な CI/CD パイプライン、AWS インフラ、コンテナ化されたアプリケーションを構築する実在のプロジェクトを、現場で実際にどう行っているかを示すウォークスルーです。多くの DevOps チュートリアルは、ハードコードされた AWS キーを持つ単一の EC2 インスタンスに「Hello World」アプリをデプロイする方法を教えます。基礎学習には問題ありませんが、これは本格的なエンジニアリングの現状を反映していません。私は、実際の企業で私がどのように実施するであろうかを指して言えるような、何らかのものをつくりたいと考えました。ショートカットも、チュートリアルの手把手指導もありません。結果：コードの git push から AWS のライブ HTTPS エンドポイントに至るまでを完全に自動化したパイプラインを作成しました。セキュリティスキャン、インフラストラクチャとしてのコード、監視機能、そして単一のステティックなクレデンシャル（認証情報）をどこにも配置しません。ライブ URL: https://tasks.therealblessing.com GitHub: github.com/nanafilbert/cicd-aws-terraform-deploy A Node.js のタスクマネージャー API と、Kanban ダッシュボード UI を備え、Application Load Balancer 背後に AWS にデプロイされました。8 ステージの GitHub Actions CI/CD パイプライン、OIDC 無キー AWS 認証（単一のステティックなクレデンシャルなし）、モジュラー Terraform（VPC、ALB、ASG、EC2 を全て再利用可能なモジュールで）、マルチステージ Docker ビルド（テストがビルド中に実行され、壊れた画像はプッシュできません）、Trivy CVE スキャン（パイプラインは HIGH/CRITICAL バルネラビリティで失敗し、ALM SSL セルチフィケートのカスタムドメイン、ローカルの Prometheus + Grafana 監視スタック、ゼロダウンタイムデプロイのための ASG インスタンスリフレッシュ、GitHub Actions（OIDC）→ Docker Hub→AWS のフローです。パイプラインは OIDC を介して短寿命トークンを使用して AWS に認証します。どこにも AWS_ACCESS_KEY_ID や AWS_SECRET_ACCESS_KEY が配置されていません。毎回のデプロイは ASG インスタンスリフレッシュをトリガーし、最新画像をプルした新しい EC2 インスタンスを置換します。8 ステージすべてが意図的であり、ESLint でのコード品質チェック、高価な処理前の早期失敗、およびカバー率 80% 以上の強制された 19 件の Jest 統合テスト（テストが失敗すると、ビルドもデプロイも行われない）を含んでいます。Trivy は依存関係の既知の CVE を検出します。任意の HIGH または CRITICAL の未修正バルネラビリティがあればパイプラインを失敗させます。これは開発中にアルパイン CVE と npm 的継承依存関係のバルネラビリティ（ピンning が必要なもの）などの実際の問題を発見しました。マルチステージ Docker ビルド。3 ステージ：デンド（単一生産依存関係のインストール）、テスト（ビルドプロセス中に Jest の実行）、プロダクション（必要なのみが Alpine 3.21 の最小限、ルートユーザーなし）。テストステージが重要です。テストが失敗すると、画像はビルドされません。物理的に壊れた画像をプッシュすることはできません。terraform plan は実行され、プランをアーティファクトとして保存します。これは次のステージで適用されます。再注入された変数も、プランと適用のズレもありません。terraform apply を使用して保存されたプランを適用します。それに直ちに明確な ASG インスタンスリフレッシュが続きます：aws autoscaling start-instance-refresh --auto-scaling-group-name $ASG_NAME --preferences '{"MinHealthyPercentage": 50, "InstanceWarmup": 60}' これが実際にはインスタンスに新しいコードを実装します。リフレッシュをトリガーしない場合、ASG は古い画像を永続的に実行し続けます。デプロイ後に /health/ready を最大 6 分間監視し、アプリケーションが決して健康状態にならない場合は、パイプラインが失敗し、あなたに即時通知します。GitHub Actions 作業の要約に書かれたパス/フェールテーブルが、一瞥で見やすい清潔な表示を提供します。3 つの独立したモジュール：ネットワーク（2 つの AZ を跨る公開サブネットを備えた VPC、インターネットゲートウェイ、ルートテーブル）、セキュリティ（セキュリティグループ。ALB はどこからでも 80 と 443 からのトラフィックを受け入れ、アプリセキュリティグループは ALB セキュリティグループからのトラフィックのみを受け入れ、EC2 インスタンスはインターネットから直接アクセスできません。コンピューティング：HTTP リダイレクトと ACM セルチフィケートの HTTPS リスナーを備えた ALB、IMDSv2 が必須のローンチテンプレート、ローリングインスタンスリフレッシュを備えた ASG、SSM アクセスクープルされた EC2 IAMロール、そしてパイプラインが実行される前に存在しなければならない一時的なセットアップを処理する bootstrap/フォルダー。OIDC プロバイダー、IAM ロール、S3 ステートバケット、DynamoDB lo...

Original Content

A walkthrough of building a real CI/CD pipeline, AWS infrastructure, and containerised app — the way it's actually done in production. Most DevOps tutorials show you how to deploy a "Hello World" app to a single EC2 instance with hardcoded AWS keys. That's fine for learning the basics, but it doesn't reflect what production engineering actually looks like. I wanted to build something I could point to and say — this is how I would do it at a real company. No shortcuts, no tutorial hand-holding. The result: a fully automated pipeline that takes code from a git push to a live HTTPS endpoint on AWS, with security scanning, infrastructure as code, observability, and zero static credentials anywhere. Live URL: https://tasks.therealblessing.com GitHub: github.com/nanafilbert/cicd-aws-terraform-deploy A Node.js task manager API with a Kanban dashboard UI, deployed to AWS behind an Application Load Balancer with: 8-stage GitHub Actions CI/CD pipeline OIDC keyless AWS authentication — no static credentials Modular Terraform — VPC, ALB, ASG, EC2, all in reusable modules Multi-stage Docker build — tests run inside the build, broken images can't be pushed Trivy CVE scanning — pipeline fails on HIGH/CRITICAL vulnerabilities ACM SSL certificate with custom domain Prometheus + Grafana observability stack locally ASG instance refresh for zero-downtime deployments GitHub Actions (OIDC) → Docker Hub → AWS │ ALB (HTTPS:443) ACM Certificate HTTP → HTTPS redirect │ Auto Scaling Group (EC2 t3.small) │ Docker Container Node.js API :3000 The pipeline authenticates to AWS using OIDC short-lived tokens — no AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY anywhere. Every deploy triggers an ASG instance refresh that replaces EC2 instances with fresh ones pulling the latest image. Eight stages, every one intentional: ESLint checks code quality. Fails fast before any expensive steps. Jest runs 19 integration tests with coverage enforced at 80%. If tests fail, nothing gets built or deployed. Trivy scans the filesystem for known CVEs in dependencies. Fails the pipeline on any HIGH or CRITICAL unfixed vulnerability. This caught real issues during development — Alpine CVEs and npm transitive dependency vulnerabilities that needed pinning. Multi-stage Docker build. Three stages: deps — installs only production dependencies test — runs Jest inside the build process production — minimal Alpine 3.21, non-root user, only what's needed to run The test stage is critical. If your tests fail, the image doesn't get built. You physically cannot push a broken image. terraform plan runs and saves the plan as an artifact. This is what gets applied in the next stage — no variables re-injected, no drift between plan and apply. terraform apply using the saved plan. Followed immediately by an explicit ASG instance refresh: aws autoscaling start-instance-refresh \ --auto-scaling-group-name $ASG_NAME \ --preferences '{"MinHealthyPercentage": 50, "InstanceWarmup": 60}' This is what actually gets new code onto the instances. Without triggering the refresh, the ASG would keep running the old image indefinitely. Polls /health/ready for up to 6 minutes after deploy. If the app never becomes healthy, the pipeline fails and you know immediately. A pass/fail table written to the GitHub Actions job summary. Clean, visible at a glance. Three independent modules: networking — VPC, public subnets across two AZs, internet gateway, route tables. security — Security groups. The ALB accepts traffic from anywhere on 80 and 443. The app security group only accepts traffic from the ALB security group — EC2 instances are never directly reachable from the internet. compute — ALB with HTTP redirect to HTTPS and an HTTPS listener with ACM certificate, launch template with IMDSv2 required, ASG with rolling instance refresh, IAM role for EC2 with SSM access. A bootstrap/ folder handles the one-time setup that must exist before the pipeline can run — the OIDC provider, IAM role, S3 state bucket, and DynamoDB lock table. Remote state in S3 with DynamoDB locking means the pipeline and local Terraform commands never conflict. This was the most important decision in the project. The traditional approach is to create an IAM user, generate access keys, and store them as GitHub secrets. This works but creates long-lived credentials that can be leaked, rotated incorrectly, or forgotten. OIDC works differently. GitHub Actions requests a short-lived token from GitHub's OIDC provider. AWS verifies that token against a trust policy and issues temporary credentials. The whole exchange happens in seconds and the credentials expire when the job ends. permissions: id-token: write contents: read - uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: ${{ secrets.AWS_ROLE_ARN }} aws-region: ${{ secrets.AWS_REGION }} The trust policy on the IAM role restricts assumption to this specific GitHub repo only: "Condition": { "StringLike": { "token.actions.githubusercontent.com:sub": "repo:nanafilbert/cicd-aws-terraform-deploy:*" } } No static credentials. Nothing to rotate. Nothing to leak. The Docker permission bug — The production container was crashing with Cannot find module '/app/src/app.js' even though the file clearly existed in the image. Took a while to figure out: I had set chmod -R 550 on the app directory. Read and execute, but no execute on directories means Node.js can't traverse the path. Changed to 755 and it worked immediately. The lesson: file permission bugs are silent and confusing — always verify what your non-root user can actually access. The HSTS loop — After adding HTTPS, all API calls from the browser were being upgraded to HTTPS even when I explicitly typed http://. Helmet's default configuration sets a Strict-Transport-Security header, which tells browsers to remember to always use HTTPS for this origin. Even clearing the cache wasn't enough — had to explicitly clear the HSTS policy in Chrome's chrome://net-internals/#hsts and disable the header in Helmet for the HTTP-only ALB endpoint. The instance refresh gap — After every deploy the new Docker image was pushed to Docker Hub, but the EC2 instance kept running the old one. Terraform saw no infrastructure changes so it didn't replace anything. The fix was to explicitly trigger an ASG instance refresh in the pipeline after every apply. Without that step, automation is an illusion — you're just pushing images that never get deployed. The Terraform state lock — A failed pipeline run left a lock on the state file. Subsequent runs couldn't acquire the lock and failed immediately. Learned that terraform force-unlock -force from the correct working directory resolves this, and added auto-unlock logic to the plan job for future failures. The app exposes Prometheus metrics via prom-client: const promClient = require("prom-client"); promClient.collectDefaultMetrics({ register }); app.get("/health/metrics", async (req, res) => { res.set("Content-Type", register.contentType); res.send(await register.metrics()); }); Locally, docker-compose up starts the full stack — app, nginx, Prometheus, and Grafana. Prometheus scrapes /health/metrics every 15 seconds. Grafana visualizes CPU usage, heap memory, event loop lag, and active handles in real time. Running a load test against the local API makes the graphs spike visibly — useful for demonstrating the observability story to anyone reviewing the project. RDS PostgreSQL — tasks currently live in memory and reset on deploy. A real database would make this production-ready in a deeper sense. CloudWatch alarms — alert on unhealthy host count and high CPU before users notice. WAF — Web Application Firewall in front of the ALB for rate limiting and bot protection at the infrastructure level. The most valuable part of this project wasn't the technology — it was the debugging. Every bug I hit taught me something real: how file permissions work in containers, how browsers cache security policies, how Terraform state locking works, how ASG instance refresh interacts with deploy automation. That's the difference between following a tutorial and building something yourself. The tutorial gives you the happy path. Building it yourself gives you everything else. If you're building a DevOps portfolio, don't copy a tutorial. Pick a problem, build something real, and let it break. That's where the learning actually happens. The full source code is at github.com/nanafilbert/cicd-aws-terraform-deploy and the live app is running at https://tasks.therealblessing.com.