Back to list
dev_to 2026年4月25日

Amazon Bedrock で Claude をドメインごとにファインチューニングする方法(コード付き完全ガイド)

How to Fine-Tune Claude on Amazon Bedrock for Your Domain (Complete Guide with Code)

Translated: 2026/4/25 6:00:23 翻訳信頼度: 96.6%
amazon-bedrockclaude-haikumodel-finetuningstartups-mlllm-cost-optimization

Japanese Translation

データセットの準備、Bedrock のセットアップ、トレーニング設定、評価、デプロイメント、そして ML チームがいないスタートアップのための実際の費用推計も含んだ解説です。ファインチューニングが本当に正しい選択なのか、よく語ります。多くの場合、それはそうではありません。良好な例を含めた適切に作成されたシステムプロンプトだけで、ファインチューニングよりも速く、安く、かつ運用オーバーヘッドを減らして、80% のドメイン適応問題を解決できます。これは最後に戻って言っておきたい非常に重要なことで、多くのチュートリアルがこれをスキップしているからです。ただし、ファインチューニングがその複雑さに見合ったカテゴリーの課題が存在します:単独のプロンプティングでは信頼ably 生成できない一貫した出力形式が必要、高負荷の推論においてトークン単位の費用が複利的に増大する、ドメイン特有の用語や推論パターンがファーストショットの例にうまく転移しない、または長いシステムプロンプトによるレイテンシーが製品の体験に計測可能な影響を及ぼしている場合です。もしそのカテゴリーにいる場合は、このガイドは Amazon Bedrock 上でゼロからデプロイされたファインチューニングされた Claude モデルまでを、実用的なコードを通し、ご紹介します。 Amazon Bedrock ファインチューニングが本当に何であるか Bedrock のファインチューニングはモデルカスタマイズであり、アンソプティク社のベース Claude モデルを、ドメイン固有のデータでトレーニングを継続するものです。その結果、AWS アカウント内に存在し、すでにご利用中の API に同様に反応し、同じプロンプトに対してより一貫性のあるパフォーマンスを発揮するカスタムモデルバリエーションが得られます。重要な制約事項:Bedrock ファインチューニングは、カスタマイズ可能な Claude モデルとしてアンソプティク社が提供しているものを使用します(執筆時点で Claude Haiku です)。マーケティングよりも狭い能力を持つかもしれませんが、これは模型に根本的な新しい知識を教えるのではなく、行動と形式の一貫性を適応させるものです。模型が異なる推論を行う必要がある場合は、ファインチューニングが役立ちます。模型にトレーニングデータにあること以外のことを知る必要がある場合は、ファインチューニングではなく、RAG(檢索增强生成)が必要です。 前提条件 コード実行前に:ターゲットリジョン(Bedrock の可用性を確認する us-east-1 または us-west-2)で Bedrock アクセスが有効な AWS アカウント。 IAM ロール(Bedrock フルアクセス権限と S3 の読み取り/書き込み権限)。 Python 3.9 以上。 トレーニングデータとモデルアーティファクト用の S3 バケット。 トレーニングデータセット(作成します)。 pip install boto3 pandas jsonlines scikit-learn tqdm コード例: import boto3 import json import pandas as pd import jsonlines from pathlib import Path # AWS セッションの構成 session = boto3.Session( region_name='us-east-1' # リジョンの Bedrock 可用性を確認 ) bedrock = session.client('bedrock') bedrock_runtime = session.client('bedrock-runtime') s3 = session.client('s3') BUCKET_NAME = "your-fine-tuning-bucket" MODEL_ID = "anthropic.claude-haiku-20240307-v1:0" ステップ 1:データセットの準備 多くのファインチューニングプロジェクトの成功や失敗はこの段階にあります。モデルはあなたが示すことを学び、垃圾が入れば垃圾が出るのが、ファインチューニングにおいてより真実です。Bedrock のファインチューニングは、特定の JSONL フォーマットのデータを期待します。各行は、プロンプトと理想となる完了(completion)を含む完全なトレーニング例です。 python # 各トレーニング例はこの構体を満たす必要があります example = { "prompt": "ここに入力プロンプト", "completion": "模型が生産すべき理想的な出力" } ドメイン適応のユースケースとして、法的ドキュメントサマリー化タスクをファインチューニングする場合、データ準備は以下のように見えます。 python class DatasetPreparator: def __init__(self, output_path: str): self.output_path = Path(output_path) self.examples = [] def add_example( self, document_text: str, ideal_summary: str, document_type: str = None ): """オプションのメタデータを伴うトレーニング例を追加します。""" # 生産プロンプトに一致するプロンプトを構築 # 必須:ファインチューニングプロンプトは推論プロンプトと一致する必要があります prompt = self._build_prompt(document_text, document_type) self.examples.append({ "prompt": prompt, "completion": ideal_summary }) def _build_prompt(self, text: str, doc_type: str = None) -> str: type_context = "" if doc_type is None else f"Document Type: {doc_type}"

Original Content

Dataset prep, Bedrock setup, training configuration, evaluation, deployment, with real cost estimates for startups without ML teams. Let me tell you when fine-tuning is actually the right answer. Most of the time it isn't. A well-crafted system prompt with good examples handles 80% of domain adaptation problems faster, cheaper and with less operational overhead than fine-tuning. I'll come back to this at the end because it's genuinely important and most tutorials skip it. But there's a specific category of problem where fine-tuning earns its complexity: when you need consistent output format that prompting alone can't reliably produce, when you're running high-volume inference where per-token costs compound, when your domain has terminology or reasoning patterns so specialised that few-shot examples don't transfer well, or when latency from long system prompts is measurably affecting your product experience. If you're in that category, this guide gets you from zero to a deployed fine-tuned Claude model on Amazon Bedrock with working code throughout. What Amazon Bedrock Fine-Tuning Actually Is Bedrock's fine-tuning is model customisation, you're taking Anthropic's base Claude model and continuing its training on your domain-specific data. The result is a custom model variant that lives in your AWS account, responds to the same API you're already using and handles your specific use case with more consistency than the base model on the same prompts. The key constraint: Bedrock fine-tuning uses the Claude models Anthropic makes available for customisation, which at time of writing is Claude Haiku. The capability is narrower than you might expect from the marketing , you're adapting behaviour and format consistency, not teaching the model fundamentally new knowledge. If you need the model to reason differently, fine-tuning helps. If you need it to know things that aren't in its training data, you need RAG, not fine-tuning. Prerequisites Before the code: AWS account with Bedrock access enabled in your target region (us-east-1 or us-west-2 for Bedrock availability) IAM role with Bedrock full access and S3 read/write permissions Python 3.9+ An S3 bucket for your training data and model artefacts Training dataset (we'll build one) bash pip install boto3 pandas jsonlines scikit-learn tqdm python import boto3 import json import pandas as pd import jsonlines from pathlib import Path # Configure your AWS session session = boto3.Session( region_name='us-east-1' # Confirm Bedrock availability in your region ) bedrock = session.client('bedrock') bedrock_runtime = session.client('bedrock-runtime') s3 = session.client('s3') BUCKET_NAME = "your-fine-tuning-bucket" MODEL_ID = "anthropic.claude-haiku-20240307-v1:0" Step 1: Dataset Preparation This is where most fine-tuning projects succeed or fail. The model learns what you show it, garbage in, garbage out is nowhere more true than in fine-tuning. Bedrock's fine-tuning expects data in a specific JSONL format. Each line is a complete training example with a prompt and the ideal completion. python # Each training example must follow this structure example = { "prompt": "Your input prompt here", "completion": "The ideal output you want the model to produce" } For a domain adaptation use case, let's say we're fine-tuning for a legal document summarisation task, your data preparation looks like this: python class DatasetPreparator: def __init__(self, output_path: str): self.output_path = Path(output_path) self.examples = [] def add_example( self, document_text: str, ideal_summary: str, document_type: str = None ): """Add a training example with optional metadata.""" # Build the prompt that matches your production prompt structure # CRITICAL: Your fine-tuning prompt must match your inference prompt prompt = self._build_prompt(document_text, document_type) self.examples.append({ "prompt": prompt, "completion": ideal_summary }) def _build_prompt(self, text: str, doc_type: str = None) -> str: type_context = f" This is a {doc_type}." if doc_type else "" return ( f"Summarise the following legal document in three sections: " f"Key Parties, Core Obligations and Risk Flags.{type_context}" f"\n\nDocument:\n{text}\n\nSummary:" ) def validate_and_write(self, train_split: float = 0.9): """Validate examples and write train/validation splits.""" # Validation checks issues = [] for i, ex in enumerate(self.examples): if len(ex['prompt']) < 10: issues.append(f"Example {i}: prompt too short") if len(ex['completion']) < 20: issues.append(f"Example {i}: completion too short") if len(ex['prompt']) > 4000: issues.append(f"Example {i}: prompt exceeds token limit") if issues: print(f"Found {len(issues)} issues:") for issue in issues[:10]: # Show first 10 print(f" {issue}") return False # Split into train and validation split_idx = int(len(self.examples) * train_split) train_data = self.examples[:split_idx] val_data = self.examples[split_idx:] # Write JSONL files for filename, data in [ ("train.jsonl", train_data), ("validation.jsonl", val_data) ]: with jsonlines.open(self.output_path / filename, 'w') as writer: writer.write_all(data) print(f"Written {len(train_data)} training examples") print(f"Written {len(val_data)} validation examples") return True # Usage prep = DatasetPreparator("./training_data") # Load your examples — minimum 32 for Bedrock, # recommend 200+ for meaningful results for _, row in your_dataframe.iterrows(): prep.add_example( document_text=row['document'], ideal_summary=row['expert_summary'], document_type=row['type'] ) prep.validate_and_write() Dataset size guidance: Bedrock requires a minimum of 32 training examples. In practice, you won't see meaningful domain adaptation below 100 examples and the sweet spot for most use cases is 300 to 1,000 high-quality examples. High quality beats high volume. 200 expert-written summaries will outperform 2,000 mediocre ones. Step 2: Upload Training Data to S3 python def upload_training_data( local_dir: str, bucket: str, prefix: str = "fine-tuning" ) -> dict: """Upload training files to S3 and return URIs.""" s3_uris = {} for filename in ["train.jsonl", "validation.jsonl"]: local_path = Path(local_dir) / filename s3_key = f"{prefix}/{filename}" print(f"Uploading {filename}...") s3.upload_file( str(local_path), bucket, s3_key ) s3_uris[filename] = f"s3://{bucket}/{s3_key}" print(f"Uploaded to {s3_uris[filename]}") return s3_uris uris = upload_training_data( "./training_data", BUCKET_NAME, "legal-summarisation/v1" ) Step 3: Configure and Launch Fine-Tuning Job python def launch_fine_tuning_job( job_name: str, training_uri: str, validation_uri: str, output_bucket: str, role_arn: str ) -> str: """Launch a Bedrock fine-tuning job and return job ARN.""" response = bedrock.create_model_customization_job( jobName=job_name, customModelName=f"{job_name}-model", roleArn=role_arn, baseModelIdentifier=MODEL_ID, # Training data configuration trainingDataConfig={ "s3Uri": training_uri }, validationDataConfig={ "validators": [{ "s3Uri": validation_uri }] }, # Output configuration outputDataConfig={ "s3Uri": f"s3://{output_bucket}/fine-tuning-output/{job_name}/" }, # Hyperparameters # These are the defaults — adjust based on your dataset size hyperParameters={ "epochCount": "3", # Start with 3, increase if underfitting "batchSize": "32", # 32 is standard for most cases "learningRate": "0.00001" # Conservative default }, customizationType="FINE_TUNING" ) job_arn = response['jobArn'] print(f"Fine-tuning job launched: {job_arn}") return job_arn # Your IAM role ARN — must have Bedrock and S3 permissions ROLE_ARN = "arn:aws:iam::YOUR_ACCOUNT_ID:role/BedrockFineTuningRole" job_arn = launch_fine_tuning_job( job_name="legal-summarisation-v1", training_uri=uris["train.jsonl"], validation_uri=uris["validation.jsonl"], output_bucket=BUCKET_NAME, role_arn=ROLE_ARN ) Hyperparameter guidance: epochCount controls how many times the model sees your training data. Start at 3. If your validation loss is still improving at epoch 3, try 5. If it plateaus at epoch 1, your dataset may have quality issues. learningRate at 0.00001 is conservative and safe. Going higher risks destabilising the base model's general capabilities. Lower if you're seeing erratic validation loss. batchSize of 32 works for most datasets. Larger batches are more stable but require more memory. Step 4: Monitor the Job Fine-tuning a Claude Haiku model typically takes 30 to 90 minutes depending on dataset size. Don't just wait, monitor it. python import time def monitor_job(job_arn: str, check_interval: int = 60) -> str: """Poll job status until completion. Returns final status.""" print(f"Monitoring job: {job_arn}") while True: response = bedrock.get_model_customization_job( jobIdentifier=job_arn ) status = response['status'] print(f"[{time.strftime('%H:%M:%S')}] Status: {status}") if status in ['Completed', 'Failed', 'Stopped']: if status == 'Completed': custom_model_arn = response['outputModelArn'] print(f"Success! Model ARN: {custom_model_arn}") return custom_model_arn else: failure_msg = response.get('failureMessage', 'Unknown error') raise Exception(f"Job {status}: {failure_msg}") # Show metrics if available if 'trainingMetrics' in response: metrics = response['trainingMetrics'] print(f" Training loss: {metrics.get('trainingLoss', 'N/A'):.4f}") time.sleep(check_interval) custom_model_arn = monitor_job(job_arn) Step 5: Evaluate Before You Deploy Never skip evaluation. The fine-tuned model will be different from the base model, the question is whether it's different in the ways you wanted. python def evaluate_model( custom_model_arn: str, test_examples: list, base_model_id: str = MODEL_ID ) -> dict: """Compare fine-tuned model against base model on test examples.""" results = { 'fine_tuned': [], 'base_model': [], 'comparisons': [] } for example in test_examples: prompt = example['prompt'] reference = example['reference_output'] # Run inference on both models ft_response = bedrock_runtime.invoke_model( modelId=custom_model_arn, body=json.dumps({ "prompt": f"\n\nHuman: {prompt}\n\nAssistant:", "max_tokens_to_sample": 1000, "temperature": 0.1 }) ) base_response = bedrock_runtime.invoke_model( modelId=base_model_id, body=json.dumps({ "prompt": f"\n\nHuman: {prompt}\n\nAssistant:", "max_tokens_to_sample": 1000, "temperature": 0.1 }) ) ft_output = json.loads( ft_response['body'].read() )['completion'] base_output = json.loads( base_response['body'].read() )['completion'] results['comparisons'].append({ 'prompt': prompt, 'reference': reference, 'fine_tuned': ft_output, 'base_model': base_output }) return results # Run on 20-30 held-out examples that weren't in training evaluation = evaluate_model( custom_model_arn, held_out_test_set ) # Review comparisons manually — automated metrics # miss nuance that matters in production for comp in evaluation['comparisons'][:5]: print(f"Prompt: {comp['prompt'][:100]}...") print(f"Reference: {comp['reference'][:200]}") print(f"Fine-tuned: {comp['fine_tuned'][:200]}") print(f"Base model: {comp['base_model'][:200]}") print("---") Read the outputs. Don't just run BLEU scores and call it done. The qualitative assessment, does the fine-tuned model actually behave the way you wanted it to?, is what tells you whether to deploy or iterate. Step 6: Deploy via Provisioned Throughput Custom models require provisioned throughput to serve inference. This is the ongoing cost commitment. python def provision_model( model_arn: str, provisioned_name: str, model_units: int = 1 ) -> str: """Provision throughput for the fine-tuned model.""" response = bedrock.create_provisioned_model_throughput( modelUnits=model_units, provisionedModelName=provisioned_name, modelId=model_arn ) provisioned_arn = response['provisionedModelArn'] print(f"Provisioned model ARN: {provisioned_arn}") return provisioned_arn provisioned_arn = provision_model( custom_model_arn, "legal-summarisation-prod", model_units=1 # Scale up based on your throughput needs ) Production inference: python def invoke_custom_model( prompt: str, provisioned_arn: str, max_tokens: int = 1000 ) -> str: """Invoke the fine-tuned model for production inference.""" response = bedrock_runtime.invoke_model( modelId=provisioned_arn, body=json.dumps({ "prompt": f"\n\nHuman: {prompt}\n\nAssistant:", "max_tokens_to_sample": max_tokens, "temperature": 0.1, "stop_sequences": ["\n\nHuman:"] }), contentType="application/json", accept="application/json" ) result = json.loads(response['body'].read()) return result['completion'] Cost Estimates Honest numbers for a startup-scale use case: Training costs: Fine-tuning job: approximately $0.004 per 1,000 tokens in your training dataset A 500-example dataset with average 800 tokens per example: ~$1.60 for training Training runs multiple epochs: multiply by epoch count (~$5-8 total for 3 epochs) Provisioned throughput: 1 model unit: approximately $5.50 per hour Running 24/7: ~$3,960 per month Running 8 hours/day: ~$1,320 per month The provisioned throughput cost is the real number to plan around. For most startups, a fine-tuned Claude Haiku model only makes economic sense at volume, if you're running thousands of requests per day where the per-token efficiency gain or quality improvement justifies the fixed monthly cost. Before You Fine-Tune: The Honest Check I promised to come back to this. Fine-tuning is genuinely powerful for the right problems. It's also consistently reached for too early by teams who haven't fully explored what's achievable with well-engineered prompts. Before committing to the complexity and cost of fine-tuning, spend a week on prompt engineering. A good system prompt with 5-10 examples often gets you to 90% of what fine-tuning would achieve, at zero training cost, with the ability to iterate in minutes rather than hours. For enterprise-grade prompt engineering, the methodology, evaluation approach and common mistakes that waste weeks of iteration, we wrote the complete guide on what prompt engineering actually is and how to do it systematically. Read it before you start a fine-tuning project. If you've done the prompt work and you're still hitting the limitations, then the fine-tune Claude on Bedrock enterprise guide covers the production considerations, IAM architecture, multi-model versioning, A/B testing custom models, that go beyond what fits in a single tutorial. Before you fine-tune, make sure you've exhausted prompt engineering. Sometimes a well-crafted system prompt does 80% of the job. Here's our enterprise prompt engineering guide if you haven't been there yet. Published by Dextra Labs | AI Consulting & Enterprise LLM Solutions