Back to list
APM ツールの効果的な利用方法
How to Use APM Tools Effectively
Translated: 2026/4/20 12:01:25
Japanese Translation
TL;DR
APM = メトリクス + トレース + ログ — すべてを組み合わせる。ファーストにオートインスタントメントを — HTTP, DB, クエーをカバードするエージェント。ビジネスコンテキストのためのカスタムタグ(order_id, customer_tier)を追加する。パーセンタイルを使って、平均値を使う — p95/p99 は遅いユーザーを明らかにする。平均値は問題を隠す。分散トレース — スペーストビューとフレイムグラフを通じたクロスサービスボトルネックを示す。症状にアラートを設定する — SLO に基づき遅延とエラー(原因ではなく)。ランブックスを含む。インテリジェントにサンプリング — 10% のトラフィックだが、100% のエラー。ベストプラクティス — 最初に関心の旅程、ライトウェイトを維持、タグの標準化、週次レビュー、アクセス共有、CI/CDとの統合。
アプリケーションパフォーマンスモニタリング (APM) ツールはアプリケーションの振る舞いの可視性を提供します。応答時間、エラー率、リソース消費を追跡します。サービスをまたぐリクエストを追跡します。ボトルネックと異常を特定します。APM ツールを持ちながらそれを効果的に使うことは別ものです。戦略的な実装と慎重な分析は、APM をオーバーヘッドから最適化アクセラレーターへと変換します。
APM ツールはメトリクス、トレース、ログの 3 つのデータタイプを収集します。メトリクスはシステムのパフォーマンスを時間経過で定量化します。トレースはシステムを流れるリクエストの流れを示します。ログは詳細なイベントレコードを提供します。
メトリクスには応答時間、スループット、エラー率が含まれます。集合メトリクスは傾向を示します。パーセンタイルメトリクスは分布を明らかにします。
# カスタムメトリクスレポート
from datadog import statsd
def process_order(order):
start = time.time()
try:
result = do_processing(order)
statsd.increment('orders.processed', tags=['status:success'])
return result
except Exception as e:
statsd.increment('orders.processed', tags=['status:error'])
raise
finally:
duration = time.time() - start
statsd.histogram('orders.processing_time', duration)
トレースはサービスをまたぐ関連オペレーションを接続します。単一ユーザーのリクエストは複数のサービスに触れる可能性があります。トレースは全体の旅をShowsします。
プロファイルはコードが費やす場所を特定します。CPU プロファイルはホット関数を示します。メモリープロファイルは割り当てのパターンを開示します。
リアルユーザーモニタリング (RUM) はブラウザ体験をキャッチします。サーバーメトリクスはクライアント側遅延を欠如します。RUM はユーザーが実際に体験することを示します。
合成モニタリングは外部場所からテストします。スケジュールされたテストは可用性とベースラインパフォーマンスを検証し、リアルユーザーデータを補完します。
ツール
重要な特徴
最も向いている
Datadog
インフラストラクチャー、APM、ログ、RUM の一つのプラットフォーム; 強力な統合エコシステム
広範な監視カバレッジ
New Relic
成熟した APM キャパビリティ; 長い歴史
伝統的および現代的なアーキテクチャ
Dynatrace
AI による分析; オートメーティッドルートカウーズ検出
エンタープライズ機能
Elastic APM
Elastic スタックと統合; セルフホスティングオプション
Elasticsearch をすでに使っているチーム
Jaeger + Prometheus
オープンソーストレース + メトリクス
大規模観測性専門チーム
APM 評価基準:
エージェントオーバーヘッド — アプリケーションパフォーマンスに影響を与える
データレテンション — 調査能力に影響を与える
コストモデル — ツール間で著しく異なる
スタックとスケール — 一部のツールは特定の言語またはフレームワークで優れています
# 例:Datadog エージェント構成
logs_enabled: true
apm_config:
enabled: true
env: production
service: order-service
process_config:
enabled: true
オートインスタントメントは即座の価値を提供します。APM エージェントは一般的なフレームワークを自動的にインスタントメントします。データベース呼び出し、HTTP リクエスト、クエーオペレーションは自動的に追跡されます。
# 自動インスタントメントで ddtrace
from ddtrace import tracer, patch_all
patch_all() # Django, requests, psycopg2 などをインスタントメントする
カスタムインスタントメントはビジネスコンテキストを追加します。技術オペレーションだけでなく、ビジネスオペレーションを追跡します。ビジネスにとって重要なものを測定します。
from ddtrace import tracer
@tracer.wrap(service='orders', resource='process_order')
def process_order(order):
with tracer.trace('validate_order') as span:
span.set_tag('order_id', order.id)
span.set_tag
Original Content
TL;DR
APM = metrics + traces + logs — Use all three together.
Auto-instrument first — Agents cover HTTP, DB, queues. Add custom tags (order_id, customer_tier) for business context.
Use percentiles, not averages — p95/p99 reveal slow users. Averages hide problems.
Distributed tracing — Shows cross-service bottlenecks via waterfall views and flame graphs.
Alert on symptoms — Latency and errors (based on SLOs), not causes. Include runbooks.
Sample intelligently — 10% of traffic, but 100% of errors.
Best practices — Start with critical journeys, keep lightweight, standardize tags, review weekly, share access, integrate with CI/CD.
Application Performance Monitoring (APM) tools provide visibility into application behavior. They track response times, error rates, and resource consumption. They trace requests across services. They identify bottlenecks and anomalies. But having an APM tool and using it effectively are different things. Strategic implementation and thoughtful analysis transform APM from overhead into optimization accelerator.
APM tools collect three types of data: metrics, traces, and logs. Metrics quantify system behavior over time. Traces show request flow through systems. Logs provide detailed event records.
Metrics include response times, throughput, and error rates. Aggregate metrics show trends. Percentile metrics reveal distribution.
# Custom metric reporting
from datadog import statsd
def process_order(order):
start = time.time()
try:
result = do_processing(order)
statsd.increment('orders.processed', tags=['status:success'])
return result
except Exception as e:
statsd.increment('orders.processed', tags=['status:error'])
raise
finally:
duration = time.time() - start
statsd.histogram('orders.processing_time', duration)
Traces connect related operations across services. A single user request might touch dozens of services. Traces show the entire journey.
Profiling identifies where code spends time. CPU profiling shows hot functions. Memory profiling reveals allocation patterns.
Real User Monitoring (RUM) captures browser experience. Server metrics miss client-side delays. RUM shows what users actually experience.
Synthetic monitoring tests from external locations. Scheduled tests verify availability and baseline performance, complementing real user data.
Tool
Key Features
Best For
Datadog
Infrastructure, APM, logs, RUM in one platform; strong integration ecosystem
Broad monitoring coverage
New Relic
Mature APM capabilities; long history
Traditional and modern architectures
Dynatrace
AI-powered analysis; automatic root cause detection
Enterprise features
Elastic APM
Integrates with Elastic Stack; self-hosted option
Teams already using Elasticsearch
Jaeger + Prometheus
Open-source tracing + metrics
Teams with observability expertise, large scale
APM Evaluation Criteria:
Agent overhead — Affects application performance
Data retention — Affects investigation capability
Cost models — Vary significantly between tools
Stack and scale — Some tools excel with specific languages or frameworks
# Example: Datadog agent configuration
logs_enabled: true
apm_config:
enabled: true
env: production
service: order-service
process_config:
enabled: true
Auto-instrumentation provides immediate value. APM agents automatically instrument common frameworks. Database calls, HTTP requests, and queue operations are tracked automatically.
# Automatic instrumentation with ddtrace
from ddtrace import tracer, patch_all
patch_all() # Instruments Django, requests, psycopg2, etc.
Custom instrumentation adds business context. Track business operations, not just technical operations. Measure what matters to the business.
from ddtrace import tracer
@tracer.wrap(service='orders', resource='process_order')
def process_order(order):
with tracer.trace('validate_order') as span:
span.set_tag('order_id', order.id)
span.set_tag('order_total', order.total)
validate(order)
with tracer.trace('charge_payment'):
charge_payment(order)
with tracer.trace('fulfill_order'):
fulfill(order)
Tag traces with useful context. User IDs, tenant IDs, and feature flags enable filtering. Custom tags power analysis.
Sample strategically at scale. Tracing everything at high volume is expensive. Sample representative transactions while keeping all error traces.
# Custom sampling rules
from ddtrace import tracer
tracer.configure(
sampler=DatadogSampler(
rules=[
SamplingRule(sample_rate=1.0, name='error_traces'),
SamplingRule(sample_rate=0.1, name='all_traces')
]
)
)
Service maps visualize dependencies. See how services connect. Identify critical paths and single points of failure.
Compare time periods to find changes. "What changed since yesterday?" is a common question. Comparison views answer quickly.
Analyze by percentiles, not averages. p50 shows typical experience. p95 and p99 show worst cases. Averages hide problems.
-- Finding slow queries in APM data
SELECT
resource,
count(*) as requests,
avg(duration) as avg_duration,
percentile(duration, 0.95) as p95_duration
FROM traces
WHERE service = 'order-service'
AND start_time > now() - interval '1 hour'
GROUP BY resource
ORDER BY p95_duration DESC
LIMIT 10
Filter by tags to isolate issues. High latency affecting one customer? Filter by customer tag. Errors in one region? Filter by region.
Correlate metrics with traces. When latency spikes, what traces show the problem? Link aggregate views to detailed evidence.
Track trends over time. Gradual degradation is easy to miss. Weekly comparisons reveal slow regression.
Trace context propagates across services. Each service adds its span to the trace. The full picture emerges from connected spans.
# Propagating trace context in HTTP calls
import requests
from ddtrace import tracer
def call_downstream_service(order):
headers = {}
tracer.inject(tracer.current_span().context, headers)
return requests.post(
'http://fulfillment-service/fulfill',
json=order.to_dict(),
headers=headers
)
Visualization
What It Shows
Benefit
Waterfall views
Timing relationships between operations
Parallel ops appear side by side; sequential ops stack vertically
Flame graphs
Aggregate trace data across many traces
Identify common patterns and hot spots
Trace search
Find specific issues by tags or duration
Navigate from symptoms to evidence
Trace context must propagate through HTTP calls, message queues, and background jobs. One missing header breaks the chain.
We help you:
Propagate context correctly — HTTP headers, message metadata, thread-local storage
Identify cross-service bottlenecks — Which service is really the slow one?
Build service maps — Visualize dependencies and failure points
👉 Get Distributed Tracing Expertise
Alert on symptoms, not causes — Users experience latency and errors. Alert on those. Investigate causes when symptoms occur.
# Datadog alert configuration
type: metric alert
query: avg(last_5m):avg:trace.web.request.duration{service:order-service} > 500
message: |
Order service latency exceeds 500ms.
Check recent deployments and downstream dependencies.
@slack-oncall
thresholds:
critical: 500
warning: 300
Set meaningful thresholds — Too sensitive creates noise. Too lenient misses issues. Base thresholds on SLO targets.
Include context in alerts — Link to dashboards. Show recent changes. Provide runbook links.
Use anomaly detection — ML identifies deviations from normal; catches issues static thresholds miss.
Use alerts to trigger investigation, not panic — Good monitoring means fewer surprises.
Correlate alerts with deployments — Did this start after a deployment? Integrate APM with CI/CD.
Start with the most important services. Don't instrument everything at once. Focus on critical paths first.
Keep instrumentation lightweight. Heavy agents affect the performance you're measuring. Monitor overhead.
Standardize tagging across services. Consistent tag names enable cross-service analysis. Document tagging conventions.
Retain data appropriately. High-resolution data for recent history. Aggregated data for longer periods. Balance insight against storage cost.
Review performance data regularly. Don't wait for alerts. Weekly performance reviews catch trends before they become problems.
Share APM access broadly. Developers should see their services' performance. Broad access improves ownership and awareness.
Integrate APM with development workflow. Link APM data to code changes. Make performance part of development, not just operations.
Train teams on APM usage. Tools are only useful when people use them effectively. Invest in training.
Practice
Benefit
Custom instrumentation
Business context in traces
Percentile analysis
Visibility into worst cases
Trace sampling
Scale without excessive cost
Alert on symptoms
Actionable notifications
Regular review
Catch trends early
APM tools are powerful, but power without strategy creates noise without insight. The difference between effective and ineffective APM lies not in the tool but in how you use it:
Instrument strategically — Auto first, custom for business context
Analyze by percentiles — Averages hide problems
Trace across services — Distributed tracing is non-negotiable for microservices
Alert on user-impacting symptoms — Not internal metrics
Review data regularly — Weekly performance reviews catch regressions
Effective APM reduces mean time to detection (MTTD) and mean time to resolution (MTTR) dramatically — not because the tool is magic, but because you have the data to ask the right questions when incidents occur: "What changed?", "Where is the time going?", "Which users are affected?" With proper instrumentation and analysis, these questions have answers. Without APM, you're guessing.
Invest in the tool, but invest more in the practices that make it valuable.
👉 Talk to Our Engineers | See Case Studies
Over-alerting on non-actionable metrics. Teams often set alerts for any CPU spike or any error, generating dozens of notifications that get ignored.
Fix:
Alert only on user-impacting symptoms (latency breaching SLO, error rate exceeding threshold)
Or on leading indicators you can actually act on (e.g., database connection pool exhaustion)
For everything else, build dashboards and review trends weekly
Every alert should have a clear runbook and require a human decision
If you ignore it, delete it
Aspect
Open-Source (Prometheus + Jaeger)
Commercial APM (Datadog, New Relic, Dynatrace)
Control
Full control
Less control
Licensing costs
None
Costs scale with volume
Operational overhead
Significant (deploy, scale, maintain)
Minimal (managed service)
Integration
DIY
Integrated metrics, traces, logs out-of-the-box
Best for
Teams with strong observability expertise, large scale
Most teams
Recommendation: Start with commercial APM for the first 1–2 years of production. When your scale makes the bill painful, evaluate open-source alternatives with dedicated SRE resources.
Business context tags. Auto-instrumentation gives you technical metrics (HTTP method, database query). Custom instrumentation answers business questions:
user_id or customer_tier — "Is the latency only affecting free tier users?"
order_total or payment_method — "Is the slowdown only for large orders?"
feature_flag — "Is this related to a canary deployment?"
tenant_id — "Is one tenant experiencing errors?"
Add these tags in spans and set up dashboards to filter by them. Without business context, you know something is slow but not who is affected — which delays investigation.