implementing-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseProduction Observability with OpenTelemetry
基于OpenTelemetry的生产环境可观测性
Purpose
目的
Implement production-grade observability using OpenTelemetry as the 2025 industry standard. Covers the three pillars (metrics, logs, traces), LGTM stack deployment, and critical log-trace correlation patterns.
以OpenTelemetry作为2025年行业标准,实现生产级可观测性。内容涵盖可观测性三大支柱(指标、日志、追踪)、LGTM栈部署,以及关键的日志-追踪关联模式。
When to Use
适用场景
Use when:
- Building production systems requiring visibility into performance and errors
- Debugging distributed systems with multiple services
- Setting up monitoring, logging, or tracing infrastructure
- Implementing structured logging with trace correlation
- Configuring alerting rules for production systems
Skip if:
- Building proof-of-concept without production deployment
- System has < 100 requests/day (console logging may suffice)
适合以下场景:
- 构建需要掌握性能与错误状态的生产系统
- 调试包含多服务的分布式系统
- 搭建监控、日志或追踪基础设施
- 实现带追踪关联的结构化日志
- 配置生产系统的告警规则
以下场景可跳过:
- 构建无需部署到生产环境的概念验证系统
- 日请求量少于100的系统(控制台日志即可满足需求)
The OpenTelemetry Standard (2025)
OpenTelemetry标准(2025版)
OpenTelemetry is the CNCF graduated project unifying observability:
┌────────────────────────────────────────────────────────┐
│ OpenTelemetry: The Unified Standard │
├────────────────────────────────────────────────────────┤
│ │
│ ONE SDK for ALL signals: │
│ ├── Metrics (Prometheus-compatible) │
│ ├── Logs (structured, correlated) │
│ ├── Traces (distributed, standardized) │
│ └── Context (propagates across services) │
│ │
│ Language SDKs: │
│ ├── Python: opentelemetry-api, opentelemetry-sdk │
│ ├── Rust: opentelemetry, tracing-opentelemetry │
│ ├── Go: go.opentelemetry.io/otel │
│ └── TypeScript: @opentelemetry/api │
│ │
│ Export to ANY backend: │
│ ├── LGTM Stack (Loki, Grafana, Tempo, Mimir) │
│ ├── Prometheus + Jaeger │
│ ├── Datadog, New Relic, Honeycomb (SaaS) │
│ └── Custom backends via OTLP protocol │
│ │
└────────────────────────────────────────────────────────┘Context7 Reference: (Trust: High, Snippets: 5,888, Score: 85.9)
/websites/opentelemetry_ioOpenTelemetry是CNCF毕业项目,实现了可观测性的统一:
┌────────────────────────────────────────────────────────┐
│ OpenTelemetry: The Unified Standard │
├────────────────────────────────────────────────────────┤
│ │
│ ONE SDK for ALL signals: │
│ ├── Metrics (Prometheus-compatible) │
│ ├── Logs (structured, correlated) │
│ ├── Traces (distributed, standardized) │
│ └── Context (propagates across services) │
│ │
│ Language SDKs: │
│ ├── Python: opentelemetry-api, opentelemetry-sdk │
│ ├── Rust: opentelemetry, tracing-opentelemetry │
│ ├── Go: go.opentelemetry.io/otel │
│ └── TypeScript: @opentelemetry/api │
│ │
│ Export to ANY backend: │
│ ├── LGTM Stack (Loki, Grafana, Tempo, Mimir) │
│ ├── Prometheus + Jaeger │
│ ├── Datadog, New Relic, Honeycomb (SaaS) │
│ └── Custom backends via OTLP protocol │
│ │
└────────────────────────────────────────────────────────┘Context7参考:(可信度:高,代码片段数:5,888,评分:85.9)
/websites/opentelemetry_ioThe Three Pillars of Observability
可观测性的三大支柱
1. Metrics (What is happening?)
1. 指标(正在发生什么?)
Track system health and performance over time.
Metric Types: Counters (always increase), Gauges (up/down), Histograms (distributions), Summaries (percentiles).
Brief Example (Python):
python
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
http_requests = meter.create_counter("http.server.requests")
http_requests.add(1, {"method": "GET", "status": 200})追踪系统健康状态与长期性能表现。
指标类型:计数器(持续增长)、仪表盘(上下波动)、直方图(分布情况)、摘要(百分位数)。
简单示例(Python):
python
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
http_requests = meter.create_counter("http.server.requests")
http_requests.add(1, {"method": "GET", "status": 200})2. Logs (What happened?)
2. 日志(已经发生了什么?)
Record discrete events with context.
CRITICAL: Always inject trace_id/span_id for log-trace correlation.
Brief Example (Python + structlog):
python
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
"processing_request",
trace_id=format(ctx.trace_id, '032x'),
span_id=format(ctx.span_id, '016x'),
user_id=user_id
)See: for complete configuration.
references/structured-logging.md记录带上下文的离散事件。
关键要求:务必将trace_id/span_id注入日志,实现日志-追踪关联。
简单示例(Python + structlog):
python
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
"processing_request",
trace_id=format(ctx.trace_id, '032x'),
span_id=format(ctx.span_id, '016x'),
user_id=user_id
)参考: 获取完整配置说明。
references/structured-logging.md3. Traces (Where did time go?)
3. 追踪(时间消耗在哪里?)
Track request flow across distributed services.
Key Concepts: Trace (end-to-end journey), Span (individual operation), Parent-Child (nested operations).
Brief Example (Python + FastAPI):
python
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
app = FastAPI()
FastAPIInstrumentor.instrument_app(app) # Auto-traces all HTTP requestsSee: for SDK installation by language.
references/opentelemetry-setup.md追踪请求在分布式服务间的流转路径。
核心概念:Trace(端到端请求链路)、Span(单个操作)、父子关系(嵌套操作)。
简单示例(Python + FastAPI):
python
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
app = FastAPI()
FastAPIInstrumentor.instrument_app(app) # Auto-traces all HTTP requests参考: 获取各语言SDK安装指南。
references/opentelemetry-setup.mdThe LGTM Stack (Self-Hosted Observability)
LGTM栈(自托管可观测性方案)
LGTM = Loki (Logs) + Grafana (Visualization) + Tempo (Traces) + Mimir (Metrics)
┌────────────────────────────────────────────────────────┐
│ LGTM Architecture │
├────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Grafana Dashboard (Port 3000) │ │
│ │ Unified UI for Logs, Metrics, Traces │ │
│ └──────┬──────────────┬─────────────┬─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Loki │ │ Tempo │ │ Mimir │ │
│ │ (Logs) │ │ (Traces) │ │(Metrics) │ │
│ │Port 3100 │ │Port 3200 │ │Port 9009 │ │
│ └────▲─────┘ └────▲─────┘ └────▲─────┘ │
│ │ │ │ │
│ └──────────────┴─────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Grafana Alloy │ │
│ │ (Collector) │ │
│ │ Port 4317/8 │ ← OTLP gRPC/HTTP │
│ └───────▲────────┘ │
│ │ │
│ OpenTelemetry Instrumented Apps │
│ │
└────────────────────────────────────────────────────────┘Quick Start: Run for a complete LGTM stack.
examples/lgtm-docker-compose/docker-compose.ymlSee: for production deployment guide.
references/lgtm-stack.mdLGTM = Loki(日志) + Grafana(可视化) + Tempo(追踪) + Mimir(指标)
┌────────────────────────────────────────────────────────┐
│ LGTM Architecture │
├────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Grafana Dashboard (Port 3000) │ │
│ │ Unified UI for Logs, Metrics, Traces │ │
│ └──────┬──────────────┬─────────────┬─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Loki │ │ Tempo │ │ Mimir │ │
│ │ (Logs) │ │ (Traces) │ │(Metrics) │ │
│ │Port 3100 │ │Port 3200 │ │Port 9009 │ │
│ └────▲─────┘ └────▲─────┘ └────▲─────┘ │
│ │ │ │ │
│ └──────────────┴─────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Grafana Alloy │ │
│ │ (Collector) │ │
│ │ Port 4317/8 │ ← OTLP gRPC/HTTP │
│ └───────▲────────┘ │
│ │ │
│ OpenTelemetry Instrumented Apps │
│ │
└────────────────────────────────────────────────────────┘快速启动:运行即可搭建完整LGTM栈。
examples/lgtm-docker-compose/docker-compose.yml参考: 获取生产环境部署指南。
references/lgtm-stack.mdCritical Pattern: Log-Trace Correlation
关键模式:日志-追踪关联
The Problem: Logs and traces live in separate systems. You see an error log but can't find the related trace.
The Solution: Inject and into every log record.
trace_idspan_id问题:日志与追踪存储在不同系统中,看到错误日志却无法找到对应的追踪链路。
解决方案:将和注入每条日志记录。
trace_idspan_idPython (structlog)
Python(structlog)
python
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
"request_processed",
trace_id=format(ctx.trace_id, '032x'), # 32-char hex
span_id=format(ctx.span_id, '016x'), # 16-char hex
user_id=user_id
)python
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
"request_processed",
trace_id=format(ctx.trace_id, '032x'), # 32位十六进制
span_id=format(ctx.span_id, '016x'), # 16位十六进制
user_id=user_id
)Rust (tracing)
Rust(tracing)
rust
use tracing::{info, instrument};
#[instrument(fields(user_id = %user_id))]
async fn process_request(user_id: u64) -> Result<Response> {
// trace_id/span_id automatically included
info!(user_id = user_id, "processing request");
Ok(result)
}See: for Go and TypeScript patterns.
references/trace-context.mdrust
use tracing::{info, instrument};
#[instrument(fields(user_id = %user_id))]
async fn process_request(user_id: u64) -> Result<Response> {
// trace_id/span_id自动包含
info!(user_id = user_id, "processing request");
Ok(result)
}参考: 获取Go与TypeScript实现模式。
references/trace-context.mdQuery in Grafana
在Grafana中查询
logql
{job="api-service"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"logql
{job="api-service"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"Quick Setup Guide
快速搭建指南
1. Choose Your Stack
1. 选择技术栈
Decision Tree:
- Greenfield: OpenTelemetry SDK + LGTM Stack (self-hosted) or Grafana Cloud (managed)
- Existing Prometheus: Add Loki (logs) + Tempo (traces)
- Kubernetes: LGTM via Helm, Alloy DaemonSet
- Zero-ops: Managed SaaS (Grafana Cloud, Datadog, New Relic)
决策树:
- 全新项目:OpenTelemetry SDK + LGTM栈(自托管)或Grafana Cloud(托管)
- 已有Prometheus:添加Loki(日志)+ Tempo(追踪)
- Kubernetes环境:通过Helm部署LGTM,使用Alloy DaemonSet
- 零运维:托管SaaS服务(Grafana Cloud、Datadog、New Relic)
2. Install OpenTelemetry SDK
2. 安装OpenTelemetry SDK
Bootstrap Script:
bash
python scripts/setup_otel.py --language python --framework fastapiManual (Python):
bash
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-fastapi \
opentelemetry-exporter-otlpSee: for Rust, Go, TypeScript installation.
references/opentelemetry-setup.md引导脚本:
bash
python scripts/setup_otel.py --language python --framework fastapi手动安装(Python):
bash
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-fastapi \
opentelemetry-exporter-otlp参考: 获取Rust、Go、TypeScript安装指南。
references/opentelemetry-setup.md3. Deploy LGTM Stack
3. 部署LGTM栈
Docker Compose (development):
bash
cd examples/lgtm-docker-compose
docker-compose up -dDocker Compose(开发环境):
bash
cd examples/lgtm-docker-compose
docker-compose up -dGrafana: http://localhost:3000 (admin/admin)
Grafana地址:http://localhost:3000(账号:admin/admin)
OTLP: localhost:4317 (gRPC), localhost:4318 (HTTP)
OTLP地址:localhost:4317(gRPC),localhost:4318(HTTP)
**See**: `references/lgtm-stack.md` for production Kubernetes deployment.
**参考**:`references/lgtm-stack.md` 获取生产环境Kubernetes部署指南。4. Configure Structured Logging
4. 配置结构化日志
See: for complete setup (Python, Rust, Go, TypeScript).
references/structured-logging.md参考: 获取完整配置(Python、Rust、Go、TypeScript)。
references/structured-logging.md5. Set Up Alerting
5. 配置告警规则
See: for Prometheus and Loki alert patterns.
references/alerting-rules.md参考: 获取Prometheus与Loki告警模式。
references/alerting-rules.mdAuto-Instrumentation
自动插桩
OpenTelemetry auto-instruments popular frameworks:
python
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
app = FastAPI()
FastAPIInstrumentor.instrument_app(app) # Auto-trace all HTTP requestsSupported: FastAPI, Flask, Django, Express, Gin, Echo, Nest.js
See: for framework-specific setup.
references/opentelemetry-setup.mdOpenTelemetry支持对主流框架自动插桩:
python
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
app = FastAPI()
FastAPIInstrumentor.instrument_app(app) # Auto-trace all HTTP requests支持的框架:FastAPI、Flask、Django、Express、Gin、Echo、Nest.js
参考: 获取框架专属配置。
references/opentelemetry-setup.mdCommon Patterns
常见模式
Custom Spans
自定义Span
python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("fetch_user_details") as span:
span.set_attribute("user_id", user_id)
user = await db.fetch_user(user_id)
span.set_attribute("user_found", user is not None)python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("fetch_user_details") as span:
span.set_attribute("user_id", user_id)
user = await db.fetch_user(user_id)
span.set_attribute("user_found", user is not None)Error Tracking
错误追踪
python
from opentelemetry.trace import Status, StatusCode
with tracer.start_as_current_span("process_payment") as span:
try:
result = process_payment(amount, card_token)
span.set_status(Status(StatusCode.OK))
except PaymentError as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raiseSee: for background job tracing and context propagation.
references/trace-context.mdpython
from opentelemetry.trace import Status, StatusCode
with tracer.start_as_current_span("process_payment") as span:
try:
result = process_payment(amount, card_token)
span.set_status(Status(StatusCode.OK))
except PaymentError as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raise参考: 获取后台任务追踪与上下文传递的内容。
references/trace-context.mdValidation and Testing
验证与测试
bash
undefinedbash
undefinedTest log-trace correlation
测试日志-追踪关联
1. Make request to your app
1. 向应用发起请求
2. Copy trace_id from logs
2. 从日志中复制trace_id
3. Query in Grafana: {job="myapp"} |= "trace_id=<TRACE_ID>"
3. 在Grafana中查询:{job="myapp"} |= "trace_id=<TRACE_ID>"
Validate metrics
验证指标
python scripts/validate_metrics.py
undefinedpython scripts/validate_metrics.py
undefinedIntegration with Other Skills
与其他能力集成
- Dashboards: Embed Grafana panels, query Prometheus metrics
- Feedback: Alert routing (Slack, PagerDuty), notification UI
- Data-Viz: Time-series charts, trace waterfall, latency heatmaps
See: for complete integration.
examples/fastapi-otel/- 仪表盘:嵌入Grafana面板,查询Prometheus指标
- 反馈:告警路由(Slack、PagerDuty)、通知UI
- 数据可视化:时间序列图表、追踪瀑布图、延迟热力图
参考: 获取完整集成示例。
examples/fastapi-otel/Progressive Disclosure
进阶参考
Setup Guides:
- - SDK installation (Python, Rust, Go, TypeScript)
references/opentelemetry-setup.md - - structlog, tracing, slog, pino configuration
references/structured-logging.md - - LGTM deployment (Docker, Kubernetes)
references/lgtm-stack.md - - Log-trace correlation patterns
references/trace-context.md - - Prometheus and Loki alert templates
references/alerting-rules.md
Examples:
- - FastAPI + OpenTelemetry + LGTM
examples/fastapi-otel/ - - Rust Axum + tracing + LGTM
examples/axum-tracing/ - - Production-ready LGTM stack
examples/lgtm-docker-compose/
Scripts:
- - Bootstrap OpenTelemetry SDK
scripts/setup_otel.py - - Generate Grafana dashboards
scripts/generate_dashboards.py - - Validate metric naming
scripts/validate_metrics.py
搭建指南:
- - SDK安装(Python、Rust、Go、TypeScript)
references/opentelemetry-setup.md - - structlog、tracing、slog、pino配置
references/structured-logging.md - - LGTM部署(Docker、Kubernetes)
references/lgtm-stack.md - - 日志-追踪关联模式
references/trace-context.md - - Prometheus与Loki告警模板
references/alerting-rules.md
示例:
- - FastAPI + OpenTelemetry + LGTM
examples/fastapi-otel/ - - Rust Axum + tracing + LGTM
examples/axum-tracing/ - - 生产可用的LGTM栈
examples/lgtm-docker-compose/
脚本:
- - 快速搭建OpenTelemetry SDK
scripts/setup_otel.py - - 生成Grafana仪表盘
scripts/generate_dashboards.py - - 验证指标命名规范
scripts/validate_metrics.py
Key Principles
核心原则
- OpenTelemetry is THE standard - Use OTel SDK, not vendor-specific SDKs
- Auto-instrumentation first - Prefer auto over manual spans
- Always correlate logs and traces - Inject trace_id/span_id into every log
- Use structured logging - JSON format, consistent field names
- LGTM stack for self-hosting - Production-ready open-source stack
- OpenTelemetry是标准 - 使用OTel SDK,而非厂商专属SDK
- 优先自动插桩 - 优先使用自动插桩而非手动创建Span
- 始终关联日志与追踪 - 每条日志都要注入trace_id/span_id
- 使用结构化日志 - JSON格式,字段命名一致
- 自托管选用LGTM栈 - 生产可用的开源栈
Common Pitfalls
常见误区
Don't:
- Use vendor-specific SDKs (use OpenTelemetry)
- Log without trace_id/span_id context
- Manually instrument what auto-instrumentation covers
- Mix logging libraries (pick one: structlog, tracing, slog, pino)
Do:
- Start with auto-instrumentation
- Add manual spans only for business-critical operations
- Use semantic conventions for span attributes
- Export to OTLP (gRPC preferred over HTTP)
- Test locally with LGTM docker-compose before production
请勿:
- 使用厂商专属SDK(请使用OpenTelemetry)
- 日志中不包含trace_id/span_id上下文
- 对自动插桩已覆盖的内容进行手动插桩
- 混合使用多种日志库(选择其一:structlog、tracing、slog、pino)
建议:
- 从自动插桩开始
- 仅对业务关键操作添加手动Span
- 对Span属性使用语义化规范
- 通过OTLP导出数据(优先使用gRPC而非HTTP)
- 上线生产环境前,先用LGTM docker-compose在本地测试
Success Metrics
成功指标
- 100% of logs include trace_id when in request context
- Mean time to resolution (MTTR) decreases by >50%
- Developers use Grafana as first debugging tool
- 80%+ of telemetry from auto-instrumentation
- Alert noise < 5% false positives
- 所有处于请求上下文的日志100%包含trace_id
- 平均问题解决时间(MTTR)减少50%以上
- 开发者将Grafana作为首选调试工具
- 80%以上的遥测数据来自自动插桩
- 告警误报率低于5%