implementing-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Production Observability with OpenTelemetry

基于OpenTelemetry的生产环境可观测性

Purpose

目的

Implement production-grade observability using OpenTelemetry as the 2025 industry standard. Covers the three pillars (metrics, logs, traces), LGTM stack deployment, and critical log-trace correlation patterns.
以OpenTelemetry作为2025年行业标准,实现生产级可观测性。内容涵盖可观测性三大支柱(指标、日志、追踪)、LGTM栈部署,以及关键的日志-追踪关联模式。

When to Use

适用场景

Use when:
  • Building production systems requiring visibility into performance and errors
  • Debugging distributed systems with multiple services
  • Setting up monitoring, logging, or tracing infrastructure
  • Implementing structured logging with trace correlation
  • Configuring alerting rules for production systems
Skip if:
  • Building proof-of-concept without production deployment
  • System has < 100 requests/day (console logging may suffice)
适合以下场景:
  • 构建需要掌握性能与错误状态的生产系统
  • 调试包含多服务的分布式系统
  • 搭建监控、日志或追踪基础设施
  • 实现带追踪关联的结构化日志
  • 配置生产系统的告警规则
以下场景可跳过:
  • 构建无需部署到生产环境的概念验证系统
  • 日请求量少于100的系统(控制台日志即可满足需求)

The OpenTelemetry Standard (2025)

OpenTelemetry标准(2025版)

OpenTelemetry is the CNCF graduated project unifying observability:
┌────────────────────────────────────────────────────────┐
│          OpenTelemetry: The Unified Standard           │
├────────────────────────────────────────────────────────┤
│                                                         │
│  ONE SDK for ALL signals:                              │
│  ├── Metrics (Prometheus-compatible)                   │
│  ├── Logs (structured, correlated)                     │
│  ├── Traces (distributed, standardized)                │
│  └── Context (propagates across services)              │
│                                                         │
│  Language SDKs:                                         │
│  ├── Python: opentelemetry-api, opentelemetry-sdk      │
│  ├── Rust: opentelemetry, tracing-opentelemetry        │
│  ├── Go: go.opentelemetry.io/otel                      │
│  └── TypeScript: @opentelemetry/api                    │
│                                                         │
│  Export to ANY backend:                                │
│  ├── LGTM Stack (Loki, Grafana, Tempo, Mimir)          │
│  ├── Prometheus + Jaeger                               │
│  ├── Datadog, New Relic, Honeycomb (SaaS)              │
│  └── Custom backends via OTLP protocol                 │
│                                                         │
└────────────────────────────────────────────────────────┘
Context7 Reference:
/websites/opentelemetry_io
(Trust: High, Snippets: 5,888, Score: 85.9)
OpenTelemetry是CNCF毕业项目,实现了可观测性的统一:
┌────────────────────────────────────────────────────────┐
│          OpenTelemetry: The Unified Standard           │
├────────────────────────────────────────────────────────┤
│                                                         │
│  ONE SDK for ALL signals:                              │
│  ├── Metrics (Prometheus-compatible)                   │
│  ├── Logs (structured, correlated)                     │
│  ├── Traces (distributed, standardized)                │
│  └── Context (propagates across services)              │
│                                                         │
│  Language SDKs:                                         │
│  ├── Python: opentelemetry-api, opentelemetry-sdk      │
│  ├── Rust: opentelemetry, tracing-opentelemetry        │
│  ├── Go: go.opentelemetry.io/otel                      │
│  └── TypeScript: @opentelemetry/api                    │
│                                                         │
│  Export to ANY backend:                                │
│  ├── LGTM Stack (Loki, Grafana, Tempo, Mimir)          │
│  ├── Prometheus + Jaeger                               │
│  ├── Datadog, New Relic, Honeycomb (SaaS)              │
│  └── Custom backends via OTLP protocol                 │
│                                                         │
└────────────────────────────────────────────────────────┘
Context7参考
/websites/opentelemetry_io
(可信度:高,代码片段数:5,888,评分:85.9)

The Three Pillars of Observability

可观测性的三大支柱

1. Metrics (What is happening?)

1. 指标(正在发生什么?)

Track system health and performance over time.
Metric Types: Counters (always increase), Gauges (up/down), Histograms (distributions), Summaries (percentiles).
Brief Example (Python):
python
from opentelemetry import metrics

meter = metrics.get_meter(__name__)
http_requests = meter.create_counter("http.server.requests")
http_requests.add(1, {"method": "GET", "status": 200})
追踪系统健康状态与长期性能表现。
指标类型:计数器(持续增长)、仪表盘(上下波动)、直方图(分布情况)、摘要(百分位数)。
简单示例(Python)
python
from opentelemetry import metrics

meter = metrics.get_meter(__name__)
http_requests = meter.create_counter("http.server.requests")
http_requests.add(1, {"method": "GET", "status": 200})

2. Logs (What happened?)

2. 日志(已经发生了什么?)

Record discrete events with context.
CRITICAL: Always inject trace_id/span_id for log-trace correlation.
Brief Example (Python + structlog):
python
import structlog
from opentelemetry import trace

logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()

logger.info(
    "processing_request",
    trace_id=format(ctx.trace_id, '032x'),
    span_id=format(ctx.span_id, '016x'),
    user_id=user_id
)
See:
references/structured-logging.md
for complete configuration.
记录带上下文的离散事件。
关键要求:务必将trace_id/span_id注入日志,实现日志-追踪关联。
简单示例(Python + structlog)
python
import structlog
from opentelemetry import trace

logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()

logger.info(
    "processing_request",
    trace_id=format(ctx.trace_id, '032x'),
    span_id=format(ctx.span_id, '016x'),
    user_id=user_id
)
参考
references/structured-logging.md
获取完整配置说明。

3. Traces (Where did time go?)

3. 追踪(时间消耗在哪里?)

Track request flow across distributed services.
Key Concepts: Trace (end-to-end journey), Span (individual operation), Parent-Child (nested operations).
Brief Example (Python + FastAPI):
python
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-traces all HTTP requests
See:
references/opentelemetry-setup.md
for SDK installation by language.
追踪请求在分布式服务间的流转路径。
核心概念:Trace(端到端请求链路)、Span(单个操作)、父子关系(嵌套操作)。
简单示例(Python + FastAPI)
python
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-traces all HTTP requests
参考
references/opentelemetry-setup.md
获取各语言SDK安装指南。

The LGTM Stack (Self-Hosted Observability)

LGTM栈(自托管可观测性方案)

LGTM = Loki (Logs) + Grafana (Visualization) + Tempo (Traces) + Mimir (Metrics)
┌────────────────────────────────────────────────────────┐
│                  LGTM Architecture                      │
├────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────────────────────────────────────┐      │
│  │           Grafana Dashboard (Port 3000)      │      │
│  │  Unified UI for Logs, Metrics, Traces       │      │
│  └──────┬──────────────┬─────────────┬─────────┘      │
│         │              │             │                 │
│         ▼              ▼             ▼                 │
│  ┌──────────┐   ┌──────────┐  ┌──────────┐            │
│  │   Loki   │   │  Tempo   │  │  Mimir   │            │
│  │  (Logs)  │   │ (Traces) │  │(Metrics) │            │
│  │Port 3100 │   │Port 3200 │  │Port 9009 │            │
│  └────▲─────┘   └────▲─────┘  └────▲─────┘            │
│       │              │             │                   │
│       └──────────────┴─────────────┘                   │
│                      │                                 │
│              ┌───────▼────────┐                        │
│              │ Grafana Alloy  │                        │
│              │  (Collector)   │                        │
│              │  Port 4317/8   │ ← OTLP gRPC/HTTP       │
│              └───────▲────────┘                        │
│                      │                                 │
│         OpenTelemetry Instrumented Apps                │
│                                                         │
└────────────────────────────────────────────────────────┘
Quick Start: Run
examples/lgtm-docker-compose/docker-compose.yml
for a complete LGTM stack.
See:
references/lgtm-stack.md
for production deployment guide.
LGTM = Loki(日志) + Grafana(可视化) + Tempo(追踪) + Mimir(指标)
┌────────────────────────────────────────────────────────┐
│                  LGTM Architecture                      │
├────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────────────────────────────────────┐      │
│  │           Grafana Dashboard (Port 3000)      │      │
│  │  Unified UI for Logs, Metrics, Traces       │      │
│  └──────┬──────────────┬─────────────┬─────────┘      │
│         │              │             │                 │
│         ▼              ▼             ▼                 │
│  ┌──────────┐   ┌──────────┐  ┌──────────┐            │
│  │   Loki   │   │  Tempo   │  │  Mimir   │            │
│  │  (Logs)  │   │ (Traces) │  │(Metrics) │            │
│  │Port 3100 │   │Port 3200 │  │Port 9009 │            │
│  └────▲─────┘   └────▲─────┘  └────▲─────┘            │
│       │              │             │                   │
│       └──────────────┴─────────────┘                   │
│                      │                                 │
│              ┌───────▼────────┐                        │
│              │ Grafana Alloy  │                        │
│              │  (Collector)   │                        │
│              │  Port 4317/8   │ ← OTLP gRPC/HTTP       │
│              └───────▲────────┘                        │
│                      │                                 │
│         OpenTelemetry Instrumented Apps                │
│                                                         │
└────────────────────────────────────────────────────────┘
快速启动:运行
examples/lgtm-docker-compose/docker-compose.yml
即可搭建完整LGTM栈。
参考
references/lgtm-stack.md
获取生产环境部署指南。

Critical Pattern: Log-Trace Correlation

关键模式:日志-追踪关联

The Problem: Logs and traces live in separate systems. You see an error log but can't find the related trace.
The Solution: Inject
trace_id
and
span_id
into every log record.
问题:日志与追踪存储在不同系统中,看到错误日志却无法找到对应的追踪链路。
解决方案:将
trace_id
span_id
注入每条日志记录。

Python (structlog)

Python(structlog)

python
import structlog
from opentelemetry import trace

logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()

logger.info(
    "request_processed",
    trace_id=format(ctx.trace_id, '032x'),  # 32-char hex
    span_id=format(ctx.span_id, '016x'),    # 16-char hex
    user_id=user_id
)
python
import structlog
from opentelemetry import trace

logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()

logger.info(
    "request_processed",
    trace_id=format(ctx.trace_id, '032x'),  # 32位十六进制
    span_id=format(ctx.span_id, '016x'),    # 16位十六进制
    user_id=user_id
)

Rust (tracing)

Rust(tracing)

rust
use tracing::{info, instrument};

#[instrument(fields(user_id = %user_id))]
async fn process_request(user_id: u64) -> Result<Response> {
    // trace_id/span_id automatically included
    info!(user_id = user_id, "processing request");
    Ok(result)
}
See:
references/trace-context.md
for Go and TypeScript patterns.
rust
use tracing::{info, instrument};

#[instrument(fields(user_id = %user_id))]
async fn process_request(user_id: u64) -> Result<Response> {
    // trace_id/span_id自动包含
    info!(user_id = user_id, "processing request");
    Ok(result)
}
参考
references/trace-context.md
获取Go与TypeScript实现模式。

Query in Grafana

在Grafana中查询

logql
{job="api-service"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"
logql
{job="api-service"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"

Quick Setup Guide

快速搭建指南

1. Choose Your Stack

1. 选择技术栈

Decision Tree:
  • Greenfield: OpenTelemetry SDK + LGTM Stack (self-hosted) or Grafana Cloud (managed)
  • Existing Prometheus: Add Loki (logs) + Tempo (traces)
  • Kubernetes: LGTM via Helm, Alloy DaemonSet
  • Zero-ops: Managed SaaS (Grafana Cloud, Datadog, New Relic)
决策树
  • 全新项目:OpenTelemetry SDK + LGTM栈(自托管)或Grafana Cloud(托管)
  • 已有Prometheus:添加Loki(日志)+ Tempo(追踪)
  • Kubernetes环境:通过Helm部署LGTM,使用Alloy DaemonSet
  • 零运维:托管SaaS服务(Grafana Cloud、Datadog、New Relic)

2. Install OpenTelemetry SDK

2. 安装OpenTelemetry SDK

Bootstrap Script:
bash
python scripts/setup_otel.py --language python --framework fastapi
Manual (Python):
bash
pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-fastapi \
    opentelemetry-exporter-otlp
See:
references/opentelemetry-setup.md
for Rust, Go, TypeScript installation.
引导脚本
bash
python scripts/setup_otel.py --language python --framework fastapi
手动安装(Python)
bash
pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-fastapi \
    opentelemetry-exporter-otlp
参考
references/opentelemetry-setup.md
获取Rust、Go、TypeScript安装指南。

3. Deploy LGTM Stack

3. 部署LGTM栈

Docker Compose (development):
bash
cd examples/lgtm-docker-compose
docker-compose up -d
Docker Compose(开发环境)
bash
cd examples/lgtm-docker-compose
docker-compose up -d

OTLP: localhost:4317 (gRPC), localhost:4318 (HTTP)

OTLP地址:localhost:4317(gRPC),localhost:4318(HTTP)


**See**: `references/lgtm-stack.md` for production Kubernetes deployment.

**参考**:`references/lgtm-stack.md` 获取生产环境Kubernetes部署指南。

4. Configure Structured Logging

4. 配置结构化日志

See:
references/structured-logging.md
for complete setup (Python, Rust, Go, TypeScript).
参考
references/structured-logging.md
获取完整配置(Python、Rust、Go、TypeScript)。

5. Set Up Alerting

5. 配置告警规则

See:
references/alerting-rules.md
for Prometheus and Loki alert patterns.
参考
references/alerting-rules.md
获取Prometheus与Loki告警模式。

Auto-Instrumentation

自动插桩

OpenTelemetry auto-instruments popular frameworks:
python
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-trace all HTTP requests
Supported: FastAPI, Flask, Django, Express, Gin, Echo, Nest.js
See:
references/opentelemetry-setup.md
for framework-specific setup.
OpenTelemetry支持对主流框架自动插桩:
python
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-trace all HTTP requests
支持的框架:FastAPI、Flask、Django、Express、Gin、Echo、Nest.js
参考
references/opentelemetry-setup.md
获取框架专属配置。

Common Patterns

常见模式

Custom Spans

自定义Span

python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("fetch_user_details") as span:
    span.set_attribute("user_id", user_id)
    user = await db.fetch_user(user_id)
    span.set_attribute("user_found", user is not None)
python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("fetch_user_details") as span:
    span.set_attribute("user_id", user_id)
    user = await db.fetch_user(user_id)
    span.set_attribute("user_found", user is not None)

Error Tracking

错误追踪

python
from opentelemetry.trace import Status, StatusCode

with tracer.start_as_current_span("process_payment") as span:
    try:
        result = process_payment(amount, card_token)
        span.set_status(Status(StatusCode.OK))
    except PaymentError as e:
        span.set_status(Status(StatusCode.ERROR, str(e)))
        span.record_exception(e)
        raise
See:
references/trace-context.md
for background job tracing and context propagation.
python
from opentelemetry.trace import Status, StatusCode

with tracer.start_as_current_span("process_payment") as span:
    try:
        result = process_payment(amount, card_token)
        span.set_status(Status(StatusCode.OK))
    except PaymentError as e:
        span.set_status(Status(StatusCode.ERROR, str(e)))
        span.record_exception(e)
        raise
参考
references/trace-context.md
获取后台任务追踪与上下文传递的内容。

Validation and Testing

验证与测试

bash
undefined
bash
undefined

Test log-trace correlation

测试日志-追踪关联

1. Make request to your app

1. 向应用发起请求

2. Copy trace_id from logs

2. 从日志中复制trace_id

3. Query in Grafana: {job="myapp"} |= "trace_id=<TRACE_ID>"

3. 在Grafana中查询:{job="myapp"} |= "trace_id=<TRACE_ID>"

Validate metrics

验证指标

python scripts/validate_metrics.py
undefined
python scripts/validate_metrics.py
undefined

Integration with Other Skills

与其他能力集成

  • Dashboards: Embed Grafana panels, query Prometheus metrics
  • Feedback: Alert routing (Slack, PagerDuty), notification UI
  • Data-Viz: Time-series charts, trace waterfall, latency heatmaps
See:
examples/fastapi-otel/
for complete integration.
  • 仪表盘:嵌入Grafana面板,查询Prometheus指标
  • 反馈:告警路由(Slack、PagerDuty)、通知UI
  • 数据可视化:时间序列图表、追踪瀑布图、延迟热力图
参考
examples/fastapi-otel/
获取完整集成示例。

Progressive Disclosure

进阶参考

Setup Guides:
  • references/opentelemetry-setup.md
    - SDK installation (Python, Rust, Go, TypeScript)
  • references/structured-logging.md
    - structlog, tracing, slog, pino configuration
  • references/lgtm-stack.md
    - LGTM deployment (Docker, Kubernetes)
  • references/trace-context.md
    - Log-trace correlation patterns
  • references/alerting-rules.md
    - Prometheus and Loki alert templates
Examples:
  • examples/fastapi-otel/
    - FastAPI + OpenTelemetry + LGTM
  • examples/axum-tracing/
    - Rust Axum + tracing + LGTM
  • examples/lgtm-docker-compose/
    - Production-ready LGTM stack
Scripts:
  • scripts/setup_otel.py
    - Bootstrap OpenTelemetry SDK
  • scripts/generate_dashboards.py
    - Generate Grafana dashboards
  • scripts/validate_metrics.py
    - Validate metric naming
搭建指南
  • references/opentelemetry-setup.md
    - SDK安装(Python、Rust、Go、TypeScript)
  • references/structured-logging.md
    - structlog、tracing、slog、pino配置
  • references/lgtm-stack.md
    - LGTM部署(Docker、Kubernetes)
  • references/trace-context.md
    - 日志-追踪关联模式
  • references/alerting-rules.md
    - Prometheus与Loki告警模板
示例
  • examples/fastapi-otel/
    - FastAPI + OpenTelemetry + LGTM
  • examples/axum-tracing/
    - Rust Axum + tracing + LGTM
  • examples/lgtm-docker-compose/
    - 生产可用的LGTM栈
脚本
  • scripts/setup_otel.py
    - 快速搭建OpenTelemetry SDK
  • scripts/generate_dashboards.py
    - 生成Grafana仪表盘
  • scripts/validate_metrics.py
    - 验证指标命名规范

Key Principles

核心原则

  1. OpenTelemetry is THE standard - Use OTel SDK, not vendor-specific SDKs
  2. Auto-instrumentation first - Prefer auto over manual spans
  3. Always correlate logs and traces - Inject trace_id/span_id into every log
  4. Use structured logging - JSON format, consistent field names
  5. LGTM stack for self-hosting - Production-ready open-source stack
  1. OpenTelemetry是标准 - 使用OTel SDK,而非厂商专属SDK
  2. 优先自动插桩 - 优先使用自动插桩而非手动创建Span
  3. 始终关联日志与追踪 - 每条日志都要注入trace_id/span_id
  4. 使用结构化日志 - JSON格式,字段命名一致
  5. 自托管选用LGTM栈 - 生产可用的开源栈

Common Pitfalls

常见误区

Don't:
  • Use vendor-specific SDKs (use OpenTelemetry)
  • Log without trace_id/span_id context
  • Manually instrument what auto-instrumentation covers
  • Mix logging libraries (pick one: structlog, tracing, slog, pino)
Do:
  • Start with auto-instrumentation
  • Add manual spans only for business-critical operations
  • Use semantic conventions for span attributes
  • Export to OTLP (gRPC preferred over HTTP)
  • Test locally with LGTM docker-compose before production
请勿
  • 使用厂商专属SDK(请使用OpenTelemetry)
  • 日志中不包含trace_id/span_id上下文
  • 对自动插桩已覆盖的内容进行手动插桩
  • 混合使用多种日志库(选择其一:structlog、tracing、slog、pino)
建议
  • 从自动插桩开始
  • 仅对业务关键操作添加手动Span
  • 对Span属性使用语义化规范
  • 通过OTLP导出数据(优先使用gRPC而非HTTP)
  • 上线生产环境前,先用LGTM docker-compose在本地测试

Success Metrics

成功指标

  1. 100% of logs include trace_id when in request context
  2. Mean time to resolution (MTTR) decreases by >50%
  3. Developers use Grafana as first debugging tool
  4. 80%+ of telemetry from auto-instrumentation
  5. Alert noise < 5% false positives
  1. 所有处于请求上下文的日志100%包含trace_id
  2. 平均问题解决时间(MTTR)减少50%以上
  3. 开发者将Grafana作为首选调试工具
  4. 80%以上的遥测数据来自自动插桩
  5. 告警误报率低于5%