observability-guidelines

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Observability Guidelines

可观测性指南

Apply these observability principles to ensure comprehensive visibility into distributed systems and microservices.

将这些可观测性原则应用于分布式系统和微服务，以确保全面的可见性。

Core Observability Principles

核心可观测性原则

Guide the development of idiomatic, maintainable, and high-performance code with built-in observability
Enforce modular design and separation of concerns through Clean Architecture
Promote test-driven development and robust observability from the start

借助内置可观测性特性，指导编写符合规范、可维护且高性能的代码
通过Clean Architecture强制模块化设计和关注点分离
从项目初期就推行测试驱动开发和完善的可观测性方案

OpenTelemetry Integration

OpenTelemetry集成

Use OpenTelemetry for distributed tracing, metrics, and structured logging
Start and propagate tracing spans across all service boundaries
Use otel.Tracer for creating spans and otel.Meter for collecting metrics
Export data to OpenTelemetry Collector, Jaeger, or Prometheus
Configure appropriate sampling rates for production environments

使用OpenTelemetry实现分布式链路追踪、指标监控和结构化日志
在所有服务边界之间启动并传播追踪Span
使用otel.Tracer创建Span，使用otel.Meter收集指标
将数据导出到OpenTelemetry Collector、Jaeger或Prometheus
为生产环境配置合适的采样率

Distributed Tracing

分布式链路追踪

Trace all incoming requests and propagate context through internal calls
Use middleware to instrument HTTP and gRPC endpoints automatically
Include trace context in all downstream service calls
Create child spans for significant operations within a service
Add relevant attributes to spans for debugging and analysis

追踪所有入站请求，并在内部调用中传播上下文
使用中间件自动为HTTP和gRPC端点添加观测能力
在所有下游服务调用中包含追踪上下文
为服务内的重要操作创建子Span
为Span添加相关属性，便于调试和分析

Metrics Collection

指标收集

Monitor these key metrics across all services:

Request latency: Track p50, p90, p95, and p99 percentiles
Throughput: Measure requests per second by endpoint
Error rate: Track 4xx and 5xx responses separately
Resource usage: Monitor CPU, memory, disk, and network utilization
Custom business metrics: Track domain-specific KPIs

在所有服务中监控以下关键指标：

请求延迟：跟踪p50、p90、p95和p99分位数
吞吐量：按端点衡量每秒请求数
错误率：分别跟踪4xx和5xx响应
资源使用率：监控CPU、内存、磁盘和网络利用率
自定义业务指标：跟踪特定领域的关键绩效指标（KPI）

Structured Logging

结构化日志

Include unique request IDs and trace context in all logs for correlation
Use structured logging formats (JSON) for machine parseability
Include relevant context: timestamp, service name, trace ID, span ID
Log at appropriate levels: DEBUG, INFO, WARN, ERROR
Avoid logging sensitive information (PII, credentials)

在所有日志中包含唯一请求ID和追踪上下文，以便关联分析
使用结构化日志格式（JSON），便于机器解析
包含相关上下文信息：时间戳、服务名称、Trace ID、Span ID
按合适的级别记录日志：DEBUG、INFO、WARN、ERROR
避免记录敏感信息（个人可识别信息PII、凭证等）

Architecture Patterns

架构模式

Apply Clean Architecture with handlers, services, repositories, and domain models
Use domain-driven design principles for clear boundaries
Prioritize interface-driven development with explicit dependency injection
Prefer composition over inheritance; favor small, purpose-specific interfaces

应用Clean Architecture，包含处理器、服务、仓库和领域模型
使用领域驱动设计原则明确边界
优先采用接口驱动开发，并使用显式依赖注入
优先使用组合而非继承；倾向于小型、单一职责的接口

Correlation and Context

关联与上下文

Propagate context through the entire request lifecycle
Use correlation IDs for request tracking across services
Include service version and deployment information in telemetry
Tag traces with relevant business context for filtering
Enable trace-to-log and log-to-trace correlation

在整个请求生命周期中传播上下文
使用关联ID跨服务跟踪请求
在遥测数据中包含服务版本和部署信息
为追踪添加相关业务上下文标签，便于筛选
启用追踪与日志的双向关联

Alerting and Dashboards

告警与仪表盘

Create dashboards for service health and business metrics
Set up alerts based on SLOs and error budgets
Use anomaly detection for proactive issue identification
Document runbooks for common alert scenarios
Review and tune alerts regularly to reduce noise

创建服务健康和业务指标的仪表盘
基于服务水平目标（SLO）和错误预算设置告警
使用异常检测提前识别问题
为常见告警场景编写运行手册
定期审核和调整告警，减少误报

Instrumentation Best Practices

观测最佳实践

Instrument at service boundaries (entry/exit points)
Add custom spans for database operations and external calls
Include relevant attributes (user ID, request type, etc.)
Avoid over-instrumentation that creates noise
Use semantic conventions for consistent attribute naming

在服务边界（入口/出口点）添加观测能力
为数据库操作和外部调用添加自定义Span
包含相关属性（用户ID、请求类型等）
避免过度观测产生无效数据
使用语义约定确保属性命名一致

Production Considerations

生产环境注意事项

Configure appropriate sampling rates to balance visibility and cost
Use head-based sampling for consistent trace capture
Implement tail-based sampling for capturing errors
Set retention policies based on debugging needs
Monitor observability infrastructure health

配置合适的采样率，平衡可见性与成本
使用头部采样确保追踪捕获的一致性
基于尾部采样捕获错误相关的追踪
根据调试需求设置数据保留策略
监控可观测性基础设施的健康状态