observability-guidelines

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Observability Guidelines

可观测性指南

Apply these observability principles to ensure comprehensive visibility into distributed systems and microservices.
将这些可观测性原则应用于分布式系统和微服务,以确保全面的可见性。

Core Observability Principles

核心可观测性原则

  • Guide the development of idiomatic, maintainable, and high-performance code with built-in observability
  • Enforce modular design and separation of concerns through Clean Architecture
  • Promote test-driven development and robust observability from the start
  • 借助内置可观测性特性,指导编写符合规范、可维护且高性能的代码
  • 通过Clean Architecture强制模块化设计和关注点分离
  • 从项目初期就推行测试驱动开发和完善的可观测性方案

OpenTelemetry Integration

OpenTelemetry集成

  • Use OpenTelemetry for distributed tracing, metrics, and structured logging
  • Start and propagate tracing spans across all service boundaries
  • Use otel.Tracer for creating spans and otel.Meter for collecting metrics
  • Export data to OpenTelemetry Collector, Jaeger, or Prometheus
  • Configure appropriate sampling rates for production environments
  • 使用OpenTelemetry实现分布式链路追踪、指标监控和结构化日志
  • 在所有服务边界之间启动并传播追踪Span
  • 使用otel.Tracer创建Span,使用otel.Meter收集指标
  • 将数据导出到OpenTelemetry Collector、Jaeger或Prometheus
  • 为生产环境配置合适的采样率

Distributed Tracing

分布式链路追踪

  • Trace all incoming requests and propagate context through internal calls
  • Use middleware to instrument HTTP and gRPC endpoints automatically
  • Include trace context in all downstream service calls
  • Create child spans for significant operations within a service
  • Add relevant attributes to spans for debugging and analysis
  • 追踪所有入站请求,并在内部调用中传播上下文
  • 使用中间件自动为HTTP和gRPC端点添加观测能力
  • 在所有下游服务调用中包含追踪上下文
  • 为服务内的重要操作创建子Span
  • 为Span添加相关属性,便于调试和分析

Metrics Collection

指标收集

Monitor these key metrics across all services:
  • Request latency: Track p50, p90, p95, and p99 percentiles
  • Throughput: Measure requests per second by endpoint
  • Error rate: Track 4xx and 5xx responses separately
  • Resource usage: Monitor CPU, memory, disk, and network utilization
  • Custom business metrics: Track domain-specific KPIs
在所有服务中监控以下关键指标:
  • 请求延迟:跟踪p50、p90、p95和p99分位数
  • 吞吐量:按端点衡量每秒请求数
  • 错误率:分别跟踪4xx和5xx响应
  • 资源使用率:监控CPU、内存、磁盘和网络利用率
  • 自定义业务指标:跟踪特定领域的关键绩效指标(KPI)

Structured Logging

结构化日志

  • Include unique request IDs and trace context in all logs for correlation
  • Use structured logging formats (JSON) for machine parseability
  • Include relevant context: timestamp, service name, trace ID, span ID
  • Log at appropriate levels: DEBUG, INFO, WARN, ERROR
  • Avoid logging sensitive information (PII, credentials)
  • 在所有日志中包含唯一请求ID和追踪上下文,以便关联分析
  • 使用结构化日志格式(JSON),便于机器解析
  • 包含相关上下文信息:时间戳、服务名称、Trace ID、Span ID
  • 按合适的级别记录日志:DEBUG、INFO、WARN、ERROR
  • 避免记录敏感信息(个人可识别信息PII、凭证等)

Architecture Patterns

架构模式

  • Apply Clean Architecture with handlers, services, repositories, and domain models
  • Use domain-driven design principles for clear boundaries
  • Prioritize interface-driven development with explicit dependency injection
  • Prefer composition over inheritance; favor small, purpose-specific interfaces
  • 应用Clean Architecture,包含处理器、服务、仓库和领域模型
  • 使用领域驱动设计原则明确边界
  • 优先采用接口驱动开发,并使用显式依赖注入
  • 优先使用组合而非继承;倾向于小型、单一职责的接口

Correlation and Context

关联与上下文

  • Propagate context through the entire request lifecycle
  • Use correlation IDs for request tracking across services
  • Include service version and deployment information in telemetry
  • Tag traces with relevant business context for filtering
  • Enable trace-to-log and log-to-trace correlation
  • 在整个请求生命周期中传播上下文
  • 使用关联ID跨服务跟踪请求
  • 在遥测数据中包含服务版本和部署信息
  • 为追踪添加相关业务上下文标签,便于筛选
  • 启用追踪与日志的双向关联

Alerting and Dashboards

告警与仪表盘

  • Create dashboards for service health and business metrics
  • Set up alerts based on SLOs and error budgets
  • Use anomaly detection for proactive issue identification
  • Document runbooks for common alert scenarios
  • Review and tune alerts regularly to reduce noise
  • 创建服务健康和业务指标的仪表盘
  • 基于服务水平目标(SLO)和错误预算设置告警
  • 使用异常检测提前识别问题
  • 为常见告警场景编写运行手册
  • 定期审核和调整告警,减少误报

Instrumentation Best Practices

观测最佳实践

  • Instrument at service boundaries (entry/exit points)
  • Add custom spans for database operations and external calls
  • Include relevant attributes (user ID, request type, etc.)
  • Avoid over-instrumentation that creates noise
  • Use semantic conventions for consistent attribute naming
  • 在服务边界(入口/出口点)添加观测能力
  • 为数据库操作和外部调用添加自定义Span
  • 包含相关属性(用户ID、请求类型等)
  • 避免过度观测产生无效数据
  • 使用语义约定确保属性命名一致

Production Considerations

生产环境注意事项

  • Configure appropriate sampling rates to balance visibility and cost
  • Use head-based sampling for consistent trace capture
  • Implement tail-based sampling for capturing errors
  • Set retention policies based on debugging needs
  • Monitor observability infrastructure health
  • 配置合适的采样率,平衡可见性与成本
  • 使用头部采样确保追踪捕获的一致性
  • 基于尾部采样捕获错误相关的追踪
  • 根据调试需求设置数据保留策略
  • 监控可观测性基础设施的健康状态