observability-guidelines
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseObservability Guidelines
可观测性指南
Apply these observability principles to ensure comprehensive visibility into distributed systems and microservices.
将这些可观测性原则应用于分布式系统和微服务,以确保全面的可见性。
Core Observability Principles
核心可观测性原则
- Guide the development of idiomatic, maintainable, and high-performance code with built-in observability
- Enforce modular design and separation of concerns through Clean Architecture
- Promote test-driven development and robust observability from the start
- 借助内置可观测性特性,指导编写符合规范、可维护且高性能的代码
- 通过Clean Architecture强制模块化设计和关注点分离
- 从项目初期就推行测试驱动开发和完善的可观测性方案
OpenTelemetry Integration
OpenTelemetry集成
- Use OpenTelemetry for distributed tracing, metrics, and structured logging
- Start and propagate tracing spans across all service boundaries
- Use otel.Tracer for creating spans and otel.Meter for collecting metrics
- Export data to OpenTelemetry Collector, Jaeger, or Prometheus
- Configure appropriate sampling rates for production environments
- 使用OpenTelemetry实现分布式链路追踪、指标监控和结构化日志
- 在所有服务边界之间启动并传播追踪Span
- 使用otel.Tracer创建Span,使用otel.Meter收集指标
- 将数据导出到OpenTelemetry Collector、Jaeger或Prometheus
- 为生产环境配置合适的采样率
Distributed Tracing
分布式链路追踪
- Trace all incoming requests and propagate context through internal calls
- Use middleware to instrument HTTP and gRPC endpoints automatically
- Include trace context in all downstream service calls
- Create child spans for significant operations within a service
- Add relevant attributes to spans for debugging and analysis
- 追踪所有入站请求,并在内部调用中传播上下文
- 使用中间件自动为HTTP和gRPC端点添加观测能力
- 在所有下游服务调用中包含追踪上下文
- 为服务内的重要操作创建子Span
- 为Span添加相关属性,便于调试和分析
Metrics Collection
指标收集
Monitor these key metrics across all services:
- Request latency: Track p50, p90, p95, and p99 percentiles
- Throughput: Measure requests per second by endpoint
- Error rate: Track 4xx and 5xx responses separately
- Resource usage: Monitor CPU, memory, disk, and network utilization
- Custom business metrics: Track domain-specific KPIs
在所有服务中监控以下关键指标:
- 请求延迟:跟踪p50、p90、p95和p99分位数
- 吞吐量:按端点衡量每秒请求数
- 错误率:分别跟踪4xx和5xx响应
- 资源使用率:监控CPU、内存、磁盘和网络利用率
- 自定义业务指标:跟踪特定领域的关键绩效指标(KPI)
Structured Logging
结构化日志
- Include unique request IDs and trace context in all logs for correlation
- Use structured logging formats (JSON) for machine parseability
- Include relevant context: timestamp, service name, trace ID, span ID
- Log at appropriate levels: DEBUG, INFO, WARN, ERROR
- Avoid logging sensitive information (PII, credentials)
- 在所有日志中包含唯一请求ID和追踪上下文,以便关联分析
- 使用结构化日志格式(JSON),便于机器解析
- 包含相关上下文信息:时间戳、服务名称、Trace ID、Span ID
- 按合适的级别记录日志:DEBUG、INFO、WARN、ERROR
- 避免记录敏感信息(个人可识别信息PII、凭证等)
Architecture Patterns
架构模式
- Apply Clean Architecture with handlers, services, repositories, and domain models
- Use domain-driven design principles for clear boundaries
- Prioritize interface-driven development with explicit dependency injection
- Prefer composition over inheritance; favor small, purpose-specific interfaces
- 应用Clean Architecture,包含处理器、服务、仓库和领域模型
- 使用领域驱动设计原则明确边界
- 优先采用接口驱动开发,并使用显式依赖注入
- 优先使用组合而非继承;倾向于小型、单一职责的接口
Correlation and Context
关联与上下文
- Propagate context through the entire request lifecycle
- Use correlation IDs for request tracking across services
- Include service version and deployment information in telemetry
- Tag traces with relevant business context for filtering
- Enable trace-to-log and log-to-trace correlation
- 在整个请求生命周期中传播上下文
- 使用关联ID跨服务跟踪请求
- 在遥测数据中包含服务版本和部署信息
- 为追踪添加相关业务上下文标签,便于筛选
- 启用追踪与日志的双向关联
Alerting and Dashboards
告警与仪表盘
- Create dashboards for service health and business metrics
- Set up alerts based on SLOs and error budgets
- Use anomaly detection for proactive issue identification
- Document runbooks for common alert scenarios
- Review and tune alerts regularly to reduce noise
- 创建服务健康和业务指标的仪表盘
- 基于服务水平目标(SLO)和错误预算设置告警
- 使用异常检测提前识别问题
- 为常见告警场景编写运行手册
- 定期审核和调整告警,减少误报
Instrumentation Best Practices
观测最佳实践
- Instrument at service boundaries (entry/exit points)
- Add custom spans for database operations and external calls
- Include relevant attributes (user ID, request type, etc.)
- Avoid over-instrumentation that creates noise
- Use semantic conventions for consistent attribute naming
- 在服务边界(入口/出口点)添加观测能力
- 为数据库操作和外部调用添加自定义Span
- 包含相关属性(用户ID、请求类型等)
- 避免过度观测产生无效数据
- 使用语义约定确保属性命名一致
Production Considerations
生产环境注意事项
- Configure appropriate sampling rates to balance visibility and cost
- Use head-based sampling for consistent trace capture
- Implement tail-based sampling for capturing errors
- Set retention policies based on debugging needs
- Monitor observability infrastructure health
- 配置合适的采样率,平衡可见性与成本
- 使用头部采样确保追踪捕获的一致性
- 基于尾部采样捕获错误相关的追踪
- 根据调试需求设置数据保留策略
- 监控可观测性基础设施的健康状态