monitoring-guidelines

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Monitoring Guidelines

监控指南

Apply these monitoring principles to ensure system reliability, performance visibility, and proactive issue detection.
遵循以下监控原则,确保系统可靠性、性能可见性及问题的主动检测。

Core Monitoring Principles

核心监控原则

  • Monitor the four golden signals: latency, traffic, errors, and saturation
  • Implement monitoring as code for reproducibility
  • Design monitoring around user experience and business impact
  • Use SLOs (Service Level Objectives) to guide alerting decisions
  • Balance comprehensive coverage with actionable insights
  • 监控四大黄金信号:延迟、流量、错误和饱和度
  • 实施即代码化监控以确保可复现性
  • 围绕用户体验和业务影响设计监控方案
  • 使用SLO(服务水平目标)指导告警决策
  • 在全面覆盖与可执行洞察之间取得平衡

Key Metrics to Monitor

需监控的关键指标

Application Metrics

应用指标

  • Request rate (requests per second)
  • Error rate (percentage of failed requests)
  • Response time (p50, p90, p95, p99 latencies)
  • Active connections and concurrent users
  • Queue depths and processing times
  • 请求速率(每秒请求数)
  • 错误率(失败请求占比)
  • 响应时间(p50、p90、p95、p99延迟)
  • 活跃连接数与并发用户数
  • 队列深度与处理时长

Infrastructure Metrics

基础设施指标

  • CPU utilization and load average
  • Memory usage and available memory
  • Disk I/O and available storage
  • Network throughput and error rates
  • Container and pod health (for Kubernetes)
  • CPU利用率与负载平均值
  • 内存使用情况与可用内存
  • 磁盘I/O与可用存储
  • 网络吞吐量与错误率
  • 容器与Pod健康状态(针对Kubernetes环境)

Business Metrics

业务指标

  • Transaction volumes and values
  • User signups and conversions
  • Feature usage and adoption rates
  • Revenue-impacting events
  • Customer satisfaction indicators
  • 交易量与交易金额
  • 用户注册与转化率
  • 功能使用情况与采用率
  • 影响收入的事件
  • 客户满意度指标

Alerting Strategy

告警策略

Alert Design Principles

告警设计原则

  • Alert on symptoms, not causes
  • Make alerts actionable with clear remediation steps
  • Set appropriate severity levels (critical, warning, info)
  • Avoid alert fatigue through proper threshold tuning
  • Include runbook links in alert notifications
  • 针对症状而非原因触发告警
  • 为告警添加清晰的补救步骤,确保可执行
  • 设置合适的严重级别(critical、warning、info)
  • 通过合理的阈值调优避免告警疲劳
  • 在告警通知中包含运行手册链接

SLO-Based Alerting

基于SLO的告警

  • Define SLOs for critical user journeys
  • Calculate error budgets and burn rates
  • Alert when error budget consumption is high
  • Use multi-window, multi-burn-rate alerts
  • Review and adjust SLOs quarterly
  • 为关键用户旅程定义SLO
  • 计算错误预算和消耗速率
  • 当错误预算消耗过高时触发告警
  • 使用多窗口、多消耗速率告警机制
  • 每季度回顾并调整SLO

Alert Configuration

告警配置

  • Set meaningful thresholds based on baseline data
  • Use hysteresis to prevent flapping alerts
  • Implement alert dependencies to reduce noise
  • Route alerts to appropriate teams
  • Configure escalation policies
  • 根据基线数据设置有意义的阈值
  • 使用滞后机制防止告警频繁波动
  • 实施告警依赖以减少无效告警
  • 将告警路由至对应负责团队
  • 配置升级策略

Dashboard Design

仪表盘设计

Effective Dashboards

高效仪表盘

  • Create overview dashboards for service health
  • Build detailed dashboards for debugging
  • Use consistent layouts and naming conventions
  • Include time range selectors and drill-down capabilities
  • Display SLO status prominently
  • 创建服务健康概览仪表盘
  • 构建用于调试的详细仪表盘
  • 使用一致的布局和命名规范
  • 包含时间范围选择器和下钻功能
  • 突出显示SLO状态

Dashboard Content

仪表盘内容

  • Show current state and recent trends
  • Include comparison to baseline or previous periods
  • Display deployment markers for correlation
  • Add annotations for significant events
  • Include links to related dashboards and logs
  • 展示当前状态和近期趋势
  • 包含与基线或往期数据的对比
  • 显示部署标记以用于关联分析
  • 为重大事件添加注释
  • 包含相关仪表盘和日志的链接

Monitoring Tools Integration

监控工具集成

Data Collection

数据收集

  • Use agents or sidecars for metric collection
  • Implement service discovery for dynamic environments
  • Configure appropriate scrape intervals
  • Use push vs pull based on use case
  • Ensure metric cardinality is manageable
  • 使用Agent或Sidecar进行指标收集
  • 为动态环境实现服务发现
  • 配置合适的采集间隔
  • 根据使用场景选择推送或拉取模式
  • 确保指标基数可控

Data Storage and Retention

数据存储与留存

  • Set retention periods based on use case
  • Implement downsampling for long-term storage
  • Use appropriate storage backends for scale
  • Plan for disaster recovery of monitoring data
  • Monitor your monitoring infrastructure
  • 根据使用场景设置留存周期
  • 为长期存储实施降采样
  • 使用适合规模的存储后端
  • 规划监控数据的灾难恢复方案
  • 对监控基础设施自身进行监控

Health Checks and Probes

健康检查与探针

  • Implement liveness probes for crash detection
  • Use readiness probes for traffic management
  • Create deep health checks that verify dependencies
  • Expose health endpoints in a standard format
  • Monitor health check latency as a metric
  • 实现存活探针以检测崩溃情况
  • 使用就绪探针进行流量管理
  • 创建可验证依赖项的深度健康检查
  • 以标准格式暴露健康检查端点
  • 将健康检查延迟作为指标进行监控

Incident Response

事件响应

  • Use monitoring data to detect incidents early
  • Correlate metrics, logs, and traces during investigation
  • Document findings and update monitoring post-incident
  • Track MTTR (Mean Time to Recovery) metrics
  • Conduct regular monitoring reviews and improvements
  • 利用监控数据尽早检测事件
  • 在调查过程中关联指标、日志和追踪数据
  • 记录调查结果并在事件后更新监控方案
  • 跟踪MTTR(平均恢复时间)指标
  • 定期开展监控回顾与优化

Capacity Planning

容量规划

  • Track resource utilization trends
  • Set alerts for approaching capacity limits
  • Use forecasting for proactive scaling
  • Document capacity requirements and headroom
  • Review capacity quarterly
  • 跟踪资源利用率趋势
  • 为接近容量限制设置告警
  • 使用预测分析进行主动扩容
  • 记录容量需求和预留空间
  • 每季度回顾容量情况