monitoring-guidelines
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMonitoring Guidelines
监控指南
Apply these monitoring principles to ensure system reliability, performance visibility, and proactive issue detection.
遵循以下监控原则,确保系统可靠性、性能可见性及问题的主动检测。
Core Monitoring Principles
核心监控原则
- Monitor the four golden signals: latency, traffic, errors, and saturation
- Implement monitoring as code for reproducibility
- Design monitoring around user experience and business impact
- Use SLOs (Service Level Objectives) to guide alerting decisions
- Balance comprehensive coverage with actionable insights
- 监控四大黄金信号:延迟、流量、错误和饱和度
- 实施即代码化监控以确保可复现性
- 围绕用户体验和业务影响设计监控方案
- 使用SLO(服务水平目标)指导告警决策
- 在全面覆盖与可执行洞察之间取得平衡
Key Metrics to Monitor
需监控的关键指标
Application Metrics
应用指标
- Request rate (requests per second)
- Error rate (percentage of failed requests)
- Response time (p50, p90, p95, p99 latencies)
- Active connections and concurrent users
- Queue depths and processing times
- 请求速率(每秒请求数)
- 错误率(失败请求占比)
- 响应时间(p50、p90、p95、p99延迟)
- 活跃连接数与并发用户数
- 队列深度与处理时长
Infrastructure Metrics
基础设施指标
- CPU utilization and load average
- Memory usage and available memory
- Disk I/O and available storage
- Network throughput and error rates
- Container and pod health (for Kubernetes)
- CPU利用率与负载平均值
- 内存使用情况与可用内存
- 磁盘I/O与可用存储
- 网络吞吐量与错误率
- 容器与Pod健康状态(针对Kubernetes环境)
Business Metrics
业务指标
- Transaction volumes and values
- User signups and conversions
- Feature usage and adoption rates
- Revenue-impacting events
- Customer satisfaction indicators
- 交易量与交易金额
- 用户注册与转化率
- 功能使用情况与采用率
- 影响收入的事件
- 客户满意度指标
Alerting Strategy
告警策略
Alert Design Principles
告警设计原则
- Alert on symptoms, not causes
- Make alerts actionable with clear remediation steps
- Set appropriate severity levels (critical, warning, info)
- Avoid alert fatigue through proper threshold tuning
- Include runbook links in alert notifications
- 针对症状而非原因触发告警
- 为告警添加清晰的补救步骤,确保可执行
- 设置合适的严重级别(critical、warning、info)
- 通过合理的阈值调优避免告警疲劳
- 在告警通知中包含运行手册链接
SLO-Based Alerting
基于SLO的告警
- Define SLOs for critical user journeys
- Calculate error budgets and burn rates
- Alert when error budget consumption is high
- Use multi-window, multi-burn-rate alerts
- Review and adjust SLOs quarterly
- 为关键用户旅程定义SLO
- 计算错误预算和消耗速率
- 当错误预算消耗过高时触发告警
- 使用多窗口、多消耗速率告警机制
- 每季度回顾并调整SLO
Alert Configuration
告警配置
- Set meaningful thresholds based on baseline data
- Use hysteresis to prevent flapping alerts
- Implement alert dependencies to reduce noise
- Route alerts to appropriate teams
- Configure escalation policies
- 根据基线数据设置有意义的阈值
- 使用滞后机制防止告警频繁波动
- 实施告警依赖以减少无效告警
- 将告警路由至对应负责团队
- 配置升级策略
Dashboard Design
仪表盘设计
Effective Dashboards
高效仪表盘
- Create overview dashboards for service health
- Build detailed dashboards for debugging
- Use consistent layouts and naming conventions
- Include time range selectors and drill-down capabilities
- Display SLO status prominently
- 创建服务健康概览仪表盘
- 构建用于调试的详细仪表盘
- 使用一致的布局和命名规范
- 包含时间范围选择器和下钻功能
- 突出显示SLO状态
Dashboard Content
仪表盘内容
- Show current state and recent trends
- Include comparison to baseline or previous periods
- Display deployment markers for correlation
- Add annotations for significant events
- Include links to related dashboards and logs
- 展示当前状态和近期趋势
- 包含与基线或往期数据的对比
- 显示部署标记以用于关联分析
- 为重大事件添加注释
- 包含相关仪表盘和日志的链接
Monitoring Tools Integration
监控工具集成
Data Collection
数据收集
- Use agents or sidecars for metric collection
- Implement service discovery for dynamic environments
- Configure appropriate scrape intervals
- Use push vs pull based on use case
- Ensure metric cardinality is manageable
- 使用Agent或Sidecar进行指标收集
- 为动态环境实现服务发现
- 配置合适的采集间隔
- 根据使用场景选择推送或拉取模式
- 确保指标基数可控
Data Storage and Retention
数据存储与留存
- Set retention periods based on use case
- Implement downsampling for long-term storage
- Use appropriate storage backends for scale
- Plan for disaster recovery of monitoring data
- Monitor your monitoring infrastructure
- 根据使用场景设置留存周期
- 为长期存储实施降采样
- 使用适合规模的存储后端
- 规划监控数据的灾难恢复方案
- 对监控基础设施自身进行监控
Health Checks and Probes
健康检查与探针
- Implement liveness probes for crash detection
- Use readiness probes for traffic management
- Create deep health checks that verify dependencies
- Expose health endpoints in a standard format
- Monitor health check latency as a metric
- 实现存活探针以检测崩溃情况
- 使用就绪探针进行流量管理
- 创建可验证依赖项的深度健康检查
- 以标准格式暴露健康检查端点
- 将健康检查延迟作为指标进行监控
Incident Response
事件响应
- Use monitoring data to detect incidents early
- Correlate metrics, logs, and traces during investigation
- Document findings and update monitoring post-incident
- Track MTTR (Mean Time to Recovery) metrics
- Conduct regular monitoring reviews and improvements
- 利用监控数据尽早检测事件
- 在调查过程中关联指标、日志和追踪数据
- 记录调查结果并在事件后更新监控方案
- 跟踪MTTR(平均恢复时间)指标
- 定期开展监控回顾与优化
Capacity Planning
容量规划
- Track resource utilization trends
- Set alerts for approaching capacity limits
- Use forecasting for proactive scaling
- Document capacity requirements and headroom
- Review capacity quarterly
- 跟踪资源利用率趋势
- 为接近容量限制设置告警
- 使用预测分析进行主动扩容
- 记录容量需求和预留空间
- 每季度回顾容量情况