Back to Details

monitoring-guidelines

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Monitoring Guidelines

监控指南

Apply these monitoring principles to ensure system reliability, performance visibility, and proactive issue detection.

遵循以下监控原则，确保系统可靠性、性能可见性及问题的主动检测。

Core Monitoring Principles

核心监控原则

Monitor the four golden signals: latency, traffic, errors, and saturation
Implement monitoring as code for reproducibility
Design monitoring around user experience and business impact
Use SLOs (Service Level Objectives) to guide alerting decisions
Balance comprehensive coverage with actionable insights

监控四大黄金信号：延迟、流量、错误和饱和度
实施即代码化监控以确保可复现性
围绕用户体验和业务影响设计监控方案
使用SLO（服务水平目标）指导告警决策
在全面覆盖与可执行洞察之间取得平衡

Key Metrics to Monitor

需监控的关键指标

Application Metrics

应用指标

Request rate (requests per second)
Error rate (percentage of failed requests)
Response time (p50, p90, p95, p99 latencies)
Active connections and concurrent users
Queue depths and processing times

请求速率（每秒请求数）
错误率（失败请求占比）
响应时间（p50、p90、p95、p99延迟）
活跃连接数与并发用户数
队列深度与处理时长

Infrastructure Metrics

基础设施指标

CPU utilization and load average
Memory usage and available memory
Disk I/O and available storage
Network throughput and error rates
Container and pod health (for Kubernetes)

CPU利用率与负载平均值
内存使用情况与可用内存
磁盘I/O与可用存储
网络吞吐量与错误率
容器与Pod健康状态（针对Kubernetes环境）

Business Metrics

业务指标

Transaction volumes and values
User signups and conversions
Feature usage and adoption rates
Revenue-impacting events
Customer satisfaction indicators

交易量与交易金额
用户注册与转化率
功能使用情况与采用率
影响收入的事件
客户满意度指标

Alerting Strategy

告警策略

Alert Design Principles

告警设计原则

Alert on symptoms, not causes
Make alerts actionable with clear remediation steps
Set appropriate severity levels (critical, warning, info)
Avoid alert fatigue through proper threshold tuning
Include runbook links in alert notifications

针对症状而非原因触发告警
为告警添加清晰的补救步骤，确保可执行
设置合适的严重级别（critical、warning、info）
通过合理的阈值调优避免告警疲劳
在告警通知中包含运行手册链接

SLO-Based Alerting

基于SLO的告警

Define SLOs for critical user journeys
Calculate error budgets and burn rates
Alert when error budget consumption is high
Use multi-window, multi-burn-rate alerts
Review and adjust SLOs quarterly

为关键用户旅程定义SLO
计算错误预算和消耗速率
当错误预算消耗过高时触发告警
使用多窗口、多消耗速率告警机制
每季度回顾并调整SLO

Alert Configuration

告警配置

Set meaningful thresholds based on baseline data
Use hysteresis to prevent flapping alerts
Implement alert dependencies to reduce noise
Route alerts to appropriate teams
Configure escalation policies

根据基线数据设置有意义的阈值
使用滞后机制防止告警频繁波动
实施告警依赖以减少无效告警
将告警路由至对应负责团队
配置升级策略

Dashboard Design

仪表盘设计

Effective Dashboards

高效仪表盘

Create overview dashboards for service health
Build detailed dashboards for debugging
Use consistent layouts and naming conventions
Include time range selectors and drill-down capabilities
Display SLO status prominently

创建服务健康概览仪表盘
构建用于调试的详细仪表盘
使用一致的布局和命名规范
包含时间范围选择器和下钻功能
突出显示SLO状态

Dashboard Content

仪表盘内容

Show current state and recent trends
Include comparison to baseline or previous periods
Display deployment markers for correlation
Add annotations for significant events
Include links to related dashboards and logs

展示当前状态和近期趋势
包含与基线或往期数据的对比
显示部署标记以用于关联分析
为重大事件添加注释
包含相关仪表盘和日志的链接

Monitoring Tools Integration

监控工具集成

Data Collection

数据收集

Use agents or sidecars for metric collection
Implement service discovery for dynamic environments
Configure appropriate scrape intervals
Use push vs pull based on use case
Ensure metric cardinality is manageable

使用Agent或Sidecar进行指标收集
为动态环境实现服务发现
配置合适的采集间隔
根据使用场景选择推送或拉取模式
确保指标基数可控

Data Storage and Retention

数据存储与留存

Set retention periods based on use case
Implement downsampling for long-term storage
Use appropriate storage backends for scale
Plan for disaster recovery of monitoring data
Monitor your monitoring infrastructure

根据使用场景设置留存周期
为长期存储实施降采样
使用适合规模的存储后端
规划监控数据的灾难恢复方案
对监控基础设施自身进行监控

Health Checks and Probes

健康检查与探针

Implement liveness probes for crash detection
Use readiness probes for traffic management
Create deep health checks that verify dependencies
Expose health endpoints in a standard format
Monitor health check latency as a metric

实现存活探针以检测崩溃情况
使用就绪探针进行流量管理
创建可验证依赖项的深度健康检查
以标准格式暴露健康检查端点
将健康检查延迟作为指标进行监控

Incident Response

事件响应

Use monitoring data to detect incidents early
Correlate metrics, logs, and traces during investigation
Document findings and update monitoring post-incident
Track MTTR (Mean Time to Recovery) metrics
Conduct regular monitoring reviews and improvements

利用监控数据尽早检测事件
在调查过程中关联指标、日志和追踪数据
记录调查结果并在事件后更新监控方案
跟踪MTTR（平均恢复时间）指标
定期开展监控回顾与优化

Capacity Planning

容量规划

Track resource utilization trends
Set alerts for approaching capacity limits
Use forecasting for proactive scaling
Document capacity requirements and headroom
Review capacity quarterly

跟踪资源利用率趋势
为接近容量限制设置告警
使用预测分析进行主动扩容
记录容量需求和预留空间
每季度回顾容量情况