monitoring-and-alerting

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Monitoring and Alerting

监控与告警

Decide what to watch, what to alert on, and how to make sure the right person finds out when things break.

决定监控内容、告警触发条件，以及确保故障发生时合适的人员能及时获知。

When to use

适用场景

Setting up monitoring on a new site or service
Defining SLOs (service level objectives) and error budgets
Choosing which alerts page someone vs which go to a quiet channel
Designing or fixing on-call rotation
Diagnosing alert fatigue
Filling monitoring gaps revealed by an incident
Migrating monitoring vendors

为新站点或服务设置监控
定义SLO（服务级别目标）和错误预算
区分需通知人员的告警与发送至静默频道的告警
设计或修复轮值待命机制
诊断告警疲劳问题
填补事件暴露出的监控漏洞
迁移监控供应商

When NOT to use

不适用场景

Responding to an active incident (use
```
incident-response
```
)
Writing the post-mortem (use
```
after-action-report
```
)
Designing analytics dashboards for product metrics (use
```
analytics-strategy
```
)
Performance optimization itself (use
```
performance-optimization
```
)

响应活跃事件（使用
```
incident-response
```
）
编写事后复盘报告（使用
```
after-action-report
```
）
设计产品指标分析仪表板（使用
```
analytics-strategy
```
）
性能优化本身（使用
```
performance-optimization
```
）

Required inputs

必填输入项

The system you're monitoring (URLs, services, dependencies)
Existing monitoring tools (uptime, errors, logs, APM)
Business hours and team timezone(s)
Who is on-call or available for incidents
Existing SLOs or success metrics, if any

待监控的系统（URL、服务、依赖项）
现有监控工具（可用性、错误、日志、APM）
业务时段和团队时区
轮值待命或可处理事件的人员
现有SLO或成功指标（如有）

The framework: 4 layers

框架：四层模型

Monitoring works in layers. Skip a layer and you'll miss a class of problems.

监控采用分层架构。跳过任意一层都会遗漏一类问题。

Layer 1: Availability

第一层：可用性

Is the site up? The simplest, most important layer.

HTTP checks from multiple regions (every 1-5 minutes)
DNS resolution checks
Certificate expiration checks
Status code checks (alert on 5xx, not just timeout)

Threshold: any sustained downtime (more than 2 consecutive failed checks) pages.

站点是否正常运行？这是最简单也最重要的一层。

多区域HTTP检查（每1-5分钟一次）
DNS解析检查
证书过期检查
状态码检查（针对5xx告警，而非仅超时）

阈值：任何持续停机（连续2次以上检查失败）需通知人员。

Layer 2: Correctness

第二层：正确性

The site is up, but is it serving the right thing?

Synthetic checks (a script that loads the homepage, clicks a button, validates expected text)
Critical user journeys (signup, checkout, search)
Content presence checks (homepage hasn't gone blank)
API contract checks (response shape and key fields are present)

Threshold: failures of critical-path synthetics page. Non-critical page-level synthetics alert during business hours only.

站点已运行，但是否提供正确内容？

合成监控（加载首页、点击按钮、验证预期文本的脚本）
关键用户路径（注册、结账、搜索）
内容存在性检查（首页未空白）
API契约检查（响应结构和关键字段是否存在）

阈值：关键路径合成监控失败需通知人员。非关键页面级合成监控仅在业务时段告警。

Layer 3: Performance

第三层：性能

The site is up and correct, but is it fast enough?

Core Web Vitals (LCP, INP, CLS) from real users (RUM)
Synthetic performance (Lighthouse, WebPageTest, custom)
API response times (p50, p95, p99)
Database query times for slow queries
Dependency response times (third-party APIs)

Threshold: regressions from baseline (e.g., p95 doubled in 5 minutes). Don't alert on absolute thresholds without baselines.

站点已运行且内容正确，但速度是否足够快？

真实用户的核心Web指标（LCP、INP、CLS）（RUM）
合成性能测试（Lighthouse、WebPageTest、自定义测试）
API响应时间（p50、p95、p99）
慢查询的数据库查询时间
依赖项响应时间（第三方API）

阈值：相对于基线的性能退化（如5分钟内p95翻倍）。无基线时不要基于绝对阈值告警。

Layer 4: Errors and anomalies

第四层：错误与异常

The site is up, correct, and fast for most, but errors are happening.

Error rate (% of requests returning 5xx)
Client-side error rate (uncaught JS exceptions)
Log error volume (unexpected spikes)
Anomaly detection (traffic falling off a cliff)
Background job failures
Queue depth

Threshold: rate-based, not count-based. "Error rate above 1% for 5 minutes" beats "more than 100 errors per minute."

站点已运行、内容正确且对大多数用户速度足够快，但是否存在错误？

错误率（返回5xx的请求占比）
客户端错误率（未捕获的JS异常）
日志错误量（异常峰值）
异常检测（流量骤降）
后台任务失败
队列深度

阈值：基于比率而非数量。“5分钟内错误率超过1%”优于“每分钟错误数超过100”。

SLOs and error budgets

SLO与错误预算

A Service Level Objective is the target for reliability. Common form: "99.9% of homepage requests succeed in under 2 seconds, measured over 30 days."

The components:

The thing you're measuring (homepage requests)
The success criterion (returns 2xx in under 2 seconds)
The target (99.9% of them)
The window (over 30 days)

The error budget is the inverse: 0.1% of requests can fail. If you've used the whole budget, slow down on risky changes.

服务级别目标（SLO）是可靠性的目标。常见形式：“30天内，99.9%的首页请求在2秒内成功响应。”

组成部分：

测量对象（首页请求）
成功标准（2秒内返回2xx状态码）
目标值（99.9%的请求）
统计窗口（30天内）

错误预算是其反向指标：0.1%的请求可以失败。若已耗尽全部预算，应放缓高风险变更。

Picking SLOs

选择SLO

Don't aim for 100%. Don't aim for "five nines" (99.999%) unless you really need it. Each nine costs an order of magnitude more.

SLO	Allowed downtime per month
99%	7 hours, 18 minutes
99.9%	43 minutes
99.95%	21 minutes
99.99%	4 minutes, 22 seconds
99.999%	26 seconds

For most marketing sites, 99.9% is plenty. For SaaS, 99.95% is reasonable. Anything higher needs significant infrastructure investment.

不要追求100%的可靠性。除非确实需要，否则不要追求“五个九”（99.999%）。每多一个“九”，成本就会增加一个数量级。

SLO	每月允许停机时间
99%	7小时18分钟
99.9%	43分钟
99.95%	21分钟
99.99%	4分22秒
99.999%	26秒

对于大多数营销站点，99.9%已足够。对于SaaS服务，99.95%较为合理。更高的可靠性需要大量基础设施投入。

Using error budgets

使用错误预算

When the budget is healthy, ship aggressively. When the budget is half-spent, slow down. When the budget is exhausted, freeze risky changes until reliability recovers.

This is what makes SLOs useful: they create a feedback loop between reliability and velocity.

预算充足时，积极发布变更。预算耗尽一半时，放缓发布速度。预算耗尽时，冻结高风险变更，直至可靠性恢复。

这正是SLO的价值所在：它在可靠性与交付速度之间建立了反馈循环。

Workflow

工作流程

Step 1: Inventory what's already monitored

步骤1：盘点现有监控内容

What tools are in place? What checks exist? What dashboards? What alerts?

Many teams have a tangle of half-configured tools. The first job is the inventory.

已部署哪些工具？存在哪些检查？有哪些仪表板？哪些告警？

许多团队的工具配置杂乱无章。第一步是进行盘点。

Step 2: Map the system

步骤2：绘制系统架构图

Draw the architecture. Front-end, back-end, database, third-party APIs, queues, workers. Each box is a candidate for monitoring.

For each box, ask:

What does "up" mean?
What does "correct" mean?
What does "fast" mean?
What's the most common failure mode?

绘制架构：前端、后端、数据库、第三方API、队列、工作进程。每个组件都是监控候选对象。

针对每个组件，需明确：

“正常运行”的定义是什么？
“内容正确”的定义是什么？
“速度足够快”的定义是什么？
最常见的故障模式是什么？

Step 3: Define the SLOs

步骤3：定义SLO

Pick 3-5 SLOs. They should be:

Tied to user-visible behavior (not internal metrics)
Achievable with current infrastructure
Measured automatically
Reviewed at least quarterly

选择3-5个SLO。它们应满足：

与用户可见行为挂钩（而非内部指标）
基于现有基础设施可实现
可自动测量
至少每季度评审一次

Step 4: Set up checks across the 4 layers

步骤4：在四层模型中配置检查

For each box, configure checks at each layer. Some boxes won't have all four; that's fine.

Box	Availability	Correctness	Performance	Errors
Homepage	HTTP check	Synthetic	LCP/INP	JS errors
Login API	HTTP check	Synthetic flow	p95 latency	5xx rate

为每个组件在各层配置检查。部分组件可能不需要全部四层，这是正常的。

组件	可用性	正确性	性能	错误
首页	HTTP检查	合成监控	LCP/INP	JS错误
登录API	HTTP检查	合成流程	p95延迟	5xx错误率

Step 5: Decide what pages and what doesn't

步骤5：确定需通知人员的告警与无需通知的告警

Three tiers:

Page (wakes someone up): site down, critical flow broken, error rate spike, security incident.
Notify (during business hours): non-critical synthetic failure, performance regression, slow query, dependency degradation.
Log (no notification): anomalies for later review, low-priority warnings, info-level events.

Anything in tier 1 must be:

Actionable (the on-call can do something about it)
Important (it represents real impact)
Rare (less than 1-2 per week is the goal)

If tier 1 alerts fire frequently, alert fatigue sets in. People stop responding.

分为三个层级：

通知人员（唤醒相关人员）：站点宕机、关键流程中断、错误率飙升、安全事件。
业务时段通知：非关键合成监控失败、性能退化、慢查询、依赖项性能下降。
仅记录（不通知）：供后续评审的异常、低优先级警告、信息级事件。

第一层级的告警必须满足：

可操作（轮值待命人员可采取措施）
重要（代表实际影响）
少见（目标是每周不超过1-2次）

若第一层级告警频繁触发，会导致告警疲劳。人员将停止响应。

Step 6: Configure routing

步骤6：配置告警路由

Where do alerts go?

Tier 1: paging system (e.g., PagerDuty, Opsgenie). Direct to on-call.
Tier 2: chat channel (Slack, Teams). Tagged with the area.
Tier 3: dashboard or log only.

Each tier should have a documented escalation path. If the on-call doesn't ack within 5-15 minutes, escalate.

告警发送至何处？

第一层级：寻呼系统（如PagerDuty、Opsgenie）。直接发送给轮值待命人员。
第二层级：聊天频道（Slack、Teams）。标记对应业务领域。
第三层级：仅发送至仪表板或日志。

每个层级都应有文档化的升级路径。若轮值待命人员未在5-15分钟内确认，需升级告警。

Step 7: Build dashboards

步骤7：构建仪表板

One dashboard per audience:

Real-time ops dashboard: current health, recent alerts, error rates, throughput
SLO dashboard: SLO status and error budget consumption
Per-service dashboards: detail for individual services or pages
Executive dashboard: uptime over weeks/months, key business metrics

Dashboards are different from alerts. Alerts say "look now." Dashboards say "here's what's happening."

为不同受众构建不同仪表板：

实时运维仪表板：当前健康状态、近期告警、错误率、吞吐量
SLO仪表板：SLO状态和错误预算消耗情况
单服务仪表板：单个服务或页面的详细信息
高管仪表板：数周/数月的可用性、关键业务指标

仪表板与告警不同。告警表示“立即查看”，仪表板表示“当前情况如下”。

Step 8: Run an alert audit

步骤8：执行告警审计

Every quarter, audit:

Which alerts fired? Were they actionable?
Which alerts didn't fire when they should have?
Are any alerts noisy (more than once a week, low actionability)?
Are runbooks up to date?
Have SLOs been met? Any consistently breached?

Tune the system. Monitoring drifts without active maintenance.

每季度进行一次审计：

哪些告警触发了？是否可操作？
哪些应该触发的告警未触发？
是否存在噪音告警（每周超过一次，可操作性低）？
运行手册是否更新？
是否达成SLO？是否存在持续违反SLO的情况？

调整监控系统。若不主动维护，监控配置会逐渐失效。

Failure patterns

常见失败模式

Alert on cause, not symptom. "CPU is high" is a cause. "Users are slow" is a symptom. Alert on symptoms; investigate causes.

Alert without a runbook. If the on-call doesn't know what to do, the alert is useless. Every paging alert needs a runbook (even a one-line one).

No baselines for "normal." Alerting on "more than 100 errors per minute" sounds reasonable but a busy day might exceed that without anything being wrong. Use rate-based and anomaly-based alerts.

Single-region monitoring. Your monitoring service in the same region as your site means you'll miss regional outages and you'll get woken up when monitoring itself has issues.

Monitoring the monitoring. Or rather, not. If your alerting platform is down, who tells you? Most paging services offer their own status feeds. Subscribe.

Too many tiers of severity. P0/P1/P2/P3/P4 with different SLAs becomes a sorting exercise. Three tiers (page, notify, log) is plenty.

Synthetics that don't match reality. A synthetic that hits the homepage every minute tests "is the homepage up." It doesn't test "is the actual user flow working." Build synthetics for the journeys that matter.

Static thresholds that never get tuned. Traffic grows, behavior changes, thresholds set last year are wrong. Review thresholds quarterly.

On-call rotation with no handoffs. Each new on-call has to figure out the system. Document. Run weekly handoff meetings or async updates.

Pager fatigue. If on-call is paged more than once or twice a week, something is wrong. Audit the alerts. Reduce, tune, or fix the underlying issues.

告警应针对症状而非原因。“CPU使用率过高”是原因，“用户访问缓慢”是症状。针对症状告警，再调查原因。

无运行手册的告警。若轮值待命人员不知道该怎么做，告警毫无用处。每个需通知人员的告警都需要运行手册（即使只有一行）。

无“正常”基线。“每分钟错误数超过100”听起来合理，但业务繁忙时可能会超过该值而无实际问题。使用基于比率和异常的告警。

单区域监控。监控服务与站点位于同一区域，会导致遗漏区域级故障，且监控自身出现问题时会误唤醒人员。

未监控监控系统。或者说，未做到这一点。若告警平台宕机，谁来通知你？大多数寻呼服务提供自身状态订阅源，请订阅。

过多的严重程度层级。P0/P1/P2/P3/P4及不同SLA会变成分类游戏。三个层级（通知人员、业务时段通知、仅记录）已足够。

与实际情况不符的合成监控。每分钟访问一次首页的合成监控仅测试“首页是否运行”，无法测试“实际用户流程是否正常”。针对重要用户路径构建合成监控。

从未调整的静态阈值。流量增长、行为变化，去年设置的阈值可能已失效。每季度评审阈值。

无交接的轮值待命机制。每位新的轮值待命人员都需自行摸索系统。做好文档记录。每周召开交接会议或进行异步更新。

寻呼疲劳。若轮值待命人员每周被唤醒超过1-2次，说明存在问题。审计告警，减少、调整或修复根本问题。

Output format

输出格式

A monitoring plan includes:

System map: what's being monitored
SLOs: the 3-5 reliability targets
Checks per layer: availability, correctness, performance, errors
Alert tiering: what pages, what notifies, what logs
Routing: where alerts go, escalation paths
Dashboards: what audiences see
Runbooks: linked from each paging alert
Audit cadence: when this gets reviewed

监控计划应包含：

系统架构图：监控对象
SLO：3-5个可靠性目标
分层检查项：可用性、正确性、性能、错误
告警层级：需通知人员、需通知、仅记录的告警
路由配置：告警发送位置、升级路径
仪表板：不同受众的视图
运行手册：链接至每个需通知人员的告警
审计周期：计划评审时间

Reference files

参考文件

```
references/slo-design-guide.md
```
: Detailed walkthrough of writing SLOs, error budget policies, and common SLO mistakes for web services.

```
references/slo-design-guide.md
```
：Web服务SLO编写、错误预算策略及常见SLO误区的详细指南。