managing-incidents

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Incident Management

事件管理

Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations.

提供涵盖检测、响应、沟通和学习的端到端事件管理指导。强调SRE文化、无责事后复盘以及面向高可靠性运维的结构化流程。

When to Use This Skill

何时使用该技能

Apply this skill when:

Setting up incident response processes for a team
Designing on-call rotations and escalation policies
Creating runbooks for common failure scenarios
Conducting blameless post-mortems after incidents
Implementing incident communication protocols (internal and external)
Choosing incident management tooling and platforms
Improving MTTR and incident frequency metrics

在以下场景应用该技能：

为团队搭建事件响应流程
设计随叫随到轮值和升级策略
为常见故障场景创建运行手册
事件发生后开展无责事后复盘
实施事件沟通协议（内部和外部）
选择事件管理工具和平台
优化MTTR和事件发生频率指标

Core Principles

核心原则

Incident Management Philosophy

事件管理理念

Declare Early and Often: Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response.

Mitigation First, Root Cause Later: Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored.

Blameless Culture: Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning.

Clear Command Structure: Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging.

Communication is Critical: Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents.

尽早且频繁地声明事件： 不要等待确定性结果。声明事件可启动协调工作，若后续无需可降级，避免延迟响应。

先缓解，后找根因： 立即停止对客户的影响（回滚、禁用功能、故障转移）。恢复稳定性后再进行调试和修复根因。

无责文化： 假设所有人出发点都是好的。聚焦系统如何失效，而非谁犯了错。打造诚实学习的心理安全环境。

清晰的指挥架构： 指定事件指挥官（IC）负责协调工作。IC分配任务但不参与实际调试。

沟通至关重要： 通过专用渠道进行内部协调，通过状态页面保持外部透明。重大事件期间每15-30分钟向利益相关方更新一次状态。

Severity Classification

严重程度分级

Standard severity levels with response times:

SEV0 (P0) - Critical Outage:

Impact: Complete service outage, critical data loss, payment processing down
Response: Page immediately 24/7, all hands on deck, executive notification
Example: API completely down, entire customer base affected

SEV1 (P1) - Major Degradation:

Impact: Major functionality degraded, significant customer subset affected
Response: Page during business hours, escalate off-hours, IC assigned
Example: 15% error rate, critical feature unavailable

SEV2 (P2) - Minor Issues:

Impact: Minor functionality impaired, edge case bug, small user subset
Response: Email/Slack alert, next business day response
Example: UI glitch, non-critical feature slow

SEV3 (P3) - Low Impact:

Impact: Cosmetic issues, no customer functionality affected
Response: Ticket queue, planned sprint
Example: Visual inconsistency, documentation error

For detailed severity decision framework and interactive classifier, see

references/severity-classification.md

带有响应时间的标准严重程度级别：

SEV0（P0）- 关键中断：

影响：服务完全中断、关键数据丢失、支付处理故障
响应：全天候立即告警，全员待命，通知管理层
示例：API完全故障，所有客户受影响

SEV1（P1）- 重大性能退化：

影响：核心功能严重退化，大量客户子集受影响
响应：工作时间内告警，非工作时间升级，指派IC
示例：15%错误率，关键功能不可用

SEV2（P2）- 次要问题：

影响：次要功能受损，边缘场景bug，少量用户受影响
响应：邮件/Slack告警，下一个工作日响应
示例：UI故障，非关键功能缓慢

SEV3（P3）- 低影响：

影响：外观问题，不影响客户功能使用
响应：工单队列，纳入规划迭代
示例：视觉不一致，文档错误

如需详细的严重程度决策框架和交互式分类器，请参阅

references/severity-classification.md

。

Incident Roles

事件角色

Incident Commander (IC):

Owns overall incident response and coordination
Makes strategic decisions (rollback vs. debug, when to escalate)
Delegates tasks to responders (does NOT do hands-on debugging)
Declares incident resolved when stability confirmed

Communications Lead:

Posts status updates to internal and external channels
Coordinates with stakeholders (executives, product, support)
Drafts post-incident customer communication
Cadence: Every 15-30 minutes for SEV0/SEV1

Subject Matter Experts (SMEs):

Hands-on debugging and mitigation
Execute runbooks and implement fixes
Provide technical context to IC

Scribe:

Documents timeline, actions, decisions in real-time
Records incident notes for post-mortem reconstruction

Assign roles based on severity:

SEV2/SEV3: Single responder
SEV1: IC + SME(s)
SEV0: IC + Communications Lead + SME(s) + Scribe

For detailed role responsibilities, see

references/incident-roles.md

事件指挥官（IC）：

负责整体事件响应和协调
制定战略决策（回滚还是调试，何时升级）
向响应人员分配任务（不参与实际调试）
确认恢复稳定后声明事件已解决

沟通负责人：

向内部和外部渠道发布状态更新
与利益相关方（管理层、产品、支持团队）协调
起草事件后的客户沟通内容
更新频率：SEV0/SEV1事件每15-30分钟一次

主题专家（SMEs）：

实际进行调试和缓解工作
执行运行手册并实施修复
向IC提供技术背景信息

记录员：

实时记录时间线、行动和决策
记录事件笔记用于事后复盘还原

根据严重程度分配角色：

SEV2/SEV3：单个响应人员
SEV1：IC + 主题专家（SMEs）
SEV0：IC + 沟通负责人 + 主题专家（SMEs） + 记录员

如需详细的角色职责，请参阅

references/incident-roles.md

。

On-Call Management

随叫随到管理

Rotation Patterns

轮值模式

Primary + Secondary:

Primary: First responder
Secondary: Backup if primary doesn't ack within 5 minutes
Rotation length: 1 week (optimal balance)

Follow-the-Sun (24/7):

Team A: US hours, Team B: Europe hours, Team C: Asia hours
Benefit: No night shifts, improved work-life balance
Requires: Multiple global teams

Tiered Escalation:

Tier 1: Junior on-call (common issues, runbook-driven)
Tier 2: Senior on-call (complex troubleshooting)
Tier 3: Team lead/architect (critical decisions)

主岗 + 副岗：

主岗：第一响应人
副岗：若主岗5分钟内未确认则作为备份
轮值时长：1周（最优平衡）

随日出勤（7×24小时）：

团队A：美国时段，团队B：欧洲时段，团队C：亚洲时段
优势：无需夜班，改善工作与生活平衡
要求：多个全球团队

分级升级：

一级：初级随叫随到（常见问题，基于运行手册处理）
二级：高级随叫随到（复杂故障排查）
三级：团队负责人/架构师（关键决策）

Best Practices

最佳实践

Rotation length: 1 week per rotation
Handoff ceremony: 30-minute call to discuss active issues
Compensation: On-call stipend + time off after major incidents
Tooling: PagerDuty, Opsgenie, or incident.io
Limits: Max 2-3 pages per night; escalate if exceeded

轮值时长：每轮1周
交接仪式：30分钟会议讨论当前活跃问题
补偿：随叫随到津贴 + 重大事件后调休
工具：PagerDuty、Opsgenie或incident.io
限制：每晚最多2-3次告警；超出则升级

Incident Response Workflow

事件响应工作流

Standard incident lifecycle:

Detection → Triage → Declaration → Investigation
  ↓
Mitigation → Resolution → Monitoring → Closure
  ↓
Post-Mortem (within 48 hours)

标准事件生命周期：

Detection → Triage → Declaration → Investigation
  ↓
Mitigation → Resolution → Monitoring → Closure
  ↓
Post-Mortem (within 48 hours)

Key Decision Points

关键决策点

When to Declare: When in doubt, declare (can always downgrade severity)

When to Escalate:

No progress after 30 minutes
Severity increases (SEV2 → SEV1)
Specialized expertise needed

When to Close:

Issue resolved and stable for 30+ minutes
Monitoring shows all metrics at baseline
No customer-reported issues

For complete workflow details, see

references/incident-workflow.md

何时声明事件： 存疑时就声明（后续可随时降级严重程度）

何时升级：

30分钟内无进展
严重程度提升（SEV2 → SEV1）
需要专业领域知识

何时关闭事件：

问题已解决且稳定运行30分钟以上
监控显示所有指标回到基线
无客户反馈问题

如需完整的工作流详情，请参阅

references/incident-workflow.md

。

Communication Protocols

沟通协议

Internal Communication

内部沟通

Incident Slack Channel:

Format:
```
#incident-YYYY-MM-DD-topic-description
```
Pin: Severity, IC name, status update template, runbook links

War Room: Video call for SEV0/SEV1 requiring real-time voice coordination

Status Update Cadence:

SEV0: Every 15 minutes
SEV1: Every 30 minutes
SEV2: Every 1-2 hours or at major milestones

事件Slack频道：

格式：
```
#incident-YYYY-MM-DD-topic-description
```
置顶：严重程度、IC姓名、状态更新模板、运行手册链接

作战室： 针对SEV0/SEV1事件的视频会议，用于实时语音协调

状态更新频率：

SEV0：每15分钟一次
SEV1：每30分钟一次
SEV2：每1-2小时一次或在重要节点更新

External Communication

外部沟通

Status Page:

Tools: Statuspage.io, Instatus, custom
Stages: Investigating → Identified → Monitoring → Resolved
Transparency: Acknowledge issue publicly, provide ETAs when possible

Customer Email:

When: SEV0/SEV1 affecting customers
Timing: Within 1 hour (acknowledge), post-resolution (full details)
Tone: Apologetic, transparent, action-oriented

Regulatory Notifications:

Data Breach: GDPR requires notification within 72 hours
Financial Services: Immediate notification to regulators
Healthcare: HIPAA breach notification rules

For communication templates, see

examples/communication-templates.md

状态页面：

工具：Statuspage.io、Instatus或自定义页面
阶段：调查中 → 已定位 → 监控中 → 已解决
透明度：公开确认问题，尽可能提供预计恢复时间

客户邮件：

场景：SEV0/SEV1事件影响客户时
时机：1小时内（确认收到问题），问题解决后（详细说明）
语气：致歉、透明、注重行动

合规通知：

数据泄露：GDPR要求72小时内通知
金融服务：立即通知监管机构
医疗保健：遵循HIPAA泄露通知规则

如需沟通模板，请参阅

examples/communication-templates.md

。

Runbooks and Playbooks

运行手册与剧本

Runbook Structure

运行手册结构

Every runbook should include:

Trigger: Alert conditions that activate this runbook
Severity: Expected severity level
Prerequisites: System state requirements
Steps: Numbered, executable commands (copy-pasteable)
Verification: How to confirm fix worked
Rollback: How to undo if steps fail
Owner: Team/person responsible
Last Updated: Date of last revision

每个运行手册应包含：

触发条件： 激活本运行手册的告警条件
严重程度： 预期的严重级别
前提条件： 系统状态要求
步骤： 编号的可执行命令（可复制粘贴）
验证： 如何确认修复生效
回滚： 步骤失败时如何撤销操作
负责人： 负责的团队/个人
最后更新时间： 上次修订日期

Best Practices

最佳实践

Executable: Commands copy-pasteable, not just descriptions
Tested: Run during disaster recovery drills
Versioned: Track changes in Git
Linked: Reference from alert definitions
Automated: Convert manual steps to scripts over time

For runbook templates, see

examples/runbooks/

directory.

可执行： 命令可直接复制粘贴，而非仅描述
已测试： 在灾难恢复演练中运行过
版本化： 在Git中跟踪变更
关联： 从告警定义中引用
自动化： 逐步将手动步骤转换为脚本

如需运行手册模板，请参阅

examples/runbooks/

目录。

Blameless Post-Mortems

无责事后复盘

Blameless Culture Tenets

无责文化原则

Assume Good Intentions: Everyone made the best decision with information available.

Focus on Systems: Investigate how processes failed, not who failed.

Psychological Safety: Create environment where honesty is rewarded.

Learning Opportunity: Incidents are gifts of organizational knowledge.

假设善意： 每个人都基于当时掌握的信息做出了最佳决策。

聚焦系统： 调查流程如何失效，而非谁犯了错。

心理安全： 打造诚实行为得到鼓励的环境。

学习机会： 事件是组织获取知识的契机。

Post-Mortem Process

事后复盘流程

1. Schedule Review (Within 48 Hours): While memory is fresh

2. Pre-Work: Reconstruct timeline, gather metrics/logs, draft document

3. Meeting Facilitation:

Timeline walkthrough
5 Whys Analysis to identify systemic root causes
What Went Well / What Went Wrong
Define action items with owners and due dates

4. Post-Mortem Document:

Sections: Summary, Timeline, Root Cause, Impact, What Went Well/Wrong, Action Items
Distribution: Engineering, product, support, leadership
Storage: Archive in searchable knowledge base

5. Follow-Up: Track action items in sprint planning

For detailed facilitation guide and template, see

references/blameless-postmortems.md

and

examples/postmortem-template.md

1. 安排回顾会议（48小时内）： 趁记忆清晰时开展

2. 前期准备： 还原时间线，收集指标/日志，起草文档

3. 会议主持：

时间线回顾
5Why分析以确定系统性根因
做得好的地方/待改进的地方
定义带有负责人和截止日期的行动项

4. 事后复盘文档：

章节：摘要、时间线、根因、影响、优缺点、行动项
分发：工程、产品、支持、管理层
存储：存档于可搜索的知识库

5. 跟进： 在迭代规划中跟踪行动项

如需详细的主持指南和模板，请参阅

references/blameless-postmortems.md

和

examples/postmortem-template.md

。

Alert Design Principles

告警设计原则

Actionable Alerts Only:

Every alert requires human action
Include graphs, runbook links, recent changes
Deduplicate related alerts
Route to appropriate team based on service ownership

Preventing Alert Fatigue:

Audit alerts quarterly: Remove non-actionable alerts
Increase thresholds for noisy metrics
Use anomaly detection instead of static thresholds
Limit: Max 2-3 pages per night

仅保留可执行的告警：

每个告警都需要人工干预
包含图表、运行手册链接、最近变更
去重相关告警
根据服务归属路由到对应团队

防止告警疲劳：

每季度审计告警：移除不可执行的告警
提高嘈杂指标的阈值
使用异常检测替代静态阈值
限制：每晚最多2-3次告警

Tool Selection

工具选择

Incident Management Platforms

事件管理平台

PagerDuty:

Best for: Established enterprises, complex escalation policies
Cost: $19-41/user/month
When: Team size 10+, budget $500+/month

Opsgenie:

Best for: Atlassian ecosystem users, flexible routing
Cost: $9-29/user/month
When: Using Atlassian products, budget $200-500/month

incident.io:

Best for: Modern teams, AI-powered response, Slack-native
When: Team size 5-50, Slack-centric culture

For detailed tool comparison, see

references/tool-comparison.md

PagerDuty：

最佳适用：成熟企业，复杂升级策略
成本：$19-41/用户/月
场景：团队规模10人以上，预算$500+/月

Opsgenie：

最佳适用：Atlassian生态用户，灵活路由
成本：$9-29/用户/月
场景：使用Atlassian产品，预算$200-500/月

incident.io：

最佳适用：现代团队，AI驱动响应，Slack原生
场景：团队规模5-50人，以Slack为中心的文化

如需详细的工具对比，请参阅

references/tool-comparison.md

。

Status Page Solutions

状态页面解决方案

Statuspage.io: Most trusted, easy setup ($29-399/month) Instatus: Budget-friendly, modern design ($19-99/month)

Statuspage.io： 最受信任，易于设置（$29-399/月） Instatus： 性价比高，现代设计（$19-99/月）

Metrics and Continuous Improvement

指标与持续改进

Key Incident Metrics

关键事件指标

MTTA (Mean Time To Acknowledge):

Target: < 5 minutes for SEV1
Improvement: Better on-call coverage

MTTR (Mean Time To Recovery):

Target: < 1 hour for SEV1
Improvement: Runbooks, automation

MTBF (Mean Time Between Failures):

Target: > 30 days for critical services
Improvement: Root cause fixes

Incident Frequency:

Track: SEV0, SEV1, SEV2 counts per month
Target: Downward trend

Action Item Completion Rate:

Target: > 90%
Improvement: Sprint integration, ownership clarity

MTTA（平均确认时间）：

目标：SEV1事件<5分钟
优化方向：更好的随叫随到覆盖

MTTR（平均恢复时间）：

目标：SEV1事件<1小时
优化方向：运行手册、自动化

MTBF（平均故障间隔时间）：

目标：关键服务>30天
优化方向：根因修复

事件发生频率：

跟踪：每月SEV0、SEV1、SEV2事件数量
目标：呈下降趋势

行动项完成率：

目标：>90%
优化方向：迭代集成、明确负责人

Continuous Improvement Loop

持续改进循环

Incident → Post-Mortem → Action Items → Prevention
   ↑                                          ↓
   └──────────── Fewer Incidents ─────────────┘

Incident → Post-Mortem → Action Items → Prevention
   ↑                                          ↓
   └──────────── Fewer Incidents ─────────────┘

Decision Frameworks

决策框架

Severity Classification Decision Tree

严重程度分类决策树

Is production completely down or critical data at risk?
├─ YES → SEV0
└─ NO  → Is major functionality degraded?
          ├─ YES → Is there a workaround?
          │        ├─ YES → SEV1
          │        └─ NO  → SEV0
          └─ NO  → Are customers impacted?
                   ├─ YES → SEV2
                   └─ NO  → SEV3

Use interactive classifier:

python scripts/classify-severity.py

Is production completely down or critical data at risk?
├─ YES → SEV0
└─ NO  → Is major functionality degraded?
          ├─ YES → Is there a workaround?
          │        ├─ YES → SEV1
          │        └─ NO  → SEV0
          └─ NO  → Are customers impacted?
                   ├─ YES → SEV2
                   └─ NO  → SEV3

使用交互式分类器：

python scripts/classify-severity.py

Escalation Matrix

升级矩阵

For detailed escalation guidance, see

references/escalation-matrix.md

如需详细的升级指导，请参阅

references/escalation-matrix.md

。

Mitigation vs. Root Cause

缓解 vs 根因

Prioritize Mitigation When:

Active customer impact ongoing
Quick fix available (rollback, disable feature)

Prioritize Root Cause When:

Customer impact already mitigated
Fix requires careful analysis

Default: Mitigation first (99% of cases)

优先缓解的场景：

仍在对客户产生影响
有快速修复方案（回滚、禁用功能）

优先根因分析的场景：

对客户的影响已缓解
修复需要仔细分析

默认原则： 先缓解（99%的场景）

Anti-Patterns to Avoid

需避免的反模式

Delayed Declaration: Waiting for certainty before declaring incident
Skipping Post-Mortems: "Small" incidents still provide learning
Blame Culture: Punishing individuals prevents systemic learning
Ignoring Action Items: Post-mortems without follow-through waste time
No Clear IC: Multiple people leading creates confusion
Alert Fatigue: Noisy, non-actionable alerts cause on-call to ignore pages
Hands-On IC: IC should delegate debugging, not do it themselves

延迟声明事件： 等待确定性结果后再声明事件
跳过事后复盘： “小”事件仍能提供学习机会
追责文化： 惩罚个人会阻碍系统性学习
忽略行动项： 无跟进的事后复盘只是浪费时间
无明确IC： 多人指挥会造成混乱
告警疲劳： 嘈杂、不可执行的告警会导致随叫随到人员忽略告警
IC参与实际调试： IC应分配调试任务，而非亲自执行

Implementation Checklist

实施检查清单

Phase 1: Foundation (Week 1)

阶段1：基础搭建（第1周）

Define severity levels (SEV0-SEV3)
Choose incident management platform
Set up basic on-call rotation
Create incident Slack channel template

定义严重程度级别（SEV0-SEV3）
选择事件管理平台
设置基础随叫随到轮值
创建事件Slack频道模板

Phase 2: Processes (Weeks 2-3)

阶段2：流程制定（第2-3周）

Create first 5 runbooks for common incidents
Set up status page
Train team on incident response
Conduct tabletop exercise

为常见事件创建首批5份运行手册
设置状态页面
为团队提供事件响应培训
开展桌面演练

Phase 3: Culture (Weeks 4+)

阶段3：文化建设（第4周及以后）

Conduct first blameless post-mortem
Establish post-mortem cadence
Implement MTTA/MTTR dashboards
Track action items in sprint planning

开展首次无责事后复盘
建立事后复盘节奏
实施MTTA/MTTR仪表盘
在迭代规划中跟踪行动项

Phase 4: Optimization (Months 3-6)

阶段4：优化（第3-6个月）

Automate incident declaration
Implement runbook automation
Monthly disaster recovery drills
Quarterly incident trend reviews

自动化事件声明
实施运行手册自动化
每月开展灾难恢复演练
每季度进行事件趋势回顾

Integration with Other Skills

与其他技能的集成

Observability: Monitoring alerts trigger incidents → Use incident-management for response

Disaster Recovery: DR provides recovery procedures → Incident-management provides operational response

Security Incident Response: Similar process with added compliance/forensics

Infrastructure-as-Code: IaC enables fast recovery via automated rebuild

Performance Engineering: Performance incidents trigger response → Performance team investigates post-mitigation

可观测性： 监控告警触发事件 → 使用事件管理技能进行响应

灾难恢复： 灾难恢复提供恢复流程 → 事件管理提供运维响应

安全事件响应： 流程类似，但增加合规/取证环节

基础设施即代码（IaC）： IaC支持通过自动化重建快速恢复

性能工程： 性能事件触发响应 → 性能团队在缓解后开展调查

Examples and Templates

示例与模板

Runbook Templates:

```
examples/runbooks/database-failover.md
```
```
examples/runbooks/cache-invalidation.md
```
```
examples/runbooks/ddos-mitigation.md
```

Post-Mortem Template:

```
examples/postmortem-template.md
```
- Complete blameless post-mortem structure

Communication Templates:

```
examples/communication-templates.md
```
- Status updates, customer emails

On-Call Handoff:

```
examples/oncall-handoff-template.md
```
- Weekly handoff format

Integration Scripts:

examples/integrations/pagerduty-slack.py

examples/integrations/statuspage-auto-update.py

examples/integrations/postmortem-generator.py

运行手册模板：

```
examples/runbooks/database-failover.md
```
```
examples/runbooks/cache-invalidation.md
```
```
examples/runbooks/ddos-mitigation.md
```

事后复盘模板：

```
examples/postmortem-template.md
```
- 完整的无责事后复盘结构

沟通模板：

```
examples/communication-templates.md
```
- 状态更新、客户邮件

随叫随到交接：

```
examples/oncall-handoff-template.md
```
- 每周交接格式

集成脚本：

examples/integrations/pagerduty-slack.py

examples/integrations/statuspage-auto-update.py

examples/integrations/postmortem-generator.py

Scripts

脚本

Interactive Severity Classifier:

bash

python scripts/classify-severity.py

Asks questions to determine appropriate severity level based on impact and urgency.

交互式严重程度分类器：

bash

python scripts/classify-severity.py

该脚本通过提问根据影响和紧急程度确定合适的严重级别。