managing-incidents
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIncident Management
事件管理
Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations.
提供涵盖检测、响应、沟通和学习的端到端事件管理指导。强调SRE文化、无责事后复盘以及面向高可靠性运维的结构化流程。
When to Use This Skill
何时使用该技能
Apply this skill when:
- Setting up incident response processes for a team
- Designing on-call rotations and escalation policies
- Creating runbooks for common failure scenarios
- Conducting blameless post-mortems after incidents
- Implementing incident communication protocols (internal and external)
- Choosing incident management tooling and platforms
- Improving MTTR and incident frequency metrics
在以下场景应用该技能:
- 为团队搭建事件响应流程
- 设计随叫随到轮值和升级策略
- 为常见故障场景创建运行手册
- 事件发生后开展无责事后复盘
- 实施事件沟通协议(内部和外部)
- 选择事件管理工具和平台
- 优化MTTR和事件发生频率指标
Core Principles
核心原则
Incident Management Philosophy
事件管理理念
Declare Early and Often: Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response.
Mitigation First, Root Cause Later: Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored.
Blameless Culture: Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning.
Clear Command Structure: Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging.
Communication is Critical: Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents.
尽早且频繁地声明事件: 不要等待确定性结果。声明事件可启动协调工作,若后续无需可降级,避免延迟响应。
先缓解,后找根因: 立即停止对客户的影响(回滚、禁用功能、故障转移)。恢复稳定性后再进行调试和修复根因。
无责文化: 假设所有人出发点都是好的。聚焦系统如何失效,而非谁犯了错。打造诚实学习的心理安全环境。
清晰的指挥架构: 指定事件指挥官(IC)负责协调工作。IC分配任务但不参与实际调试。
沟通至关重要: 通过专用渠道进行内部协调,通过状态页面保持外部透明。重大事件期间每15-30分钟向利益相关方更新一次状态。
Severity Classification
严重程度分级
Standard severity levels with response times:
SEV0 (P0) - Critical Outage:
- Impact: Complete service outage, critical data loss, payment processing down
- Response: Page immediately 24/7, all hands on deck, executive notification
- Example: API completely down, entire customer base affected
SEV1 (P1) - Major Degradation:
- Impact: Major functionality degraded, significant customer subset affected
- Response: Page during business hours, escalate off-hours, IC assigned
- Example: 15% error rate, critical feature unavailable
SEV2 (P2) - Minor Issues:
- Impact: Minor functionality impaired, edge case bug, small user subset
- Response: Email/Slack alert, next business day response
- Example: UI glitch, non-critical feature slow
SEV3 (P3) - Low Impact:
- Impact: Cosmetic issues, no customer functionality affected
- Response: Ticket queue, planned sprint
- Example: Visual inconsistency, documentation error
For detailed severity decision framework and interactive classifier, see .
references/severity-classification.md带有响应时间的标准严重程度级别:
SEV0(P0)- 关键中断:
- 影响:服务完全中断、关键数据丢失、支付处理故障
- 响应:全天候立即告警,全员待命,通知管理层
- 示例:API完全故障,所有客户受影响
SEV1(P1)- 重大性能退化:
- 影响:核心功能严重退化,大量客户子集受影响
- 响应:工作时间内告警,非工作时间升级,指派IC
- 示例:15%错误率,关键功能不可用
SEV2(P2)- 次要问题:
- 影响:次要功能受损,边缘场景bug,少量用户受影响
- 响应:邮件/Slack告警,下一个工作日响应
- 示例:UI故障,非关键功能缓慢
SEV3(P3)- 低影响:
- 影响:外观问题,不影响客户功能使用
- 响应:工单队列,纳入规划迭代
- 示例:视觉不一致,文档错误
如需详细的严重程度决策框架和交互式分类器,请参阅。
references/severity-classification.mdIncident Roles
事件角色
Incident Commander (IC):
- Owns overall incident response and coordination
- Makes strategic decisions (rollback vs. debug, when to escalate)
- Delegates tasks to responders (does NOT do hands-on debugging)
- Declares incident resolved when stability confirmed
Communications Lead:
- Posts status updates to internal and external channels
- Coordinates with stakeholders (executives, product, support)
- Drafts post-incident customer communication
- Cadence: Every 15-30 minutes for SEV0/SEV1
Subject Matter Experts (SMEs):
- Hands-on debugging and mitigation
- Execute runbooks and implement fixes
- Provide technical context to IC
Scribe:
- Documents timeline, actions, decisions in real-time
- Records incident notes for post-mortem reconstruction
Assign roles based on severity:
- SEV2/SEV3: Single responder
- SEV1: IC + SME(s)
- SEV0: IC + Communications Lead + SME(s) + Scribe
For detailed role responsibilities, see .
references/incident-roles.md事件指挥官(IC):
- 负责整体事件响应和协调
- 制定战略决策(回滚还是调试,何时升级)
- 向响应人员分配任务(不参与实际调试)
- 确认恢复稳定后声明事件已解决
沟通负责人:
- 向内部和外部渠道发布状态更新
- 与利益相关方(管理层、产品、支持团队)协调
- 起草事件后的客户沟通内容
- 更新频率:SEV0/SEV1事件每15-30分钟一次
主题专家(SMEs):
- 实际进行调试和缓解工作
- 执行运行手册并实施修复
- 向IC提供技术背景信息
记录员:
- 实时记录时间线、行动和决策
- 记录事件笔记用于事后复盘还原
根据严重程度分配角色:
- SEV2/SEV3:单个响应人员
- SEV1:IC + 主题专家(SMEs)
- SEV0:IC + 沟通负责人 + 主题专家(SMEs) + 记录员
如需详细的角色职责,请参阅。
references/incident-roles.mdOn-Call Management
随叫随到管理
Rotation Patterns
轮值模式
Primary + Secondary:
- Primary: First responder
- Secondary: Backup if primary doesn't ack within 5 minutes
- Rotation length: 1 week (optimal balance)
Follow-the-Sun (24/7):
- Team A: US hours, Team B: Europe hours, Team C: Asia hours
- Benefit: No night shifts, improved work-life balance
- Requires: Multiple global teams
Tiered Escalation:
- Tier 1: Junior on-call (common issues, runbook-driven)
- Tier 2: Senior on-call (complex troubleshooting)
- Tier 3: Team lead/architect (critical decisions)
主岗 + 副岗:
- 主岗:第一响应人
- 副岗:若主岗5分钟内未确认则作为备份
- 轮值时长:1周(最优平衡)
随日出勤(7×24小时):
- 团队A:美国时段,团队B:欧洲时段,团队C:亚洲时段
- 优势:无需夜班,改善工作与生活平衡
- 要求:多个全球团队
分级升级:
- 一级:初级随叫随到(常见问题,基于运行手册处理)
- 二级:高级随叫随到(复杂故障排查)
- 三级:团队负责人/架构师(关键决策)
Best Practices
最佳实践
- Rotation length: 1 week per rotation
- Handoff ceremony: 30-minute call to discuss active issues
- Compensation: On-call stipend + time off after major incidents
- Tooling: PagerDuty, Opsgenie, or incident.io
- Limits: Max 2-3 pages per night; escalate if exceeded
- 轮值时长:每轮1周
- 交接仪式:30分钟会议讨论当前活跃问题
- 补偿:随叫随到津贴 + 重大事件后调休
- 工具:PagerDuty、Opsgenie或incident.io
- 限制:每晚最多2-3次告警;超出则升级
Incident Response Workflow
事件响应工作流
Standard incident lifecycle:
Detection → Triage → Declaration → Investigation
↓
Mitigation → Resolution → Monitoring → Closure
↓
Post-Mortem (within 48 hours)标准事件生命周期:
Detection → Triage → Declaration → Investigation
↓
Mitigation → Resolution → Monitoring → Closure
↓
Post-Mortem (within 48 hours)Key Decision Points
关键决策点
When to Declare: When in doubt, declare (can always downgrade severity)
When to Escalate:
- No progress after 30 minutes
- Severity increases (SEV2 → SEV1)
- Specialized expertise needed
When to Close:
- Issue resolved and stable for 30+ minutes
- Monitoring shows all metrics at baseline
- No customer-reported issues
For complete workflow details, see .
references/incident-workflow.md何时声明事件: 存疑时就声明(后续可随时降级严重程度)
何时升级:
- 30分钟内无进展
- 严重程度提升(SEV2 → SEV1)
- 需要专业领域知识
何时关闭事件:
- 问题已解决且稳定运行30分钟以上
- 监控显示所有指标回到基线
- 无客户反馈问题
如需完整的工作流详情,请参阅。
references/incident-workflow.mdCommunication Protocols
沟通协议
Internal Communication
内部沟通
Incident Slack Channel:
- Format:
#incident-YYYY-MM-DD-topic-description - Pin: Severity, IC name, status update template, runbook links
War Room: Video call for SEV0/SEV1 requiring real-time voice coordination
Status Update Cadence:
- SEV0: Every 15 minutes
- SEV1: Every 30 minutes
- SEV2: Every 1-2 hours or at major milestones
事件Slack频道:
- 格式:
#incident-YYYY-MM-DD-topic-description - 置顶:严重程度、IC姓名、状态更新模板、运行手册链接
作战室: 针对SEV0/SEV1事件的视频会议,用于实时语音协调
状态更新频率:
- SEV0:每15分钟一次
- SEV1:每30分钟一次
- SEV2:每1-2小时一次或在重要节点更新
External Communication
外部沟通
Status Page:
- Tools: Statuspage.io, Instatus, custom
- Stages: Investigating → Identified → Monitoring → Resolved
- Transparency: Acknowledge issue publicly, provide ETAs when possible
Customer Email:
- When: SEV0/SEV1 affecting customers
- Timing: Within 1 hour (acknowledge), post-resolution (full details)
- Tone: Apologetic, transparent, action-oriented
Regulatory Notifications:
- Data Breach: GDPR requires notification within 72 hours
- Financial Services: Immediate notification to regulators
- Healthcare: HIPAA breach notification rules
For communication templates, see .
examples/communication-templates.md状态页面:
- 工具:Statuspage.io、Instatus或自定义页面
- 阶段:调查中 → 已定位 → 监控中 → 已解决
- 透明度:公开确认问题,尽可能提供预计恢复时间
客户邮件:
- 场景:SEV0/SEV1事件影响客户时
- 时机:1小时内(确认收到问题),问题解决后(详细说明)
- 语气:致歉、透明、注重行动
合规通知:
- 数据泄露:GDPR要求72小时内通知
- 金融服务:立即通知监管机构
- 医疗保健:遵循HIPAA泄露通知规则
如需沟通模板,请参阅。
examples/communication-templates.mdRunbooks and Playbooks
运行手册与剧本
Runbook Structure
运行手册结构
Every runbook should include:
- Trigger: Alert conditions that activate this runbook
- Severity: Expected severity level
- Prerequisites: System state requirements
- Steps: Numbered, executable commands (copy-pasteable)
- Verification: How to confirm fix worked
- Rollback: How to undo if steps fail
- Owner: Team/person responsible
- Last Updated: Date of last revision
每个运行手册应包含:
- 触发条件: 激活本运行手册的告警条件
- 严重程度: 预期的严重级别
- 前提条件: 系统状态要求
- 步骤: 编号的可执行命令(可复制粘贴)
- 验证: 如何确认修复生效
- 回滚: 步骤失败时如何撤销操作
- 负责人: 负责的团队/个人
- 最后更新时间: 上次修订日期
Best Practices
最佳实践
- Executable: Commands copy-pasteable, not just descriptions
- Tested: Run during disaster recovery drills
- Versioned: Track changes in Git
- Linked: Reference from alert definitions
- Automated: Convert manual steps to scripts over time
For runbook templates, see directory.
examples/runbooks/- 可执行: 命令可直接复制粘贴,而非仅描述
- 已测试: 在灾难恢复演练中运行过
- 版本化: 在Git中跟踪变更
- 关联: 从告警定义中引用
- 自动化: 逐步将手动步骤转换为脚本
如需运行手册模板,请参阅目录。
examples/runbooks/Blameless Post-Mortems
无责事后复盘
Blameless Culture Tenets
无责文化原则
Assume Good Intentions: Everyone made the best decision with information available.
Focus on Systems: Investigate how processes failed, not who failed.
Psychological Safety: Create environment where honesty is rewarded.
Learning Opportunity: Incidents are gifts of organizational knowledge.
假设善意: 每个人都基于当时掌握的信息做出了最佳决策。
聚焦系统: 调查流程如何失效,而非谁犯了错。
心理安全: 打造诚实行为得到鼓励的环境。
学习机会: 事件是组织获取知识的契机。
Post-Mortem Process
事后复盘流程
1. Schedule Review (Within 48 Hours): While memory is fresh
2. Pre-Work: Reconstruct timeline, gather metrics/logs, draft document
3. Meeting Facilitation:
- Timeline walkthrough
- 5 Whys Analysis to identify systemic root causes
- What Went Well / What Went Wrong
- Define action items with owners and due dates
4. Post-Mortem Document:
- Sections: Summary, Timeline, Root Cause, Impact, What Went Well/Wrong, Action Items
- Distribution: Engineering, product, support, leadership
- Storage: Archive in searchable knowledge base
5. Follow-Up: Track action items in sprint planning
For detailed facilitation guide and template, see and .
references/blameless-postmortems.mdexamples/postmortem-template.md1. 安排回顾会议(48小时内): 趁记忆清晰时开展
2. 前期准备: 还原时间线,收集指标/日志,起草文档
3. 会议主持:
- 时间线回顾
- 5Why分析以确定系统性根因
- 做得好的地方/待改进的地方
- 定义带有负责人和截止日期的行动项
4. 事后复盘文档:
- 章节:摘要、时间线、根因、影响、优缺点、行动项
- 分发:工程、产品、支持、管理层
- 存储:存档于可搜索的知识库
5. 跟进: 在迭代规划中跟踪行动项
如需详细的主持指南和模板,请参阅和。
references/blameless-postmortems.mdexamples/postmortem-template.mdAlert Design Principles
告警设计原则
Actionable Alerts Only:
- Every alert requires human action
- Include graphs, runbook links, recent changes
- Deduplicate related alerts
- Route to appropriate team based on service ownership
Preventing Alert Fatigue:
- Audit alerts quarterly: Remove non-actionable alerts
- Increase thresholds for noisy metrics
- Use anomaly detection instead of static thresholds
- Limit: Max 2-3 pages per night
仅保留可执行的告警:
- 每个告警都需要人工干预
- 包含图表、运行手册链接、最近变更
- 去重相关告警
- 根据服务归属路由到对应团队
防止告警疲劳:
- 每季度审计告警:移除不可执行的告警
- 提高嘈杂指标的阈值
- 使用异常检测替代静态阈值
- 限制:每晚最多2-3次告警
Tool Selection
工具选择
Incident Management Platforms
事件管理平台
PagerDuty:
- Best for: Established enterprises, complex escalation policies
- Cost: $19-41/user/month
- When: Team size 10+, budget $500+/month
Opsgenie:
- Best for: Atlassian ecosystem users, flexible routing
- Cost: $9-29/user/month
- When: Using Atlassian products, budget $200-500/month
incident.io:
- Best for: Modern teams, AI-powered response, Slack-native
- When: Team size 5-50, Slack-centric culture
For detailed tool comparison, see .
references/tool-comparison.mdPagerDuty:
- 最佳适用:成熟企业,复杂升级策略
- 成本:$19-41/用户/月
- 场景:团队规模10人以上,预算$500+/月
Opsgenie:
- 最佳适用:Atlassian生态用户,灵活路由
- 成本:$9-29/用户/月
- 场景:使用Atlassian产品,预算$200-500/月
incident.io:
- 最佳适用:现代团队,AI驱动响应,Slack原生
- 场景:团队规模5-50人,以Slack为中心的文化
如需详细的工具对比,请参阅。
references/tool-comparison.mdStatus Page Solutions
状态页面解决方案
Statuspage.io: Most trusted, easy setup ($29-399/month)
Instatus: Budget-friendly, modern design ($19-99/month)
Statuspage.io: 最受信任,易于设置($29-399/月)
Instatus: 性价比高,现代设计($19-99/月)
Metrics and Continuous Improvement
指标与持续改进
Key Incident Metrics
关键事件指标
MTTA (Mean Time To Acknowledge):
- Target: < 5 minutes for SEV1
- Improvement: Better on-call coverage
MTTR (Mean Time To Recovery):
- Target: < 1 hour for SEV1
- Improvement: Runbooks, automation
MTBF (Mean Time Between Failures):
- Target: > 30 days for critical services
- Improvement: Root cause fixes
Incident Frequency:
- Track: SEV0, SEV1, SEV2 counts per month
- Target: Downward trend
Action Item Completion Rate:
- Target: > 90%
- Improvement: Sprint integration, ownership clarity
MTTA(平均确认时间):
- 目标:SEV1事件<5分钟
- 优化方向:更好的随叫随到覆盖
MTTR(平均恢复时间):
- 目标:SEV1事件<1小时
- 优化方向:运行手册、自动化
MTBF(平均故障间隔时间):
- 目标:关键服务>30天
- 优化方向:根因修复
事件发生频率:
- 跟踪:每月SEV0、SEV1、SEV2事件数量
- 目标:呈下降趋势
行动项完成率:
- 目标:>90%
- 优化方向:迭代集成、明确负责人
Continuous Improvement Loop
持续改进循环
Incident → Post-Mortem → Action Items → Prevention
↑ ↓
└──────────── Fewer Incidents ─────────────┘Incident → Post-Mortem → Action Items → Prevention
↑ ↓
└──────────── Fewer Incidents ─────────────┘Decision Frameworks
决策框架
Severity Classification Decision Tree
严重程度分类决策树
Is production completely down or critical data at risk?
├─ YES → SEV0
└─ NO → Is major functionality degraded?
├─ YES → Is there a workaround?
│ ├─ YES → SEV1
│ └─ NO → SEV0
└─ NO → Are customers impacted?
├─ YES → SEV2
└─ NO → SEV3Use interactive classifier:
python scripts/classify-severity.pyIs production completely down or critical data at risk?
├─ YES → SEV0
└─ NO → Is major functionality degraded?
├─ YES → Is there a workaround?
│ ├─ YES → SEV1
│ └─ NO → SEV0
└─ NO → Are customers impacted?
├─ YES → SEV2
└─ NO → SEV3使用交互式分类器:
python scripts/classify-severity.pyEscalation Matrix
升级矩阵
For detailed escalation guidance, see .
references/escalation-matrix.md如需详细的升级指导,请参阅。
references/escalation-matrix.mdMitigation vs. Root Cause
缓解 vs 根因
Prioritize Mitigation When:
- Active customer impact ongoing
- Quick fix available (rollback, disable feature)
Prioritize Root Cause When:
- Customer impact already mitigated
- Fix requires careful analysis
Default: Mitigation first (99% of cases)
优先缓解的场景:
- 仍在对客户产生影响
- 有快速修复方案(回滚、禁用功能)
优先根因分析的场景:
- 对客户的影响已缓解
- 修复需要仔细分析
默认原则: 先缓解(99%的场景)
Anti-Patterns to Avoid
需避免的反模式
- Delayed Declaration: Waiting for certainty before declaring incident
- Skipping Post-Mortems: "Small" incidents still provide learning
- Blame Culture: Punishing individuals prevents systemic learning
- Ignoring Action Items: Post-mortems without follow-through waste time
- No Clear IC: Multiple people leading creates confusion
- Alert Fatigue: Noisy, non-actionable alerts cause on-call to ignore pages
- Hands-On IC: IC should delegate debugging, not do it themselves
- 延迟声明事件: 等待确定性结果后再声明事件
- 跳过事后复盘: “小”事件仍能提供学习机会
- 追责文化: 惩罚个人会阻碍系统性学习
- 忽略行动项: 无跟进的事后复盘只是浪费时间
- 无明确IC: 多人指挥会造成混乱
- 告警疲劳: 嘈杂、不可执行的告警会导致随叫随到人员忽略告警
- IC参与实际调试: IC应分配调试任务,而非亲自执行
Implementation Checklist
实施检查清单
Phase 1: Foundation (Week 1)
阶段1:基础搭建(第1周)
- Define severity levels (SEV0-SEV3)
- Choose incident management platform
- Set up basic on-call rotation
- Create incident Slack channel template
- 定义严重程度级别(SEV0-SEV3)
- 选择事件管理平台
- 设置基础随叫随到轮值
- 创建事件Slack频道模板
Phase 2: Processes (Weeks 2-3)
阶段2:流程制定(第2-3周)
- Create first 5 runbooks for common incidents
- Set up status page
- Train team on incident response
- Conduct tabletop exercise
- 为常见事件创建首批5份运行手册
- 设置状态页面
- 为团队提供事件响应培训
- 开展桌面演练
Phase 3: Culture (Weeks 4+)
阶段3:文化建设(第4周及以后)
- Conduct first blameless post-mortem
- Establish post-mortem cadence
- Implement MTTA/MTTR dashboards
- Track action items in sprint planning
- 开展首次无责事后复盘
- 建立事后复盘节奏
- 实施MTTA/MTTR仪表盘
- 在迭代规划中跟踪行动项
Phase 4: Optimization (Months 3-6)
阶段4:优化(第3-6个月)
- Automate incident declaration
- Implement runbook automation
- Monthly disaster recovery drills
- Quarterly incident trend reviews
- 自动化事件声明
- 实施运行手册自动化
- 每月开展灾难恢复演练
- 每季度进行事件趋势回顾
Integration with Other Skills
与其他技能的集成
Observability: Monitoring alerts trigger incidents → Use incident-management for response
Disaster Recovery: DR provides recovery procedures → Incident-management provides operational response
Security Incident Response: Similar process with added compliance/forensics
Infrastructure-as-Code: IaC enables fast recovery via automated rebuild
Performance Engineering: Performance incidents trigger response → Performance team investigates post-mitigation
可观测性: 监控告警触发事件 → 使用事件管理技能进行响应
灾难恢复: 灾难恢复提供恢复流程 → 事件管理提供运维响应
安全事件响应: 流程类似,但增加合规/取证环节
基础设施即代码(IaC): IaC支持通过自动化重建快速恢复
性能工程: 性能事件触发响应 → 性能团队在缓解后开展调查
Examples and Templates
示例与模板
Runbook Templates:
examples/runbooks/database-failover.mdexamples/runbooks/cache-invalidation.mdexamples/runbooks/ddos-mitigation.md
Post-Mortem Template:
- - Complete blameless post-mortem structure
examples/postmortem-template.md
Communication Templates:
- - Status updates, customer emails
examples/communication-templates.md
On-Call Handoff:
- - Weekly handoff format
examples/oncall-handoff-template.md
Integration Scripts:
examples/integrations/pagerduty-slack.pyexamples/integrations/statuspage-auto-update.pyexamples/integrations/postmortem-generator.py
运行手册模板:
examples/runbooks/database-failover.mdexamples/runbooks/cache-invalidation.mdexamples/runbooks/ddos-mitigation.md
事后复盘模板:
- - 完整的无责事后复盘结构
examples/postmortem-template.md
沟通模板:
- - 状态更新、客户邮件
examples/communication-templates.md
随叫随到交接:
- - 每周交接格式
examples/oncall-handoff-template.md
集成脚本:
examples/integrations/pagerduty-slack.pyexamples/integrations/statuspage-auto-update.pyexamples/integrations/postmortem-generator.py
Scripts
脚本
Interactive Severity Classifier:
bash
python scripts/classify-severity.pyAsks questions to determine appropriate severity level based on impact and urgency.
交互式严重程度分类器:
bash
python scripts/classify-severity.py该脚本通过提问根据影响和紧急程度确定合适的严重级别。
Further Reading
延伸阅读
Books:
- Google SRE Book: "Postmortem Culture" (Chapter 15)
- "The Phoenix Project" by Gene Kim
- "Site Reliability Engineering" (Full book)
Online Resources:
- Atlassian: "How to Run a Blameless Postmortem"
- PagerDuty: "Incident Response Guide"
- Google SRE: "Postmortem Culture: Learning from Failure"
Standards:
- Incident Command System (ICS) - FEMA standard adapted for tech
- ITIL Incident Management - Traditional IT service management
书籍:
- 《Google SRE手册》:“事后复盘文化”(第15章)
- 《凤凰项目》(Gene Kim 著)
- 《站点可靠性工程》(全书)
在线资源:
- Atlassian:“如何开展无责事后复盘”
- PagerDuty:“事件响应指南”
- Google SRE:“事后复盘文化:从失败中学习”
标准:
- 事件指挥系统(ICS)- 适配科技行业的FEMA标准
- ITIL事件管理 - 传统IT服务管理