managing-incidents

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incident Management

事件管理

Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations.
提供涵盖检测、响应、沟通和学习的端到端事件管理指导。强调SRE文化、无责事后复盘以及面向高可靠性运维的结构化流程。

When to Use This Skill

何时使用该技能

Apply this skill when:
  • Setting up incident response processes for a team
  • Designing on-call rotations and escalation policies
  • Creating runbooks for common failure scenarios
  • Conducting blameless post-mortems after incidents
  • Implementing incident communication protocols (internal and external)
  • Choosing incident management tooling and platforms
  • Improving MTTR and incident frequency metrics
在以下场景应用该技能:
  • 为团队搭建事件响应流程
  • 设计随叫随到轮值和升级策略
  • 为常见故障场景创建运行手册
  • 事件发生后开展无责事后复盘
  • 实施事件沟通协议(内部和外部)
  • 选择事件管理工具和平台
  • 优化MTTR和事件发生频率指标

Core Principles

核心原则

Incident Management Philosophy

事件管理理念

Declare Early and Often: Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response.
Mitigation First, Root Cause Later: Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored.
Blameless Culture: Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning.
Clear Command Structure: Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging.
Communication is Critical: Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical incidents.
尽早且频繁地声明事件: 不要等待确定性结果。声明事件可启动协调工作,若后续无需可降级,避免延迟响应。
先缓解,后找根因: 立即停止对客户的影响(回滚、禁用功能、故障转移)。恢复稳定性后再进行调试和修复根因。
无责文化: 假设所有人出发点都是好的。聚焦系统如何失效,而非谁犯了错。打造诚实学习的心理安全环境。
清晰的指挥架构: 指定事件指挥官(IC)负责协调工作。IC分配任务但不参与实际调试。
沟通至关重要: 通过专用渠道进行内部协调,通过状态页面保持外部透明。重大事件期间每15-30分钟向利益相关方更新一次状态。

Severity Classification

严重程度分级

Standard severity levels with response times:
SEV0 (P0) - Critical Outage:
  • Impact: Complete service outage, critical data loss, payment processing down
  • Response: Page immediately 24/7, all hands on deck, executive notification
  • Example: API completely down, entire customer base affected
SEV1 (P1) - Major Degradation:
  • Impact: Major functionality degraded, significant customer subset affected
  • Response: Page during business hours, escalate off-hours, IC assigned
  • Example: 15% error rate, critical feature unavailable
SEV2 (P2) - Minor Issues:
  • Impact: Minor functionality impaired, edge case bug, small user subset
  • Response: Email/Slack alert, next business day response
  • Example: UI glitch, non-critical feature slow
SEV3 (P3) - Low Impact:
  • Impact: Cosmetic issues, no customer functionality affected
  • Response: Ticket queue, planned sprint
  • Example: Visual inconsistency, documentation error
For detailed severity decision framework and interactive classifier, see
references/severity-classification.md
.
带有响应时间的标准严重程度级别:
SEV0(P0)- 关键中断:
  • 影响:服务完全中断、关键数据丢失、支付处理故障
  • 响应:全天候立即告警,全员待命,通知管理层
  • 示例:API完全故障,所有客户受影响
SEV1(P1)- 重大性能退化:
  • 影响:核心功能严重退化,大量客户子集受影响
  • 响应:工作时间内告警,非工作时间升级,指派IC
  • 示例:15%错误率,关键功能不可用
SEV2(P2)- 次要问题:
  • 影响:次要功能受损,边缘场景bug,少量用户受影响
  • 响应:邮件/Slack告警,下一个工作日响应
  • 示例:UI故障,非关键功能缓慢
SEV3(P3)- 低影响:
  • 影响:外观问题,不影响客户功能使用
  • 响应:工单队列,纳入规划迭代
  • 示例:视觉不一致,文档错误
如需详细的严重程度决策框架和交互式分类器,请参阅
references/severity-classification.md

Incident Roles

事件角色

Incident Commander (IC):
  • Owns overall incident response and coordination
  • Makes strategic decisions (rollback vs. debug, when to escalate)
  • Delegates tasks to responders (does NOT do hands-on debugging)
  • Declares incident resolved when stability confirmed
Communications Lead:
  • Posts status updates to internal and external channels
  • Coordinates with stakeholders (executives, product, support)
  • Drafts post-incident customer communication
  • Cadence: Every 15-30 minutes for SEV0/SEV1
Subject Matter Experts (SMEs):
  • Hands-on debugging and mitigation
  • Execute runbooks and implement fixes
  • Provide technical context to IC
Scribe:
  • Documents timeline, actions, decisions in real-time
  • Records incident notes for post-mortem reconstruction
Assign roles based on severity:
  • SEV2/SEV3: Single responder
  • SEV1: IC + SME(s)
  • SEV0: IC + Communications Lead + SME(s) + Scribe
For detailed role responsibilities, see
references/incident-roles.md
.
事件指挥官(IC):
  • 负责整体事件响应和协调
  • 制定战略决策(回滚还是调试,何时升级)
  • 向响应人员分配任务(不参与实际调试)
  • 确认恢复稳定后声明事件已解决
沟通负责人:
  • 向内部和外部渠道发布状态更新
  • 与利益相关方(管理层、产品、支持团队)协调
  • 起草事件后的客户沟通内容
  • 更新频率:SEV0/SEV1事件每15-30分钟一次
主题专家(SMEs):
  • 实际进行调试和缓解工作
  • 执行运行手册并实施修复
  • 向IC提供技术背景信息
记录员:
  • 实时记录时间线、行动和决策
  • 记录事件笔记用于事后复盘还原
根据严重程度分配角色:
  • SEV2/SEV3:单个响应人员
  • SEV1:IC + 主题专家(SMEs)
  • SEV0:IC + 沟通负责人 + 主题专家(SMEs) + 记录员
如需详细的角色职责,请参阅
references/incident-roles.md

On-Call Management

随叫随到管理

Rotation Patterns

轮值模式

Primary + Secondary:
  • Primary: First responder
  • Secondary: Backup if primary doesn't ack within 5 minutes
  • Rotation length: 1 week (optimal balance)
Follow-the-Sun (24/7):
  • Team A: US hours, Team B: Europe hours, Team C: Asia hours
  • Benefit: No night shifts, improved work-life balance
  • Requires: Multiple global teams
Tiered Escalation:
  • Tier 1: Junior on-call (common issues, runbook-driven)
  • Tier 2: Senior on-call (complex troubleshooting)
  • Tier 3: Team lead/architect (critical decisions)
主岗 + 副岗:
  • 主岗:第一响应人
  • 副岗:若主岗5分钟内未确认则作为备份
  • 轮值时长:1周(最优平衡)
随日出勤(7×24小时):
  • 团队A:美国时段,团队B:欧洲时段,团队C:亚洲时段
  • 优势:无需夜班,改善工作与生活平衡
  • 要求:多个全球团队
分级升级:
  • 一级:初级随叫随到(常见问题,基于运行手册处理)
  • 二级:高级随叫随到(复杂故障排查)
  • 三级:团队负责人/架构师(关键决策)

Best Practices

最佳实践

  • Rotation length: 1 week per rotation
  • Handoff ceremony: 30-minute call to discuss active issues
  • Compensation: On-call stipend + time off after major incidents
  • Tooling: PagerDuty, Opsgenie, or incident.io
  • Limits: Max 2-3 pages per night; escalate if exceeded
  • 轮值时长:每轮1周
  • 交接仪式:30分钟会议讨论当前活跃问题
  • 补偿:随叫随到津贴 + 重大事件后调休
  • 工具:PagerDuty、Opsgenie或incident.io
  • 限制:每晚最多2-3次告警;超出则升级

Incident Response Workflow

事件响应工作流

Standard incident lifecycle:
Detection → Triage → Declaration → Investigation
Mitigation → Resolution → Monitoring → Closure
Post-Mortem (within 48 hours)
标准事件生命周期:
Detection → Triage → Declaration → Investigation
Mitigation → Resolution → Monitoring → Closure
Post-Mortem (within 48 hours)

Key Decision Points

关键决策点

When to Declare: When in doubt, declare (can always downgrade severity)
When to Escalate:
  • No progress after 30 minutes
  • Severity increases (SEV2 → SEV1)
  • Specialized expertise needed
When to Close:
  • Issue resolved and stable for 30+ minutes
  • Monitoring shows all metrics at baseline
  • No customer-reported issues
For complete workflow details, see
references/incident-workflow.md
.
何时声明事件: 存疑时就声明(后续可随时降级严重程度)
何时升级:
  • 30分钟内无进展
  • 严重程度提升(SEV2 → SEV1)
  • 需要专业领域知识
何时关闭事件:
  • 问题已解决且稳定运行30分钟以上
  • 监控显示所有指标回到基线
  • 无客户反馈问题
如需完整的工作流详情,请参阅
references/incident-workflow.md

Communication Protocols

沟通协议

Internal Communication

内部沟通

Incident Slack Channel:
  • Format:
    #incident-YYYY-MM-DD-topic-description
  • Pin: Severity, IC name, status update template, runbook links
War Room: Video call for SEV0/SEV1 requiring real-time voice coordination
Status Update Cadence:
  • SEV0: Every 15 minutes
  • SEV1: Every 30 minutes
  • SEV2: Every 1-2 hours or at major milestones
事件Slack频道:
  • 格式:
    #incident-YYYY-MM-DD-topic-description
  • 置顶:严重程度、IC姓名、状态更新模板、运行手册链接
作战室: 针对SEV0/SEV1事件的视频会议,用于实时语音协调
状态更新频率:
  • SEV0:每15分钟一次
  • SEV1:每30分钟一次
  • SEV2:每1-2小时一次或在重要节点更新

External Communication

外部沟通

Status Page:
  • Tools: Statuspage.io, Instatus, custom
  • Stages: Investigating → Identified → Monitoring → Resolved
  • Transparency: Acknowledge issue publicly, provide ETAs when possible
Customer Email:
  • When: SEV0/SEV1 affecting customers
  • Timing: Within 1 hour (acknowledge), post-resolution (full details)
  • Tone: Apologetic, transparent, action-oriented
Regulatory Notifications:
  • Data Breach: GDPR requires notification within 72 hours
  • Financial Services: Immediate notification to regulators
  • Healthcare: HIPAA breach notification rules
For communication templates, see
examples/communication-templates.md
.
状态页面:
  • 工具:Statuspage.io、Instatus或自定义页面
  • 阶段:调查中 → 已定位 → 监控中 → 已解决
  • 透明度:公开确认问题,尽可能提供预计恢复时间
客户邮件:
  • 场景:SEV0/SEV1事件影响客户时
  • 时机:1小时内(确认收到问题),问题解决后(详细说明)
  • 语气:致歉、透明、注重行动
合规通知:
  • 数据泄露:GDPR要求72小时内通知
  • 金融服务:立即通知监管机构
  • 医疗保健:遵循HIPAA泄露通知规则
如需沟通模板,请参阅
examples/communication-templates.md

Runbooks and Playbooks

运行手册与剧本

Runbook Structure

运行手册结构

Every runbook should include:
  1. Trigger: Alert conditions that activate this runbook
  2. Severity: Expected severity level
  3. Prerequisites: System state requirements
  4. Steps: Numbered, executable commands (copy-pasteable)
  5. Verification: How to confirm fix worked
  6. Rollback: How to undo if steps fail
  7. Owner: Team/person responsible
  8. Last Updated: Date of last revision
每个运行手册应包含:
  1. 触发条件: 激活本运行手册的告警条件
  2. 严重程度: 预期的严重级别
  3. 前提条件: 系统状态要求
  4. 步骤: 编号的可执行命令(可复制粘贴)
  5. 验证: 如何确认修复生效
  6. 回滚: 步骤失败时如何撤销操作
  7. 负责人: 负责的团队/个人
  8. 最后更新时间: 上次修订日期

Best Practices

最佳实践

  • Executable: Commands copy-pasteable, not just descriptions
  • Tested: Run during disaster recovery drills
  • Versioned: Track changes in Git
  • Linked: Reference from alert definitions
  • Automated: Convert manual steps to scripts over time
For runbook templates, see
examples/runbooks/
directory.
  • 可执行: 命令可直接复制粘贴,而非仅描述
  • 已测试: 在灾难恢复演练中运行过
  • 版本化: 在Git中跟踪变更
  • 关联: 从告警定义中引用
  • 自动化: 逐步将手动步骤转换为脚本
如需运行手册模板,请参阅
examples/runbooks/
目录。

Blameless Post-Mortems

无责事后复盘

Blameless Culture Tenets

无责文化原则

Assume Good Intentions: Everyone made the best decision with information available.
Focus on Systems: Investigate how processes failed, not who failed.
Psychological Safety: Create environment where honesty is rewarded.
Learning Opportunity: Incidents are gifts of organizational knowledge.
假设善意: 每个人都基于当时掌握的信息做出了最佳决策。
聚焦系统: 调查流程如何失效,而非谁犯了错。
心理安全: 打造诚实行为得到鼓励的环境。
学习机会: 事件是组织获取知识的契机。

Post-Mortem Process

事后复盘流程

1. Schedule Review (Within 48 Hours): While memory is fresh
2. Pre-Work: Reconstruct timeline, gather metrics/logs, draft document
3. Meeting Facilitation:
  • Timeline walkthrough
  • 5 Whys Analysis to identify systemic root causes
  • What Went Well / What Went Wrong
  • Define action items with owners and due dates
4. Post-Mortem Document:
  • Sections: Summary, Timeline, Root Cause, Impact, What Went Well/Wrong, Action Items
  • Distribution: Engineering, product, support, leadership
  • Storage: Archive in searchable knowledge base
5. Follow-Up: Track action items in sprint planning
For detailed facilitation guide and template, see
references/blameless-postmortems.md
and
examples/postmortem-template.md
.
1. 安排回顾会议(48小时内): 趁记忆清晰时开展
2. 前期准备: 还原时间线,收集指标/日志,起草文档
3. 会议主持:
  • 时间线回顾
  • 5Why分析以确定系统性根因
  • 做得好的地方/待改进的地方
  • 定义带有负责人和截止日期的行动项
4. 事后复盘文档:
  • 章节:摘要、时间线、根因、影响、优缺点、行动项
  • 分发:工程、产品、支持、管理层
  • 存储:存档于可搜索的知识库
5. 跟进: 在迭代规划中跟踪行动项
如需详细的主持指南和模板,请参阅
references/blameless-postmortems.md
examples/postmortem-template.md

Alert Design Principles

告警设计原则

Actionable Alerts Only:
  • Every alert requires human action
  • Include graphs, runbook links, recent changes
  • Deduplicate related alerts
  • Route to appropriate team based on service ownership
Preventing Alert Fatigue:
  • Audit alerts quarterly: Remove non-actionable alerts
  • Increase thresholds for noisy metrics
  • Use anomaly detection instead of static thresholds
  • Limit: Max 2-3 pages per night
仅保留可执行的告警:
  • 每个告警都需要人工干预
  • 包含图表、运行手册链接、最近变更
  • 去重相关告警
  • 根据服务归属路由到对应团队
防止告警疲劳:
  • 每季度审计告警:移除不可执行的告警
  • 提高嘈杂指标的阈值
  • 使用异常检测替代静态阈值
  • 限制:每晚最多2-3次告警

Tool Selection

工具选择

Incident Management Platforms

事件管理平台

PagerDuty:
  • Best for: Established enterprises, complex escalation policies
  • Cost: $19-41/user/month
  • When: Team size 10+, budget $500+/month
Opsgenie:
  • Best for: Atlassian ecosystem users, flexible routing
  • Cost: $9-29/user/month
  • When: Using Atlassian products, budget $200-500/month
incident.io:
  • Best for: Modern teams, AI-powered response, Slack-native
  • When: Team size 5-50, Slack-centric culture
For detailed tool comparison, see
references/tool-comparison.md
.
PagerDuty:
  • 最佳适用:成熟企业,复杂升级策略
  • 成本:$19-41/用户/月
  • 场景:团队规模10人以上,预算$500+/月
Opsgenie:
  • 最佳适用:Atlassian生态用户,灵活路由
  • 成本:$9-29/用户/月
  • 场景:使用Atlassian产品,预算$200-500/月
incident.io:
  • 最佳适用:现代团队,AI驱动响应,Slack原生
  • 场景:团队规模5-50人,以Slack为中心的文化
如需详细的工具对比,请参阅
references/tool-comparison.md

Status Page Solutions

状态页面解决方案

Statuspage.io: Most trusted, easy setup ($29-399/month) Instatus: Budget-friendly, modern design ($19-99/month)
Statuspage.io: 最受信任,易于设置($29-399/月) Instatus: 性价比高,现代设计($19-99/月)

Metrics and Continuous Improvement

指标与持续改进

Key Incident Metrics

关键事件指标

MTTA (Mean Time To Acknowledge):
  • Target: < 5 minutes for SEV1
  • Improvement: Better on-call coverage
MTTR (Mean Time To Recovery):
  • Target: < 1 hour for SEV1
  • Improvement: Runbooks, automation
MTBF (Mean Time Between Failures):
  • Target: > 30 days for critical services
  • Improvement: Root cause fixes
Incident Frequency:
  • Track: SEV0, SEV1, SEV2 counts per month
  • Target: Downward trend
Action Item Completion Rate:
  • Target: > 90%
  • Improvement: Sprint integration, ownership clarity
MTTA(平均确认时间):
  • 目标:SEV1事件<5分钟
  • 优化方向:更好的随叫随到覆盖
MTTR(平均恢复时间):
  • 目标:SEV1事件<1小时
  • 优化方向:运行手册、自动化
MTBF(平均故障间隔时间):
  • 目标:关键服务>30天
  • 优化方向:根因修复
事件发生频率:
  • 跟踪:每月SEV0、SEV1、SEV2事件数量
  • 目标:呈下降趋势
行动项完成率:
  • 目标:>90%
  • 优化方向:迭代集成、明确负责人

Continuous Improvement Loop

持续改进循环

Incident → Post-Mortem → Action Items → Prevention
   ↑                                          ↓
   └──────────── Fewer Incidents ─────────────┘
Incident → Post-Mortem → Action Items → Prevention
   ↑                                          ↓
   └──────────── Fewer Incidents ─────────────┘

Decision Frameworks

决策框架

Severity Classification Decision Tree

严重程度分类决策树

Is production completely down or critical data at risk?
├─ YES → SEV0
└─ NO  → Is major functionality degraded?
          ├─ YES → Is there a workaround?
          │        ├─ YES → SEV1
          │        └─ NO  → SEV0
          └─ NO  → Are customers impacted?
                   ├─ YES → SEV2
                   └─ NO  → SEV3
Use interactive classifier:
python scripts/classify-severity.py
Is production completely down or critical data at risk?
├─ YES → SEV0
└─ NO  → Is major functionality degraded?
          ├─ YES → Is there a workaround?
          │        ├─ YES → SEV1
          │        └─ NO  → SEV0
          └─ NO  → Are customers impacted?
                   ├─ YES → SEV2
                   └─ NO  → SEV3
使用交互式分类器:
python scripts/classify-severity.py

Escalation Matrix

升级矩阵

For detailed escalation guidance, see
references/escalation-matrix.md
.
如需详细的升级指导,请参阅
references/escalation-matrix.md

Mitigation vs. Root Cause

缓解 vs 根因

Prioritize Mitigation When:
  • Active customer impact ongoing
  • Quick fix available (rollback, disable feature)
Prioritize Root Cause When:
  • Customer impact already mitigated
  • Fix requires careful analysis
Default: Mitigation first (99% of cases)
优先缓解的场景:
  • 仍在对客户产生影响
  • 有快速修复方案(回滚、禁用功能)
优先根因分析的场景:
  • 对客户的影响已缓解
  • 修复需要仔细分析
默认原则: 先缓解(99%的场景)

Anti-Patterns to Avoid

需避免的反模式

  • Delayed Declaration: Waiting for certainty before declaring incident
  • Skipping Post-Mortems: "Small" incidents still provide learning
  • Blame Culture: Punishing individuals prevents systemic learning
  • Ignoring Action Items: Post-mortems without follow-through waste time
  • No Clear IC: Multiple people leading creates confusion
  • Alert Fatigue: Noisy, non-actionable alerts cause on-call to ignore pages
  • Hands-On IC: IC should delegate debugging, not do it themselves
  • 延迟声明事件: 等待确定性结果后再声明事件
  • 跳过事后复盘: “小”事件仍能提供学习机会
  • 追责文化: 惩罚个人会阻碍系统性学习
  • 忽略行动项: 无跟进的事后复盘只是浪费时间
  • 无明确IC: 多人指挥会造成混乱
  • 告警疲劳: 嘈杂、不可执行的告警会导致随叫随到人员忽略告警
  • IC参与实际调试: IC应分配调试任务,而非亲自执行

Implementation Checklist

实施检查清单

Phase 1: Foundation (Week 1)

阶段1:基础搭建(第1周)

  • Define severity levels (SEV0-SEV3)
  • Choose incident management platform
  • Set up basic on-call rotation
  • Create incident Slack channel template
  • 定义严重程度级别(SEV0-SEV3)
  • 选择事件管理平台
  • 设置基础随叫随到轮值
  • 创建事件Slack频道模板

Phase 2: Processes (Weeks 2-3)

阶段2:流程制定(第2-3周)

  • Create first 5 runbooks for common incidents
  • Set up status page
  • Train team on incident response
  • Conduct tabletop exercise
  • 为常见事件创建首批5份运行手册
  • 设置状态页面
  • 为团队提供事件响应培训
  • 开展桌面演练

Phase 3: Culture (Weeks 4+)

阶段3:文化建设(第4周及以后)

  • Conduct first blameless post-mortem
  • Establish post-mortem cadence
  • Implement MTTA/MTTR dashboards
  • Track action items in sprint planning
  • 开展首次无责事后复盘
  • 建立事后复盘节奏
  • 实施MTTA/MTTR仪表盘
  • 在迭代规划中跟踪行动项

Phase 4: Optimization (Months 3-6)

阶段4:优化(第3-6个月)

  • Automate incident declaration
  • Implement runbook automation
  • Monthly disaster recovery drills
  • Quarterly incident trend reviews
  • 自动化事件声明
  • 实施运行手册自动化
  • 每月开展灾难恢复演练
  • 每季度进行事件趋势回顾

Integration with Other Skills

与其他技能的集成

Observability: Monitoring alerts trigger incidents → Use incident-management for response
Disaster Recovery: DR provides recovery procedures → Incident-management provides operational response
Security Incident Response: Similar process with added compliance/forensics
Infrastructure-as-Code: IaC enables fast recovery via automated rebuild
Performance Engineering: Performance incidents trigger response → Performance team investigates post-mitigation
可观测性: 监控告警触发事件 → 使用事件管理技能进行响应
灾难恢复: 灾难恢复提供恢复流程 → 事件管理提供运维响应
安全事件响应: 流程类似,但增加合规/取证环节
基础设施即代码(IaC): IaC支持通过自动化重建快速恢复
性能工程: 性能事件触发响应 → 性能团队在缓解后开展调查

Examples and Templates

示例与模板

Runbook Templates:
  • examples/runbooks/database-failover.md
  • examples/runbooks/cache-invalidation.md
  • examples/runbooks/ddos-mitigation.md
Post-Mortem Template:
  • examples/postmortem-template.md
    - Complete blameless post-mortem structure
Communication Templates:
  • examples/communication-templates.md
    - Status updates, customer emails
On-Call Handoff:
  • examples/oncall-handoff-template.md
    - Weekly handoff format
Integration Scripts:
  • examples/integrations/pagerduty-slack.py
  • examples/integrations/statuspage-auto-update.py
  • examples/integrations/postmortem-generator.py
运行手册模板:
  • examples/runbooks/database-failover.md
  • examples/runbooks/cache-invalidation.md
  • examples/runbooks/ddos-mitigation.md
事后复盘模板:
  • examples/postmortem-template.md
    - 完整的无责事后复盘结构
沟通模板:
  • examples/communication-templates.md
    - 状态更新、客户邮件
随叫随到交接:
  • examples/oncall-handoff-template.md
    - 每周交接格式
集成脚本:
  • examples/integrations/pagerduty-slack.py
  • examples/integrations/statuspage-auto-update.py
  • examples/integrations/postmortem-generator.py

Scripts

脚本

Interactive Severity Classifier:
bash
python scripts/classify-severity.py
Asks questions to determine appropriate severity level based on impact and urgency.
交互式严重程度分类器:
bash
python scripts/classify-severity.py
该脚本通过提问根据影响和紧急程度确定合适的严重级别。

Further Reading

延伸阅读

Books:
  • Google SRE Book: "Postmortem Culture" (Chapter 15)
  • "The Phoenix Project" by Gene Kim
  • "Site Reliability Engineering" (Full book)
Online Resources:
  • Atlassian: "How to Run a Blameless Postmortem"
  • PagerDuty: "Incident Response Guide"
  • Google SRE: "Postmortem Culture: Learning from Failure"
Standards:
  • Incident Command System (ICS) - FEMA standard adapted for tech
  • ITIL Incident Management - Traditional IT service management
书籍:
  • 《Google SRE手册》:“事后复盘文化”(第15章)
  • 《凤凰项目》(Gene Kim 著)
  • 《站点可靠性工程》(全书)
在线资源:
  • Atlassian:“如何开展无责事后复盘”
  • PagerDuty:“事件响应指南”
  • Google SRE:“事后复盘文化:从失败中学习”
标准:
  • 事件指挥系统(ICS)- 适配科技行业的FEMA标准
  • ITIL事件管理 - 传统IT服务管理