incident-response

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Incident Response

事件响应

When to Use

适用场景

Activate this skill when:

Production service is down or returning errors to users
Error rate has spiked beyond normal thresholds
Performance has degraded significantly (latency increase, timeouts)
An alert has fired from the monitoring system
Users are reporting issues that indicate a systemic problem
A failed deployment needs investigation and remediation
Conducting a post-mortem or root cause analysis after an incident

Output: Write runbooks to

docs/runbooks/<service>-runbook.md

and post-mortems to

postmortem-YYYY-MM-DD.md

Do NOT use this skill for:

Setting up monitoring or alerting rules (use
```
monitoring-setup
```
)
Performing routine deployments (use
```
deployment-pipeline
```
)
Docker image or infrastructure issues (use
```
docker-best-practices
```
)
Feature development or code changes (use
```
python-backend-expert
```
or
```
react-frontend-expert
```
)

在以下场景下启用本技能：

生产环境服务宕机或向用户返回错误
错误率超出正常阈值激增
性能显著下降（延迟增加、超时）
监控系统触发告警
用户反馈表明存在系统性问题
失败的部署需要排查和修复
事件发生后开展事后复盘或根因分析

输出： 将运行手册写入

docs/runbooks/<service>-runbook.md

，将事后复盘文档写入

postmortem-YYYY-MM-DD.md

。

请勿在以下场景使用本技能：

设置监控或告警规则（请使用
```
monitoring-setup
```
技能）
执行常规部署（请使用
```
deployment-pipeline
```
技能）
Docker镜像或基础设施问题（请使用
```
docker-best-practices
```
技能）
功能开发或代码变更（请使用
```
python-backend-expert
```
或
```
react-frontend-expert
```
技能）

Instructions

操作指南

Severity Classification

严重程度分级

Classify every incident immediately. Severity determines response urgency, communication cadence, and escalation path.

Severity	Impact	Examples	Response Time	Update Cadence
SEV1 (P1)	Complete outage, all users affected	Service down, data loss, security breach	Immediate (< 5 min)	Every 15 min
SEV2 (P2)	Major degradation, most users affected	Core feature broken, severe latency	< 15 min	Every 30 min
SEV3 (P3)	Partial degradation, some users affected	Non-critical feature broken, intermittent errors	< 1 hour	Every 2 hours
SEV4 (P4)	Minor issue, few users affected	Cosmetic bug, edge case error	< 4 hours	Daily

Escalation rules:

SEV1: Page on-call engineer + engineering manager immediately
SEV2: Page on-call engineer, notify engineering manager
SEV3: Notify on-call engineer via Slack
SEV4: Create ticket, address during normal working hours

See

references/escalation-contacts.md

for the contact matrix.

立即对每个事件进行分级。严重程度决定响应优先级、沟通频率和升级路径。

严重程度	影响	示例	响应时间	更新频率
SEV1 (P1)	完全宕机，所有用户受影响	服务下线、数据丢失、安全漏洞	立即响应（<5分钟）	每15分钟更新
SEV2 (P2)	严重性能下降，多数用户受影响	核心功能故障、严重延迟	<15分钟响应	每30分钟更新
SEV3 (P3)	部分性能下降，部分用户受影响	非核心功能故障、间歇性错误	<1小时响应	每2小时更新
SEV4 (P4)	轻微问题，少数用户受影响	界面显示bug、边缘场景错误	<4小时响应	每日更新

升级规则：

SEV1：立即呼叫值班工程师+工程经理
SEV2：呼叫值班工程师，通知工程经理
SEV3：通过Slack通知值班工程师
SEV4：创建工单，在正常工作时间处理

请查看

references/escalation-contacts.md

获取联系人矩阵。

5-Minute Triage Workflow

5分钟分流工作流

When an incident is detected, follow this triage workflow within the first 5 minutes.

┌─────────────────────────────────────────────────────────┐
│  MINUTE 0-1: Acknowledge and Classify                   │
│  • Acknowledge the alert or report                      │
│  • Assign severity (SEV1-SEV4)                          │
│  • Designate incident commander                         │
├─────────────────────────────────────────────────────────┤
│  MINUTE 1-2: Assess Scope                               │
│  • Check health endpoints for all services              │
│  • Check error rate and latency dashboards              │
│  • Determine: which services are affected?              │
├─────────────────────────────────────────────────────────┤
│  MINUTE 2-3: Identify Recent Changes                    │
│  • Check: was there a recent deployment?                │
│  • Check: any infrastructure changes?                   │
│  • Check: any external dependency issues?               │
├─────────────────────────────────────────────────────────┤
│  MINUTE 3-4: Initial Communication                      │
│  • Post in #incidents channel                           │
│  • Update status page if SEV1/SEV2                      │
│  • Page additional responders if needed                 │
├─────────────────────────────────────────────────────────┤
│  MINUTE 4-5: Begin Investigation or Mitigate            │
│  • If recent deploy: consider immediate rollback        │
│  • If not deploy-related: begin diagnostic commands     │
│  • Start incident timeline log                          │
└─────────────────────────────────────────────────────────┘

Quick health check command:

bash

./skills/incident-response/scripts/health-check-all-services.sh \
  --output-dir ./incident-triage/

检测到事件后，在最初5分钟内遵循以下分流工作流。

┌─────────────────────────────────────────────────────────┐
│  MINUTE 0-1: Acknowledge and Classify                   │
│  • Acknowledge the alert or report                      │
│  • Assign severity (SEV1-SEV4)                          │
│  • Designate incident commander                         │
├─────────────────────────────────────────────────────────┤
│  MINUTE 1-2: Assess Scope                               │
│  • Check health endpoints for all services              │
│  • Check error rate and latency dashboards              │
│  • Determine: which services are affected?              │
├─────────────────────────────────────────────────────────┤
│  MINUTE 2-3: Identify Recent Changes                    │
│  • Check: was there a recent deployment?                │
│  • Check: any infrastructure changes?                   │
│  • Check: any external dependency issues?               │
├─────────────────────────────────────────────────────────┤
│  MINUTE 3-4: Initial Communication                      │
│  • Post in #incidents channel                           │
│  • Update status page if SEV1/SEV2                      │
│  • Page additional responders if needed                 │
├─────────────────────────────────────────────────────────┤
│  MINUTE 4-5: Begin Investigation or Mitigate            │
│  • If recent deploy: consider immediate rollback        │
│  • If not deploy-related: begin diagnostic commands     │
│  • Start incident timeline log                          │
└─────────────────────────────────────────────────────────┘

快速健康检查命令：

bash

./skills/incident-response/scripts/health-check-all-services.sh \
  --output-dir ./incident-triage/

Incident Commander Role

事件指挥官职责

The incident commander (IC) coordinates the response. They do NOT investigate directly.

IC responsibilities:

Coordinate -- Assign tasks to responders, prevent duplicate work
Communicate -- Post regular updates to stakeholders
Decide -- Make go/no-go decisions on rollback, escalation, communication
Track -- Maintain the incident timeline
Close -- Declare the incident resolved and schedule the post-mortem

IC communication template (initial):

INCIDENT DECLARED: [Title]
Severity: [SEV1/SEV2/SEV3/SEV4]
Commander: [Name]
Start time: [UTC timestamp]
Impact: [What users are experiencing]
Status: Investigating
Next update: [Time]

IC communication template (update):

INCIDENT UPDATE: [Title]
Severity: [SEV level]
Duration: [Time since start]
Status: [Investigating/Identified/Mitigating/Resolved]
Current findings: [What we know]
Actions in progress: [What we are doing]
Next update: [Time]

事件指挥官（IC）负责协调响应工作，不直接参与排查。

IC职责：

协调 -- 为响应人员分配任务，避免重复工作
沟通 -- 向利益相关者定期发布更新
决策 -- 就回滚、升级、沟通事宜做出是否执行的决定
跟踪 -- 维护事件时间线
收尾 -- 宣布事件已解决并安排事后复盘

IC初始沟通模板：

INCIDENT DECLARED: [Title]
Severity: [SEV1/SEV2/SEV3/SEV4]
Commander: [Name]
Start time: [UTC timestamp]
Impact: [What users are experiencing]
Status: Investigating
Next update: [Time]

IC更新沟通模板：

INCIDENT UPDATE: [Title]
Severity: [SEV level]
Duration: [Time since start]
Status: [Investigating/Identified/Mitigating/Resolved]
Current findings: [What we know]
Actions in progress: [What we are doing]
Next update: [Time]

Investigation Steps

排查步骤

Follow these diagnostic steps based on the type of issue.

根据问题类型遵循以下诊断步骤。

Application Errors (FastAPI)

应用错误（FastAPI）

bash

undefined

bash

undefined

1. Check application logs for errors

1. 检查应用日志中的错误

./skills/incident-response/scripts/fetch-logs.sh
--service backend
--since "15 minutes ago"
--output-dir ./incident-logs/

2. Check error rate from logs

2. 从日志中统计错误率

docker logs app-backend --since 15m 2>&1 | grep -c "ERROR"

3. Check active connections and request patterns

3. 检查活跃连接和请求模式

curl -s http://localhost:8000/health/ready | jq .

4. Check if the issue is in a specific endpoint

4. 检查问题是否出在特定端点

5. Check Python process status

5. 检查Python进程状态

docker exec app-backend ps aux docker exec app-backend python -c "import sys; print(sys.version)"

undefined

docker exec app-backend ps aux docker exec app-backend python -c "import sys; print(sys.version)"

undefined

Database Issues (PostgreSQL)

数据库问题（PostgreSQL）

bash

undefined

bash

undefined

1. Check database connectivity

1. 检查数据库连通性

docker exec app-db pg_isready -U postgres

2. Check active connections (connection pool exhaustion?)

2. 检查活跃连接（是否连接池耗尽？）

docker exec app-db psql -U postgres -d app_prod -c " SELECT count(*), state FROM pg_stat_activity GROUP BY state ORDER BY count DESC; "

3. Check for long-running queries (locks, deadlocks?)

3. 检查长运行查询（是否有锁、死锁？）

docker exec app-db psql -U postgres -d app_prod -c " SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds' AND state != 'idle' ORDER BY duration DESC; "

4. Check for lock contention

4. 检查锁竞争

docker exec app-db psql -U postgres -d app_prod -c " SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_query FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted; "

5. Check disk space

5. 检查磁盘空间

docker exec app-db df -h /var/lib/postgresql/data

undefined

docker exec app-db df -h /var/lib/postgresql/data

undefined

Redis Issues

Redis问题

bash

undefined

bash

undefined

1. Check Redis connectivity

1. 检查Redis连通性

docker exec app-redis redis-cli ping

2. Check memory usage

2. 检查内存使用情况

docker exec app-redis redis-cli info memory | grep used_memory_human

3. Check connected clients

3. 检查连接客户端数量

docker exec app-redis redis-cli info clients | grep connected_clients

4. Check slow log

4. 检查慢查询日志

docker exec app-redis redis-cli slowlog get 10

5. Check keyspace

5. 检查键空间

docker exec app-redis redis-cli info keyspace

undefined

docker exec app-redis redis-cli info keyspace

undefined

Network and Infrastructure

网络与基础设施

bash

undefined

bash

undefined

1. Check DNS resolution

1. 检查DNS解析

nslookup api.example.com

2. Check SSL certificate expiry

2. 检查SSL证书有效期

echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null |
openssl x509 -noout -dates

3. Check container resource usage

3. 检查容器资源使用情况

docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

4. Check disk space on host

4. 检查主机磁盘空间

df -h /

5. Check if dependent services are reachable

5. 检查依赖服务是否可达

curl -sf https://external-api.example.com/health || echo "External API unreachable"

undefined

curl -sf https://external-api.example.com/health || echo "External API unreachable"

undefined

Remediation Actions

修复操作

Immediate Mitigations (apply within minutes)

立即缓解措施（数分钟内实施）

Issue	Mitigation	Command
Bad deployment	Rollback	`./scripts/deploy.sh --rollback --env production --version $PREV_SHA --output-dir ./results/`
Connection pool exhausted	Restart backend	`docker restart app-backend`
Long-running query	Kill query	`SELECT pg_terminate_backend(<pid>);`
Memory leak	Restart service	`docker restart app-backend`
Redis full	Flush non-critical keys	`redis-cli --scan --pattern "cache:*" \| xargs redis-cli del`
SSL expired	Apply new cert	Update cert in load balancer
Disk full	Clean logs/temp files	`docker system prune -f`

问题	缓解措施	命令
错误部署	回滚	`./scripts/deploy.sh --rollback --env production --version $PREV_SHA --output-dir ./results/`
连接池耗尽	重启后端服务	`docker restart app-backend`
长运行查询	终止查询	`SELECT pg_terminate_backend(<pid>);`
内存泄漏	重启服务	`docker restart app-backend`
Redis内存已满	清理非核心键	`redis-cli --scan --pattern "cache:*" \| xargs redis-cli del`
SSL证书过期	应用新证书	在负载均衡器中更新证书
磁盘已满	清理日志/临时文件	`docker system prune -f`

Longer-Term Fixes (apply after stabilization)

长期修复方案（稳定后实施）

Fix the root cause in code -- Create a branch, fix, test, deploy through normal pipeline
Add monitoring -- If the issue was not caught by existing alerts, add new alert rules
Add tests -- Write regression tests for the failure scenario
Update runbooks -- Document the new failure mode and remediation steps

在代码中修复根因 -- 创建分支、修复、测试、通过正常流水线部署
添加监控 -- 如果现有告警未检测到问题，添加新的告警规则
添加测试 -- 为故障场景编写回归测试
更新运行手册 -- 记录新的故障模式和修复步骤

Communication Protocol

沟通协议

Internal Communication

内部沟通

Channels:

```
#incidents
```
-- Active incident coordination (SEV1/SEV2)
```
#incidents-low
```
-- SEV3/SEV4 tracking
```
#engineering
```
-- Post-incident summaries

Rules:

All communication happens in the designated incident channel
Use threads for investigation details, keep main channel for status updates
IC posts updates at the defined cadence (see severity table)
Tag relevant people explicitly, do not assume they are watching
Timestamp all significant findings and actions

渠道：

```
#incidents
```
-- 活跃事件协调（SEV1/SEV2）
```
#incidents-low
```
-- SEV3/SEV4跟踪
```
#engineering
```
-- 事件后总结

规则：

所有沟通在指定的事件渠道进行
使用线程记录排查细节，主渠道仅用于状态更新
IC按照定义的频率发布更新（请查看严重程度表格）
明确标记相关人员，不要假设他们在关注
为所有重要发现和操作添加时间戳

External Communication (SEV1/SEV2)

外部沟通（SEV1/SEV2）

Status page update template:

[Investigating] We are investigating reports of [issue description].
Users may experience [user-visible impact].
We will provide an update within [time].

[Identified] The issue has been identified as [brief description].
We are working on a fix. Estimated resolution: [time estimate].

[Resolved] The issue affecting [service] has been resolved.
The root cause was [brief description].
We apologize for the disruption and will publish a detailed post-mortem.

状态页面更新模板：

[调查中] 我们正在调查[问题描述]的相关报告。
用户可能会遇到[用户可见影响]。
我们将在[时间]内提供更新。

[已定位] 问题已被确定为[简要描述]。
我们正在修复，预计解决时间：[时间预估]。

[已解决] 影响[服务]的问题已解决。
根因为[简要描述]。
我们对此次中断表示歉意，将发布详细的事后复盘文档。

Post-Mortem / RCA Framework

事后复盘/根因分析框架

Conduct a blameless post-mortem within 48 hours of every SEV1/SEV2 incident. SEV3 incidents receive a lightweight review.

See

references/post-mortem-template.md

for the full template.

Post-mortem principles:

Blameless -- Focus on systems and processes, not individuals
Thorough -- Identify all contributing factors, not just the trigger
Actionable -- Every finding must produce a concrete action item with an owner
Timely -- Conduct within 48 hours while details are fresh
Shared -- Publish to the entire engineering team

Post-mortem structure:

Summary -- What happened, when, and what was the impact
Timeline -- Minute-by-minute account of detection, investigation, mitigation
Root cause -- The fundamental reason the incident occurred
Contributing factors -- Other conditions that made the incident worse
What went well -- Effective parts of the response
What could be improved -- Gaps in detection, response, or tooling
Action items -- Specific tasks with owners and due dates

Five Whys technique for root cause analysis:

Why did users see 500 errors?
  -> Because the backend service returned errors to the load balancer.
Why did the backend service return errors?
  -> Because database connections timed out.
Why did database connections time out?
  -> Because the connection pool was exhausted.
Why was the connection pool exhausted?
  -> Because a new endpoint opened connections without releasing them.
Why were connections not released?
  -> Because the endpoint was missing the async context manager for sessions.

Root cause: Missing async context manager for database sessions in new endpoint.

Generate a structured incident report:

bash

python skills/incident-response/scripts/generate-incident-report.py \
  --title "Database connection pool exhaustion" \
  --severity SEV2 \
  --start-time "2024-01-15T14:30:00Z" \
  --end-time "2024-01-15T15:15:00Z" \
  --output-dir ./post-mortems/

所有SEV1/SEV2事件需在48小时内开展无责事后复盘。SEV3事件进行轻量级回顾。

请查看

references/post-mortem-template.md

获取完整模板。

事后复盘原则：

无责 -- 聚焦系统和流程，而非个人
全面 -- 识别所有促成因素，而非仅触发点
可执行 -- 每个发现必须产生具体的行动项并指定负责人
及时 -- 在细节清晰的48小时内开展
共享 -- 向整个工程团队发布

事后复盘结构：

摘要 -- 发生了什么、时间、影响
时间线 -- 检测、调查、缓解的分分钟记录
根因 -- 事件发生的根本原因
促成因素 -- 其他导致问题恶化的条件
做得好的地方 -- 响应中的有效部分
待改进之处 -- 检测、响应或工具中的差距
行动项 -- 具体任务，包含负责人和截止日期

根因分析的5Why技术：

为什么用户看到500错误？
  -> 因为后端服务向负载均衡器返回错误。
为什么后端服务返回错误？
  -> 因为数据库连接超时。
为什么数据库连接超时？
  -> 因为连接池耗尽。
为什么连接池耗尽？
  -> 因为新端点打开连接后未释放。
为什么连接未被释放？
  -> 因为端点缺少会话的异步上下文管理器。

根因：新端点中缺少数据库会话的异步上下文管理器。

生成结构化事件报告：

bash

python skills/incident-response/scripts/generate-incident-report.py \
  --title "Database connection pool exhaustion" \
  --severity SEV2 \
  --start-time "2024-01-15T14:30:00Z" \
  --end-time "2024-01-15T15:15:00Z" \
  --output-dir ./post-mortems/

Incident Response Scripts

事件响应脚本

Script Purpose Usage

Script	Purpose	Usage
`scripts/fetch-logs.sh`	Fetch recent logs from services	`./scripts/fetch-logs.sh --service backend --since "30m" --output-dir ./logs/`
`scripts/health-check-all-services.sh`	Check health of all services	`./scripts/health-check-all-services.sh --output-dir ./health/`
`scripts/generate-incident-report.py`	Generate structured incident report	`python scripts/generate-incident-report.py --title "..." --severity SEV1 --output-dir ./reports/`

scripts/fetch-logs.sh

Fetch recent logs from services

./scripts/fetch-logs.sh --service backend --since "30m" --output-dir ./logs/

scripts/health-check-all-services.sh

Check health of all services

./scripts/health-check-all-services.sh --output-dir ./health/

scripts/generate-incident-report.py

Generate structured incident report

python scripts/generate-incident-report.py --title "..." --severity SEV1 --output-dir ./reports/

脚本用途使用方法

脚本	用途	使用方法
`scripts/fetch-logs.sh`	获取服务的近期日志	`./scripts/fetch-logs.sh --service backend --since "30m" --output-dir ./logs/`
`scripts/health-check-all-services.sh`	检查所有服务的健康状态	`./scripts/health-check-all-services.sh --output-dir ./health/`
`scripts/generate-incident-report.py`	生成结构化事件报告	`python scripts/generate-incident-report.py --title "..." --severity SEV1 --output-dir ./reports/`

scripts/fetch-logs.sh

获取服务的近期日志

./scripts/fetch-logs.sh --service backend --since "30m" --output-dir ./logs/

scripts/health-check-all-services.sh

检查所有服务的健康状态

./scripts/health-check-all-services.sh --output-dir ./health/

scripts/generate-incident-report.py

生成结构化事件报告

python scripts/generate-incident-report.py --title "..." --severity SEV1 --output-dir ./reports/

Quick Reference: Common Incident Patterns

快速参考：常见事件模式

Pattern	Symptom	Likely Cause	First Action
502/503 errors	Users see error page	Backend crashed or overloaded	Check `docker ps` , restart if needed
Slow responses	High latency, timeouts	DB queries, external API	Check slow query log, DB connections
Partial failures	Some endpoints fail	Single dependency down	Check individual service health
Memory growth	OOM kills, restarts	Memory leak	Check `docker stats` , restart
Error spike after deploy	Errors start exactly at deploy time	Bug in new code	Rollback immediately
Gradual degradation	Slowly worsening metrics	Resource exhaustion, connection leak	Check resource usage trends

模式	症状	可能原因	首要操作
502/503错误	用户看到错误页面	后端崩溃或过载	检查 `docker ps` ，必要时重启
响应缓慢	高延迟、超时	数据库查询、外部API	检查慢查询日志、数据库连接
部分失败	部分端点故障	单个依赖服务下线	检查单个服务健康状态
内存增长	OOM终止、重启	内存泄漏	检查 `docker stats` ，重启服务
部署后错误激增	错误在部署时开始出现	新代码存在bug	立即回滚
渐进式性能下降	指标逐渐恶化	资源耗尽、连接泄漏	检查资源使用趋势

Output Files

输出文件

Runbooks: Write to

docs/runbooks/<service>-runbook.md

markdown

undefined

运行手册： 写入

docs/runbooks/<service>-runbook.md

：

markdown

undefined

Runbook: [Service Name]

Service Overview

Purpose, dependencies, critical paths

Purpose, dependencies, critical paths

Common Issues

Issue 1: [Description]

Symptoms: [What you see]
Diagnosis: [Commands to run]
Resolution: [Steps to fix]

Symptoms: [What you see]
Diagnosis: [Commands to run]
Resolution: [Steps to fix]

Escalation

On-call: #ops-oncall
Service owner: @team-name


**Post-mortems:** Write to `postmortem-YYYY-MM-DD.md`:
```markdown

On-call: #ops-oncall
Service owner: @team-name


**事后复盘文档：** 写入`postmortem-YYYY-MM-DD.md`：
```markdown

Post-Mortem: [Incident Title]

Summary

Date: YYYY-MM-DD
Severity: SEV1-4
Duration: X hours
Impact: [Users/revenue affected]

Date: YYYY-MM-DD
Severity: SEV1-4
Duration: X hours
Impact: [Users/revenue affected]

Timeline

HH:MM - [Event]

HH:MM - [Event]

Root Cause

[Technical explanation]

Action Items

[Preventive measure] - Owner: @name - Due: YYYY-MM-DD

undefined

[Preventive measure] - Owner: @name - Due: YYYY-MM-DD

undefined