incident-response

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incident Response

事件响应

When to Use

适用场景

Activate this skill when:
  • Production service is down or returning errors to users
  • Error rate has spiked beyond normal thresholds
  • Performance has degraded significantly (latency increase, timeouts)
  • An alert has fired from the monitoring system
  • Users are reporting issues that indicate a systemic problem
  • A failed deployment needs investigation and remediation
  • Conducting a post-mortem or root cause analysis after an incident
Output: Write runbooks to
docs/runbooks/<service>-runbook.md
and post-mortems to
postmortem-YYYY-MM-DD.md
.
Do NOT use this skill for:
  • Setting up monitoring or alerting rules (use
    monitoring-setup
    )
  • Performing routine deployments (use
    deployment-pipeline
    )
  • Docker image or infrastructure issues (use
    docker-best-practices
    )
  • Feature development or code changes (use
    python-backend-expert
    or
    react-frontend-expert
    )
在以下场景下启用本技能:
  • 生产环境服务宕机或向用户返回错误
  • 错误率超出正常阈值激增
  • 性能显著下降(延迟增加、超时)
  • 监控系统触发告警
  • 用户反馈表明存在系统性问题
  • 失败的部署需要排查和修复
  • 事件发生后开展事后复盘或根因分析
输出: 将运行手册写入
docs/runbooks/<service>-runbook.md
,将事后复盘文档写入
postmortem-YYYY-MM-DD.md
请勿在以下场景使用本技能:
  • 设置监控或告警规则(请使用
    monitoring-setup
    技能)
  • 执行常规部署(请使用
    deployment-pipeline
    技能)
  • Docker镜像或基础设施问题(请使用
    docker-best-practices
    技能)
  • 功能开发或代码变更(请使用
    python-backend-expert
    react-frontend-expert
    技能)

Instructions

操作指南

Severity Classification

严重程度分级

Classify every incident immediately. Severity determines response urgency, communication cadence, and escalation path.
SeverityImpactExamplesResponse TimeUpdate Cadence
SEV1 (P1)Complete outage, all users affectedService down, data loss, security breachImmediate (< 5 min)Every 15 min
SEV2 (P2)Major degradation, most users affectedCore feature broken, severe latency< 15 minEvery 30 min
SEV3 (P3)Partial degradation, some users affectedNon-critical feature broken, intermittent errors< 1 hourEvery 2 hours
SEV4 (P4)Minor issue, few users affectedCosmetic bug, edge case error< 4 hoursDaily
Escalation rules:
  • SEV1: Page on-call engineer + engineering manager immediately
  • SEV2: Page on-call engineer, notify engineering manager
  • SEV3: Notify on-call engineer via Slack
  • SEV4: Create ticket, address during normal working hours
See
references/escalation-contacts.md
for the contact matrix.
立即对每个事件进行分级。严重程度决定响应优先级、沟通频率和升级路径。
严重程度影响示例响应时间更新频率
SEV1 (P1)完全宕机,所有用户受影响服务下线、数据丢失、安全漏洞立即响应(<5分钟)每15分钟更新
SEV2 (P2)严重性能下降,多数用户受影响核心功能故障、严重延迟<15分钟响应每30分钟更新
SEV3 (P3)部分性能下降,部分用户受影响非核心功能故障、间歇性错误<1小时响应每2小时更新
SEV4 (P4)轻微问题,少数用户受影响界面显示bug、边缘场景错误<4小时响应每日更新
升级规则:
  • SEV1:立即呼叫值班工程师+工程经理
  • SEV2:呼叫值班工程师,通知工程经理
  • SEV3:通过Slack通知值班工程师
  • SEV4:创建工单,在正常工作时间处理
请查看
references/escalation-contacts.md
获取联系人矩阵。

5-Minute Triage Workflow

5分钟分流工作流

When an incident is detected, follow this triage workflow within the first 5 minutes.
┌─────────────────────────────────────────────────────────┐
│  MINUTE 0-1: Acknowledge and Classify                   │
│  • Acknowledge the alert or report                      │
│  • Assign severity (SEV1-SEV4)                          │
│  • Designate incident commander                         │
├─────────────────────────────────────────────────────────┤
│  MINUTE 1-2: Assess Scope                               │
│  • Check health endpoints for all services              │
│  • Check error rate and latency dashboards              │
│  • Determine: which services are affected?              │
├─────────────────────────────────────────────────────────┤
│  MINUTE 2-3: Identify Recent Changes                    │
│  • Check: was there a recent deployment?                │
│  • Check: any infrastructure changes?                   │
│  • Check: any external dependency issues?               │
├─────────────────────────────────────────────────────────┤
│  MINUTE 3-4: Initial Communication                      │
│  • Post in #incidents channel                           │
│  • Update status page if SEV1/SEV2                      │
│  • Page additional responders if needed                 │
├─────────────────────────────────────────────────────────┤
│  MINUTE 4-5: Begin Investigation or Mitigate            │
│  • If recent deploy: consider immediate rollback        │
│  • If not deploy-related: begin diagnostic commands     │
│  • Start incident timeline log                          │
└─────────────────────────────────────────────────────────┘
Quick health check command:
bash
./skills/incident-response/scripts/health-check-all-services.sh \
  --output-dir ./incident-triage/
检测到事件后,在最初5分钟内遵循以下分流工作流。
┌─────────────────────────────────────────────────────────┐
│  MINUTE 0-1: Acknowledge and Classify                   │
│  • Acknowledge the alert or report                      │
│  • Assign severity (SEV1-SEV4)                          │
│  • Designate incident commander                         │
├─────────────────────────────────────────────────────────┤
│  MINUTE 1-2: Assess Scope                               │
│  • Check health endpoints for all services              │
│  • Check error rate and latency dashboards              │
│  • Determine: which services are affected?              │
├─────────────────────────────────────────────────────────┤
│  MINUTE 2-3: Identify Recent Changes                    │
│  • Check: was there a recent deployment?                │
│  • Check: any infrastructure changes?                   │
│  • Check: any external dependency issues?               │
├─────────────────────────────────────────────────────────┤
│  MINUTE 3-4: Initial Communication                      │
│  • Post in #incidents channel                           │
│  • Update status page if SEV1/SEV2                      │
│  • Page additional responders if needed                 │
├─────────────────────────────────────────────────────────┤
│  MINUTE 4-5: Begin Investigation or Mitigate            │
│  • If recent deploy: consider immediate rollback        │
│  • If not deploy-related: begin diagnostic commands     │
│  • Start incident timeline log                          │
└─────────────────────────────────────────────────────────┘
快速健康检查命令:
bash
./skills/incident-response/scripts/health-check-all-services.sh \
  --output-dir ./incident-triage/

Incident Commander Role

事件指挥官职责

The incident commander (IC) coordinates the response. They do NOT investigate directly.
IC responsibilities:
  1. Coordinate -- Assign tasks to responders, prevent duplicate work
  2. Communicate -- Post regular updates to stakeholders
  3. Decide -- Make go/no-go decisions on rollback, escalation, communication
  4. Track -- Maintain the incident timeline
  5. Close -- Declare the incident resolved and schedule the post-mortem
IC communication template (initial):
INCIDENT DECLARED: [Title]
Severity: [SEV1/SEV2/SEV3/SEV4]
Commander: [Name]
Start time: [UTC timestamp]
Impact: [What users are experiencing]
Status: Investigating
Next update: [Time]
IC communication template (update):
INCIDENT UPDATE: [Title]
Severity: [SEV level]
Duration: [Time since start]
Status: [Investigating/Identified/Mitigating/Resolved]
Current findings: [What we know]
Actions in progress: [What we are doing]
Next update: [Time]
事件指挥官(IC)负责协调响应工作,不直接参与排查。
IC职责:
  1. 协调 -- 为响应人员分配任务,避免重复工作
  2. 沟通 -- 向利益相关者定期发布更新
  3. 决策 -- 就回滚、升级、沟通事宜做出是否执行的决定
  4. 跟踪 -- 维护事件时间线
  5. 收尾 -- 宣布事件已解决并安排事后复盘
IC初始沟通模板:
INCIDENT DECLARED: [Title]
Severity: [SEV1/SEV2/SEV3/SEV4]
Commander: [Name]
Start time: [UTC timestamp]
Impact: [What users are experiencing]
Status: Investigating
Next update: [Time]
IC更新沟通模板:
INCIDENT UPDATE: [Title]
Severity: [SEV level]
Duration: [Time since start]
Status: [Investigating/Identified/Mitigating/Resolved]
Current findings: [What we know]
Actions in progress: [What we are doing]
Next update: [Time]

Investigation Steps

排查步骤

Follow these diagnostic steps based on the type of issue.
根据问题类型遵循以下诊断步骤。

Application Errors (FastAPI)

应用错误(FastAPI)

bash
undefined
bash
undefined

1. Check application logs for errors

1. 检查应用日志中的错误

./skills/incident-response/scripts/fetch-logs.sh
--service backend
--since "15 minutes ago"
--output-dir ./incident-logs/
./skills/incident-response/scripts/fetch-logs.sh
--service backend
--since "15 minutes ago"
--output-dir ./incident-logs/

2. Check error rate from logs

2. 从日志中统计错误率

docker logs app-backend --since 15m 2>&1 | grep -c "ERROR"
docker logs app-backend --since 15m 2>&1 | grep -c "ERROR"

3. Check active connections and request patterns

3. 检查活跃连接和请求模式

4. Check if the issue is in a specific endpoint

4. 检查问题是否出在特定端点

docker logs app-backend --since 15m 2>&1 |
grep "ERROR" |
grep -oP '"path":"[^"]*"' | sort | uniq -c | sort -rn
docker logs app-backend --since 15m 2>&1 |
grep "ERROR" |
grep -oP '"path":"[^"]*"' | sort | uniq -c | sort -rn

5. Check Python process status

5. 检查Python进程状态

docker exec app-backend ps aux docker exec app-backend python -c "import sys; print(sys.version)"
undefined
docker exec app-backend ps aux docker exec app-backend python -c "import sys; print(sys.version)"
undefined

Database Issues (PostgreSQL)

数据库问题(PostgreSQL)

bash
undefined
bash
undefined

1. Check database connectivity

1. 检查数据库连通性

docker exec app-db pg_isready -U postgres
docker exec app-db pg_isready -U postgres

2. Check active connections (connection pool exhaustion?)

2. 检查活跃连接(是否连接池耗尽?)

docker exec app-db psql -U postgres -d app_prod -c " SELECT count(*), state FROM pg_stat_activity GROUP BY state ORDER BY count DESC; "
docker exec app-db psql -U postgres -d app_prod -c " SELECT count(*), state FROM pg_stat_activity GROUP BY state ORDER BY count DESC; "

3. Check for long-running queries (locks, deadlocks?)

3. 检查长运行查询(是否有锁、死锁?)

docker exec app-db psql -U postgres -d app_prod -c " SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds' AND state != 'idle' ORDER BY duration DESC; "
docker exec app-db psql -U postgres -d app_prod -c " SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds' AND state != 'idle' ORDER BY duration DESC; "

4. Check for lock contention

4. 检查锁竞争

docker exec app-db psql -U postgres -d app_prod -c " SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_query FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted; "
docker exec app-db psql -U postgres -d app_prod -c " SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_query FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted; "

5. Check disk space

5. 检查磁盘空间

docker exec app-db df -h /var/lib/postgresql/data
undefined
docker exec app-db df -h /var/lib/postgresql/data
undefined

Redis Issues

Redis问题

bash
undefined
bash
undefined

1. Check Redis connectivity

1. 检查Redis连通性

docker exec app-redis redis-cli ping
docker exec app-redis redis-cli ping

2. Check memory usage

2. 检查内存使用情况

docker exec app-redis redis-cli info memory | grep used_memory_human
docker exec app-redis redis-cli info memory | grep used_memory_human

3. Check connected clients

3. 检查连接客户端数量

docker exec app-redis redis-cli info clients | grep connected_clients
docker exec app-redis redis-cli info clients | grep connected_clients

4. Check slow log

4. 检查慢查询日志

docker exec app-redis redis-cli slowlog get 10
docker exec app-redis redis-cli slowlog get 10

5. Check keyspace

5. 检查键空间

docker exec app-redis redis-cli info keyspace
undefined
docker exec app-redis redis-cli info keyspace
undefined

Network and Infrastructure

网络与基础设施

bash
undefined
bash
undefined

1. Check DNS resolution

1. 检查DNS解析

nslookup api.example.com
nslookup api.example.com

2. Check SSL certificate expiry

2. 检查SSL证书有效期

echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null |
openssl x509 -noout -dates
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null |
openssl x509 -noout -dates

3. Check container resource usage

3. 检查容器资源使用情况

docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

4. Check disk space on host

4. 检查主机磁盘空间

df -h /
df -h /

5. Check if dependent services are reachable

5. 检查依赖服务是否可达

curl -sf https://external-api.example.com/health || echo "External API unreachable"
undefined
curl -sf https://external-api.example.com/health || echo "External API unreachable"
undefined

Remediation Actions

修复操作

Immediate Mitigations (apply within minutes)

立即缓解措施(数分钟内实施)

IssueMitigationCommand
Bad deploymentRollback
./scripts/deploy.sh --rollback --env production --version $PREV_SHA --output-dir ./results/
Connection pool exhaustedRestart backend
docker restart app-backend
Long-running queryKill query
SELECT pg_terminate_backend(<pid>);
Memory leakRestart service
docker restart app-backend
Redis fullFlush non-critical keys
redis-cli --scan --pattern "cache:*" | xargs redis-cli del
SSL expiredApply new certUpdate cert in load balancer
Disk fullClean logs/temp files
docker system prune -f
问题缓解措施命令
错误部署回滚
./scripts/deploy.sh --rollback --env production --version $PREV_SHA --output-dir ./results/
连接池耗尽重启后端服务
docker restart app-backend
长运行查询终止查询
SELECT pg_terminate_backend(<pid>);
内存泄漏重启服务
docker restart app-backend
Redis内存已满清理非核心键
redis-cli --scan --pattern "cache:*" | xargs redis-cli del
SSL证书过期应用新证书在负载均衡器中更新证书
磁盘已满清理日志/临时文件
docker system prune -f

Longer-Term Fixes (apply after stabilization)

长期修复方案(稳定后实施)

  1. Fix the root cause in code -- Create a branch, fix, test, deploy through normal pipeline
  2. Add monitoring -- If the issue was not caught by existing alerts, add new alert rules
  3. Add tests -- Write regression tests for the failure scenario
  4. Update runbooks -- Document the new failure mode and remediation steps
  1. 在代码中修复根因 -- 创建分支、修复、测试、通过正常流水线部署
  2. 添加监控 -- 如果现有告警未检测到问题,添加新的告警规则
  3. 添加测试 -- 为故障场景编写回归测试
  4. 更新运行手册 -- 记录新的故障模式和修复步骤

Communication Protocol

沟通协议

Internal Communication

内部沟通

Channels:
  • #incidents
    -- Active incident coordination (SEV1/SEV2)
  • #incidents-low
    -- SEV3/SEV4 tracking
  • #engineering
    -- Post-incident summaries
Rules:
  1. All communication happens in the designated incident channel
  2. Use threads for investigation details, keep main channel for status updates
  3. IC posts updates at the defined cadence (see severity table)
  4. Tag relevant people explicitly, do not assume they are watching
  5. Timestamp all significant findings and actions
渠道:
  • #incidents
    -- 活跃事件协调(SEV1/SEV2)
  • #incidents-low
    -- SEV3/SEV4跟踪
  • #engineering
    -- 事件后总结
规则:
  1. 所有沟通在指定的事件渠道进行
  2. 使用线程记录排查细节,主渠道仅用于状态更新
  3. IC按照定义的频率发布更新(请查看严重程度表格)
  4. 明确标记相关人员,不要假设他们在关注
  5. 为所有重要发现和操作添加时间戳

External Communication (SEV1/SEV2)

外部沟通(SEV1/SEV2)

Status page update template:
[Investigating] We are investigating reports of [issue description].
Users may experience [user-visible impact].
We will provide an update within [time].
[Identified] The issue has been identified as [brief description].
We are working on a fix. Estimated resolution: [time estimate].
[Resolved] The issue affecting [service] has been resolved.
The root cause was [brief description].
We apologize for the disruption and will publish a detailed post-mortem.
状态页面更新模板:
[调查中] 我们正在调查[问题描述]的相关报告。
用户可能会遇到[用户可见影响]。
我们将在[时间]内提供更新。
[已定位] 问题已被确定为[简要描述]。
我们正在修复,预计解决时间:[时间预估]。
[已解决] 影响[服务]的问题已解决。
根因为[简要描述]。
我们对此次中断表示歉意,将发布详细的事后复盘文档。

Post-Mortem / RCA Framework

事后复盘/根因分析框架

Conduct a blameless post-mortem within 48 hours of every SEV1/SEV2 incident. SEV3 incidents receive a lightweight review.
See
references/post-mortem-template.md
for the full template.
Post-mortem principles:
  1. Blameless -- Focus on systems and processes, not individuals
  2. Thorough -- Identify all contributing factors, not just the trigger
  3. Actionable -- Every finding must produce a concrete action item with an owner
  4. Timely -- Conduct within 48 hours while details are fresh
  5. Shared -- Publish to the entire engineering team
Post-mortem structure:
  1. Summary -- What happened, when, and what was the impact
  2. Timeline -- Minute-by-minute account of detection, investigation, mitigation
  3. Root cause -- The fundamental reason the incident occurred
  4. Contributing factors -- Other conditions that made the incident worse
  5. What went well -- Effective parts of the response
  6. What could be improved -- Gaps in detection, response, or tooling
  7. Action items -- Specific tasks with owners and due dates
Five Whys technique for root cause analysis:
Why did users see 500 errors?
  -> Because the backend service returned errors to the load balancer.
Why did the backend service return errors?
  -> Because database connections timed out.
Why did database connections time out?
  -> Because the connection pool was exhausted.
Why was the connection pool exhausted?
  -> Because a new endpoint opened connections without releasing them.
Why were connections not released?
  -> Because the endpoint was missing the async context manager for sessions.

Root cause: Missing async context manager for database sessions in new endpoint.
Generate a structured incident report:
bash
python skills/incident-response/scripts/generate-incident-report.py \
  --title "Database connection pool exhaustion" \
  --severity SEV2 \
  --start-time "2024-01-15T14:30:00Z" \
  --end-time "2024-01-15T15:15:00Z" \
  --output-dir ./post-mortems/
所有SEV1/SEV2事件需在48小时内开展无责事后复盘。SEV3事件进行轻量级回顾。
请查看
references/post-mortem-template.md
获取完整模板。
事后复盘原则:
  1. 无责 -- 聚焦系统和流程,而非个人
  2. 全面 -- 识别所有促成因素,而非仅触发点
  3. 可执行 -- 每个发现必须产生具体的行动项并指定负责人
  4. 及时 -- 在细节清晰的48小时内开展
  5. 共享 -- 向整个工程团队发布
事后复盘结构:
  1. 摘要 -- 发生了什么、时间、影响
  2. 时间线 -- 检测、调查、缓解的分分钟记录
  3. 根因 -- 事件发生的根本原因
  4. 促成因素 -- 其他导致问题恶化的条件
  5. 做得好的地方 -- 响应中的有效部分
  6. 待改进之处 -- 检测、响应或工具中的差距
  7. 行动项 -- 具体任务,包含负责人和截止日期
根因分析的5Why技术:
为什么用户看到500错误?
  -> 因为后端服务向负载均衡器返回错误。
为什么后端服务返回错误?
  -> 因为数据库连接超时。
为什么数据库连接超时?
  -> 因为连接池耗尽。
为什么连接池耗尽?
  -> 因为新端点打开连接后未释放。
为什么连接未被释放?
  -> 因为端点缺少会话的异步上下文管理器。

根因:新端点中缺少数据库会话的异步上下文管理器。
生成结构化事件报告:
bash
python skills/incident-response/scripts/generate-incident-report.py \
  --title "Database connection pool exhaustion" \
  --severity SEV2 \
  --start-time "2024-01-15T14:30:00Z" \
  --end-time "2024-01-15T15:15:00Z" \
  --output-dir ./post-mortems/

Incident Response Scripts

事件响应脚本

ScriptPurposeUsage
scripts/fetch-logs.sh
Fetch recent logs from services
./scripts/fetch-logs.sh --service backend --since "30m" --output-dir ./logs/
scripts/health-check-all-services.sh
Check health of all services
./scripts/health-check-all-services.sh --output-dir ./health/
scripts/generate-incident-report.py
Generate structured incident report
python scripts/generate-incident-report.py --title "..." --severity SEV1 --output-dir ./reports/
脚本用途使用方法
scripts/fetch-logs.sh
获取服务的近期日志
./scripts/fetch-logs.sh --service backend --since "30m" --output-dir ./logs/
scripts/health-check-all-services.sh
检查所有服务的健康状态
./scripts/health-check-all-services.sh --output-dir ./health/
scripts/generate-incident-report.py
生成结构化事件报告
python scripts/generate-incident-report.py --title "..." --severity SEV1 --output-dir ./reports/

Quick Reference: Common Incident Patterns

快速参考:常见事件模式

PatternSymptomLikely CauseFirst Action
502/503 errorsUsers see error pageBackend crashed or overloadedCheck
docker ps
, restart if needed
Slow responsesHigh latency, timeoutsDB queries, external APICheck slow query log, DB connections
Partial failuresSome endpoints failSingle dependency downCheck individual service health
Memory growthOOM kills, restartsMemory leakCheck
docker stats
, restart
Error spike after deployErrors start exactly at deploy timeBug in new codeRollback immediately
Gradual degradationSlowly worsening metricsResource exhaustion, connection leakCheck resource usage trends
模式症状可能原因首要操作
502/503错误用户看到错误页面后端崩溃或过载检查
docker ps
,必要时重启
响应缓慢高延迟、超时数据库查询、外部API检查慢查询日志、数据库连接
部分失败部分端点故障单个依赖服务下线检查单个服务健康状态
内存增长OOM终止、重启内存泄漏检查
docker stats
,重启服务
部署后错误激增错误在部署时开始出现新代码存在bug立即回滚
渐进式性能下降指标逐渐恶化资源耗尽、连接泄漏检查资源使用趋势

Output Files

输出文件

Runbooks: Write to
docs/runbooks/<service>-runbook.md
:
markdown
undefined
运行手册: 写入
docs/runbooks/<service>-runbook.md
markdown
undefined

Runbook: [Service Name]

Runbook: [Service Name]

Service Overview

Service Overview

  • Purpose, dependencies, critical paths
  • Purpose, dependencies, critical paths

Common Issues

Common Issues

Issue 1: [Description]

Issue 1: [Description]

  • Symptoms: [What you see]
  • Diagnosis: [Commands to run]
  • Resolution: [Steps to fix]
  • Symptoms: [What you see]
  • Diagnosis: [Commands to run]
  • Resolution: [Steps to fix]

Escalation

Escalation

  • On-call: #ops-oncall
  • Service owner: @team-name

**Post-mortems:** Write to `postmortem-YYYY-MM-DD.md`:
```markdown
  • On-call: #ops-oncall
  • Service owner: @team-name

**事后复盘文档:** 写入`postmortem-YYYY-MM-DD.md`:
```markdown

Post-Mortem: [Incident Title]

Post-Mortem: [Incident Title]

Summary

Summary

  • Date: YYYY-MM-DD
  • Severity: SEV1-4
  • Duration: X hours
  • Impact: [Users/revenue affected]
  • Date: YYYY-MM-DD
  • Severity: SEV1-4
  • Duration: X hours
  • Impact: [Users/revenue affected]

Timeline

Timeline

  • HH:MM - [Event]
  • HH:MM - [Event]

Root Cause

Root Cause

[Technical explanation]
[Technical explanation]

Action Items

Action Items

  • [Preventive measure] - Owner: @name - Due: YYYY-MM-DD
undefined
  • [Preventive measure] - Owner: @name - Due: YYYY-MM-DD
undefined