postmortem-writing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Postmortem Writing

事后复盘文档撰写

Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
编写高效、无责的事后复盘文档,推动组织学习并防止事件再次发生的全面指南。

When to Use This Skill

何时使用此技能

  • Conducting post-incident reviews
  • Writing postmortem documents
  • Facilitating blameless postmortem meetings
  • Identifying root causes and contributing factors
  • Creating actionable follow-up items
  • Building organizational learning culture
  • 进行事件后复盘
  • 撰写事后复盘文档
  • 主持无责事后复盘会议
  • 识别根本原因和促成因素
  • 创建可落地的后续行动项
  • 构建组织学习文化

Core Concepts

核心概念

1. Blameless Culture

1. 无责文化

Blame-FocusedBlameless
"Who caused this?""What conditions allowed this?"
"Someone made a mistake""The system allowed this mistake"
Punish individualsImprove systems
Hide informationShare learnings
Fear of speaking upPsychological safety
追责导向无责导向
"是谁导致了这个问题?""哪些条件导致了这个问题?"
"有人犯了错误""系统允许这个错误发生"
惩罚个人优化系统
隐瞒信息分享经验教训
害怕发言心理安全感

2. Postmortem Triggers

2. 事后复盘触发条件

  • SEV1 or SEV2 incidents
  • Customer-facing outages > 15 minutes
  • Data loss or security incidents
  • Near-misses that could have been severe
  • Novel failure modes
  • Incidents requiring unusual intervention
  • SEV1或SEV2级别事件
  • 面向客户的停机时长超过15分钟
  • 数据丢失或安全事件
  • 可能引发严重后果的未遂事件
  • 新型故障模式
  • 需要非常规干预的事件

Quick Start

快速入门

Postmortem Timeline

事后复盘时间线

Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents
Day 0: 事件发生
Day 1-2: 起草事后复盘文档
Day 3-5: 召开事后复盘会议
Day 5-7: 最终确定文档,创建工单
Week 2+: 完成行动项
Quarterly: 复盘跨事件的模式

Templates

模板

Template 1: Standard Postmortem

Template 1: 标准事后复盘

markdown
undefined
markdown
undefined

Postmortem: [Incident Title]

事后复盘: [事件标题]

Date: 2024-01-15 Authors: @alice, @bob Status: Draft | In Review | Final Incident Severity: SEV2 Incident Duration: 47 minutes
日期: 2024-01-15 Authors: @alice, @bob 状态: 草稿 | 审核中 | 最终版 事件级别: SEV2 事件时长: 47分钟

Executive Summary

执行摘要

On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
Impact:
  • 12,000 customers unable to complete purchases
  • Estimated revenue loss: $45,000
  • 847 support tickets created
  • No data loss or security implications
2024年1月15日,支付处理服务出现47分钟的停机,影响约12,000名客户。根本原因是v2.3.4版本部署中的配置变更导致数据库连接池耗尽。通过回滚到v2.3.3版本并提高连接池限制解决了该事件。
影响:
  • 12,000名客户无法完成购买
  • 预估收入损失: $45,000
  • 创建847个支持工单
  • 无数据丢失或安全影响

Timeline (All times UTC)

时间线 (所有时间为UTC时区)

TimeEvent
14:23Deployment v2.3.4 completed to production
14:31First alert:
payment_error_rate > 5%
14:33On-call engineer @alice acknowledges alert
14:35Initial investigation begins, error rate at 23%
14:41Incident declared SEV2, @bob joins
14:45Database connection exhaustion identified
14:52Decision to rollback deployment
14:58Rollback to v2.3.3 initiated
15:10Rollback complete, error rate dropping
15:18Service fully recovered, incident resolved
时间事件
14:23v2.3.4版本部署至生产环境完成
14:31首次警报:
payment_error_rate > 5%
14:33值班工程师@alice确认警报
14:35开始初步调查,错误率达23%
14:41宣布事件为SEV2级别,@bob加入处理
14:45确定数据库连接耗尽问题
14:52做出回滚部署的决策
14:58启动回滚至v2.3.3版本
15:10回滚完成,错误率开始下降
15:18服务完全恢复,事件解决

Root Cause Analysis

根本原因分析

What Happened

事件经过

The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
v2.3.4版本部署中包含了数据库查询模式的变更,意外移除了一个高频调用端点的连接池机制。每个请求都会新建一个数据库连接,而非复用池中的连接。

Why It Happened

原因分析

  1. Proximate Cause: Code change in
    PaymentRepository.java
    replaced pooled
    DataSource
    with direct
    DriverManager.getConnection()
    calls.
  2. Contributing Factors:
    • Code review did not catch the connection handling change
    • No integration tests specifically for connection pool behavior
    • Staging environment has lower traffic, masking the issue
    • Database connection metrics alert threshold was too high (90%)
  3. 5 Whys Analysis:
    • Why did the service fail? → Database connections exhausted
    • Why were connections exhausted? → Each request opened new connection
    • Why did each request open new connection? → Code bypassed connection pool
    • Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
    • Why was developer unfamiliar? → No documentation on connection management patterns
  1. 直接原因:
    PaymentRepository.java
    中的代码变更将池化
    DataSource
    替换为直接调用
    DriverManager.getConnection()
  2. 促成因素:
    • 代码评审未发现连接处理的变更
    • 没有针对连接池行为的专门集成测试
    • 预发布环境流量较低,掩盖了问题
    • 数据库连接指标警报阈值过高(90%)
  3. 5Why分析:
    • 服务为何故障? → 数据库连接耗尽
    • 连接为何耗尽? → 每个请求新建连接
    • 为何每个请求新建连接? → 代码绕过了连接池
    • 代码为何绕过连接池? → 开发者不熟悉代码库的模式
    • 开发者为何不熟悉? → 没有关于连接管理模式的文档

System Diagram

系统架构图


[Client] → [Load Balancer] → [Payment Service] → [Database]
Connection Pool (broken)
Direct connections (cause)

[客户端] → [负载均衡器] → [支付服务] → [数据库]
连接池(已损坏)
直接连接(原因)

Detection

检测机制

What Worked

有效之处

  • Error rate alert fired within 8 minutes of deployment
  • Grafana dashboard clearly showed connection spike
  • On-call response was swift (2 minute acknowledgment)
  • 部署后8分钟内触发了错误率警报
  • Grafana仪表盘清晰显示连接数激增
  • 值班响应迅速(2分钟内确认警报)

What Didn't Work

待改进之处

  • Database connection metric alert threshold too high
  • No deployment-correlated alerting
  • Canary deployment would have caught this earlier
  • 数据库连接指标警报阈值过高
  • 没有与部署关联的警报机制
  • 金丝雀部署本可以更早发现问题

Detection Gap

检测差距

The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
部署于14:23完成,但首次警报直到14:31才触发(间隔8分钟)。部署感知型警报可以更快检测到问题。

Response

响应过程

What Worked

有效之处

  • On-call engineer quickly identified database as the issue
  • Rollback decision was made decisively
  • Clear communication in incident channel
  • 值班工程师快速定位到数据库问题
  • 回滚决策果断
  • 事件沟通渠道中的信息清晰

What Could Be Improved

待改进之处

  • Took 10 minutes to correlate issue with recent deployment
  • Had to manually check deployment history
  • Rollback took 12 minutes (could be faster)
  • 花费10分钟才将问题与近期部署关联起来
  • 必须手动检查部署历史
  • 回滚耗时12分钟(可进一步优化)

Impact

影响评估

Customer Impact

客户影响

  • 12,000 unique customers affected
  • Average impact duration: 35 minutes
  • 847 support tickets (23% of affected users)
  • Customer satisfaction score dropped 12 points
  • 12,000名独立客户受影响
  • 平均影响时长: 35分钟
  • 847个支持工单(占受影响用户的23%)
  • 客户满意度评分下降12分

Business Impact

业务影响

  • Estimated revenue loss: $45,000
  • Support cost: ~$2,500 (agent time)
  • Engineering time: ~8 person-hours
  • 预估收入损失: $45,000
  • 支持成本: ~$2,500(客服时间)
  • 工程师时间: ~8人时

Technical Impact

技术影响

  • Database primary experienced elevated load
  • Some replica lag during incident
  • No permanent damage to systems
  • 数据库主节点负载升高
  • 事件期间部分副本节点出现延迟
  • 系统无永久性损坏

Lessons Learned

经验教训

What Went Well

做得好的地方

  1. Alerting detected the issue before customer reports
  2. Team collaborated effectively under pressure
  3. Rollback procedure worked smoothly
  4. Communication was clear and timely
  1. 警报在客户反馈前检测到问题
  2. 团队在压力下协作高效
  3. 回滚流程运行顺畅
  4. 沟通清晰及时

What Went Wrong

存在的问题

  1. Code review missed critical change
  2. Test coverage gap for connection pooling
  3. Staging environment doesn't reflect production traffic
  4. Alert thresholds were not tuned properly
  1. 代码评审遗漏了关键变更
  2. 连接池相关的测试覆盖不足
  3. 预发布环境无法反映生产流量情况
  4. 警报阈值未合理配置

Where We Got Lucky

侥幸之处

  1. Incident occurred during business hours with full team available
  2. Database handled the load without failing completely
  3. No other incidents occurred simultaneously
  1. 事件发生在工作时间,团队全员可用
  2. 数据库承受了负载而未完全崩溃
  3. 未同时发生其他事件

Action Items

行动项

PriorityActionOwnerDue DateTicket
P0Add integration test for connection pool behavior@alice2024-01-22ENG-1234
P0Lower database connection alert threshold to 70%@bob2024-01-17OPS-567
P1Document connection management patterns@alice2024-01-29DOC-89
P1Implement deployment-correlated alerting@bob2024-02-05OPS-568
P2Evaluate canary deployment strategy@charlie2024-02-15ENG-1235
P2Load test staging with production-like traffic@dave2024-02-28QA-123
优先级行动内容负责人截止日期工单编号
P0添加连接池行为的集成测试@alice2024-01-22ENG-1234
P0将数据库连接警报阈值降低至70%@bob2024-01-17OPS-567
P1编写连接管理模式的文档@alice2024-01-29DOC-89
P1实现部署关联的警报机制@bob2024-02-05OPS-568
P2评估金丝雀部署策略@charlie2024-02-15ENG-1235
P2用生产级流量对预发布环境进行负载测试@dave2024-02-28QA-123

Appendix

附录

Supporting Data

支持数据

Error Rate Graph

错误率图表

[Link to Grafana dashboard snapshot]
[Link to Grafana dashboard snapshot]

Database Connection Graph

数据库连接图表

[Link to metrics]
[Link to metrics]

Related Incidents

相关事件

  • 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
  • 2023-11-02: 用户服务中类似的连接问题(POSTMORTEM-42)

References

参考资料

  • Connection Pool Best Practices
  • Deployment Runbook
undefined
  • Connection Pool Best Practices
  • Deployment Runbook
undefined

Template 2: 5 Whys Analysis

Template 2: 5Why分析

markdown
undefined
markdown
undefined

5 Whys Analysis: [Incident]

5Why分析: [事件]

Problem Statement

问题陈述

Payment service experienced 47-minute outage due to database connection exhaustion.
支付服务因数据库连接耗尽出现47分钟停机。

Analysis

分析过程

Why #1: Why did the service fail?

Why #1: 服务为何故障?

Answer: Database connections were exhausted, causing all new requests to fail.
Evidence: Metrics showed connection count at 100/100 (max), with 500+ pending requests.

答案: 数据库连接耗尽,导致所有新请求失败。
证据: 指标显示连接数达到100/100(上限),有500+待处理请求。

Why #2: Why were database connections exhausted?

Why #2: 数据库连接为何耗尽?

Answer: Each incoming request opened a new database connection instead of using the connection pool.
Evidence: Code diff shows direct
DriverManager.getConnection()
instead of pooled
DataSource
.

答案: 每个入站请求新建一个数据库连接,而非使用连接池。
证据: 代码差异显示使用直接
DriverManager.getConnection()
而非池化
DataSource

Why #3: Why did the code bypass the connection pool?

Why #3: 代码为何绕过连接池?

Answer: A developer refactored the repository class and inadvertently changed the connection acquisition method.
Evidence: PR #1234 shows the change, made while fixing a different bug.

答案: 一名开发者重构仓库类时,意外更改了连接获取方式。
证据: PR #1234显示了该变更,是在修复另一个bug时做出的。

Why #4: Why wasn't this caught in code review?

Why #4: 代码评审为何未发现?

Answer: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
Evidence: Review comments only discuss business logic.

答案: 评审者关注的是功能变更(bug修复),未注意到基础设施变更。
证据: 评审评论仅讨论业务逻辑。

Why #5: Why isn't there a safety net for this type of change?

Why #5: 为何没有针对此类变更的安全机制?

Answer: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
Evidence: Test suite has no tests for connection handling; wiki has no article on database connections.
答案: 我们缺少验证连接池行为的自动化测试,也没有关于连接模式的文档。
证据: 测试套件没有连接处理相关测试; wiki中没有数据库连接相关文章。

Root Causes Identified

确定的根本原因

  1. Primary: Missing automated tests for infrastructure behavior
  2. Secondary: Insufficient documentation of architectural patterns
  3. Tertiary: Code review checklist doesn't include infrastructure considerations
  1. 主要原因: 缺少针对基础设施行为的自动化测试
  2. 次要原因: 架构模式文档不足
  3. 三级原因: 代码评审checklist未包含基础设施检查项

Systemic Improvements

系统性改进措施

Root CauseImprovementType
Missing testsAdd infrastructure behavior testsPrevention
Missing docsDocument connection patternsPrevention
Review gapsUpdate review checklistDetection
No canaryImplement canary deploymentsMitigation
undefined
根本原因改进措施类型
缺少测试添加基础设施行为测试预防
缺少文档编写连接模式文档预防
评审漏洞更新评审checklist检测
无金丝雀部署实现金丝雀部署缓解
undefined

Template 3: Quick Postmortem (Minor Incidents)

Template 3: 快速事后复盘(小型事件)

markdown
undefined
markdown
undefined

Quick Postmortem: [Brief Title]

快速事后复盘: [简短标题]

Date: 2024-01-15 | Duration: 12 min | Severity: SEV3
日期: 2024-01-15 | 时长: 12分钟 | 级别: SEV3

What Happened

事件经过

API latency spiked to 5s due to cache miss storm after cache flush.
API延迟飙升至5秒,原因是缓存刷新后出现缓存击穿。

Timeline

时间线

  • 10:00 - Cache flush initiated for config update
  • 10:02 - Latency alerts fire
  • 10:05 - Identified as cache miss storm
  • 10:08 - Enabled cache warming
  • 10:12 - Latency normalized
  • 10:00 - 为配置更新启动缓存刷新
  • 10:02 - 延迟警报触发
  • 10:05 - 确定为缓存击穿问题
  • 10:08 - 启用缓存预热
  • 10:12 - 延迟恢复正常

Root Cause

根本原因

Full cache flush for minor config update caused thundering herd.
为小型配置更新执行全量缓存刷新,引发了惊群效应。

Fix

修复措施

  • Immediate: Enabled cache warming
  • Long-term: Implement partial cache invalidation (ENG-999)
  • 即时: 启用缓存预热
  • 长期: 实现部分缓存失效(ENG-999)

Lessons

经验教训

Don't full-flush cache in production; use targeted invalidation.
undefined
生产环境中不要执行全量缓存刷新; 使用定向失效。
undefined

Facilitation Guide

主持指南

Running a Postmortem Meeting

召开事后复盘会议

markdown
undefined
markdown
undefined

Meeting Structure (60 minutes)

会议结构(60分钟)

1. Opening (5 min)

1. 开场(5分钟)

  • Remind everyone of blameless culture
  • "We're here to learn, not to blame"
  • Review meeting norms
  • 提醒所有人遵循无责文化
  • "我们在此是为了学习,而非追责"
  • 回顾会议准则

2. Timeline Review (15 min)

2. 时间线回顾(15分钟)

  • Walk through events chronologically
  • Ask clarifying questions
  • Identify gaps in timeline
  • 按时间顺序梳理事件
  • 提出澄清问题
  • 识别时间线中的空白

3. Analysis Discussion (20 min)

3. 分析讨论(20分钟)

  • What failed?
  • Why did it fail?
  • What conditions allowed this?
  • What would have prevented it?
  • 什么出现了故障?
  • 为何出现故障?
  • 哪些条件允许故障发生?
  • 本可以如何预防?

4. Action Items (15 min)

4. 行动项确定(15分钟)

  • Brainstorm improvements
  • Prioritize by impact and effort
  • Assign owners and due dates
  • 头脑风暴改进措施
  • 按影响和工作量排序
  • 分配负责人和截止日期

5. Closing (5 min)

5. 收尾(5分钟)

  • Summarize key learnings
  • Confirm action item owners
  • Schedule follow-up if needed
  • 总结关键经验教训
  • 确认行动项负责人
  • 如有需要,安排跟进会议

Facilitation Tips

主持技巧

  • Keep discussion on track
  • Redirect blame to systems
  • Encourage quiet participants
  • Document dissenting views
  • Time-box tangents
undefined
  • 保持讨论聚焦
  • 将追责导向引导至系统层面
  • 鼓励沉默的参与者发言
  • 记录不同意见
  • 限制偏离主题的讨论时间
undefined

Anti-Patterns to Avoid

需避免的反模式

Anti-PatternProblemBetter Approach
Blame gameShuts down learningFocus on systems
Shallow analysisDoesn't prevent recurrenceAsk "why" 5 times
No action itemsWaste of timeAlways have concrete next steps
Unrealistic actionsNever completedScope to achievable tasks
No follow-upActions forgottenTrack in ticketing system
反模式问题更佳做法
追责游戏阻碍学习聚焦于系统优化
浅层分析无法防止事件复发连续问5次"为什么"
无行动项浪费时间始终制定具体的后续步骤
不切实际的行动项无法完成限定在可实现的任务范围内
无跟进行动项被遗忘在工单系统中跟踪进度

Best Practices

最佳实践

Do's

应该做的

  • Start immediately - Memory fades fast
  • Be specific - Exact times, exact errors
  • Include graphs - Visual evidence
  • Assign owners - No orphan action items
  • Share widely - Organizational learning
  • 立即启动 - 记忆会快速消退
  • 具体明确 - 精确的时间、准确的错误信息
  • 包含图表 - 可视化证据
  • 分配负责人 - 无无人负责的行动项
  • 广泛分享 - 促进组织学习

Don'ts

不应该做的

  • Don't name and shame - Ever
  • Don't skip small incidents - They reveal patterns
  • Don't make it a blame doc - That kills learning
  • Don't create busywork - Actions should be meaningful
  • Don't skip follow-up - Verify actions completed
  • 绝不点名批评 - 任何时候都不可以
  • 不忽略小型事件 - 它们能揭示模式
  • 不将文档变成追责工具 - 这会扼杀学习氛围
  • 不制造无用功 - 行动项应具备实际意义
  • 不跳过跟进 - 验证行动项是否完成

Resources

参考资源