postmortem-writing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Postmortem Writing

事后复盘文档撰写

Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.

编写高效、无责的事后复盘文档，推动组织学习并防止事件再次发生的全面指南。

When to Use This Skill

何时使用此技能

Conducting post-incident reviews
Writing postmortem documents
Facilitating blameless postmortem meetings
Identifying root causes and contributing factors
Creating actionable follow-up items
Building organizational learning culture

进行事件后复盘
撰写事后复盘文档
主持无责事后复盘会议
识别根本原因和促成因素
创建可落地的后续行动项
构建组织学习文化

Core Concepts

核心概念

1. Blameless Culture

1. 无责文化

Blame-Focused	Blameless
"Who caused this?"	"What conditions allowed this?"
"Someone made a mistake"	"The system allowed this mistake"
Punish individuals	Improve systems
Hide information	Share learnings
Fear of speaking up	Psychological safety

追责导向	无责导向
"是谁导致了这个问题？"	"哪些条件导致了这个问题？"
"有人犯了错误"	"系统允许这个错误发生"
惩罚个人	优化系统
隐瞒信息	分享经验教训
害怕发言	心理安全感

2. Postmortem Triggers

2. 事后复盘触发条件

SEV1 or SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes
Incidents requiring unusual intervention

SEV1或SEV2级别事件
面向客户的停机时长超过15分钟
数据丢失或安全事件
可能引发严重后果的未遂事件
新型故障模式
需要非常规干预的事件

Quick Start

快速入门

Postmortem Timeline

事后复盘时间线

Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents

Day 0: 事件发生
Day 1-2: 起草事后复盘文档
Day 3-5: 召开事后复盘会议
Day 5-7: 最终确定文档，创建工单
Week 2+: 完成行动项
Quarterly: 复盘跨事件的模式

Templates

模板

Template 1: Standard Postmortem

Template 1: 标准事后复盘

markdown

undefined

markdown

undefined

Postmortem: [Incident Title]

事后复盘: [事件标题]

Date: 2024-01-15 Authors: @alice, @bob Status: Draft | In Review | Final Incident Severity: SEV2 Incident Duration: 47 minutes

日期: 2024-01-15 Authors: @alice, @bob 状态: 草稿 | 审核中 | 最终版 事件级别: SEV2 事件时长: 47分钟

Executive Summary

执行摘要

On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.

Impact:

12,000 customers unable to complete purchases
Estimated revenue loss: $45,000
847 support tickets created
No data loss or security implications

2024年1月15日，支付处理服务出现47分钟的停机，影响约12,000名客户。根本原因是v2.3.4版本部署中的配置变更导致数据库连接池耗尽。通过回滚到v2.3.3版本并提高连接池限制解决了该事件。

影响:

12,000名客户无法完成购买
预估收入损失: $45,000
创建847个支持工单
无数据丢失或安全影响

Timeline (All times UTC)

时间线 (所有时间为UTC时区)

Time	Event
14:23	Deployment v2.3.4 completed to production
14:31	First alert: `payment_error_rate > 5%`
14:33	On-call engineer @alice acknowledges alert
14:35	Initial investigation begins, error rate at 23%
14:41	Incident declared SEV2, @bob joins
14:45	Database connection exhaustion identified
14:52	Decision to rollback deployment
14:58	Rollback to v2.3.3 initiated
15:10	Rollback complete, error rate dropping
15:18	Service fully recovered, incident resolved

时间	事件
14:23	v2.3.4版本部署至生产环境完成
14:31	首次警报: `payment_error_rate > 5%`
14:33	值班工程师@alice确认警报
14:35	开始初步调查，错误率达23%
14:41	宣布事件为SEV2级别，@bob加入处理
14:45	确定数据库连接耗尽问题
14:52	做出回滚部署的决策
14:58	启动回滚至v2.3.3版本
15:10	回滚完成，错误率开始下降
15:18	服务完全恢复，事件解决

Root Cause Analysis

根本原因分析

What Happened

事件经过

The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.

v2.3.4版本部署中包含了数据库查询模式的变更，意外移除了一个高频调用端点的连接池机制。每个请求都会新建一个数据库连接，而非复用池中的连接。

Why It Happened

原因分析

Proximate Cause: Code change in
```
PaymentRepository.java
```
replaced pooled
```
DataSource
```
with direct
```
DriverManager.getConnection()
```
calls.
Contributing Factors:
- Code review did not catch the connection handling change
- No integration tests specifically for connection pool behavior
- Staging environment has lower traffic, masking the issue
- Database connection metrics alert threshold was too high (90%)
5 Whys Analysis:
- Why did the service fail? → Database connections exhausted
- Why were connections exhausted? → Each request opened new connection
- Why did each request open new connection? → Code bypassed connection pool
- Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
- Why was developer unfamiliar? → No documentation on connection management patterns

直接原因:
```
PaymentRepository.java
```
中的代码变更将池化
```
DataSource
```
替换为直接调用
```
DriverManager.getConnection()
```
。
促成因素:
- 代码评审未发现连接处理的变更
- 没有针对连接池行为的专门集成测试
- 预发布环境流量较低，掩盖了问题
- 数据库连接指标警报阈值过高(90%)
5Why分析:
- 服务为何故障? → 数据库连接耗尽
- 连接为何耗尽? → 每个请求新建连接
- 为何每个请求新建连接? → 代码绕过了连接池
- 代码为何绕过连接池? → 开发者不熟悉代码库的模式
- 开发者为何不熟悉? → 没有关于连接管理模式的文档

System Diagram

系统架构图


[Client] → [Load Balancer] → [Payment Service] → [Database]
↓
Connection Pool (broken)
↓
Direct connections (cause)


[客户端] → [负载均衡器] → [支付服务] → [数据库]
↓
连接池(已损坏)
↓
直接连接(原因)

Detection

检测机制

What Worked

有效之处

Error rate alert fired within 8 minutes of deployment
Grafana dashboard clearly showed connection spike
On-call response was swift (2 minute acknowledgment)

部署后8分钟内触发了错误率警报
Grafana仪表盘清晰显示连接数激增
值班响应迅速(2分钟内确认警报)

What Didn't Work

待改进之处

Database connection metric alert threshold too high
No deployment-correlated alerting
Canary deployment would have caught this earlier

数据库连接指标警报阈值过高
没有与部署关联的警报机制
金丝雀部署本可以更早发现问题

Detection Gap

检测差距

The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.

部署于14:23完成，但首次警报直到14:31才触发(间隔8分钟)。部署感知型警报可以更快检测到问题。

Response

响应过程

What Worked

有效之处

On-call engineer quickly identified database as the issue
Rollback decision was made decisively
Clear communication in incident channel

值班工程师快速定位到数据库问题
回滚决策果断
事件沟通渠道中的信息清晰

What Could Be Improved

待改进之处

Took 10 minutes to correlate issue with recent deployment
Had to manually check deployment history
Rollback took 12 minutes (could be faster)

花费10分钟才将问题与近期部署关联起来
必须手动检查部署历史
回滚耗时12分钟(可进一步优化)

Impact

影响评估

Customer Impact

客户影响

12,000 unique customers affected
Average impact duration: 35 minutes
847 support tickets (23% of affected users)
Customer satisfaction score dropped 12 points

12,000名独立客户受影响
平均影响时长: 35分钟
847个支持工单(占受影响用户的23%)
客户满意度评分下降12分

Business Impact

业务影响

Estimated revenue loss: $45,000
Support cost: ~$2,500 (agent time)
Engineering time: ~8 person-hours

预估收入损失: $45,000
支持成本: ~$2,500(客服时间)
工程师时间: ~8人时

Technical Impact

技术影响

Database primary experienced elevated load
Some replica lag during incident
No permanent damage to systems

数据库主节点负载升高
事件期间部分副本节点出现延迟
系统无永久性损坏

Lessons Learned

经验教训

What Went Well

做得好的地方

Alerting detected the issue before customer reports
Team collaborated effectively under pressure
Rollback procedure worked smoothly
Communication was clear and timely

警报在客户反馈前检测到问题
团队在压力下协作高效
回滚流程运行顺畅
沟通清晰及时

What Went Wrong

存在的问题

Code review missed critical change
Test coverage gap for connection pooling
Staging environment doesn't reflect production traffic
Alert thresholds were not tuned properly

代码评审遗漏了关键变更
连接池相关的测试覆盖不足
预发布环境无法反映生产流量情况
警报阈值未合理配置

Where We Got Lucky

侥幸之处

Incident occurred during business hours with full team available
Database handled the load without failing completely
No other incidents occurred simultaneously

事件发生在工作时间，团队全员可用
数据库承受了负载而未完全崩溃
未同时发生其他事件

Action Items

行动项

Priority	Action	Owner	Due Date	Ticket
P0	Add integration test for connection pool behavior	@alice	2024-01-22	ENG-1234
P0	Lower database connection alert threshold to 70%	@bob	2024-01-17	OPS-567
P1	Document connection management patterns	@alice	2024-01-29	DOC-89
P1	Implement deployment-correlated alerting	@bob	2024-02-05	OPS-568
P2	Evaluate canary deployment strategy	@charlie	2024-02-15	ENG-1235
P2	Load test staging with production-like traffic	@dave	2024-02-28	QA-123

优先级	行动内容	负责人	截止日期	工单编号
P0	添加连接池行为的集成测试	@alice	2024-01-22	ENG-1234
P0	将数据库连接警报阈值降低至70%	@bob	2024-01-17	OPS-567
P1	编写连接管理模式的文档	@alice	2024-01-29	DOC-89
P1	实现部署关联的警报机制	@bob	2024-02-05	OPS-568
P2	评估金丝雀部署策略	@charlie	2024-02-15	ENG-1235
P2	用生产级流量对预发布环境进行负载测试	@dave	2024-02-28	QA-123

Appendix

附录

Supporting Data

支持数据

Error Rate Graph

错误率图表

[Link to Grafana dashboard snapshot]

Database Connection Graph

数据库连接图表

[Link to metrics]

Related Incidents

References

参考资料

Connection Pool Best Practices
Deployment Runbook

undefined

Connection Pool Best Practices
Deployment Runbook

undefined

Template 2: 5 Whys Analysis

Template 2: 5Why分析

markdown

undefined

markdown

undefined

5 Whys Analysis: [Incident]

5Why分析: [事件]

Problem Statement

问题陈述

Payment service experienced 47-minute outage due to database connection exhaustion.

支付服务因数据库连接耗尽出现47分钟停机。

Analysis

分析过程

Why #1: Why did the service fail?

Why #1: 服务为何故障?

Answer: Database connections were exhausted, causing all new requests to fail.

Evidence: Metrics showed connection count at 100/100 (max), with 500+ pending requests.

答案: 数据库连接耗尽，导致所有新请求失败。

证据: 指标显示连接数达到100/100(上限)，有500+待处理请求。

Why #2: Why were database connections exhausted?

Why #2: 数据库连接为何耗尽?

Answer: Each incoming request opened a new database connection instead of using the connection pool.

Evidence: Code diff shows direct

DriverManager.getConnection()

instead of pooled

DataSource

答案: 每个入站请求新建一个数据库连接，而非使用连接池。

证据: 代码差异显示使用直接

DriverManager.getConnection()

而非池化

DataSource

。

Why #3: Why did the code bypass the connection pool?

Why #3: 代码为何绕过连接池?

Answer: A developer refactored the repository class and inadvertently changed the connection acquisition method.

Evidence: PR #1234 shows the change, made while fixing a different bug.

答案: 一名开发者重构仓库类时，意外更改了连接获取方式。

证据: PR #1234显示了该变更，是在修复另一个bug时做出的。

Why #4: Why wasn't this caught in code review?

Why #4: 代码评审为何未发现?

Answer: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.

Evidence: Review comments only discuss business logic.

答案: 评审者关注的是功能变更(bug修复)，未注意到基础设施变更。

证据: 评审评论仅讨论业务逻辑。

Why #5: Why isn't there a safety net for this type of change?

Why #5: 为何没有针对此类变更的安全机制?

Answer: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.

Evidence: Test suite has no tests for connection handling; wiki has no article on database connections.

答案: 我们缺少验证连接池行为的自动化测试，也没有关于连接模式的文档。

证据: 测试套件没有连接处理相关测试; wiki中没有数据库连接相关文章。

Root Causes Identified

确定的根本原因

Primary: Missing automated tests for infrastructure behavior
Secondary: Insufficient documentation of architectural patterns
Tertiary: Code review checklist doesn't include infrastructure considerations

主要原因: 缺少针对基础设施行为的自动化测试
次要原因: 架构模式文档不足
三级原因: 代码评审checklist未包含基础设施检查项

Systemic Improvements

系统性改进措施

Root Cause	Improvement	Type
Missing tests	Add infrastructure behavior tests	Prevention
Missing docs	Document connection patterns	Prevention
Review gaps	Update review checklist	Detection
No canary	Implement canary deployments	Mitigation

undefined

根本原因	改进措施	类型
缺少测试	添加基础设施行为测试	预防
缺少文档	编写连接模式文档	预防
评审漏洞	更新评审checklist	检测
无金丝雀部署	实现金丝雀部署	缓解

undefined

Template 3: Quick Postmortem (Minor Incidents)

Template 3: 快速事后复盘(小型事件)

markdown

undefined

markdown

undefined

Quick Postmortem: [Brief Title]

快速事后复盘: [简短标题]

Date: 2024-01-15 | Duration: 12 min | Severity: SEV3

日期: 2024-01-15 | 时长: 12分钟 | 级别: SEV3

What Happened

事件经过

API latency spiked to 5s due to cache miss storm after cache flush.

API延迟飙升至5秒，原因是缓存刷新后出现缓存击穿。

Timeline

时间线

10:00 - Cache flush initiated for config update
10:02 - Latency alerts fire
10:05 - Identified as cache miss storm
10:08 - Enabled cache warming
10:12 - Latency normalized

10:00 - 为配置更新启动缓存刷新
10:02 - 延迟警报触发
10:05 - 确定为缓存击穿问题
10:08 - 启用缓存预热
10:12 - 延迟恢复正常

Root Cause

根本原因

Full cache flush for minor config update caused thundering herd.

为小型配置更新执行全量缓存刷新，引发了惊群效应。

Fix

修复措施

Immediate: Enabled cache warming
Long-term: Implement partial cache invalidation (ENG-999)

即时: 启用缓存预热
长期: 实现部分缓存失效(ENG-999)

Lessons

经验教训

Don't full-flush cache in production; use targeted invalidation.

undefined

生产环境中不要执行全量缓存刷新; 使用定向失效。

undefined

Facilitation Guide

主持指南

Running a Postmortem Meeting

召开事后复盘会议

markdown

undefined

markdown

undefined

Meeting Structure (60 minutes)

会议结构(60分钟)

1. Opening (5 min)

1. 开场(5分钟)

Remind everyone of blameless culture
"We're here to learn, not to blame"
Review meeting norms

提醒所有人遵循无责文化
"我们在此是为了学习，而非追责"
回顾会议准则

2. Timeline Review (15 min)

2. 时间线回顾(15分钟)

Walk through events chronologically
Ask clarifying questions
Identify gaps in timeline

按时间顺序梳理事件
提出澄清问题
识别时间线中的空白

3. Analysis Discussion (20 min)

3. 分析讨论(20分钟)

What failed?
Why did it fail?
What conditions allowed this?
What would have prevented it?

什么出现了故障?
为何出现故障?
哪些条件允许故障发生?
本可以如何预防?

4. Action Items (15 min)

4. 行动项确定(15分钟)

Brainstorm improvements
Prioritize by impact and effort
Assign owners and due dates

头脑风暴改进措施
按影响和工作量排序
分配负责人和截止日期

5. Closing (5 min)

5. 收尾(5分钟)

Summarize key learnings
Confirm action item owners
Schedule follow-up if needed

总结关键经验教训
确认行动项负责人
如有需要，安排跟进会议

Facilitation Tips

主持技巧

Keep discussion on track
Redirect blame to systems
Encourage quiet participants
Document dissenting views
Time-box tangents

undefined

保持讨论聚焦
将追责导向引导至系统层面
鼓励沉默的参与者发言
记录不同意见
限制偏离主题的讨论时间

undefined

Anti-Patterns to Avoid

需避免的反模式

Anti-Pattern	Problem	Better Approach
Blame game	Shuts down learning	Focus on systems
Shallow analysis	Doesn't prevent recurrence	Ask "why" 5 times
No action items	Waste of time	Always have concrete next steps
Unrealistic actions	Never completed	Scope to achievable tasks
No follow-up	Actions forgotten	Track in ticketing system

反模式	问题	更佳做法
追责游戏	阻碍学习	聚焦于系统优化
浅层分析	无法防止事件复发	连续问5次"为什么"
无行动项	浪费时间	始终制定具体的后续步骤
不切实际的行动项	无法完成	限定在可实现的任务范围内
无跟进	行动项被遗忘	在工单系统中跟踪进度

Best Practices

最佳实践

Do's

应该做的

Start immediately - Memory fades fast
Be specific - Exact times, exact errors
Include graphs - Visual evidence
Assign owners - No orphan action items
Share widely - Organizational learning

立即启动 - 记忆会快速消退
具体明确 - 精确的时间、准确的错误信息
包含图表 - 可视化证据
分配负责人 - 无无人负责的行动项
广泛分享 - 促进组织学习

Don'ts

不应该做的

Don't name and shame - Ever
Don't skip small incidents - They reveal patterns
Don't make it a blame doc - That kills learning
Don't create busywork - Actions should be meaningful
Don't skip follow-up - Verify actions completed

绝不点名批评 - 任何时候都不可以
不忽略小型事件 - 它们能揭示模式
不将文档变成追责工具 - 这会扼杀学习氛围
不制造无用功 - 行动项应具备实际意义
不跳过跟进 - 验证行动项是否完成

postmortem-writing

Original

Translation

Postmortem Writing

事后复盘文档撰写

When to Use This Skill

何时使用此技能

Core Concepts

核心概念

1. Blameless Culture

1. 无责文化

2. Postmortem Triggers

2. 事后复盘触发条件

Quick Start

快速入门

Postmortem Timeline

事后复盘时间线

Templates

模板

Template 1: Standard Postmortem

Template 1: 标准事后复盘

Postmortem: [Incident Title]

事后复盘: [事件标题]

Executive Summary

执行摘要

Timeline (All times UTC)

时间线 (所有时间为UTC时区)

Root Cause Analysis

根本原因分析

What Happened

事件经过

Why It Happened

原因分析

System Diagram

系统架构图

Detection

检测机制

What Worked

有效之处

What Didn't Work

待改进之处

Detection Gap

检测差距

Response

响应过程

What Worked

有效之处

What Could Be Improved

待改进之处

Impact

影响评估

Customer Impact

客户影响

Business Impact

业务影响

Technical Impact

技术影响

Lessons Learned

经验教训

What Went Well

做得好的地方

What Went Wrong

存在的问题

Where We Got Lucky

侥幸之处

Action Items

行动项

Appendix

附录

Supporting Data

支持数据

Error Rate Graph

错误率图表

Database Connection Graph

数据库连接图表

Related Incidents

相关事件

References

参考资料

Template 2: 5 Whys Analysis