postmortem-writing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePostmortem Writing
事后复盘文档撰写
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
编写高效、无责的事后复盘文档,推动组织学习并防止事件再次发生的全面指南。
When to Use This Skill
何时使用此技能
- Conducting post-incident reviews
- Writing postmortem documents
- Facilitating blameless postmortem meetings
- Identifying root causes and contributing factors
- Creating actionable follow-up items
- Building organizational learning culture
- 进行事件后复盘
- 撰写事后复盘文档
- 主持无责事后复盘会议
- 识别根本原因和促成因素
- 创建可落地的后续行动项
- 构建组织学习文化
Core Concepts
核心概念
1. Blameless Culture
1. 无责文化
| Blame-Focused | Blameless |
|---|---|
| "Who caused this?" | "What conditions allowed this?" |
| "Someone made a mistake" | "The system allowed this mistake" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
| Fear of speaking up | Psychological safety |
| 追责导向 | 无责导向 |
|---|---|
| "是谁导致了这个问题?" | "哪些条件导致了这个问题?" |
| "有人犯了错误" | "系统允许这个错误发生" |
| 惩罚个人 | 优化系统 |
| 隐瞒信息 | 分享经验教训 |
| 害怕发言 | 心理安全感 |
2. Postmortem Triggers
2. 事后复盘触发条件
- SEV1 or SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes
- Incidents requiring unusual intervention
- SEV1或SEV2级别事件
- 面向客户的停机时长超过15分钟
- 数据丢失或安全事件
- 可能引发严重后果的未遂事件
- 新型故障模式
- 需要非常规干预的事件
Quick Start
快速入门
Postmortem Timeline
事后复盘时间线
Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidentsDay 0: 事件发生
Day 1-2: 起草事后复盘文档
Day 3-5: 召开事后复盘会议
Day 5-7: 最终确定文档,创建工单
Week 2+: 完成行动项
Quarterly: 复盘跨事件的模式Templates
模板
Template 1: Standard Postmortem
Template 1: 标准事后复盘
markdown
undefinedmarkdown
undefinedPostmortem: [Incident Title]
事后复盘: [事件标题]
Date: 2024-01-15
Authors: @alice, @bob
Status: Draft | In Review | Final
Incident Severity: SEV2
Incident Duration: 47 minutes
日期: 2024-01-15
Authors: @alice, @bob
状态: 草稿 | 审核中 | 最终版
事件级别: SEV2
事件时长: 47分钟
Executive Summary
执行摘要
On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
Impact:
- 12,000 customers unable to complete purchases
- Estimated revenue loss: $45,000
- 847 support tickets created
- No data loss or security implications
2024年1月15日,支付处理服务出现47分钟的停机,影响约12,000名客户。根本原因是v2.3.4版本部署中的配置变更导致数据库连接池耗尽。通过回滚到v2.3.3版本并提高连接池限制解决了该事件。
影响:
- 12,000名客户无法完成购买
- 预估收入损失: $45,000
- 创建847个支持工单
- 无数据丢失或安全影响
Timeline (All times UTC)
时间线 (所有时间为UTC时区)
| Time | Event |
|---|---|
| 14:23 | Deployment v2.3.4 completed to production |
| 14:31 | First alert: |
| 14:33 | On-call engineer @alice acknowledges alert |
| 14:35 | Initial investigation begins, error rate at 23% |
| 14:41 | Incident declared SEV2, @bob joins |
| 14:45 | Database connection exhaustion identified |
| 14:52 | Decision to rollback deployment |
| 14:58 | Rollback to v2.3.3 initiated |
| 15:10 | Rollback complete, error rate dropping |
| 15:18 | Service fully recovered, incident resolved |
| 时间 | 事件 |
|---|---|
| 14:23 | v2.3.4版本部署至生产环境完成 |
| 14:31 | 首次警报: |
| 14:33 | 值班工程师@alice确认警报 |
| 14:35 | 开始初步调查,错误率达23% |
| 14:41 | 宣布事件为SEV2级别,@bob加入处理 |
| 14:45 | 确定数据库连接耗尽问题 |
| 14:52 | 做出回滚部署的决策 |
| 14:58 | 启动回滚至v2.3.3版本 |
| 15:10 | 回滚完成,错误率开始下降 |
| 15:18 | 服务完全恢复,事件解决 |
Root Cause Analysis
根本原因分析
What Happened
事件经过
The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
v2.3.4版本部署中包含了数据库查询模式的变更,意外移除了一个高频调用端点的连接池机制。每个请求都会新建一个数据库连接,而非复用池中的连接。
Why It Happened
原因分析
-
Proximate Cause: Code change inreplaced pooled
PaymentRepository.javawith directDataSourcecalls.DriverManager.getConnection() -
Contributing Factors:
- Code review did not catch the connection handling change
- No integration tests specifically for connection pool behavior
- Staging environment has lower traffic, masking the issue
- Database connection metrics alert threshold was too high (90%)
-
5 Whys Analysis:
- Why did the service fail? → Database connections exhausted
- Why were connections exhausted? → Each request opened new connection
- Why did each request open new connection? → Code bypassed connection pool
- Why did code bypass connection pool? → Developer unfamiliar with codebase patterns
- Why was developer unfamiliar? → No documentation on connection management patterns
-
直接原因:中的代码变更将池化
PaymentRepository.java替换为直接调用DataSource。DriverManager.getConnection() -
促成因素:
- 代码评审未发现连接处理的变更
- 没有针对连接池行为的专门集成测试
- 预发布环境流量较低,掩盖了问题
- 数据库连接指标警报阈值过高(90%)
-
5Why分析:
- 服务为何故障? → 数据库连接耗尽
- 连接为何耗尽? → 每个请求新建连接
- 为何每个请求新建连接? → 代码绕过了连接池
- 代码为何绕过连接池? → 开发者不熟悉代码库的模式
- 开发者为何不熟悉? → 没有关于连接管理模式的文档
System Diagram
系统架构图
[Client] → [Load Balancer] → [Payment Service] → [Database]
↓
Connection Pool (broken)
↓
Direct connections (cause)
[客户端] → [负载均衡器] → [支付服务] → [数据库]
↓
连接池(已损坏)
↓
直接连接(原因)
Detection
检测机制
What Worked
有效之处
- Error rate alert fired within 8 minutes of deployment
- Grafana dashboard clearly showed connection spike
- On-call response was swift (2 minute acknowledgment)
- 部署后8分钟内触发了错误率警报
- Grafana仪表盘清晰显示连接数激增
- 值班响应迅速(2分钟内确认警报)
What Didn't Work
待改进之处
- Database connection metric alert threshold too high
- No deployment-correlated alerting
- Canary deployment would have caught this earlier
- 数据库连接指标警报阈值过高
- 没有与部署关联的警报机制
- 金丝雀部署本可以更早发现问题
Detection Gap
检测差距
The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
部署于14:23完成,但首次警报直到14:31才触发(间隔8分钟)。部署感知型警报可以更快检测到问题。
Response
响应过程
What Worked
有效之处
- On-call engineer quickly identified database as the issue
- Rollback decision was made decisively
- Clear communication in incident channel
- 值班工程师快速定位到数据库问题
- 回滚决策果断
- 事件沟通渠道中的信息清晰
What Could Be Improved
待改进之处
- Took 10 minutes to correlate issue with recent deployment
- Had to manually check deployment history
- Rollback took 12 minutes (could be faster)
- 花费10分钟才将问题与近期部署关联起来
- 必须手动检查部署历史
- 回滚耗时12分钟(可进一步优化)
Impact
影响评估
Customer Impact
客户影响
- 12,000 unique customers affected
- Average impact duration: 35 minutes
- 847 support tickets (23% of affected users)
- Customer satisfaction score dropped 12 points
- 12,000名独立客户受影响
- 平均影响时长: 35分钟
- 847个支持工单(占受影响用户的23%)
- 客户满意度评分下降12分
Business Impact
业务影响
- Estimated revenue loss: $45,000
- Support cost: ~$2,500 (agent time)
- Engineering time: ~8 person-hours
- 预估收入损失: $45,000
- 支持成本: ~$2,500(客服时间)
- 工程师时间: ~8人时
Technical Impact
技术影响
- Database primary experienced elevated load
- Some replica lag during incident
- No permanent damage to systems
- 数据库主节点负载升高
- 事件期间部分副本节点出现延迟
- 系统无永久性损坏
Lessons Learned
经验教训
What Went Well
做得好的地方
- Alerting detected the issue before customer reports
- Team collaborated effectively under pressure
- Rollback procedure worked smoothly
- Communication was clear and timely
- 警报在客户反馈前检测到问题
- 团队在压力下协作高效
- 回滚流程运行顺畅
- 沟通清晰及时
What Went Wrong
存在的问题
- Code review missed critical change
- Test coverage gap for connection pooling
- Staging environment doesn't reflect production traffic
- Alert thresholds were not tuned properly
- 代码评审遗漏了关键变更
- 连接池相关的测试覆盖不足
- 预发布环境无法反映生产流量情况
- 警报阈值未合理配置
Where We Got Lucky
侥幸之处
- Incident occurred during business hours with full team available
- Database handled the load without failing completely
- No other incidents occurred simultaneously
- 事件发生在工作时间,团队全员可用
- 数据库承受了负载而未完全崩溃
- 未同时发生其他事件
Action Items
行动项
| Priority | Action | Owner | Due Date | Ticket |
|---|---|---|---|---|
| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |
| 优先级 | 行动内容 | 负责人 | 截止日期 | 工单编号 |
|---|---|---|---|---|
| P0 | 添加连接池行为的集成测试 | @alice | 2024-01-22 | ENG-1234 |
| P0 | 将数据库连接警报阈值降低至70% | @bob | 2024-01-17 | OPS-567 |
| P1 | 编写连接管理模式的文档 | @alice | 2024-01-29 | DOC-89 |
| P1 | 实现部署关联的警报机制 | @bob | 2024-02-05 | OPS-568 |
| P2 | 评估金丝雀部署策略 | @charlie | 2024-02-15 | ENG-1235 |
| P2 | 用生产级流量对预发布环境进行负载测试 | @dave | 2024-02-28 | QA-123 |
Appendix
附录
Supporting Data
支持数据
Error Rate Graph
错误率图表
[Link to Grafana dashboard snapshot]
[Link to Grafana dashboard snapshot]
Database Connection Graph
数据库连接图表
[Link to metrics]
[Link to metrics]
Related Incidents
相关事件
- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
- 2023-11-02: 用户服务中类似的连接问题(POSTMORTEM-42)
References
参考资料
- Connection Pool Best Practices
- Deployment Runbook
undefined- Connection Pool Best Practices
- Deployment Runbook
undefinedTemplate 2: 5 Whys Analysis
Template 2: 5Why分析
markdown
undefinedmarkdown
undefined5 Whys Analysis: [Incident]
5Why分析: [事件]
Problem Statement
问题陈述
Payment service experienced 47-minute outage due to database connection exhaustion.
支付服务因数据库连接耗尽出现47分钟停机。
Analysis
分析过程
Why #1: Why did the service fail?
Why #1: 服务为何故障?
Answer: Database connections were exhausted, causing all new requests to fail.
Evidence: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
答案: 数据库连接耗尽,导致所有新请求失败。
证据: 指标显示连接数达到100/100(上限),有500+待处理请求。
Why #2: Why were database connections exhausted?
Why #2: 数据库连接为何耗尽?
Answer: Each incoming request opened a new database connection instead of using the connection pool.
Evidence: Code diff shows direct instead of pooled .
DriverManager.getConnection()DataSource答案: 每个入站请求新建一个数据库连接,而非使用连接池。
证据: 代码差异显示使用直接而非池化。
DriverManager.getConnection()DataSourceWhy #3: Why did the code bypass the connection pool?
Why #3: 代码为何绕过连接池?
Answer: A developer refactored the repository class and inadvertently changed the connection acquisition method.
Evidence: PR #1234 shows the change, made while fixing a different bug.
答案: 一名开发者重构仓库类时,意外更改了连接获取方式。
证据: PR #1234显示了该变更,是在修复另一个bug时做出的。
Why #4: Why wasn't this caught in code review?
Why #4: 代码评审为何未发现?
Answer: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
Evidence: Review comments only discuss business logic.
答案: 评审者关注的是功能变更(bug修复),未注意到基础设施变更。
证据: 评审评论仅讨论业务逻辑。
Why #5: Why isn't there a safety net for this type of change?
Why #5: 为何没有针对此类变更的安全机制?
Answer: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
Evidence: Test suite has no tests for connection handling; wiki has no article on database connections.
答案: 我们缺少验证连接池行为的自动化测试,也没有关于连接模式的文档。
证据: 测试套件没有连接处理相关测试; wiki中没有数据库连接相关文章。
Root Causes Identified
确定的根本原因
- Primary: Missing automated tests for infrastructure behavior
- Secondary: Insufficient documentation of architectural patterns
- Tertiary: Code review checklist doesn't include infrastructure considerations
- 主要原因: 缺少针对基础设施行为的自动化测试
- 次要原因: 架构模式文档不足
- 三级原因: 代码评审checklist未包含基础设施检查项
Systemic Improvements
系统性改进措施
| Root Cause | Improvement | Type |
|---|---|---|
| Missing tests | Add infrastructure behavior tests | Prevention |
| Missing docs | Document connection patterns | Prevention |
| Review gaps | Update review checklist | Detection |
| No canary | Implement canary deployments | Mitigation |
undefined| 根本原因 | 改进措施 | 类型 |
|---|---|---|
| 缺少测试 | 添加基础设施行为测试 | 预防 |
| 缺少文档 | 编写连接模式文档 | 预防 |
| 评审漏洞 | 更新评审checklist | 检测 |
| 无金丝雀部署 | 实现金丝雀部署 | 缓解 |
undefinedTemplate 3: Quick Postmortem (Minor Incidents)
Template 3: 快速事后复盘(小型事件)
markdown
undefinedmarkdown
undefinedQuick Postmortem: [Brief Title]
快速事后复盘: [简短标题]
Date: 2024-01-15 | Duration: 12 min | Severity: SEV3
日期: 2024-01-15 | 时长: 12分钟 | 级别: SEV3
What Happened
事件经过
API latency spiked to 5s due to cache miss storm after cache flush.
API延迟飙升至5秒,原因是缓存刷新后出现缓存击穿。
Timeline
时间线
- 10:00 - Cache flush initiated for config update
- 10:02 - Latency alerts fire
- 10:05 - Identified as cache miss storm
- 10:08 - Enabled cache warming
- 10:12 - Latency normalized
- 10:00 - 为配置更新启动缓存刷新
- 10:02 - 延迟警报触发
- 10:05 - 确定为缓存击穿问题
- 10:08 - 启用缓存预热
- 10:12 - 延迟恢复正常
Root Cause
根本原因
Full cache flush for minor config update caused thundering herd.
为小型配置更新执行全量缓存刷新,引发了惊群效应。
Fix
修复措施
- Immediate: Enabled cache warming
- Long-term: Implement partial cache invalidation (ENG-999)
- 即时: 启用缓存预热
- 长期: 实现部分缓存失效(ENG-999)
Lessons
经验教训
Don't full-flush cache in production; use targeted invalidation.
undefined生产环境中不要执行全量缓存刷新; 使用定向失效。
undefinedFacilitation Guide
主持指南
Running a Postmortem Meeting
召开事后复盘会议
markdown
undefinedmarkdown
undefinedMeeting Structure (60 minutes)
会议结构(60分钟)
1. Opening (5 min)
1. 开场(5分钟)
- Remind everyone of blameless culture
- "We're here to learn, not to blame"
- Review meeting norms
- 提醒所有人遵循无责文化
- "我们在此是为了学习,而非追责"
- 回顾会议准则
2. Timeline Review (15 min)
2. 时间线回顾(15分钟)
- Walk through events chronologically
- Ask clarifying questions
- Identify gaps in timeline
- 按时间顺序梳理事件
- 提出澄清问题
- 识别时间线中的空白
3. Analysis Discussion (20 min)
3. 分析讨论(20分钟)
- What failed?
- Why did it fail?
- What conditions allowed this?
- What would have prevented it?
- 什么出现了故障?
- 为何出现故障?
- 哪些条件允许故障发生?
- 本可以如何预防?
4. Action Items (15 min)
4. 行动项确定(15分钟)
- Brainstorm improvements
- Prioritize by impact and effort
- Assign owners and due dates
- 头脑风暴改进措施
- 按影响和工作量排序
- 分配负责人和截止日期
5. Closing (5 min)
5. 收尾(5分钟)
- Summarize key learnings
- Confirm action item owners
- Schedule follow-up if needed
- 总结关键经验教训
- 确认行动项负责人
- 如有需要,安排跟进会议
Facilitation Tips
主持技巧
- Keep discussion on track
- Redirect blame to systems
- Encourage quiet participants
- Document dissenting views
- Time-box tangents
undefined- 保持讨论聚焦
- 将追责导向引导至系统层面
- 鼓励沉默的参与者发言
- 记录不同意见
- 限制偏离主题的讨论时间
undefinedAnti-Patterns to Avoid
需避免的反模式
| Anti-Pattern | Problem | Better Approach |
|---|---|---|
| Blame game | Shuts down learning | Focus on systems |
| Shallow analysis | Doesn't prevent recurrence | Ask "why" 5 times |
| No action items | Waste of time | Always have concrete next steps |
| Unrealistic actions | Never completed | Scope to achievable tasks |
| No follow-up | Actions forgotten | Track in ticketing system |
| 反模式 | 问题 | 更佳做法 |
|---|---|---|
| 追责游戏 | 阻碍学习 | 聚焦于系统优化 |
| 浅层分析 | 无法防止事件复发 | 连续问5次"为什么" |
| 无行动项 | 浪费时间 | 始终制定具体的后续步骤 |
| 不切实际的行动项 | 无法完成 | 限定在可实现的任务范围内 |
| 无跟进 | 行动项被遗忘 | 在工单系统中跟踪进度 |
Best Practices
最佳实践
Do's
应该做的
- Start immediately - Memory fades fast
- Be specific - Exact times, exact errors
- Include graphs - Visual evidence
- Assign owners - No orphan action items
- Share widely - Organizational learning
- 立即启动 - 记忆会快速消退
- 具体明确 - 精确的时间、准确的错误信息
- 包含图表 - 可视化证据
- 分配负责人 - 无无人负责的行动项
- 广泛分享 - 促进组织学习
Don'ts
不应该做的
- Don't name and shame - Ever
- Don't skip small incidents - They reveal patterns
- Don't make it a blame doc - That kills learning
- Don't create busywork - Actions should be meaningful
- Don't skip follow-up - Verify actions completed
- 绝不点名批评 - 任何时候都不可以
- 不忽略小型事件 - 它们能揭示模式
- 不将文档变成追责工具 - 这会扼杀学习氛围
- 不制造无用功 - 行动项应具备实际意义
- 不跳过跟进 - 验证行动项是否完成