after-action-report
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAfter-Action Report
事后回顾报告(AAR)
Run a structured retrospective on a launch, incident, or completed project. Produce actionable lessons, not just a document.
This skill is for after-the-fact analysis. For active incident response, use . For planning launches, use .
incident-responselaunch-runbook针对发布、事件或已完成项目开展结构化复盘,产出可落地的经验教训,而非仅停留在文档层面。
本技能用于事后分析。若需处理活跃事件,请使用;若需规划发布,请使用。
incident-responselaunch-runbookWhen to use
使用场景
- After any incident (any severity)
- After every major launch
- At the end of a project (sprint retro, quarterly retro, project closeout)
- When a recurring issue has happened enough times to demand investigation
- When a decision didn't work out and the team wants to learn
- 任何事件发生后(无论严重程度)
- 每次重大发布完成后
- 项目结束时(迭代复盘、季度复盘、项目收尾)
- 重复问题多次出现需要调查时
- 某项决策未达预期,团队希望从中学习时
When NOT to use
非适用场景
- During an active incident (use )
incident-response - For pre-launch planning (use )
launch-runbook - For one-off bug fixes that don't merit broad analysis
- 活跃事件处理期间(请使用)
incident-response - 发布前规划阶段(请使用)
launch-runbook - 无需大范围分析的一次性bug修复
Required inputs
必要输入
- The event being analyzed (incident, launch, project)
- A timeline reconstructed from logs, chat, tickets
- Participant accounts of what they observed and did
- Outcomes and impact (what actually happened to users, the business)
- 待分析的事件(事件、发布、项目)
- 从日志、聊天记录、工单还原的时间线
- 参与者的观察与行动描述
- 结果与影响(对用户、业务实际造成的影响)
The framework: blameless analysis
框架:无责分析
The most important principle: blameless. Without it, retrospectives produce hidden information and theatrical lessons rather than real ones.
最重要的原则:无责。若违背此原则,复盘将只会产生隐藏信息和表面化的经验,而非真实有效的结论。
What blameless means
无责的含义
- Focus on systems, not individuals
- Assume everyone made reasonable decisions given what they knew at the time
- The question is "why was this decision reasonable to make?" not "who screwed up?"
- Fixing the system means the next person in that situation succeeds where this person didn't
- 聚焦系统,而非个人
- 假设每个人在当时已知的信息下做出了合理决策
- 核心问题是「为什么这个决策在当时是合理的?」而非「谁搞砸了?」
- 修复系统意味着后续处于相同场景的人能避免重蹈覆辙
What blameless does not mean
无责不代表
- No accountability (action items still have owners)
- No hard truths (sometimes the system is broken in obvious ways)
- No standards (some patterns of failure are individual, not systemic)
- No discomfort (real reflection is uncomfortable)
- 无需问责(行动项仍需明确负责人)
- 回避真相(有时系统存在明显漏洞)
- 没有标准(部分失败模式属于个人问题,而非系统性问题)
- 无需反思(真实的反思往往伴随不适)
The framework: 6 sections
框架:6大模块
A complete AAR covers six sections.
一份完整的AAR包含以下6个模块。
1. Summary
1. 摘要
A 2 to 3 paragraph overview. Captures:
- What happened
- Impact (users, business, time)
- Root cause (in plain language)
- Top action items
This is what executives read. Anyone who reads only this section should leave with the most important information.
2-3段概述内容,需涵盖:
- 事件经过
- 影响范围(用户、业务、时长)
- 根本原因(用通俗语言描述)
- 核心行动项
这部分是供高管阅读的内容。仅阅读此模块的人也应能获取最关键的信息。
2. Timeline
2. 时间线
A reconstructed timeline of events.
For incidents:
- T-0: Detection
- T+X: Acknowledgment
- T+Y: Severity assessed, IC assigned
- T+Z: Investigation began
- ... mitigation, communication, resolution events
- T+N: Resolution declared
For launches:
- Pre-launch decisions and milestones
- Launch day events
- Post-launch monitoring observations
For projects:
- Major milestones, decisions, pivots
- Both planned and emergent
The timeline is the source of truth. Disagreements about what happened get resolved here.
还原事件的完整时间线。
针对事件:
- T-0:发现事件
- T+X:确认事件
- T+Y:评估严重程度,指派事件协调人(IC)
- T+Z:开始调查
- ... 缓解、沟通、解决事件的关键节点
- T+N:宣布事件解决
针对发布:
- 发布前的决策与里程碑
- 发布当日的关键事件
- 发布后的监控观察结果
针对项目:
- 重大里程碑、决策、方向调整
- 包含计划内与突发情况
时间线是事实依据,关于事件经过的分歧需在此处解决。
3. Root cause analysis
3. 根本原因分析
What caused this, in plain language.
Use one or both of:
Five whys. Start with the surface symptom. Ask "why?" Repeat 5 times (or until you reach a true root). Each "why" should yield a substantive answer, not a tautology.
Example:
- Why did the site go down? Database connection pool exhausted.
- Why was the pool exhausted? Background job opened too many connections.
- Why did the background job open too many connections? Connection cleanup code didn't run on errors.
- Why didn't cleanup run on errors? Original code review didn't cover error paths.
- Why didn't the review cover error paths? No checklist for error handling in our review process.
The fifth why often reveals the system fix. In this case: improve the review process.
Causal chain. Multiple contributing factors that combined.
- Factor 1: Background job opened too many connections (technical)
- Factor 2: Connection limit was set too low for actual traffic (configuration)
- Factor 3: No alert on connection pool saturation (monitoring)
- Factor 4: Recent traffic doubled without infra capacity review (process)
No single fix addresses the incident. Multiple gaps need attention.
用通俗语言描述事件的成因。
可选择以下一种或两种方法:
Five whys:从表面症状入手,连续问5次「为什么?」(直到找到真正的根源)。每次的「为什么」都应给出实质性答案,而非同义反复。
示例:
- 为什么网站宕机?数据库连接池耗尽。
- 为什么连接池耗尽?后台任务打开了过多连接。
- 为什么后台任务打开过多连接?错误情况下的连接清理代码未执行。
- 为什么清理代码未执行?原代码评审未覆盖错误路径。
- 为什么评审未覆盖错误路径?我们的评审流程没有错误处理检查清单。
第五个「为什么」通常指向系统层面的修复方案。此示例中,需优化评审流程。
因果链:多个影响因素共同作用导致事件发生。
- 因素1:后台任务打开过多连接(技术层面)
- 因素2:连接限制设置低于实际流量需求(配置层面)
- 因素3:未设置连接池饱和告警(监控层面)
- 因素4:近期流量翻倍但未进行基础设施容量评估(流程层面)
单一修复无法解决此类事件,需同时关注多个短板。
4. Contributing factors
4. 影响因素
Factors that didn't cause the event but made it worse, or removed safety nets that would have caught it.
- Monitoring gaps
- Documentation gaps
- Process gaps
- Tooling gaps
- Knowledge gaps
A "would have been caught earlier if..." factor.
未直接导致事件,但加剧了事件影响,或缺失了本可提前发现问题的安全机制的因素。
- 监控漏洞
- 文档缺失
- 流程漏洞
- 工具不足
- 知识缺口
即「如果具备XX条件,事件本可更早被发现」类因素。
5. What went well
5. 做得好的地方
Real lessons require capturing successes, not just failures.
- What detection worked?
- What response worked?
- What decisions were good?
- What tools or processes performed as expected?
This is not consolation. It's calibration. Things that worked here should be reinforced and replicated.
真实的经验总结需要同时记录成功,而非仅关注失败。
- 哪些检测机制有效?
- 哪些响应措施有效?
- 哪些决策是正确的?
- 哪些工具或流程达到了预期效果?
这不是安慰环节,而是校准环节。此处验证有效的方法应被强化和复制。
6. Action items
6. 行动项
Specific, owned, dated.
| Action | Owner | Due | Type |
|---|---|---|---|
| Add alert on connection pool saturation | [name] | [date] | Monitoring |
| Add error handling checklist to PR template | [name] | [date] | Process |
| Audit other background jobs for similar issue | [name] | [date] | Code |
Action item criteria:
- Specific. "Improve monitoring" is not actionable. "Add alert on connection pool saturation, threshold 80%, page on-call" is.
- Owned. A name. Not "the team."
- Dated. A real date. Not "soon."
- Sized. Roughly hours, days, or weeks of effort.
- Closeable. Definition of done is clear.
Action items that don't close in their committed timeframe should re-surface in the next AAR. Patterns of unclosed actions point to deeper organizational issues.
需具备具体性、明确负责人、明确截止日期。
| 行动项 | 负责人 | 截止日期 | 类型 |
|---|---|---|---|
| 添加连接池饱和告警 | [姓名] | [日期] | 监控 |
| 向PR模板添加错误处理检查清单 | [姓名] | [日期] | 流程 |
| 排查其他后台任务是否存在同类问题 | [姓名] | [日期] | 代码 |
行动项标准:
- 具体性:「优化监控」不具备可执行性;「添加连接池饱和告警,阈值80%,触发时通知值班人员」才是具体的。
- 明确负责人:需指定具体姓名,而非「团队」。
- 明确截止日期:需设定真实日期,而非「尽快」。
- 明确工作量:大致估算所需时长(小时、天、周)。
- 可闭环:需明确完成标准。
未在承诺时间内完成的行动项应在下次AAR中重新提出。行动项反复未闭环的模式,指向更深层的组织问题。
Workflow
工作流程
1. Schedule the AAR
1. 安排AAR会议
Within 1 to 2 weeks of the event. Long enough that emotions cooled and facts gathered. Short enough that memories are fresh.
For incidents: pre-decided in the response procedure.
For launches: schedule on the runbook.
For projects: schedule at project closeout.
事件发生后1-2周内召开。时间足够让情绪平复、事实收集完成,同时确保记忆仍清晰。
针对事件:在响应流程中提前约定。
针对发布:在发布手册中提前安排。
针对项目:在项目收尾阶段安排。
2. Gather inputs
2. 收集输入信息
Before the meeting:
- Reconstructed timeline (often the scribe's notes if there was one)
- Logs, chat transcripts, tickets, incident updates
- Individual accounts from each participant (written, before the meeting)
- Impact data (users affected, duration, revenue impact, etc.)
会议前准备:
- 还原的时间线(若有记录员,通常为记录员的笔记)
- 日志、聊天记录、工单、事件更新内容
- 每位参与者的个人描述(会议前以书面形式提交)
- 影响数据(受影响用户数、时长、收入影响等)
3. Run the meeting
3. 召开会议
Typical agenda (60 to 90 minutes):
- Read the summary as drafted (5 min)
- Walk the timeline together. Add corrections. Resolve disagreements. (20 to 30 min)
- Discuss root cause. Use five whys or causal chain. (15 to 20 min)
- Discuss contributing factors. (10 min)
- Discuss what went well. (10 min)
- Identify action items. Owners and dates. (10 min)
A facilitator runs the meeting. Often the IC for an incident, or a project lead for a project. The facilitator is not the scribe.
典型议程(60-90分钟):
- 宣读草拟的摘要(5分钟)
- 共同梳理时间线,补充修正内容,解决分歧(20-30分钟)
- 讨论根本原因,使用Five whys或因果链方法(15-20分钟)
- 讨论影响因素(10分钟)
- 讨论做得好的地方(10分钟)
- 确定行动项,明确负责人与截止日期(10分钟)
会议由主持人主导,通常为事件的协调人(IC)或项目负责人。主持人不兼任记录员。
4. Write the document
4. 撰写文档
Within a few days of the meeting. The full AAR includes all 6 sections.
会议后几天内完成。完整的AAR需包含全部6个模块。
5. Distribute
5. 分发文档
Internal: post in a known location. Make searchable. Reference in onboarding.
For high-severity incidents: external summary may be appropriate (status page, customer email, public blog).
内部:发布至指定位置,确保可搜索,并在新员工入职时作为参考资料。
针对高严重程度事件:可能需要对外发布摘要(状态页面、客户邮件、公开博客)。
6. Track action items
6. 追踪行动项
Every action item should be tracked to closure. The next AAR re-surfaces unclosed ones.
每个行动项都需追踪至闭环。下次AAR需重新提出未闭环的行动项。
Failure patterns
常见失败模式
- Skipping the AAR for "small" incidents. Patterns get missed.
- Naming and shaming. Real lessons get hidden when people fear blame.
- Generic action items. "Improve testing" instead of specific testing change.
- Action items that never close. Filed, forgotten. Same incident recurs.
- Theater retrospectives. Going through the motions without genuine reflection.
- Skipping "what went well." Misses calibration on what's working.
- Blame externalized. "Our vendor failed." OK, what's our system for vendor risk?
- Single-person AAR. One person writes the whole thing. Misses other perspectives.
- AAR only for failures. Successful launches deserve AARs too. Lessons from success are valuable.
- Long delays. Memories fade. Conversations cool. Get it done within 2 weeks.
- 因事件「规模小」而跳过AAR:会遗漏潜在的问题模式。
- 点名批评:当人们害怕被指责时,真实的经验会被隐藏。
- 泛化的行动项:比如「优化测试」而非具体的测试改进措施。
- 行动项从未闭环:被归档后遗忘,导致同类事件重复发生。
- 形式化复盘:走过场而非真正反思。
- 跳过「做得好的地方」环节:错失对有效方法的校准机会。
- 归咎于外部:「供应商失误」,但需思考我们的供应商风险管理机制是否存在问题。
- 单人完成AAR:仅由一人撰写全部内容,会遗漏其他视角。
- 仅针对失败开展AAR:成功的发布也值得开展AAR,成功的经验同样宝贵。
- 延迟开展AAR:记忆模糊、关注度下降,需在2周内完成。
Output format
输出格式
A markdown document at .
aar-[date]-[event-name].mdStructure:
markdown
undefined生成名为的Markdown文档。
aar-[日期]-[事件名称].md结构如下:
markdown
undefinedAAR: [Event name]
AAR: [Event name]
Date of event: [YYYY-MM-DD]
AAR date: [YYYY-MM-DD]
Severity / scope: [SEV-1 / Major launch / Project closeout]
Facilitator: [Name]
Participants: [Names]
Date of event: [YYYY-MM-DD]
AAR date: [YYYY-MM-DD]
Severity / scope: [SEV-1 / Major launch / Project closeout]
Facilitator: [Name]
Participants: [Names]
Summary
Summary
[2 to 3 paragraphs]
[2 to 3 paragraphs]
Impact
Impact
- Users affected: [number, segment]
- Duration: [time]
- Revenue / business impact: [if applicable]
- Users affected: [number, segment]
- Duration: [time]
- Revenue / business impact: [if applicable]
Timeline
Timeline
[Timestamped events]
[Timestamped events]
Root cause analysis
Root cause analysis
[Five whys or causal chain]
[Five whys or causal chain]
Contributing factors
Contributing factors
[List]
[List]
What went well
What went well
[List]
[List]
Action items
Action items
| Action | Owner | Due | Type | Status |
|---|---|---|---|---|
| Action | Owner | Due | Type | Status |
|---|---|---|---|---|
Lessons
Lessons
[Reflections that don't fit elsewhere. Often the most quotable section.]
---[Reflections that don't fit elsewhere. Often the most quotable section.]
---Reference files
参考文件
- - Fillable AAR template covering incidents, launches, and projects.
references/aar-template.md
- - 适用于事件、发布、项目的可填写AAR模板。
references/aar-template.md