after-action-report

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

After-Action Report

事后回顾报告(AAR)

Run a structured retrospective on a launch, incident, or completed project. Produce actionable lessons, not just a document.
This skill is for after-the-fact analysis. For active incident response, use
incident-response
. For planning launches, use
launch-runbook
.

针对发布、事件或已完成项目开展结构化复盘,产出可落地的经验教训,而非仅停留在文档层面。
本技能用于事后分析。若需处理活跃事件,请使用
incident-response
;若需规划发布,请使用
launch-runbook

When to use

使用场景

  • After any incident (any severity)
  • After every major launch
  • At the end of a project (sprint retro, quarterly retro, project closeout)
  • When a recurring issue has happened enough times to demand investigation
  • When a decision didn't work out and the team wants to learn
  • 任何事件发生后(无论严重程度)
  • 每次重大发布完成后
  • 项目结束时(迭代复盘、季度复盘、项目收尾)
  • 重复问题多次出现需要调查时
  • 某项决策未达预期,团队希望从中学习时

When NOT to use

非适用场景

  • During an active incident (use
    incident-response
    )
  • For pre-launch planning (use
    launch-runbook
    )
  • For one-off bug fixes that don't merit broad analysis

  • 活跃事件处理期间(请使用
    incident-response
  • 发布前规划阶段(请使用
    launch-runbook
  • 无需大范围分析的一次性bug修复

Required inputs

必要输入

  • The event being analyzed (incident, launch, project)
  • A timeline reconstructed from logs, chat, tickets
  • Participant accounts of what they observed and did
  • Outcomes and impact (what actually happened to users, the business)

  • 待分析的事件(事件、发布、项目)
  • 从日志、聊天记录、工单还原的时间线
  • 参与者的观察与行动描述
  • 结果与影响(对用户、业务实际造成的影响)

The framework: blameless analysis

框架:无责分析

The most important principle: blameless. Without it, retrospectives produce hidden information and theatrical lessons rather than real ones.
最重要的原则:无责。若违背此原则,复盘将只会产生隐藏信息和表面化的经验,而非真实有效的结论。

What blameless means

无责的含义

  • Focus on systems, not individuals
  • Assume everyone made reasonable decisions given what they knew at the time
  • The question is "why was this decision reasonable to make?" not "who screwed up?"
  • Fixing the system means the next person in that situation succeeds where this person didn't
  • 聚焦系统,而非个人
  • 假设每个人在当时已知的信息下做出了合理决策
  • 核心问题是「为什么这个决策在当时是合理的?」而非「谁搞砸了?」
  • 修复系统意味着后续处于相同场景的人能避免重蹈覆辙

What blameless does not mean

无责不代表

  • No accountability (action items still have owners)
  • No hard truths (sometimes the system is broken in obvious ways)
  • No standards (some patterns of failure are individual, not systemic)
  • No discomfort (real reflection is uncomfortable)

  • 无需问责(行动项仍需明确负责人)
  • 回避真相(有时系统存在明显漏洞)
  • 没有标准(部分失败模式属于个人问题,而非系统性问题)
  • 无需反思(真实的反思往往伴随不适)

The framework: 6 sections

框架:6大模块

A complete AAR covers six sections.
一份完整的AAR包含以下6个模块。

1. Summary

1. 摘要

A 2 to 3 paragraph overview. Captures:
  • What happened
  • Impact (users, business, time)
  • Root cause (in plain language)
  • Top action items
This is what executives read. Anyone who reads only this section should leave with the most important information.
2-3段概述内容,需涵盖:
  • 事件经过
  • 影响范围(用户、业务、时长)
  • 根本原因(用通俗语言描述)
  • 核心行动项
这部分是供高管阅读的内容。仅阅读此模块的人也应能获取最关键的信息。

2. Timeline

2. 时间线

A reconstructed timeline of events.
For incidents:
  • T-0: Detection
  • T+X: Acknowledgment
  • T+Y: Severity assessed, IC assigned
  • T+Z: Investigation began
  • ... mitigation, communication, resolution events
  • T+N: Resolution declared
For launches:
  • Pre-launch decisions and milestones
  • Launch day events
  • Post-launch monitoring observations
For projects:
  • Major milestones, decisions, pivots
  • Both planned and emergent
The timeline is the source of truth. Disagreements about what happened get resolved here.
还原事件的完整时间线。
针对事件:
  • T-0:发现事件
  • T+X:确认事件
  • T+Y:评估严重程度,指派事件协调人(IC)
  • T+Z:开始调查
  • ... 缓解、沟通、解决事件的关键节点
  • T+N:宣布事件解决
针对发布:
  • 发布前的决策与里程碑
  • 发布当日的关键事件
  • 发布后的监控观察结果
针对项目:
  • 重大里程碑、决策、方向调整
  • 包含计划内与突发情况
时间线是事实依据,关于事件经过的分歧需在此处解决。

3. Root cause analysis

3. 根本原因分析

What caused this, in plain language.
Use one or both of:
Five whys. Start with the surface symptom. Ask "why?" Repeat 5 times (or until you reach a true root). Each "why" should yield a substantive answer, not a tautology.
Example:
  • Why did the site go down? Database connection pool exhausted.
  • Why was the pool exhausted? Background job opened too many connections.
  • Why did the background job open too many connections? Connection cleanup code didn't run on errors.
  • Why didn't cleanup run on errors? Original code review didn't cover error paths.
  • Why didn't the review cover error paths? No checklist for error handling in our review process.
The fifth why often reveals the system fix. In this case: improve the review process.
Causal chain. Multiple contributing factors that combined.
  • Factor 1: Background job opened too many connections (technical)
  • Factor 2: Connection limit was set too low for actual traffic (configuration)
  • Factor 3: No alert on connection pool saturation (monitoring)
  • Factor 4: Recent traffic doubled without infra capacity review (process)
No single fix addresses the incident. Multiple gaps need attention.
用通俗语言描述事件的成因。
可选择以下一种或两种方法:
Five whys:从表面症状入手,连续问5次「为什么?」(直到找到真正的根源)。每次的「为什么」都应给出实质性答案,而非同义反复。
示例:
  • 为什么网站宕机?数据库连接池耗尽。
  • 为什么连接池耗尽?后台任务打开了过多连接。
  • 为什么后台任务打开过多连接?错误情况下的连接清理代码未执行。
  • 为什么清理代码未执行?原代码评审未覆盖错误路径。
  • 为什么评审未覆盖错误路径?我们的评审流程没有错误处理检查清单。
第五个「为什么」通常指向系统层面的修复方案。此示例中,需优化评审流程。
因果链:多个影响因素共同作用导致事件发生。
  • 因素1:后台任务打开过多连接(技术层面)
  • 因素2:连接限制设置低于实际流量需求(配置层面)
  • 因素3:未设置连接池饱和告警(监控层面)
  • 因素4:近期流量翻倍但未进行基础设施容量评估(流程层面)
单一修复无法解决此类事件,需同时关注多个短板。

4. Contributing factors

4. 影响因素

Factors that didn't cause the event but made it worse, or removed safety nets that would have caught it.
  • Monitoring gaps
  • Documentation gaps
  • Process gaps
  • Tooling gaps
  • Knowledge gaps
A "would have been caught earlier if..." factor.
未直接导致事件,但加剧了事件影响,或缺失了本可提前发现问题的安全机制的因素。
  • 监控漏洞
  • 文档缺失
  • 流程漏洞
  • 工具不足
  • 知识缺口
即「如果具备XX条件,事件本可更早被发现」类因素。

5. What went well

5. 做得好的地方

Real lessons require capturing successes, not just failures.
  • What detection worked?
  • What response worked?
  • What decisions were good?
  • What tools or processes performed as expected?
This is not consolation. It's calibration. Things that worked here should be reinforced and replicated.
真实的经验总结需要同时记录成功,而非仅关注失败。
  • 哪些检测机制有效?
  • 哪些响应措施有效?
  • 哪些决策是正确的?
  • 哪些工具或流程达到了预期效果?
这不是安慰环节,而是校准环节。此处验证有效的方法应被强化和复制。

6. Action items

6. 行动项

Specific, owned, dated.
ActionOwnerDueType
Add alert on connection pool saturation[name][date]Monitoring
Add error handling checklist to PR template[name][date]Process
Audit other background jobs for similar issue[name][date]Code
Action item criteria:
  • Specific. "Improve monitoring" is not actionable. "Add alert on connection pool saturation, threshold 80%, page on-call" is.
  • Owned. A name. Not "the team."
  • Dated. A real date. Not "soon."
  • Sized. Roughly hours, days, or weeks of effort.
  • Closeable. Definition of done is clear.
Action items that don't close in their committed timeframe should re-surface in the next AAR. Patterns of unclosed actions point to deeper organizational issues.

需具备具体性、明确负责人、明确截止日期。
行动项负责人截止日期类型
添加连接池饱和告警[姓名][日期]监控
向PR模板添加错误处理检查清单[姓名][日期]流程
排查其他后台任务是否存在同类问题[姓名][日期]代码
行动项标准:
  • 具体性:「优化监控」不具备可执行性;「添加连接池饱和告警,阈值80%,触发时通知值班人员」才是具体的。
  • 明确负责人:需指定具体姓名,而非「团队」。
  • 明确截止日期:需设定真实日期,而非「尽快」。
  • 明确工作量:大致估算所需时长(小时、天、周)。
  • 可闭环:需明确完成标准。
未在承诺时间内完成的行动项应在下次AAR中重新提出。行动项反复未闭环的模式,指向更深层的组织问题。

Workflow

工作流程

1. Schedule the AAR

1. 安排AAR会议

Within 1 to 2 weeks of the event. Long enough that emotions cooled and facts gathered. Short enough that memories are fresh.
For incidents: pre-decided in the response procedure. For launches: schedule on the runbook. For projects: schedule at project closeout.
事件发生后1-2周内召开。时间足够让情绪平复、事实收集完成,同时确保记忆仍清晰。
针对事件:在响应流程中提前约定。 针对发布:在发布手册中提前安排。 针对项目:在项目收尾阶段安排。

2. Gather inputs

2. 收集输入信息

Before the meeting:
  • Reconstructed timeline (often the scribe's notes if there was one)
  • Logs, chat transcripts, tickets, incident updates
  • Individual accounts from each participant (written, before the meeting)
  • Impact data (users affected, duration, revenue impact, etc.)
会议前准备:
  • 还原的时间线(若有记录员,通常为记录员的笔记)
  • 日志、聊天记录、工单、事件更新内容
  • 每位参与者的个人描述(会议前以书面形式提交)
  • 影响数据(受影响用户数、时长、收入影响等)

3. Run the meeting

3. 召开会议

Typical agenda (60 to 90 minutes):
  • Read the summary as drafted (5 min)
  • Walk the timeline together. Add corrections. Resolve disagreements. (20 to 30 min)
  • Discuss root cause. Use five whys or causal chain. (15 to 20 min)
  • Discuss contributing factors. (10 min)
  • Discuss what went well. (10 min)
  • Identify action items. Owners and dates. (10 min)
A facilitator runs the meeting. Often the IC for an incident, or a project lead for a project. The facilitator is not the scribe.
典型议程(60-90分钟):
  • 宣读草拟的摘要(5分钟)
  • 共同梳理时间线,补充修正内容,解决分歧(20-30分钟)
  • 讨论根本原因,使用Five whys或因果链方法(15-20分钟)
  • 讨论影响因素(10分钟)
  • 讨论做得好的地方(10分钟)
  • 确定行动项,明确负责人与截止日期(10分钟)
会议由主持人主导,通常为事件的协调人(IC)或项目负责人。主持人不兼任记录员。

4. Write the document

4. 撰写文档

Within a few days of the meeting. The full AAR includes all 6 sections.
会议后几天内完成。完整的AAR需包含全部6个模块。

5. Distribute

5. 分发文档

Internal: post in a known location. Make searchable. Reference in onboarding.
For high-severity incidents: external summary may be appropriate (status page, customer email, public blog).
内部:发布至指定位置,确保可搜索,并在新员工入职时作为参考资料。
针对高严重程度事件:可能需要对外发布摘要(状态页面、客户邮件、公开博客)。

6. Track action items

6. 追踪行动项

Every action item should be tracked to closure. The next AAR re-surfaces unclosed ones.

每个行动项都需追踪至闭环。下次AAR需重新提出未闭环的行动项。

Failure patterns

常见失败模式

  • Skipping the AAR for "small" incidents. Patterns get missed.
  • Naming and shaming. Real lessons get hidden when people fear blame.
  • Generic action items. "Improve testing" instead of specific testing change.
  • Action items that never close. Filed, forgotten. Same incident recurs.
  • Theater retrospectives. Going through the motions without genuine reflection.
  • Skipping "what went well." Misses calibration on what's working.
  • Blame externalized. "Our vendor failed." OK, what's our system for vendor risk?
  • Single-person AAR. One person writes the whole thing. Misses other perspectives.
  • AAR only for failures. Successful launches deserve AARs too. Lessons from success are valuable.
  • Long delays. Memories fade. Conversations cool. Get it done within 2 weeks.

  • 因事件「规模小」而跳过AAR:会遗漏潜在的问题模式。
  • 点名批评:当人们害怕被指责时,真实的经验会被隐藏。
  • 泛化的行动项:比如「优化测试」而非具体的测试改进措施。
  • 行动项从未闭环:被归档后遗忘,导致同类事件重复发生。
  • 形式化复盘:走过场而非真正反思。
  • 跳过「做得好的地方」环节:错失对有效方法的校准机会。
  • 归咎于外部:「供应商失误」,但需思考我们的供应商风险管理机制是否存在问题。
  • 单人完成AAR:仅由一人撰写全部内容,会遗漏其他视角。
  • 仅针对失败开展AAR:成功的发布也值得开展AAR,成功的经验同样宝贵。
  • 延迟开展AAR:记忆模糊、关注度下降,需在2周内完成。

Output format

输出格式

A markdown document at
aar-[date]-[event-name].md
.
Structure:
markdown
undefined
生成名为
aar-[日期]-[事件名称].md
的Markdown文档。
结构如下:
markdown
undefined

AAR: [Event name]

AAR: [Event name]

Date of event: [YYYY-MM-DD] AAR date: [YYYY-MM-DD] Severity / scope: [SEV-1 / Major launch / Project closeout] Facilitator: [Name] Participants: [Names]
Date of event: [YYYY-MM-DD] AAR date: [YYYY-MM-DD] Severity / scope: [SEV-1 / Major launch / Project closeout] Facilitator: [Name] Participants: [Names]

Summary

Summary

[2 to 3 paragraphs]
[2 to 3 paragraphs]

Impact

Impact

  • Users affected: [number, segment]
  • Duration: [time]
  • Revenue / business impact: [if applicable]
  • Users affected: [number, segment]
  • Duration: [time]
  • Revenue / business impact: [if applicable]

Timeline

Timeline

[Timestamped events]
[Timestamped events]

Root cause analysis

Root cause analysis

[Five whys or causal chain]
[Five whys or causal chain]

Contributing factors

Contributing factors

[List]
[List]

What went well

What went well

[List]
[List]

Action items

Action items

ActionOwnerDueTypeStatus
ActionOwnerDueTypeStatus

Lessons

Lessons

[Reflections that don't fit elsewhere. Often the most quotable section.]

---
[Reflections that don't fit elsewhere. Often the most quotable section.]

---

Reference files

参考文件

  • references/aar-template.md
    - Fillable AAR template covering incidents, launches, and projects.
  • references/aar-template.md
    - 适用于事件、发布、项目的可填写AAR模板。