-
Read incident context — Scan
for incident reports matching the described incident. If an incident report exists, use its timeline, impact data, and preliminary root cause as input. Also check for previous postmortems to identify recurring patterns.
-
Parse the incident — Extract from
and any linked incident report: what happened, when, what was affected, and how it was resolved. If details are insufficient, ask one round of clarifying questions covering the timeline, impact scope, and resolution steps.
-
Verify blamelessness — Before writing, review all input for named individuals. Replace names with roles (e.g., "the on-call engineer" instead of "Alice"). The postmortem describes systems, processes, and decisions — not people.
-
Identify contributing factors — Avoid naming a single "root cause." Most incidents result from multiple contributing factors that align (Swiss cheese model). Identify each layer that failed: process, tooling, monitoring, testing, communication, architecture.
-
Run 5 Whys per contributing factor — For each contributing factor, ask "Why?" five times to trace from the symptom to the systemic issue. Stop when you reach a factor that is actionable and systemic, not when you reach a person's decision.
-
Analyze safety barriers — Using the Swiss cheese model, document which safety barriers existed and which failed. Categories: Detection (monitoring, alerting), Prevention (code review, testing, feature flags), Mitigation (rollback, circuit breakers, graceful degradation), Communication (status pages, incident channels, escalation paths).
-
Check for recurrence — Search
for previous postmortems or incident reports with similar contributing factors. If this is a recurring theme, flag it explicitly and reference previous documents.
-
Write action items — Every action item must be categorized as Detect, Prevent, or Mitigate. Each must have an owner (role or team, not individual name) and a due date. Prioritize actions that address the deepest systemic issues, not just the immediate trigger.
-
Document what went well — Identify aspects of the response that worked: fast detection, effective communication, successful rollback, team coordination. This section is mandatory.
-
Determine the next file number — List files in
to find the highest numbered file. Increment by 1.
-
Write the file — Save to
.chalk/docs/engineering/<n>_postmortem_<incident>.md
.
-
Confirm — Present the postmortem with a summary of contributing factors, the number of action items, and any recurrence warnings.