How to Write a Post-Mortem / Retrospective
A practical guide to writing blameless post-mortems and retrospectives — timeline construction, root cause analysis, action items that actually get done, and a Markdown template.
A post-mortem is a structured review of something that didn't go as planned — an outage, a failed launch, a missed deadline, a shipped bug that reached production. A retrospective is the same structure applied to planned reviews — end of sprint, end of quarter, end of project.
Both exist for one reason: so the organization learns faster than its people leave. Without them, every new engineer rediscovers every old failure mode.
This guide covers how to run a post-mortem that produces action items that actually get done, and how to write it up so that people six months from now (including you) can learn from it.
The cultural prerequisite: blameless
Before anyone writes a template, the team has to agree on one thing: post-mortems are blameless. That means:
- The focus is on systems, processes, and decisions — not individuals.
- We assume people did the best they could with the information they had at the time.
- Hindsight is not evidence. "You should have known" is not a valid criticism.
- Nobody gets fired, demoted, or put on a PIP because of a post-mortem finding.
Blameless doesn't mean unaccountable. If someone repeatedly ignores process or acts negligently, that's a management conversation — it happens outside the post-mortem, not in it. The post-mortem stays focused on "what can we change in the system to prevent this class of incident."
If your organization can't commit to blamelessness, post-mortems will be theater. People will cover themselves, withhold information, and assign vague action items. Nothing will improve. Fix the culture first.
When to write one
Write a post-mortem for:
- Any customer-facing incident (outage, data loss, major bug in production)
- Any security incident (breach, exposure, unauthorized access)
- Any missed launch or significantly missed deadline
- Any "near miss" — something that almost went wrong and shouldn't happen again
- Any project, at its conclusion (this is the retrospective flavor)
You do not need one for:
- Individual bugs caught in code review or QA
- Minor delays inside a sprint
- Expected failures in a controlled environment (load testing, chaos engineering)
A good heuristic: if you'd want someone joining the team next year to know about it, write one.
The structure
A post-mortem has six sections:
- Summary — what happened, at a glance
- Timeline — what happened, with times
- Root causes — why it happened
- Impact — what it cost
- What went well, what didn't — the reflection
- Action items — what we'll do
The Project Post-Mortem template is a filled-in starting point. Here's a walkthrough of each section.
1. Summary
3–5 sentences at the top. Someone reading only the summary should know:
- What happened (the user-visible symptom)
- When and for how long
- Who was affected
- What the root cause was
- What we're doing about it
Example:
Summary: On 2026-03-15 from 14:22 to 15:47 UTC (1h 25m), the checkout API returned 500 errors for approximately 8% of requests. Root cause was an expired TLS certificate on our payment provider's staging environment that was inadvertently used in production due to a feature flag mis-configuration. All affected customers were retried successfully. We are adding certificate expiration monitoring and tightening feature flag environment checks.
2. Timeline
A chronological list with timestamps. Times in UTC for shared context.
## Timeline All times UTC. - **14:22** — Payment processor starts returning TLS errors. Not yet detected. - **14:24** — First checkout 500 errors appear in logs. - **14:31** — PagerDuty alert fires (checkout error rate > 5%). - **14:33** — On-call engineer @alice acknowledges; opens incident channel. - **14:41** — @alice identifies payment provider as upstream cause. - **14:49** — @bob joins; notices staging endpoint in config. - **14:52** — Feature flag rollback initiated. - **15:14** — Rollback completes; error rate drops. - **15:47** — Error rate back to baseline; incident closed.
Rules for the timeline:
- Fact, not judgment. "Error rate dropped" — not "we quickly fixed the issue."
- Times in UTC. If the team is distributed, local times are confusing.
- Attribute responders by initial or handle, not full names. Keeps the focus on roles.
- Include detection gaps. The time between the problem starting (14:22) and the alert firing (14:31) is data. Nine minutes of undetected errors is an action item candidate.
3. Root causes
Root causes are why the incident happened, separated from the symptoms. Use the "5 Whys" technique to dig past the first layer.
Surface cause: "Feature flag pointed to staging." Why? Config file was copied from a test environment. Why? No CI check validates environment-specific values. Why? The config system doesn't have environment awareness.
A good root cause section distinguishes between:
- Trigger — the specific change or event that started the incident
- Contributing factors — conditions that made the incident possible or worse
- Detection gaps — why it wasn't caught sooner
- Response gaps — why it wasn't fixed faster
Incidents almost always have multiple root causes. "The cert expired" is a cause. "Our monitoring didn't catch cert expiration" is also a cause. Both get listed.
4. Impact
Quantified, specific:
- Customer impact: "~2,400 checkout attempts failed. ~1,900 retried successfully. ~500 abandoned and have been contacted by support."
- Revenue impact: "Estimated $18,000 in lost or delayed transactions."
- Internal impact: "On-call team paged for 1h 25m. Engineering time spent: ~12 person-hours across 4 engineers."
- Data integrity: "No data loss. 3 partial orders are being manually reconciled."
Impact is not "it was bad." Impact is numbers, identities, and time.
5. What went well, what didn't
Usually two bullet lists. This is the retrospective heart of the document.
What went well:
- Alert fired within 9 minutes of the first error.
- On-call engineer identified the upstream cause quickly.
- Rollback mechanism worked as designed.
- Support was looped in within 15 minutes.
What didn't:
- Cert expiration wasn't monitored — a preventable root cause.
- Config file didn't fail a CI check despite pointing to staging.
- Status page was updated 20 minutes after incident started.
- No runbook for payment provider incidents; engineer worked from first principles.
Be specific. Every bullet should point to a concrete moment or artifact, not a general feeling.
6. Action items
Where most post-mortems fail. Action items are commitments, not wishes.
Each action item gets:
- An owner (one person)
- A priority (P0 / P1 / P2)
- A due date
- A ticket number (not a bullet in a Google doc — a tracked item in Jira, Linear, GitHub Issues, whatever you use)
Template:
## Action items | Owner | Priority | Due | Item | Ticket | | ------- | :------: | --------- | ----------------------------------------------------- | -------- | | @alice | P0 | 2026-03-22| Add cert expiration monitoring for all prod endpoints | INFRA-312| | @bob | P1 | 2026-04-01| CI check for environment-specific config values | PLAT-204 | | @carol | P1 | 2026-03-25| Write payment provider incident runbook | OPS-189 | | @dave | P2 | 2026-04-15| Reduce detection time — error rate alert threshold | OBS-77 |
Categories of action items:
- Prevent the same failure (primary goal)
- Detect sooner (reduce time-to-detection)
- Mitigate faster (reduce time-to-resolution)
- Communicate better (internal and external)
- Document (runbook, onboarding, decision record)
Track every action item in a dashboard. Review open items at every subsequent retrospective. An action item that ages out without being completed is a red flag — either it wasn't important, or the ownership is broken.
Facilitation
The document structure matters, but so does the meeting.
Before the meeting:
- The facilitator (often a service owner or team lead) drafts the timeline and a first-pass root cause analysis.
- Share the draft 24–48 hours before the meeting so people arrive prepared.
During the meeting (60–90 minutes):
- Review the timeline (15 min). Anyone who was there fills gaps, corrects errors.
- Root cause discussion (20 min). Apply the 5 Whys. Challenge surface causes.
- What went well / didn't (15 min). Round-robin so everyone speaks.
- Action items (15 min). For each candidate, assign owner, priority, and due date in the meeting — don't defer.
- Wrap (5 min). Agree who publishes the final doc and by when.
Rules of engagement:
- No laptops except the scribe. Eye contact > multitasking.
- "I" statements about your own role. "We" statements about systems.
- Facilitator gently interrupts blame-shaped language: "Let's focus on what the system allowed, not who pushed the button."
- If emotions run hot, pause. Reconvene the next day if needed.
Publishing
Internal post-mortems go in a shared folder, linked from your incident management tool. Searchable, discoverable, part of onboarding.
Customer-facing post-mortems are a subset of the internal one. Include:
- What happened (in customer terms, not internal jargon)
- When and for how long
- What you're doing about it
- An apology, if warranted
Don't include: internal team names, individual handles, raw logs, speculation about specific engineers' actions. Customers want to know they're in good hands, not rubberneck your process.
Templates vs. reality
No template fits every incident. A network partition post-mortem and a missed marketing launch post-mortem have the same structure but very different content. Use the template as a skeleton, not a cage. If a section doesn't apply, omit it. If an incident needs a section the template doesn't have (e.g., "regulatory disclosure timeline" for a security incident), add it.
The template serves the goal. The goal is learning.
Running retrospectives (not incident-driven)
Retrospectives follow the same structure with lighter weighting:
- What we did (the project or sprint, briefly)
- What went well
- What didn't
- What we'd do differently
- Action items
Common formats:
- Start, Stop, Continue — what to start doing, stop doing, keep doing
- Mad, Sad, Glad — what frustrated people, disappointed them, delighted them
- Sailboat — what's moving us forward, what's holding us back
For ongoing teams, do retros every 2–4 weeks. Skipping them is tempting when things are going well — don't. The point is to notice problems early, not clean up after disasters.
Anti-patterns
Watch for these failure modes:
- The "always on time next time" action item. Vague promises without mechanisms don't change anything. "Add CI check for X" changes things.
- The meeting that produces no document. If it's not written down, it didn't happen.
- The document nobody reads. Indexed, searchable, linked from onboarding — or it's wasted effort.
- Private finger-pointing after the meeting. Symptom of insufficient blamelessness in the room.
- The "we'll fix it later" that never happens. Every action item without an owner and due date.
Closing
The test of a good post-mortem isn't the meeting or the document. It's whether the same class of incident happens again in six months. If it does, your action items didn't stick — either they were wrong, they weren't owned, or they weren't executed.
Treat every post-mortem as an investment in the team's future speed. Incidents are going to happen. What separates mature teams from immature ones is whether they compound the lessons.
Frequently Asked Questions
What's the difference between a post-mortem and a retrospective?+
What does 'blameless' actually mean?+
Who should attend a post-mortem?+
How long after the incident should we hold the post-mortem?+
Why do most action items never get done?+
Keep reading
How to Write a Product Requirements Document (PRD)
A practical guide to writing PRDs that engineers actually read — structure, examples, anti-patterns, and a Markdown template you can adapt to any product team.
Using Markdown for Technical Writing
Why Markdown is the default format for technical writing in 2026, how to structure docs, and the patterns that keep large docs maintainable over time.
Writing API Documentation in Markdown
A practical guide to writing API docs that developers actually read: structure, auth, endpoint reference, examples, errors, and versioning — with Markdown patterns that scale.