Post-Mortem Template for Indie SaaS
Let's be honest: when an incident hits your indie SaaS, your first priority is getting things back online. Your second might be catching your breath, and your third is probably wondering if you lost any customers. The thought of sitting down to write a detailed post-mortem can feel like an unnecessary burden, a luxury for bigger teams with dedicated SREs.
But here's the thing: for indie SaaS, a well-executed post-mortem isn't a luxury; it's a critical tool for survival and growth. You don't have an army of engineers to learn from every mistake. Every outage is a precious, albeit painful, learning opportunity. Skipping the post-mortem means you're almost guaranteed to repeat the same mistakes, eroding trust and wasting valuable development time.
This article outlines a practical, no-nonsense post-mortem template designed for indie SaaS founders and small engineering teams. It's about learning, not blaming, and making your system more resilient with each bump in the road.
When to Write a Post-Mortem?
Not every hiccup warrants a full post-mortem. Your time is too valuable. So, what's the threshold?
Consider writing a post-mortem if any of the following occurred:
- Customer-facing impact: Any incident that caused noticeable degradation or downtime for your users.
- Data loss or corruption: Even if quickly recovered, understanding the cause is paramount.
- Significant manual intervention: If an engineer had to wake up at 3 AM or spend hours debugging a critical system.
- Repeat incidents: If you're seeing the same type of problem crop up again, it's time to dig deeper.
For an indie SaaS, a simple "impact-based" rule is often more practical than a complex severity scale. If customers noticed, or if it cost you significant operational time, write it down.
The Core Elements of an Effective Post-Mortem
This template provides a structured approach. Adapt it to your specific needs, but ensure you cover these key areas.
1. Title & Metadata
Start with the essentials. This helps with organization and quick referencing.
- Title: Clear and concise (e.g., "API Latency Spike & Data Ingestion Failure on 2023-10-26")
- Incident Date & Time (UTC): When the incident started.
- Incident Duration: How long the service was degraded/down.
- Services Affected: List all components impacted (e.g., "User API", "Background Jobs", "Database").
- Reporting Engineer(s): Who discovered/reported it internally.
- Lead Investigator(s): Who took point on resolution and post-mortem.
2. Summary
A high-level overview for anyone who needs to quickly grasp the situation without diving into the technical details.
- What happened?
- What was the immediate impact?
- How was it resolved?
- What's the key takeaway?
3. Impact
Detail who was affected and how. Be specific.
- Customer Impact: How many users? What functionality was unavailable? Was there data loss?
- Internal Impact: Was the team blocked? Did it affect internal tools?
- Financial Impact (Optional): If directly measurable (e.g., lost sales during downtime).
4. Root Cause
This is where you dig deep. The goal is to identify the underlying systemic issue, not just the symptom. Use the "5 Whys" technique to avoid stopping at the surface.
- What was the specific technical failure?
- Why did that failure occur?
- What were the contributing factors? (e.g., outdated dependency, race condition, insufficient resource allocation, poor error handling, missing validation).
Pitfall: Don't stop at "the server crashed." Why did it crash? Was it OOM? If so, why did it run out of memory? Was it a memory leak, or an unexpected load spike? If a load spike, why wasn't it handled?
5. Detection
How was the incident first identified? This is crucial for improving your monitoring.
- Method of Detection: (e.g., Automated alert, customer report, internal observation).
- Time to Detect: How long from the incident start until it was identified?
- Alerting System: Which system triggered the alert? (e.g., Tickr, Prometheus, Sentry).
This is where a tool like Tickr shines. If your uptime monitoring detected a probe failure and alerted you via email or Telegram before customers started complaining, that's a win, even amidst an outage. If a customer reported it first, it's a clear signal to improve your monitoring coverage.
6. Resolution
Describe the steps taken to mitigate the incident and restore service.
- What was the immediate action? (e.g., "Restarted service X", "Rolled back deployment Y").
- What were the subsequent steps to fully restore functionality?
- Who performed these actions?
7. Timeline
A detailed, chronological sequence of events is invaluable for understanding the incident flow and identifying delays. Include timestamps for every significant action.
09:00 UTC: Tickr alert received: "Probe failed forhttps://api.yourproduct.com/healthwith body substring 'OK' not found."09:02 UTC: Engineer A acknowledges alert, starts investigation. Observes high CPU on database server.09:05 UTC: Engineer A identifies a runaway query viapg_stat_activity.09:07 UTC: Engineer A terminates the rogue query. CPU returns to normal.09:08 UTC: Tickr probe reports healthy again.09:10 UTC: Service fully restored.09:15 UTC: Engineer A sends internal update to the team.
This timeline helps pinpoint where improvements can be made, whether it's faster detection, quicker diagnosis, or more efficient resolution steps.
8. Lessons Learned & Action Items
This is the most critical section. What will you do differently to prevent recurrence or minimize impact? These must be concrete, actionable, and assigned to a specific person.
- Preventative Actions:
- Example 1 (Code/DB): "Add a
LIMIT 100clause to theGET /api/v1/reportsendpoint to prevent excessively large result sets. Assign to John, due EOD Friday." - Example 2 (Monitoring): "Set up a new Tickr probe for
POST /api/v2/jobsendpoint, configured to check for a successful 200 OK response and a specific body substring confirming job queue health. Assign to Sarah, due next Tuesday."
- Example 1 (Code/DB): "Add a
- Detection Improvements:
- Add more granular metrics/alerts (e.g., "Set up a Prometheus alert for DB connection pool exhaustion").
- Improve existing monitoring (e.g., "Lower Tickr probe failure threshold from 3 consecutive failures to 2").
- Resolution Improvements:
- Document runbooks for common incident types.
- Improve access to diagnostic tools.
- Process Improvements:
- Review deployment process for potential risks.
- Conduct a load test for specific scenarios.
Pitfall: Avoid vague action items like "be more careful." Instead, aim for "Implement automated unit tests for critical data transformation logic," or "Refactor the authentication service to use a more resilient token validation library."
9. Communication Plan (Optional but Recommended)
Decide how you'll communicate with customers, if necessary.
- Will you send an email? Post on your status page?
- What will the message convey? (Transparency builds trust).
Pitfalls to Avoid
- Blame Game: Post-mortems are about systems and processes, not individuals. Focus on "how" and "what," not "who."
- Shallow Root Cause: Don't stop at the first "why." Keep digging until you find the true underlying issue.
- Vague Action Items: If an action item isn't specific, measurable, assignable, relevant, and time-bound (SMART), it's likely to be ignored.
- **Not