Skip to main content

Post-mortem: {Incident title}

Template — copy this page into operations/postmortems/YYYY-MM-DD-slug.md after a customer-impacting incident. Delete this callout when you're done.

DateYYYY-MM-DD
Severitysev-1 / sev-2 / sev-3
Owner@incident-commander
Statusdraft / final

TL;DR

A 2-3 sentence executive summary. What broke, who was affected, how it was fixed, what we'll do about it.

Impact

  • Users affected: number / percentage.
  • Calls affected: number, by status.
  • Duration: start → end (UTC).
  • What worked / didn't work for users.
  • Money impact (if any): lost revenue, refunds, provider costs.

Timeline

All times UTC. Be ruthless about including the times of detection, escalation, mitigation, and verification.

TimeEvent
14:02Sentry alert fires: OpenAiTimeoutException spike.
14:04On-call ack.
14:08Identified that OpenAI was returning 503s.
14:12Decided to fail-fast and surface "analysis temporarily unavailable" to clients.
14:20OpenAI returns to baseline.
14:25Backlog of failed calls re-queued.

What went well

  • ...
  • ...

What went poorly

  • ...
  • ...

Where we got lucky

  • ...

Root cause

The mechanical cause and the contributing factors. Avoid blame. Look for systemic issues.

The pipeline didn't degrade gracefully when OpenAI returned 503s; we retried tightly and exhausted our worker pool, which made existing transcript reads slow.

Action items

ActionOwnerPriorityDue
Add exponential backoff with cap to OpenAiAnalysisClient@ownerhighYYYY-MM-DD
Cap worker thread pool size; add metric@ownermediumYYYY-MM-DD
Update analysis doc with retry behaviour@ownerlowYYYY-MM-DD

Lessons learned

A short, opinionated takeaway worth remembering on future on-call shifts.