How to Write a Good Incident Update

Most incident updates are terrible

Here's a real example of a bad incident update:

"We are currently investigating increased error rates on our primary API cluster. The issue appears to be related to elevated latency on our database layer. Our engineering team is actively investigating."

What did the user learn? Nothing. "We're looking into it" with extra words. The user still doesn't know: is my data safe? When will it be fixed? Should I tell my users?

Now here's a better version:

"Our API is returning errors for about 30% of requests. Your data is safe. We've identified the cause and a fix is being deployed now. We expect full recovery within 20 minutes."

Same incident, but this time the user knows what's broken, what's safe, what's happening, and when it'll be fixed.

The four questions every update should answer

When you're writing an incident update, your users have four questions:

1. What's broken?

Be specific. "Some users are experiencing issues" tells nobody anything. Instead:

"The dashboard is loading slowly (10-15 second load times instead of the usual 2 seconds)"
"Login is failing for users who sign in with Google"
"Webhook deliveries are delayed by approximately 30 minutes"

Use language your users understand. They think in features, not infrastructure.

2. What's the impact?

Tell users what this means for them practically:

"You can still read data but writes will fail"
"Emails are queued and will be delivered once resolved - no messages will be lost"
"New signups are temporarily unavailable, existing users are unaffected"

If their data is safe, say so. If payments aren't affected, say so. Address the scariest possibility even if it's not happening.

3. What are you doing about it?

Be honest about where you are in the process:

Just started investigating: "We've identified the issue and are investigating the root cause"
Know the cause: "This is caused by [simple explanation]. We're working on a fix"
Fix in progress: "A fix has been deployed and we're monitoring recovery"
Waiting for propagation: "The fix is live but some users may experience issues for the next 15 minutes as caches update"

4. When will it be fixed?

This is the hardest one. It's okay to say you don't know yet:

"We expect resolution within the next hour"
"We'll provide an update in 30 minutes with an estimated timeline"
"Recovery is in progress and we're seeing improvement"

Never promise a time you can't deliver. "We'll update you in 30 minutes" is better than "this will be fixed in 10 minutes" when you're not sure.

A simple template

Here's a template that works for any incident update:

[What's happening] in plain language.

[Impact] - what users can and can't do.

[What we're doing] - current status of the fix.

[Next update] - when users will hear from us again.

Example:

Our API is returning 500 errors for approximately 20% of requests. Dashboard reads are working normally but saving changes may fail. We've identified a database connection issue and are scaling up our connection pool. We expect full recovery within 15 minutes and will update this page when resolved.

That's 4 sentences. Takes 30 seconds to read. Answers all four questions.

Tone matters

Skip the corporate speak. "We sincerely apologize for any inconvenience this may cause" sounds like it was generated by a legal team. Just say "Sorry about this." But don't swing too far casual either - "Oops, stuff's broken lol" undermines confidence when your users' businesses depend on your service.

The right tone is direct and calm. You're the person in the room who knows what's happening. Act like it.

One more thing: don't blame your vendors. "Our cloud provider is experiencing issues" might be technically true, but your users chose your product, not your cloud provider. Own the incident.

Timing matters more than perfection

The worst incident update is the one that comes 2 hours late. A quick "We're aware of issues affecting [feature] and are investigating. More details in 15 minutes." is better than a perfect update that arrives after users have already found out from Twitter.

First update: within 5 minutes of detection. Can be brief. Follow-up updates: every 30 minutes during active incidents. Resolution update: when the issue is fully resolved, summarize what happened.

Let AI help

If writing clear, calm incident updates under pressure isn't your strength (it isn't for most people), let AI draft them for you.

Chirp generates incident updates automatically when monitors detect failures. The AI writes the initial report, suggests updates as you make progress, and writes the post-incident summary when it's resolved. You review and publish instead of drafting from scratch while your site is on fire.

There are other tools that do this too. The point is: incident communication is stressful enough without also being a writing exercise.