Engineering

Slack on Slack: Turning channels into our central command for incident response

How we provide essential context to responders and stakeholders, fix problems quickly, and make sure they don’t happen again

A team comes together to quickly put out a very small fire
Image Credit: Giacomo Bagnara

When fire departments respond to a five-alarm blaze, their ability to save lives and property depends on how quickly they can establish roles, communicate about the nature and extent of the fire, and agree on the best path forward.

As a former search-and-rescue pilot and commander, and an emergency services leader at Burning Man, I’ve addressed my fair share of real-world crises. The service glitches and bugs that we deal with at Slack aren’t life-or-death, but the way we remedy them, using our own tool, really isn’t all that different from the best practices used by first responders around the world.

At Slack, we don’t really have a dedicated incident response team. Instead, we train all engineers to be incident responders (through a three-hour class that new hires are expected to complete within their first 90 days), and about 25% also train as incident commanders (ICs). Anyone in the company can “pull the fire alarm,” which automatically alerts on-call engineers and incident commanders, and get a response any time of day, seven days a week.

By cleverly using automation and creating a central location where all stakeholders can quickly view relevant context around the issue at hand, we can not only shorten the incident but reduce its impact on our customers and ourselves.

Rally responders in a central channel

The incident response process at many companies revolves around getting everybody on a phone call or, when possible, into the same physical meeting room. But if you’ve ever been part of an incident response conference call, you know how disruptive it can be when somebody new enters and the meeting pauses so they can catch up. It’s a comedy of errors worthy of an SNL skit.

As soon as an incident is spotted by one of our monitoring integrations, like PagerDuty or Grafana, we use a slash command (/incident-pde create) that we developed to spin up a dedicated channel in Slack and post a message inside with relevant documentation links. Having a separate channel for each incident establishes a single source of truth for the response team.

Each incident channel we create follows a common naming convention: #incd-YYMMDD-description. This way, a channel like #incd-200602-elevated-api-errors is easy to find both during and after an incident (and the fastest way to do so is accessing Slack’s Quick Switcher with command+K or control+K).

Anyone who joins the user group called @incident-next-followers is automatically added to the incident channel at its creation.

Each incident is led by an incident commander (IC). The IC builds the team structure, as with Lego blocks, that is necessary to resolve the particular incident, ensuring that the following people are added to the channel:

  1. Subject-matter experts (who could be pulled in with a user group, e.g., @frontend-eng), who will create, test and deploy technical fixes
  2. A customer experience team liaison to bridge our engineers and any impacted customers
  3. An executive-team liaison for any high-profile or urgent incidents

Faster communication = faster resolutions

No matter the incident, it’s rare to have all your people together at the very beginning of a problem. A few people start the investigation, a couple of experts are tagged in, and likely a few more sometime after that. Waiting to start until you have everyone assembled? That’s simply not an option. This is where having a dedicated Slack channel for the incident can speed up your resolution time.

Anyone who joins the incident in progress can quickly scroll up to see the full history of what’s already been explored, ruled out and accomplished, without interrupting ongoing investigations. In addition to helping responders get up to speed as they join the incident response, this means that other stakeholders (such as an executive, or an account rep whose customer is impacted by the incident) can follow along in the channel and see what fixes are being implemented and when, without jogging the elbow of the responders.

We centralize that information in a few different ways:

1. We rely on the channel topic—visible at the top of each incident channel—to help new responders get up to speed. The topic clearly indicates the IC, severity (on a descending scale of Sev-1 to Sev-4) and status (active, under control, paused, all clear, etc.). The IC periodically updates a pinned message with the latest status report for a quick snapshot of the most essential information and any immediate needs.

2. We use threads for deep dives into particular subjects to create focus and keep from cluttering the main conversation. Threads are incredibly powerful, and they work especially well because we have a social convention of sharing our discoveries or decisions back to the main channel.

3. Emoji reactions make it easy to scan for the status of each request. We convey quick responses using:

  • 👀 to mean “I’m looking at this”
  • ✅ to mean “This is done”
  • 📮 for post-incident-response follow-up

4. We also use emoji in our user status. For example, the IC will mark themselves with a ⛑ symbol, both in their status and in the channel description, so others can see at a glance who’s running the incident response. Because we may have multiple incidents running simultaneously, an IC will also name which one they’re responsible for in their user status, and anyone can view that by simply hovering over their status emoji.

5. Thanks to the Reacji Channeler, we can even use emoji reactions to instantly route a message from one public channel to another. For example, when someone shares the post-incident review doc in the incident channel, they’ll tag it with a particular reacji. This shares that message in an #announce-incident-reviews channel, which anyone can follow. It’s an app you can install and use out of the box—no code required.

6. For large-scale or complex incidents, we use additional channels that follow the same naming convention as the main channel (and for those where we’re handling sensitive data or policy matters, additional private channels). We use the “share message” feature in Slack to copy key messages from the main incident channel to these auxiliary channels.

All of your incident data, all in one place

If we resolve individual issues without digging into their root causes, we’ll create an infinite game of Whac-A-Mole. At Slack, our entire engineering org has three priorities:

  1. Fix Sev-1 and Sev-2 incidents, fast
  2. Make sure those incidents don’t happen again by running an effective post-incident review and agreeing on next steps
  3. Work on objectives and key results by building innovative products and features

Only when priorities one and two are complete can we focus on number three, which means we take post-incident reviews—and access to any supporting documentation—seriously.

We have the time-stamped discussions, screenshots, graphs and resulting decisions all collected in the same place, and much of that links out to other systems and dashboards. Plus, the ability to archive channels means that the historical record is preserved and can be referenced to identify patterns and onboard new engineers more effectively.

Remember the 📮 reacji mentioned earlier? A quick search for has::postbox: in:#incd-YYMMDD-description will show a list of all the messages the response team flagged for discussion during the review. And when the incident review is complete, we’ll share the final report back to the dedicated incident channel, where it’s neatly stored alongside all the relevant context. (Extra suggestion: Use a custom emoji reaction and the aforementioned Reacji Channeler to copy all of your incident review docs to a dedicated #announce-incident-review channel.)

When a fire is allowed to burn unchecked, it gets hotter and larger. IT incidents work much the same way: When incidents take longer to resolve, the disruption grows exponentially, not only for your customers and your bottom line but also for your hardworking engineering team. Supplement your emergency plan with these best practices, and you can quickly resume service and let more of your on-call engineers sleep peacefully through the night.

Move dev work forward faster with Slack

Learn how engineering teams use Slack for deployments, incident management, testing and more

Learn more

About Slack

Slack has transformed business communication. It's the leading channel-based messaging platform, used by millions to align their teams, unify their systems, and drive their businesses forward. Only Slack offers a secure, enterprise-grade environment that can scale with the largest companies in the world. It is a new layer of the business technology stack where people can work together more effectively, connect all their other software tools and services, and find the information they need to do their best work. Slack is where work happens.