It’s been a rough couple of days at Slack HQ. We’ve had two separate incidents where too many users were unable to connect for too long a time. Firstly, and most importantly, we want to apologize. While we’re glad it didn’t affect more users, any downtime is too much downtime.
In the first event on Tuesday (October 14th), all users were locked out of Slack for 14 minutes (users who had already established connections could continue to work) and, following that, 13% of users had poor or no availability for periods of up to two hours. Today, at 11.27am, there was a separate incident with a similar effect for a subset of users. That one was resolved an hour later, at 12.28pm (all times and dates are San Francisco local time).
While working hard on restoring the service to what you expect it to be, we were also reading the feedback coming in. We know that people worry when they can’t get into a service they rely on. It’s hard to know whether it’s a blip or the start of an unpleasant new trend. We’re working very hard to make sure it’s the former, but we wanted to give some detail as to what happened and what we’re doing to address it. At the bottom of this post, you’ll find a day-by-day breakdown of the events.
Slack has grown extremely quickly. With the exception of national holidays, we’ve broken a usage record every single day for the last year (every Monday has been our best-ever Monday, every Tuesday our best-ever Tuesday, and so on). Throughout that year, up to this week, we’ve had a grand total of 89 minutes where Slack was functioning at any level lower than “perfect” (and the majority of those minutes were things like “slow posting from integrations”, which were minimally disruptive.)
You can see the quarterly history summarized in the chart above and view every little detail on our status site’s calendar. Incidents, which we mark in yellow and red on the calendar, typically only affect certain integrations or users. Almost all of our scheduled maintenance is conducted in a way that goes totally unnoticed by users, even when their team is migrated to another server cluster mid-conversation.
This week’s performance, however, was far worse and it was far below the standard to which we hold ourselves. Nevertheless, we feel confident — and want you to feel confident — that this is neither typical for Slack nor indicative of future performance.
We’ll be continuing to monitor and improve things, and in the interests of transparency, wanted to share with Slack users the details we shared internally in our initial post-mortem.
- A portion of the internal network of our hosting provider suddenly became unavailable. Most of the affected machines were in our search clusters.
- This caused a large backlog of message and file data which needed to be re-fed into our search index (about 10 hours worth of data for 2% of teams).
- This event was unrelated to the following day’s problems (and was almost completely unnoticed by users) but was an exacerbating factor in what followed because our work queues had a steady stream of events to process.
- We undertook some routine maintenance, during which, as a result of an automation malfunction, corrupted code was deployed to our web servers and “job queue workers” (machines which process asynchronous tasks, such as link unfurling).
- This was noticed immediately internally, and fixed within 14 minutes. However, 13% of Slack’s users were disconnected from Slack during this window.
- Those users all immediately attempted reconnecting simultaneously.
- The massive number of simultaneous reconnections demanded more database capacity than we had readily available, which caused cascading connection failures.
- We immediately began to add additional database capacity to cover this kind of situation. We anticipate completing this provisioning in the next two days.
- Most users were able to reconnect within 30 minutes, but 5% of users remained disconnected for up to 2 hours while we recovered their database cluster.
- We also immediately made optimizations and other changes to methods by which people reconnect which should help reduce the total capacity needed to recover from massive disconnection-reconnection processes.
- Late on Tuesday night, Google released a statement about the POODLE vulnerability, which necessitated a disabling of SSLv3, as reflected in our tweet at the time.
- We reconfigured internal services and then our API to disable SSLv3 support without incident.
- We then updated the real-time message servers that support Slack’s real-time features and in the build process introduced a bug that would cause each message server to crash once at some time in the future.
- We began gracefully restarting real-time message servers but one particularly heavily-used server crashed before we got to it.
- The simultaneous reconnections from all of those users overwhelmed the databases backing their teams and prolonged the outage.
- The situation was exacerbated by a change to the web and desktop software clients made late on Tuesday. In order to ensure correct appearance of historical messages following Tuesday’s incident, each client cleared its internal cache and reloaded all history from scratch, which increased the strain on database servers when they all re-connected en masse.
We know you depend on Slack, and that any downtime in a service you’ve come to love and trust is worrisome. It will take time to rebuild the trust we’ve lost. We can’t change history to make the little red squares from this week’s status calendar back into little green ones: but we want you to know that we’re invested and working around the clock to keep you consistently, confidently connected on Slack.