According to Nick Rockwell, Fastly's Senior Vice President of Engineering and Infrastructure, in a blog post a few hours ago, "We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change. We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal.
"This outage was broad and severe, and we're truly sorry for the impact to our customers and everyone who relies on them."
Rockwell continues, "On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.
"Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors."
Unfortunately, when you're one of the Internet's leading CDN providers, 49 minutes is a VERY long time.
Rockwell's blog post offers a simplified timeline.
The only part of this timeline that bothers us is in the final few lines. Over seven hours between recovery and the deployment of a bug fix.
Rockwell concludes the company's mea culpa with this:
"Where do we go from here?
"We're deploying the bug fix across our network as quickly and safely as possible.
"We are conducting a complete post mortem of the processes and practices we followed during this incident.
"We'll figure out why we didn't detect the bug during our software quality assurance and testing processes.
"We'll evaluate ways to improve our remediation time."
Of course, the internet is a complex beast, but outages like this should never happen. With luck, everyone will learn from this. And a note to CDN customers - don't put all your delivery eggs in one basket.