In a tweet issued around 9:30pm last night (AEST) Fastly said, "We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration. Our global network is coming back online."
The Fastly status page showed this:
For readers unfamiliar with a Content Distribution Network, this is a service that will take copies of a web site and distribute them to servers scattered around the world.
For instance, you may be reading iTWire.com in London or perhaps Anchorage in Alaska. If we wanted to improve the service you receive, we would engage a CDN to store our content and whenever you access iTWire.com, the content would be delivered to you from a local server.
This has two broad positives. Firstly for us, we don't have to serve every request from our own computing resources, instead we only respond to a small number of requests from the CDN servers. Secondly, you will get a much snappier response to your page requests - the CND servers are very big and fast and also, they're closer to you.
In order to gain some context within the local IT industry, we asked a number of vendors for their thoughts on the outage itself and also how organisations should protect their presence on the internet.
Lotem Finklestein, Head of Threat Intelligence at Check Point offered this, "While we don't yet know the reason for the widespread outage at cloud service company, Fastly, it's important to understand why the impact is so extensive. Fastly is a CDN - a content delivery network. CDNs generate replicas of original websites for the website owners to allow load balancing."
"When a CDN fails, it means that all the replicas are unavailable and no one is able to see the content from the original server. So it seems like Amazon, Reddit, Twitch and all these big sites have been attacked in unison, but they were not attacked. There is no outage for these companies. The only outage was at Fastly, the CDN that serves them.
Leo Lynch, Director, Asia Pacific, StorageCraft, an Arcserve Company reminds us that, "If the last year has taught us anything, it's that we never know what's around the corner. The latest Fastly mass internet outage, which caused many Australians to see the "HTTP Error 503" on Tuesday night when accessing their favourite websites, is only one many severe disruptions that have plagued businesses in the past year.
While Mercer reports that only about half of businesses have a business continuity plan, it's often this type of thorough, proactive planning that helps companies successfully tackle the biggest challenges that come along."
"According to what we've learned, today's outage across several news outlets was a result of a misconfiguration," said Andy Champagne, Akamai Technologies' SVP and chief technology officer of Akamai Labs. "This means that there could have been an error in a file or something as simple as a typo made by someone managing the system.
"It is also our understanding that people were getting  errors returned very quickly, which is an indicator of a service being unavailable, versus a cyberattack. In an attack it usually takes some time for the consumer to see an error.
"What people experienced today is just another reminder of how the internet is a lifeline for consumers and for businesses, and we have come to count on it being reliable and available to us when we need it.
Marcus Thompson, AM, PhD, retired Army officer and former Head of Information Warfare for the Australian Defence Force wanted to localise the impact of this outage, noting that, "The Fastly outage demonstrates, yet again, the importance of digital sovereignty in Australia. This was a technical outage, rather than a cyber-attack, yet the effect on Australian businesses and people was the same. It calls into question our dependence on foreign service providers.
"We need to look closer to home for how we connect to the digital world around us. Australia has some of the greatest data and security skills in the world - the cost of not utilising that in terms of security and economic value is staggering.
"The Government's Security of Critical Infrastructure (SOCI) legislation couldn't be more timely and important to drive this - and bring our data - home."
In a similar vein, Adam Cassar, Co-founder of Peakhour.io says, "A global issue shows some shared component failed resulting in Fastly not being able to process requests, likely effecting their ability to connect to client Origin servers. Fastly tweeted that it was a 'configuration issue' (as we noted above). After the issue was resolved Fastly say clients may experience a 'lower CHR'.
"What does this imply? Varnish achieves its performance through in-memory caching. A lower CHR would mean that Varnish was restarted, losing that cache. A configuration error means that a configuration change was enacted that resulted in a global outage. We can surmise from this that, there are shared components in the Fastly caching network and that the configuration change was enacted globally without sufficient testing.
Finally, according to Associate Professor Carsten Rudolph, Department of Software Systems & Cybersecurity, Faculty of Information Technology, Monash University "During last night's outage, which impacted websites like The Age, Sydney Morning Herald, New York Times, Amazon and Gov.uk, Fastly claimed that the 'network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime'. While automatic failover is not easy, if there is a major issue, the remaining nodes might receive a very high load and either become very slow or completely fail.
"Today we learnt that the outage was due to a misconfiguration of Fastly's 'points of presence' (POPs). These are servers distributed all over the world and once the issue was identified, it was relatively easy to fix.
"Moving from centralised solutions to distributed architectures that use a world-wide network of POPs can improve speed of delivery and potentially its reliability. However, the example of the Fastly outage shows that small errors can not only disrupt centralised services, but also these distributed solutions.
"These types of reliability issues can potentially result in financial losses and point to the need for a proper risk analysis. Businesses need to understand exactly what services and infrastructures they rely on. Even if these services promise high stability and redundancy, it is always possible that one or even several could fail and businesses need to plan for these outages and have contingency actions in place, if the risk becomes too high."
As usual, the messages are plain for all to see. For providers, make sure you test all changes before they're rolled out, and for customers, make sure you have multiple ways to deliver your internet content.