Wednesday, 05 June 2019 08:28

Google says wrongly applied configuration change caused outage Featured

By
Google says wrongly applied configuration change caused outage Image by 200 Degrees from Pixabay

The outage that Google experienced on Monday AEST was caused by a configuration change that was pushed out to more servers than intended, the company says in a blog post.

Engineering vice-president Benjamin Sloss wrote on Tuesday US time that the configuration change was meant to be pushed out to a small number of servers in a single region, but was incorrectly sent to a larger number across several neighbouring regions.

This caused those regions to to stop using more than half of their available network capacity.

Sunday's incident affected multiple services in Google Cloud, G Suite and YouTube.

A little more than 10 years ago, Google had a major hiccup due to a configuration snafu, when it reported every search result as being from a site that was infested with malware.

As iTWire's David Williams explained at the time, "It seems one poor Google tech — who I suspect now has to bring in donuts for the rest of the office — accidentally added a single "/" to the list of bad sites, on a line all by its lonesome self.

"When Google's search results check for matches the solitary slash actually rings true for everything – every URL with a slash in it registered as malware according to this search term. The solution wasn't to reboot anything, it was to take out that little one liner from the bad sites register."

Of Sunday's screw-up, Sloss wrote: "The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not.

"The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam."

He said the issue was detected "within seconds" but took much longer to fix than the target of a few minutes.

"Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage," Sloss said.

"The Google teams were keenly aware that every minute which passed represented another minute of user impact, and brought on additional help to parallelise restoration efforts."

As to the impact, Sloss said YouTube had seen a 2.5% drop in views for an hour, while Google Cloud Storage showed a 30% reduction in traffic. About 1% of Gmail users had experienced issues.

"With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors to both the network capacity loss and the slow restoration," he said.

"We will then have a focused engineering sprint to ensure we have not only fixed the direct cause of the problem, but also guarded against the entire class of issues illustrated by this event."

But his statement about the end of the issue may be a little premature. The Google Cloud status page has the following legend at the end: "We're investigating an issue with Google Compute Engine Persistent Disk in us-east4-b and us-east4-c. Affected customers may observe IO errors on Persistent Disks attached to instances and/or may fail to create PD snapshots in us-east4-b and us-east4-c.

"The issue should be resolved for majority of users and we expect a full resolution in the near future. We're waiting on our final changes to propagate. We will provide another status update by Tuesday, 2019-06-04 17:10 US/Pacific with current details." That update should be out at 10.10am AEST if it is on time.

LEARN HOW TO REDUCE YOUR RISK OF A CYBER ATTACK

Australia is a cyber espionage hot spot.

As we automate, script and move to the cloud, more and more businesses are reliant on infrastructure that has high potential to be exposed to risk.

It only takes one awry email to expose an accounts payable process, and for cyber attackers to cost a business thousands of dollars.

In the free white paper ‘6 steps to improve your Business Cyber Security’ you will learn some simple steps you should be taking to prevent devastating malicious cyber attacks from destroying your business.

Cyber security can no longer be ignored, in this white paper you will learn:

· How does business security get breached?
· What can it cost to get it wrong?
· 6 actionable tips

DOWNLOAD NOW!

ADVERTISE ON ITWIRE NEWS SITE & NEWSLETTER

iTWire can help you promote your company, services, and products.

Get more LEADS & MORE SALES

Advertise on the iTWire News Site / Website

Advertise in the iTWire UPDATE / Newsletter

Promote your message via iTWire Sponsored Content/News

Guest Opinion for Home Page exposure

Contact Andrew on 0412 390 000 or email [email protected]

OR CLICK HERE!

Sam Varghese

website statistics

Sam Varghese has been writing for iTWire since 2006, a year after the site came into existence. For nearly a decade thereafter, he wrote mostly about free and open source software, based on his own use of this genre of software. Since May 2016, he has been writing across many areas of technology. He has been a journalist for nearly 40 years in India (Indian Express and Deccan Herald), the UAE (Khaleej Times) and Australia (Daily Commercial News (now defunct) and The Age). His personal blog is titled Irregular Expression.

VENDOR NEWS & EVENTS

REVIEWS

Recent Comments