Monday, 11 January 2021 13:16

How GitHub revamped its on-call strategy for over 50 engineering teams

By

Global open-source software host, GitHub, serves 56 million users and is the custodian of billions of lines of open-source code. With such a product, on-call is part of life, but it doesn’t have to come at the expense of work-life balance or accepting technical debt.

Whether we're performing routine maintenance out-of-hours, or we're in the midst of a global pandemic, robust on-call is essential for any technical organisation.

Like many companies such as Google, Netflix and others - even small organisations, let alone planet-size ones - as GitHub expands its product offerings by number and complexity it became critical to evolve its on-call strategy to maintain scale with its 56 million, and growing, users.

GitHub previously had a monolithic on-call structure but transformed it into having each of its 50+ engineering teams be responsible for the code they maintain. This including assigning ownership of 16,000 files from the monolithic Ruby on Rails codebase to those teams while dealing with educational hurdles and addressing the work-life balance.

This was not purely a logistical effort; Mary Moore Simmons, Director of Engineering at GitHub, also scoped out the various cultural and education hurdles, including adding COVID-19 into the mix, cultivating a blameless culture, creating training, dealing with criticality, and focusing on long-term success.

She detailed the adventure in a recent blog post, distilling her learnings and challenges for the benefit of IT professionals globally. In fact, putting aside GitHub’s size for a moment, the problems GitHub experienced are not so dissimilar to any other organisation.

Previously, GitHub's monolith spanned a huge number of products and features. Most engineers did not have enough familiarity with great swathes of the codebase to feel confident when responding to on-call incidents. This meant frequent escalations to another team and the engineer felt more like a switchboard operator than a team.

Additionally, the on-call rotation was large with a 24-hour on-call period. Consequently, engineers were only actually on-call about four times per year and thus invariably many never gained the context to provide this confidence.

Compounding the situation, the monitoring and documentation were not well-maintained because the on-call rotation was spread so far and engineers only had to deal with it for 24 hours at a time. Without determined effort, the result was noisy alerts and poor runbooks.

Then, because most engineers weren't confident with the monolithic on-call shift, the same small group of people who knew the platform best were involved in every production incident causing an imbalance in on-call responsibilities and taking their time from any other project they were involved in.

Simmons knew something had to be done. The first step was logistics and assigning ownership of specific code to specific engineering teams. With a 16,000+ file codebase in one monolith, this was no small effort. To resolve it, GitHub rolled out a new system to associate files to services, and services to teams. For example, components of the API belong to the apps team while the permissions model belongs to the authorisation team.

This detail is now pulled into an internal Service Catalog, so any GitHub staff member can identify unambiguously which engineering team owns which service.

To ensure compliance, a new lint rule was added that prevented any code being updated in, or added to, the monolith without the ownership information being supplied.

Monitoring and alerting was split up so teams set up monitoring relating to their area only before ultimately all sorts of no-longer-needed alerts were decommissioned entirely.

Nicely, to help the many diverse teams know what they needed to do GitHub “ate its own dog food,” creating GitHub issues for every team with clear checklists.

It wasn't all smooth sailing. In fact, GitHub began this journey back in 2019, some seven months before COVID-19 was announced as a global pandemic. The added stress of a pandemic magnified anxieties and necessitated changing the project management to a higher-touch, empathy-first approach.

Many engineers had never been on-call in the past and lacked experience with operational best practice. To cater for this, Simmons and her team designed and delivered training, created significant tooling and documentation, and opened Slack channels where anyone could ask for help.

Reasonably, some engineers were anxious about the impact on-call would have on their lives. How would they respond to a page within minutes while attempting to do everyday tasks like grocery shopping? Simmons and the team worked with the teams to understand concerns, document tips and tricks from experienced on-call engineers, and work with people one-on-one where needed. They also reinforced that team members are there to support each other; someone could take over on-call for a couple of hours if a colleague needed to go for a run or handle childcare, for example. In fact, leveraging GitHub’s global presence meant they could lean on team members in other parts of the world to take on on-call while still in their ordinary hours.

An essential message Simmons and her team made was that of a blameless culture. They found another anxiety was engineers who were concerned about letting their team down while working on-call. As an organisation, GitHub reinforced mistakes are ok, outages happen, but people who bravely work on something they’re not familiar with when on-call ought to be celebrated.

Each engineering team had different levels of criticality; some need resolution within minutes, while others can wait until the next business day. Some engineers were concerned this caused an unfair balance, but GitHub sees this as a self-resolving problem. Different engineers want to work on more business-critical and technically complex systems, while others have different interests, and thus each engineer will naturally select teams with the operational rigour they identify most strongly with.

Importantly - and a lesson for any organisation - the whole on-call experience needed to feedback into itself to make the overall experience better. The person on-call ought to be active when not responding to pages updating runbooks, tuning noisy alerts, scripting or automating on-call tasks, and fixing the underlying technical debt.

By the end of this journey, Simmons notes, GitHub's incident resolution time improved but the journey is not over. Nor will it ever be; organisations need constantly improve their best-practices and the cultural changes Simmons and her team identified need continual promotion.

They also need continual feedback, and to that end Simmons states they will regularly survey engineers about their on-call experience to always be learning and improving in their drive for excellence, to continue to be the trusted home for all developers.

 


Subscribe to ITWIRE UPDATE Newsletter here

GRAND OPENING OF THE ITWIRE SHOP

The much awaited iTWire Shop is now open to our readers.

Visit the iTWire Shop, a leading destination for stylish accessories, gear & gadgets, lifestyle products and everyday portable office essentials, drones, zoom lenses for smartphones, software and online training.

PLUS Big Brands include: Apple, Lenovo, LG, Samsung, Sennheiser and many more.

Products available for any country.

We hope you enjoy and find value in the much anticipated iTWire Shop.

ENTER THE SHOP NOW!

INTRODUCING ITWIRE TV

iTWire TV offers a unique value to the Tech Sector by providing a range of video interviews, news, views and reviews, and also provides the opportunity for vendors to promote your company and your marketing messages.

We work with you to develop the message and conduct the interview or product review in a safe and collaborative way. Unlike other Tech YouTube channels, we create a story around your message and post that on the homepage of ITWire, linking to your message.

In addition, your interview post message can be displayed in up to 7 different post displays on our the iTWire.com site to drive traffic and readers to your video content and downloads. This can be a significant Lead Generation opportunity for your business.

We also provide 3 videos in one recording/sitting if you require so that you have a series of videos to promote to your customers. Your sales team can add your emails to sales collateral and to the footer of their sales and marketing emails.

See the latest in Tech News, Views, Interviews, Reviews, Product Promos and Events. Plus funny videos from our readers and customers.

SEE WHAT'S ON ITWIRE TV NOW!

BACK TO HOME PAGE
David M Williams

David has been computing since 1984 where he instantly gravitated to the family Commodore 64. He completed a Bachelor of Computer Science degree from 1990 to 1992, commencing full-time employment as a systems analyst at the end of that year. David subsequently worked as a UNIX Systems Manager, Asia-Pacific technical specialist for an international software company, Business Analyst, IT Manager, and other roles. David has been the Chief Information Officer for national public companies since 2007, delivering IT knowledge and business acumen, seeking to transform the industries within which he works. David is also involved in the user group community, the Australian Computer Society technical advisory boards, and education.

Share News tips for the iTWire Journalists? Your tip will be anonymous

WEBINARS ONLINE & ON-DEMAND

GUEST ARTICLES

VENDOR NEWS

Guest Opinion

Guest Interviews

Guest Reviews

Guest Research

Guest Research & Case Studies

Channel News

Comments