Netflix’s way: innovating resilience

0
1803

From Scott Baldwin – Head of Enterprise Resilience at Netflix

A common statement I hear from new Netflix employees is ‘The culture is real!’. By this, they refer to Netflix’s famous culture of Freedom & Responsibility, which enables employees to do whatever they feel is the right thing to do, and to take creative risks with the company’s full support behind them. The other side of this, however, is the intentional lack of any prescriptive direction, which means a traditional approach to resilience planning (mandates, requirements, etc.) will not work.

While this attitude is still very much the rule, COVID-19 did highlight the need for some centralized capabilities from a resilience perspective. In August 2020, I received the opportunity to create Netflix’s first enterprise-wide resilience program. My only requirements were to:

  • make sure we increased resilience.
  • make sure it fit within the Netflix culture.

Everything else was up to me.

The program I designed has three primary capabilities:

  • Resilience Risk Scoring – Create a standardized, intuitive, and useful metrics to allow our risk owners to understand their current resilience risk within context of the entire organization.
  • Dependency mapping and predictive impact modeling – Using this score, develop a ‘virtual ecosystem’ where we can clearly measure the interdependencies of our assets, and proactively and predictively identify weak links in our dependency chains. Think Chaos Engineering in a virtual world, but not exclusive of technology.
  • Resilience as a Service (RaaS) – While many risk owners will simply need context to improve their resilience capabilities, some will need assistance and expertise to reduce their resilience risk. For this, RaaS will act as a consultancy and provide any risk mitigation activities the risk owners request.

The Resilience Risk Score.

For this article, I will only focus on the first capability, the Resilience Risk Scoring. Another maxim we use at Netflix is ‘Context, not Control’, or in other words, simply provide the information needed for employees to make informed decisions. My solution was to provide standardized, quantified resilience risk scores that each department could use to determine their ongoing investment in resilience.

The first part of this is to inventory each department’s portfolio of ‘Risk Assets’. Risk Assets are anything that is:

  • Critical to the ongoing operations of Netflix
  • Owned by the department from a resilience perspective

Risk Assets can include applications, processes, vendor relationships, studio productions, physical locations, etc. Once we identify Risk Assets, a Criticality Score applies to each one. The Criticality Score ranges from 0-10, and it relates inversely to the asset’s RTO, so that a Criticality Score of ‘0’ means the Risk Asset can suffer a disruption for weeks without any impact on the company, and a score of ‘10’ means it must be available 24/7. This is our way of conducting the BIA.

The next step is to identify the current level of resilience for each asset, which we call the Resilience Capability Score. Again, this is a range of 0-10, where ‘0’ represents no current resilience capability and ‘10’ represents the maximum, and is made up of ‘Availability’, ‘Response’, and ‘Recovery’. Because Risk Assets are generic terms for many different types of activities and entities, each Risk Asset type has its own set of metrics. The questions we ask to determine the resilience of a cloud-based application is very different than the ones we ask to determine a building’s readiness.

However, the standardized approach to this allows us to equate these various types of asset’s resilience. For instance, from a risk perspective, a score of ‘5’ means the same thing whether we’re talking about a vendor or a business process.

The final piece is the Resilience Risk Score, which is simply the difference between the Criticality and Capability scores (Figure 1). In other words, the gap between how resilient the asset should be and how resilient it actually is. However, because this gap is represented by a number *(-10 to +10), it makes it easy for our business partners to quickly and intuitively understand their resilience status and take the actions they feel are appropriate.

At the end of the day, these scores not only help to inform the risk owners of how they should be investing in resilience, it is also the foundation for the rest of the program, i.e., dependency mapping, and Resilience as a Service. Having unified scores for each asset allows easy and intuitive identification of weak links within a dependency chain, as well as a methodology to assign inherited risks, predict likely failures, and understand aggregate impacts.

Additionally, where departments request it, we can engage with specific BCM activities resulting in a demonstrable improvement of their risk profile.

While this is a new program, and will doubtless change over time, the initial response to it has been very gratifying. We have been able to communicate resilience risk in a way that is both meaningful and provides practical steps for the business owners to take, all while respecting Netflix’s culture of ‘Freedom & Responsibility’.

*It is possible that an asset is over-resilient, designated by a score of >0.