AMA (ASK ME ANYTHING) LIVE

Chaos Engineering
+ DiRT

Breaking systems to build resiliency

Ask the leaders of Netflix’s Chaos Engineering and Google’s Disaster Recovery Testing (DiRT) exercises anything cloud, distributed systems, and controlled chaos.

View On-Demand Video

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form

Who's on the panel?

Bruce Wong

Senior Engineering
Manager

Bruce Wong

Bruce is a technology leader at Twilio, a Cloud Communications Platform that enables developers to integrate rich communication into products. Prior to Twilio, Bruce founded Chaos Engineering at Netflix; a team that set out to change the way engineering thought about how to own and operate large-scale, resilient systems.

Casey Rosenthal

Engineering Manager
(Traffic & Chaos Team)

Casey Rosenthal

Casey is currently the Engineering Manager for the Traffic Team and the Chaos Team at Netflix. He finds opportunities to leverage his experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a vision of the possible to clients and colleagues alike.

John Welsh

Cloud Infrastructure and SRE Disaster Recovery

John Welsh

John is a program manager for Cloud Infrastructure and Site Reliability Engineering Disaster Recovery at Google. John’s work includes building the world’s most reliable Cloud everywhere quickly and causing controlled disasters (and recovering from them safely) during Google’s DiRT (Disaster Recovery Testing) week.

Robert Castley

Senior Performance Engineer
(Moderator)

Robert Castley

Robert is a Web Performance Specialist, Web Developer and all round good looking geek! He's been optimizing web page performance for years and specializes in CSS, JavaScript, HTML5, XML, Web Design/UX, & Performance.

AMA Transcript

Robert:

Hello, everyone, and welcome to Catchpoint's fourth live Ask Me Anything on Chaos Engineering and DiRT, presented in partnership with O'Reilly Media. My name is Robert Castley, and I will be your moderator today.

A brief bit about me: I'm about to celebrate my third year here at Catchpoint, and my little claim to fame in life is I've developed a content management system called Mambo that now lives on as Joomla.

Let's start by introducing today's panelists. First, we have Bruce Wong, who is a senior engineer manager at Twilio, which is a Cloud Communications Platform that enables developers to integrate rich communication into their products. Prior to Twilio, Bruce was instrumental to the development of Chaos Engineering at Netflix; a team that set out the way to change engineering thought about how to own and operate large-scale resilient systems.

A fun fact about Bruce, he applies engineering principles on his approaches to cooking barbecue and chocolate chip cookies. Honored to have you on the panel today, Bruce.


Bruce:

Thanks, Robert. Yeah, I'm up to my revision 27 on chocolate chip cookies.


Robert:

Next we have Casey Rosenthal, who's currently the engineer manager of the Traffic and Chaos Teams at Netflix. Casey finds opportunities to leverage his experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a vision of the possible to the clients and colleagues alike.

A fun "alternative" fact about Casey is, in 1883, he was a helium miner and crashed out before the insulation and Lighter-Than-Air Industries moved to vacuum pumps. You don't look that old, Casey. Thrilled to have you onboard, anyway.


Casey:

Thanks for saying that with a straight face!


Robert:

It's my job. Last, but certainly not least, we have John Welsh, Program Manager for Cloud Infrastructure and Site Reliability Engineering Disaster Recovery at Google. John's work includes building the world's most reliable Cloud everywhere quickly and causing controlled disasters during Google's Disaster Recovery Testing, or DiRT, week.

A fun fact about John, he's driven across the country several times and visited 48 out of the 50 states, only missing Florida and Louisiana. How have you never been to Florida?


John:

Hi, everybody. I don't know, we just missed it!


Robert:

Thanks for joining, we appreciate it. We do have some fantastic user-submitted questions, and we only have an hour, so let's dive right in.

First question, how did chaos engineering come to be, and where is it now? Bruce, let's kick off with you, and pass off to Casey to share how it's being practiced today at Netflix.

Bruce:

Thanks, Robert. The origins of chaos engineering really date back further than even the term “chaos engineering.” Chaos Monkey was developed in about 2009 by a guy named Greg Orzell. Greg was one of the reasons why I ended up at Netflix. Greg recruited me. I got to Netflix around 2010, quickly acquainted with the likes of Chaos Monkey, seeing all my servers go away. I played multiple hats and multiple roles at Netflix, but Chaos Monkey was always there. It was being spread throughout engineering internally. Years later, we found out that as Netflix grew and the complexity scaled, we found that while Chaos Monkey was very successful and very key to a lot of our early success, we found that it wasn't enough.

That's where, multiple teams later, different roles later, I set out to put a team and a charter around this notion of failure injection and created a team around what we now refer to as “chaos engineering.” That's where we set out to level up the abilities of failure injection. Take this as a discipline and a practice across engineering at Netflix, and definitely outside of Netflix, as well. Delivering things like Chaos Kong and a lot of the work that Casey now talks about and does, took over the team after I left.

That being said, I can hand it off to Casey, and he can tell you a little bit about where things have gone since then.


Casey:

Sure. Over the last two years or so, chaos engineering at Netflix ... When I started working with it, if you asked, "What is chaos engineering?" the response you'd get a lot is, "Oh, that's when you break things in production." I can find a lot of people who can break things in production and aren't adding any value to Netflix's bottom line. We formalized a discipline around it. We published the principlesofchaos.org, and then we found like-minded people within the industry, and now outside of the industry, to build the community.

Chaos Community Day is now a small conference around chaos engineering. There is a Chaos Maturity model that sets out a methodology for organizations to adopt chaos. There's chaos engineering tracks at conferences: Velocity in San Jose upcoming and QCon in New York City. It's definitely gone from something that Netflix originally found value in to something that, specifically the word "chaos engineering," something that's hired for across the tech industry, but now also outside of the tech industry.

Of course, DiRT is very much an ideological sibling of what chaos engineering aims to achieve.


Robert:
Awesome, thank you. John, what is Google DiRT? Where did it start, and how is it different from chaos engineering?

John:

Sure, yeah. As Casey was alluding to, there are a lot of similarities between the DiRT disaster recovery testing at Google and chaos engineering over at Netflix. In fact, some of the conferences that Casey mentioned, we have attended and enjoyed participating in. I really like that community aspect of it.

Where DiRT got started, Ben Treynor Sloss, who's a VP of engineering and helped coin the phrase site reliability engineering, SRE, within Google, and now it's going to become an industry term. You can hire SREs in the same way that Casey's advocating hiring chaos engineers. He coined that. We started with reliability. What are first tenants of reliability engineering?

Similar to the maturity matrix that Casey mentioned, we have site reliability engineering maturity matrix, where you take a product from its inception through a very well-honed service where you've got a matrix for everything, you can forecast your capacity needs. It's a very stable service, for example. Then, from that came this idea that what if we start to inject some faults? What if we inject some disruption into those very stable systems? Can we learn things that we might not learn otherwise? Can we learn things about failure modes before they actually occur in production?

The whole premise here is, if we break it in a controlled and careful manner and we make sure we can recover from it, then, when it happens unexpectedly, we're in a better position as far as humans are trained and have a good instant response. Systems can self-heal and auto-recover and continue to operate in a degraded state and still maintain their service levels and things like that.

DiRT's been around somewhere between 9 and 10 years now. I've been at Google for 10 years, and I joined the DiRT team about five years ago in 2012. I've been a part of it for a while. We've been working to grow it. It started within the engineering groups, and then, we've actually expanded it company-wide. We have exercises that involve everything from a theoretical paper-type test, like a tabletop exercise, to very practical testing where we're causing issues within the production environment, within staging and test environments. We've even expanded to different areas, like legal and finance and facilities and things like that, physical security, info security, etc. What is in scope for DiRT is actually everything within the company.

Then, we've even started more recently to move more into Cloud because it's obviously a big focus for the company. We've even done some limited testing with Cloud customers. That's the scope, and that's a little bit of the evolution. Along the way, we've developed better metrics, we've grown the program, we've tried to automate, where we can, and things like that. It's, I think, a mature program in the sense it's been around for a while, but it still has a ways to go.


Robert:
That's brilliant. Thank you very much, John. Casey, we have integration tests, so why do we need chaos engineering?

Casey:

That's one of my favorite questions. Let me think about that for a second. We view chaos engineering as a form of experimentation, so we draw a distinction between testing and experimentation. In some circles, QA has a bad connotation, but if you'll leave that aside, testing and chaos engineering kind of live in that space of QA, quality assurance, where we're building confidence in a system or product. That's kind of where they live. Testing, strictly speaking, doesn't create new knowledge. It just determines the valence of a known property. In classical testing, given some conditions, n=function, you should have this output. Usually, you're determining the binary truthiness of that assertion. Right? Either this is true or false.

Whereas, experimentation is a formal method to generate new knowledge. You have a hypothesis and you try doing something, and you see what comes out of it. If your hypothesis is not supported by the data, then it's kicking off this form of exploration. The difference there is that, in generating new knowledge, we're looking at things that are much more complicated than you can reasonably expect an engineer to test. Because when you're testing something, you have to know what to test. Right? That's a given. You have to make an assertion on a known property.

In chaos engineering, we're saying, "Look, these systems are just too complicated at this point to reasonably expect an engineer to know all of the properties that a system has." This form of experimentation helps us uncover new properties that the system has. Definitely, testing is very important. Chaos engineering is not testing; it's complementary.


Bruce:

To go off of what you're saying about knowledge, I started thinking about the difference between knowledge and skills, as well. I found that knowledge is a good start, but for leveling up your teams and training individuals, I found chaos engineering to be very helpful in that regard. For example, every craft has their own set of journeys. If you have a doctor who graduated from med school, he has a lot of knowledge. He or she has a lot of knowledge, but they go through years of training, years of fellowship. We call it a practice. I think, in our industry we also have that notion of a craft and practice, as well. There's knowledge that you have a base on, but you need to build skills upon that base of knowledge.

Outages are really what makes engineers a lot better and a lot more honed. Chaos engineering is really a good way to accelerate learning, to not just build knowledge about a system, but also to build skills in responding to such system.


John:

To comment on that, Bruce, at Google we have this concept of a blameless, post-mortem culture. Every time you have an outage, a real outage for example, rather than have fear around that or distrust around that, we celebrate those situations because we want to document what happened, we want to share that broadly, and we want people to own that so that we can say, "Look, this is what happened to me. Here's how I responded. Here's how I fixed it. Here's how I'm preventing it from happening again," and the next team can take advantage of those lessons.

By that knowledge share and by that openness - you're right, they do get more experience because they have these outages, but they actually get to benefit from other people's experience having that culture.


Bruce:

Yeah, and I think that sharing knowledge with blameless post-mortems is a great start, and I think that's a practice that we definitely do at Twilio, as well. But I think the chaos gives you that ability to actually give other people the experience of that outage that happened in a controlled manner. Whether it's loading the right dashboards and understanding what metrics actually means or seeing the different parts of the system and how they respond to that outage situation. Getting to practice when it's 3:00 p.m. and not 3:00 a.m. is pretty advantageous.


Robert:
Cool, thank you guys. John, this is one of my favorite questions. Do you let your on-call or affected engineers know when something is planned before it happens? Or is it a total surprise that they must react to?

John:

Yeah, it's a good question. There's definitely a nuance to this. We've tried to provide some rules. We set some ground rules for our DiRT test, but then, we also have some guidelines around when to communicate, to what degree do you communicate, those kinds of things. It's come up more recently where, as the company matures or the company becomes more tolerant of this disruption, you may be able to communicate less. Or you may find you actually need to communicate more. It's a tricky thing to answer. I think it comes with practice.

A clear example is, I could announce that I'm going to be doing some kind of testing. I could give a window of time. "Sometime next week, you should expect to be disrupted somewhere on the order of two hours. If you see something, say something. Escalate normally." Those kinds of things.

Or you could be very specific. If I'm going to inject faults or cause an issue a specific technical system, and I'm not actually evaluating human response to that, it doesn't really matter that the humans know that it's happening or when it's happening. You're still going to cause issues it. Actually, it could be to your advantage because you have more eyes on the problem. You might actually get more feedback about what's breaking and how it's breaking. If you didn't announce it, you might not actually get that feedback. Having a strong feedback cycle is really important and it helps shape and adjust your tests.

All that being said, there's definitely times where an element of surprise is important to the goal of the test. If you're trying to test incident management, not telling your on-call ahead of time that they're going to receive a page as part of an exercise, you'll get a more accurate response from that individual. They won't know it's coming, just like a real page. They'll have to respond, just like a real incident. Even in those cases, one of our rules is everything must be labeled as DiRT. We still want someone to very quickly be able to distinguish, "Is this a real incident or is it a test scenario?" That's so we can prioritize real incidents.

It's also really important as someone moderating these tests. One of the important things that I have to do is distinguish at any moment in time, "Is this a test or is this a real incident?" That lets me make judgment calls. I can make a call, "Okay, we're going to continue with our test as is. I don't think this real incident affects the test." I can pause the test and see if the real incident can resolve itself and we can continue. In some extreme cases, I can revert the test. You have to be careful because even in those cases, if your first instinct is to revert, you might actually compile onto the real problem and actually cause more issues than if you were to continue with your test or just pause your test.

Communication's really important in that respect because if people don't know that it's a test ... At the beginning, maybe, they don't know, but once it's undergoing ... It's not labeled as a test, then it's going to create a lot of confusion and it actually can taint your results. It's a judgment call. We do it based on the content of the test, the goal of the test, the risk, and the blast radius of the test.


Robert:
Excellent. Thank you very much. Bruce, when you put a new application live, do you actually plan chaos testing around that app before it's taking any traffic? Or is it expected that standards were followed from the reliability, and then, it will be randomly tested later?

Bruce:

It's a great question. That was one of the things I had to get acquainted with joining Twilio. We were actually building brand new products that had zero customers. If there's a great time to break production, is when you have zero customers. Our staging environment in those cases, actually, has more traffic than production because it's not launched yet. When we have a private alpha or an open beta, the expectations and tolerance for the service being unavailable is a lot higher, so we can be a lot more aggressive with our chaos testing.

That said, I think there are definitely engineering standards that we have. We call it our Operational Maturity model, similar to what Casey mentioned, a Chaos Maturity model. We have that across many, many different dimensions, and we actually make sure that every team follows those things before they actually make products generally available. Chaos actually has an entire portion of that maturity model to make sure that we actually have done the resilient ... Not only are we resilient, but are we actually validating that our designs and implementation of that resilience actually works.

Yeah, I think there's a lot of standards and we're finding that chaos is actually giving us the ability to iterate on other aspects, not just resilience, as well. Telemetry or insight or monitoring is a good example. I read a lot of RCAs or RFOs, the root cause analysis documents put out by different companies when there's an outage. One of the themes that I've noticed is, oftentimes, there's either missing alerts or missing telemetry and you need to add more.

Chaos engineering, actually, has allowed us to iterate on our telemetry before we even launch, and so, we cause a failure that we know can and will happen. Did our monitors and alerts catch it? Do we have enough telemetry around that particular scenario to respond appropriately? We can actually iterate on that in isolation before we even get to a launch.


Robert:
Understood, thank you. Casey, this is also another really good question. How often do your exercises negatively affect your users and subscribers? Does it ever become an issue for your team?

Casey:

Oh, never. Yeah, so, occasionally. I think the key to having a good practice is that we never run ... I mentioned this in the webinar chat. We never run a chaos experiment if we expect that the system is going to negatively impact our customers. The point of the whole discipline is to build confidence in the system, so if we know that there's going to be a negative impact, then of course, we engineer a solution to remove that impact. There's no sense in running an experiment just to verify that we think something's going to go poorly. As long as we have a good expectation that things are going to go well, a chaos experiment, hopefully, most of the time confirms that.

In the cases where it doesn't ... Those are rare, but they do happen. That is a very important signal for us. That signal gives us new knowledge and tells us where to focus our engineering effort to make the system more resilient. We do have automated controls in place to stop some of our more sophisticated chaos experiments if we can automatically get a signal that our customers are being impacted. Aside from that, numerically, we know that if we are causing some small inconvenience to some of our customers, the ROI to the entire customer base is much greater by the resiliency that we're adding and the outages that we've prevented that could have affected much larger spots of our customers.

Does that ever cause trouble for our team? I assume that's a reference to perception within the company or management. At Netflix, no, although the culture here is probably pretty unusual for most tech companies. We have very strong support for the chaos practice and a very strong focus on that ultimate ROI.

It's interesting talking to other industries, like banking, where they're like, "Well, we can't do chaos experiments because there's real money on the line." Turns out that banking is one of the industries, now, that is quickly picking up chaos engineering as a practice. I know ING speaks about it. We have anecdotes from other large banks: DB and the large bank in Australia, the name slips my mind. The financial industry is actually looking at this as a very useful, solid practice.

I find that the fear of affecting customers and consumers of a service is well-warranted, but that should not prevent the adoption of the practice. Because if it's well thought out, the ROI of the practice outweighs the inconvenience or the minimal harm that it can do along the way.


Bruce:

I think that there's a lot of validation in the investment into a team like Casey's, to invest in the controls and tooling around failure injection and the value of having tooling around that. I think, in some of the earlier iterations of failure injection at Netflix, there was a large amount of customer impact caused by ... I think it was called Latency Monkey. It was the right strategy, but it didn't have the tooling and the maturity around our understanding of how to control failure injection, wasn't as mature as it is today. A lot of the work that Kolton Andrus did around the failure injection testing framework, allowed for chaos to be a lot more surgical in its precision. That actually allowed the team to build a lot of confidence across the organization.

Like Casey's saying, I think it's in ROI and figuring out where you are, where your organization is at, where your tooling is at, and where your application, how resilient, where that's at. That's where you have to think about where to get started.


John:

If I can build on that…At Google, we think of the equation in terms of minimizing the cost and maximizing the value of what you're trying to learn. Cost can come in the form of financial cost, reputational cost to the company brand, cost in terms of developer or employee productivity. That's a big one. If you disrupt 30,000 people for an hour, that's a pretty expensive test. What you're trying to get out of that should be highly valued. It's kind of a gut check that we do, where we say, "What is the cost of the test? What are we hoping to learn?" We actually ask the test proctors to document: what is the goal of the test, what is the risk, what is the impact, what are the mitigations so that we can help evaluate that equation of minimum cost and maximum value.

Then, just to comment about what Bruce was talking about, culture. At Google, one of the reasons that DiRT is popular and successful is because we have a lot of executive buy-in, and we make it fun. We have a storyline, we inject a bit of play into it, we have a blog story, and we have memes and things like that that go along with the theme of the testing. We have both this, "Okay, I know I'm disrupting you," but I'm trying to take a little bit of the edge off by having a theme or some fun around it. Then, if things really do go crazy, we have air cover from executives who are saying, "No, this is okay. We've reviewed these tests. It's okay. We're going to move forward with these." That's the model that we've built at Google.


Casey:

Yeah, if I could come back to that, too … One of our big current focuses right now in our chaos tooling is around what we call “Precision Chaos.” Aaron Blohowiak will be talking about this at Velocity in a couple of months, which is actually engineering into our experimentation platform very small blast radiuses. Again, in a lot of cases you can engineer in a stronger statistically relevant signal on a smaller audience. Let's not let engineering stop us from minimizing the impact on customers.


Robert:
Cool, thank you guys. Do you have concrete examples of chaos, i.e. how do you decide/prioritize what to experimental test? If you know something will fail a priority, do you still run on production or notify instead?

Casey:

If we know something's going to fail, we notify instead. There's no sense for us to test that in production if we already know it's going to ... That's just going to create headaches for people, and we're not in the business of making people miserable, despite what some people might think. I'll leave that there.


Robert:
I think that's perfect. Thank you. John, you've already alluded to this earlier with your blameless post-mortems. How do you grade the response to a chaos scenario? Are post-mortems performed and shared every time?

John:

Over the years we've developed metrics. It's important that you're measuring, you have a control, and you're adding in variables to it. You want to measure, create a baseline, and then, you're going to inject a fault into it and you want to see what happens. You may not have an accurate theory of what will be the outcome, but you want to at least have a baseline that you're measuring against.

A really clear example of this would be service levels. Maybe you're going to do a latency injection test, and you've got your normal traffic patterns for a service that you measure. You put service indicators into place, either the specific alerts and measurements that you would see in a graph. You would show your traffic pattern as it goes into peak traffic and trough traffic. What you want to see is if maybe if I reduced the capacity, how does that affect my traffic patterns?

The injection might be to, without much notice, dramatically reduce the available capacity and you would watch those metrics. You would watch the graph or you would add latency if you're talking about QPS and things like that. You could inject a bunch of latency, and you would see those things slow down. You might see a spike or a drop in traffic patterns. Measuring that is really important. Having indicators and alerts and things around your services helps us to find whether or not these tests are successful, what the behaviors are. Those all go into results.

We have a couple of different types of tests at Google. Some are paper tests where you fill out a form and we ask you a bunch of questions. Those questions have very specific fields, things like: What is your goal? What we're the risks? What are the impact? How are you going to do the test? How are you going to communicate during the test? How are you going to monitor the test and roll it back. Things like that. Those all point to specific metrics by which we will help you grade the results of the test.

We've created this flag system, and so, as we review your test, we're flagging on key words. Then we can categorize tests. As you write up your results, then, we can aggregate into dashboards all of the tests for a cycle. Maybe Q1 or something like that. Then you can search and filter and say, "Show me all of the tests that had to do with latency. Show me all the tests that had to do with latency for the spanner team."

Then they can take those results and we can compare them against real outages because we also track all of our real outages. That helps people make decisions about if we're seeing a bunch of real outages in the area that seem to be caused by poor releases, like a bad release process, but we're doing all of our testing in the area of latency. Then that's a strong indication that we should have a conversation about changing where you're performing your tests.

Or alternatively, maybe you're having a bunch of outages related to your release process, and you're also testing your release process, but you're still having outages. That's a strong indication that we need to go deeper into that topic. We start with the test plan or the goal or the intent of the test, and we've got these metrics around it and how you're measuring what's happening during the test. Then we role that into results so you can compare it and do some analysis and things like that.

That helps a team over time understand, "Am I making progress towards reducing the number or actual outages in a given category?" and improve your recovery time objectives and things like that.


Robert:
Brilliant. Casey, this question coming up is conjuring up an Austin Powers Dr. Evil moment. What processes/mechanisms are in place to safely restore a service when the controlled disaster exercise gets out of hand? Stated slightly differently, does Chaos Monkey have an off button, and if so, who gets to push it, and how is that decision made?

Casey:

Chaos Monkey can't really have an off button because what it does is discrete. It turns a server off and it's done. It does have customizations in our continuous delivery system so that a team can set termination policies that make the most sense for their service. But once a server's off, it's off. For our more sophisticated chaos experimentation, we're looking at things like, "Does a failure in one of our mid-tier services affect the customer experience overall?" To do that, we keep a very close eye on the customer experience overall for both our experimental group and a control group.

Having an automated experimentation system allows us to get a very strong signal from a very small amount of our customer traffic. Just by comparing the customer experience for these two very small groups, we can see that if there's a big deviation in the customer experience for the experimental group, then that's bad. The experiment will automatically stop at that point and just provide the context back to the stakeholders that, "Hey, we just discovered that we are vulnerable to this resiliency issue."

Those are the signals that are wins for chaos engineering, because it tells us we had this vulnerability we didn't know about. Now we can fix it before it affects the entire service. Yeah, definitely having that automated kill switch in there gives us more confidence to run the experiments in the first place.


Robert:
John, that leads quite nicely into, "Have you explored automation within DiRT exercises?"

John:

I mentioned we had different flavors of tests. We have, maybe, theoretical tests, we have practical tests, tests that are written down on a virtual piece of paper and then executed. We also have automated testing. We built a coding framework. We call it Catzilla. You might think of it as Chaos Monkey. It's internal to Google currently, and it's a way for us to programmatically cause issues and do it automatically.

For example, if you want to go back to latency example, HTTP requests, we use Stubby RPCs. Maybe you want to inject latency into a Stubby RPC between a front-end service and a back-end service. We have a pre-written automated piece of code in Catzilla where you can go in and enter your user group name, and we will auto-generate the test code. You just hit ‘run.’ You set up a couple of proto buffer variables: how much latency over what period of time, what are your targets. Then, we will just run that for you in the background. You'll get a result and a log and it'll pop up any errors and even file bugs for you. If you want to run that on a schedule, like a Cron job, we have that option as well.

We feel pretty good about knowing what are good tests or what are the right kinds of things to test or how do we limit the blast radius of tests. We said, "How can we make it much easier for more testing to take place and do it in an automated fashion?" We have this idea of an off-the-shelf test. If I just walk into a store, do I have to figure out what I want or can I just grab something off the shelf and it's going to work for me? By reducing the barrier of entry and making it super easy, almost completely free, for an engineer to run a test, we're helping remove the excuse as to, "Why should I run this test?" or "It's hard," or "I don't want to run it."

Our automated tests in Catzilla, we focus on running tests safely, securely, and making sure that they scale. You can run it across a very narrow section of a service or a whole fleet. This is great. Fuzz testing, as some examples, or capacity draining, a lot of RPC stuff, Stubby-type fault injection, pretty much anything. If there's an API, we can mess with that API. If there's monitoring and alerting, we can mess with those monitoring and alerting. We can fire automatic pages. We can insert an issue automatically into your paging system. It'll cause a real page, and someone has to respond without actually breaking your production system. We can mock that scenario.

Catzilla is our way to do automated false-injection testing, and it's frameworked so that engineers can write their own test or they can use our free ones. I think, we're experimenting. Eventually, we would like to make this something open-source or available for other companies to run in the vein of Chaos Monkey. Right now, we still have some work to do internally.


Casey:

It'd be great to open-source, given that GRPC is basically the open-source version of Stubby, which now, you actually have the foundation you need to open-source that product.


John:

That's right, yeah. That's exactly right. I think some of our challenges are, how do we open-source a framework like that and make sure that anyone who uses it is able to do so safely and securely and not affect other projects or companies or users. For example, if you were to use it in Google's Cloud environment, or AWS, something like that, how do we make sure that running a Catzilla test doesn't affect other customer's virtual machines in the same zone? How do we have that level of isolation? Those are some of the technical challenges we're trying to work through.


Robert:
Awesome. Thank you. Bruce, is there an incremental path to introducing chaos engineering practices? For example, how would you ease into an environment where the operational maturity of a company is relatively low, and taking out a production system could or would probably result in unhappy executives?

Bruce:

Yeah, and I think there's a notion that, "Is chaos testing ever done?" I would say that it's never done. Unless your product is going to be stopped, you're not going to support your product anymore, it's never done because you're constantly fixing bugs, you're constantly shipping new features. Even if you're not, underlying infrastructure is probably changing underneath you, whether it's the public Cloud or kernel updates or just things that are getting updated that are out of your span of control. That said, I think, this notion of experimentation is never really done at that point.

The other thing I wanted to add is, Robert, you'd mentioned, if you do it in production, that might cause an outage. A lot of times outages happen. Outages will happen, they continue to happen. I actually think outages are opportunities. In the case you're not doing chaos engineering, when you have that outage, have that blameless post-mortem and ask, "How do we know we fixed this problem? Did we fix the problem? How much confidence do we want that we actually fixed it or not? Or do we want to wait until the text time this random set of occurrences happens. The choice is not whether or not your system is resilient or not. The choice is when you're going to find out that it's resilient or not. Are you going to wait six months or a year for that condition to happen again? Or are you going to do it when it's top of mind, when you just put the fix in? If it doesn't work, you iterate for it and fix it again.


Robert:
Brilliant, thank you. Casey, how do you incentivize your engineers to care about destructive testing when obviously the primary focus is shipping good product, velocity, etc?

Casey:

I think the brilliance of Chaos Monkey ... The point of that was to make engineers care about, not destructive testing, but the outcome of it. I don't really think, outside the people on this panel and people who are obviously working in chaos engineering. The rest of the company shouldn't care about the fact that we're surfacing this information through the methods that we're doing it, but rather the outcome. At Netflix, there's really no mechanism to mandate a best practice, so we couldn't just go out and force our engineers to accept ... We could probably list on this call a dozen best practices for resilient engineering, fall backs, data redundancy, stuff like that. We know what those best practices are. We couldn't force engineers to write their code with those by telling them, "Hey, you have to write your code this way." That just wouldn't work at Netflix.

Instead, what Chaos Monkey does and did was it took the pain of a condition that we actually felt when we moved to the Cloud, of instances just disappearing on a regular basis, and it brought that pain to the forefront of their attention. If Chaos Monkey is running in their pre-production environments, and then in their production environments during business hours, then that's what makes that important to them. They can't get their jobs done unless they solve this problem first. Now once they solve it, if they solved it well, then they don't have to think about it again. That's a great place to be from our point of view as well as theirs.

Really, chaos engineering is ... We're not trying to get them to care about what we're doing. We're just trying to get them to care about, and even acknowledge and be aware of, the implications of the systemic behaviors that arise from working on these complex systems at scale.


Bruce:

To add on to that, I think there's a critical component of a DevOps culture that's critical here, is that your engineers feel the pain of running their code. If they're getting paged, then the choice is really 3:00 p.m. or 3:00 a.m. to them. If someone else is getting paged because of my bugs or my code, then I'm not feeling those consequences. Incentives are not aligned in that case. You give an engineer the choice between getting woken up at 3:00 a.m. or 3:00 p.m., and then, they'd get woken up once. Then the very next time they're going to talk about, "Look, okay. How do I not get woken up and lose sleep over this?"

I actually find sleep is a very big motivator for engineers, well, in general, but you have to make sure that your organization is aligning the incentives correctly there.


Robert:
Brilliant. Thank you both. John, what qualities make a great chaos engineer? What do you look for when hiring? Is it somebody with good looks and charm like me?

John:

Of course! No, I think a really important quality is to be calm under pressure. There are definitely scenarios where the cost to the value proposition is high. The risks are high, the impact is high, and so there's a lot of pressure to make sure that you're testing carefully to make sure that you can put things back together efficiently.

If things do go awry, and they do go awry, making sure that you're able to remain calm, to invoke incident management protocols, to pull in the right experts, and to try to mitigate the issues. Someone who's going to panic, that would be bad, obviously. You could be surprised. Someone who might panic in the sense of, "Let's quickly revert," that could be damaging and cause more problems just as well as someone who might be running around the room screaming, "My hair's on fire."

Collecting data, someone who's comfortable looking at the whole picture and saying, "What are all of the issues at play." We have this saying at Google. "Look left and right before you proceed." Someone who's a technologist, somebody who's a generalist is really important, someone who has a sense or has trained around incident management is really important.

At Google, we have this very flat hierarchal structure on your day-to-day. Ideas can come from anywhere. You don't have to go through a bunch of senior levels to make any kind of decision throughout your day. You have a lot of autonomy. But when you have an incident, that needs to change. A very structured hierarchy is actually more efficient and helps out in a real disaster, so we look at what FEMA does, and things like that, Red Cross, and try to adopt similar protocols.

Someone who's had some experience under pressure, someone who's had some experience in chaotic situations, someone who can learn quickly, those are the kind of qualities that we look for. Obviously, site reliability engineering experience is a plus, being on-call, those kinds of things. Having a background in tech is also helpful. The content of what we're dealing with varies dramatically. One day we may be dealing with a genome infrastructure system, and the next day we're dealing with our fleet of airplanes and our flight controllers and things like that. It's not quite as narrow. Hopefully, that helps.


Robert:
Yeah, sure. Remove all the technical aspects and that sounds like an ideal fit for my wife, because that sounds like her every day to be honest. Just closing out because we're getting to the top of the hour. Just quickly from each of you a short snippet on how you see chaos engineering or DiRT evolving over the next 10 years. Casey?

Casey:

I see that this is going to become increasingly relevant and adopted by more and more companies outside of tech, particularly as the systems that we work with become more and more sophisticated. I think, classical testing will have diminishing returns on the kinds of systems that we test. Everything from neural networks to the systems that run autonomous driving and things like that. You can't really introspect those types of systems, at least a human can't, in a meaningful way. Chaos engineering looking at system effects and systemic behavior at the edge, designing this form of experiments, I think is one of the only reasonable tools that we have on hand to build more confidence in systems like that. I do see a lot of increased adoption in chaos engineering's future.


Bruce:

You know, 10 years is a long time. I think, to answer that question, I think about what was tech like 10 years ago. The public Cloud was not a thing back then, it was just emerging. The iPhone didn't exist yet, and so to predict 10 years from now, the odds of us doing that successfully is very low. That said, I think, the early days of what we thought was crazy has now become the norm. The public Cloud is the norm now. That's no longer crazy. Mobile is definitely ... It's everywhere.

I think about the things that are going on today and what's moving in technology, things around Cloud services. Google has machine learning as a service now. Devices are everywhere, in everything. IoT wearables are becoming a reality. Then, you also have serverless that is also this emerging thing that ... I think, serverless today is the Cloud that we thought of crazy back 10 years ago. 10 years from now, I do think serverless will take off.

But in that world, there's an increased amount of complexity. I think chaos engineering becomes extremely important when you don't own all of the code that is required to run your application. Because you don't own every component, I think, validating resilience when you have these systems that probably don't fail very often, but often enough that you need to care.

Serverless and the emerging of new technology today of what we think is emerging and crazy, 10 years from now is probably going to be reality and the need for chaos engineering is actually going to grow over time as more companies and more developers do more with less thanks to Cloud providers and Cloud services.


John:

Yeah, sure. Casey mentioned it, but I think machine learning is going to continue to come to the forefront and be prevalent in all different aspects of computing and business. Systems where there's fault injection, where there's chaos, is just built in. It's part of the norm. It's expected, and that's okay. Systems can become intelligent about what they’re breaking in. They can self-heal, is a definitely possibility. Things like Bruce was alluding to around Cloud are going to continue to develop. Personal or customizable or private Cloud, I think, is going to be a thing where there's more of it. It's everywhere, and you have more customization around that. You're going to want to have the ability to test it and feel good about it and reduce that anxiety that chaos engineering can help you do.


Robert:

Brilliant. Thank you. That concludes this AMA, and I'd like to thank you all for joining in on the discussion today. We've had so many fantastic user-generated questions roll in. We couldn't get to all of them on the broadcast, unfortunately. With that, a special thanks to O'Reilly Media for co-sponsoring this event and helping us bring it together, this amazing panel, and of course, a huge thanks to Bruce, Casey, and John for sharing their time and expertise with us today.