Time to clear the air. Ask the top DevOps and SRE minds how to scale your systems and organization for peak performance.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form
My name is Andrew Smirnov, I am a performance engineer at Catchpoint, performance advocate, work with a lot of our clients to set up their monitoring strategy. With that, what we're going to do is we're going to go through and introduce the actual panelists that you guys are all excited to hear from and I'll just be leading them to the questions you posted on the site.
First and foremost, we're going to go through and introduce Adam Jacob. He is the co-founder and chief technology officer at Chef. He's the creator of the Chef IT Automation Platform. He's been building new infrastructures for 15 different startups at this point. 13 years of experience as a system administrator, system architect, and tools developer. You might have seen Adam give his fantastic DevOps Kung Fu presentation at various conferences. If you haven't seen it, definitely check it out.
Next, we have Liz Fong-Jones. Liz is a site reliability engineer at Google and she manages a team of SRE’s responsible for Google's storage systems. She lives with her wife, metamour, two Samoyeds in Brooklyn, and in her spare time, she plays classical piano, leads an EVE online alliance, and advocates for transgender rights. Liz was most recently on an engaging panel with other SRE managers at USENIX’s SREcon16. More information on Site Reliability Engineering at Google can be found here.
Last but not least, we've got Charity Majors, she is the co-founder and CTO of Honeycomb.io, a new startup focused on mining machine data. Previously, Charity ran infrastructure at Parse and was an engineering manager at Facebook. She also worked on RocksDB Team to build and deploy the world's first Mongo + Rocks in production. Just a fun fact about Charity, she likes single malt scotch. If you ever want to thank her for her time or just get together with her, you have a scotch drinking buddy in Charity.
I do like to think of myself as a whiskey generalist. Single malts, first and foremost, but I really don't discriminate. Good. Cool! Did I write that? I was trying to remember if I had written a bio, but that was very well done. Thank you, I'm so happy to be here. Adam and Liz are two of my favorite people in the world, and the questions you guys sent in were amazing, sorry that the people sent in. Yeah, happy to be here.
Echoing Charity, the questions you all asked have been great. I really liked all of them. So thank you. Hopefully we'll address most of them.
Perfect, perfect. Well, let's get started, then. This is going to be a fairly interactive session. I'll usually call on someone to answer the questions and some of the panelists might chime in. We'll start with Charity. The first one is really going to be talking about defining SRE and DevOps.
Charity, are the distinctions between SRE and DevOps and operations just media-made? Are there valid differences between the two? What are they and why do they exist at all?
I thought it was hilarious that a solid quarter, two-third of the questions that anyone asked were a version on "What the hell is this?" "Define this." "Is there a difference?" "Am I doing it right?" "Am I doing it wrong?" "What does the job need?" "Is this thing a part of DevOps?" "Is this thing a part of SRE?" The answer is yes to every option that you listed. There are specific heritages. Can you make that a plural? I'm not sure. There are different lines of ... SRE was very much, Google started having this conversation ten plus years ago about how do we scale the human side of our software. Around the same time, Adam from Chef and Etsy and Jez Humble and a bunch of people were having this conversation about similar, but slightly different, Google has a very specific set of problems in some ways. DevOps is more open source and crowdsourced and “bottom-up” is the way Kellan from Etsy was describing it to me.
We can't be operators anymore. We have to scale ourselves technically. We have to learn new skill sets. Practice in this field was taking a very heavy toll on the people, on the humans who are trying to run these systems. We're all familiar with the signs. It's a classic thing right now, the burnt-out people who are caring for systems. People started reaching the age of thirty and realized this just wasn't going to work. Yes, I could ramble about this for a long time. I think that yes, the people who spend a lot of time talking about how these things are different or defined, would say that DevOps is about empathy, it's about practicing like adult methodology. The people who, like Ben Treynor and all the amazing people at Google, would say that SRE is about software engineering for humans. Liz, maybe you want to give a better definition of how Google would define this.
Sure. Basically, our perspective is that DevOps is, in class-inheritance phrasing, that DevOps is an interface that says, "Here are some principles, here are some things that you should consider doing." There are many different concrete implementations of the DevOps philosophy. Some of them were independently invented, but we also agree on very similar philosophies within the DevOps organization at Etsy, within the SRE organization at Google, the production engineering organization at Facebook, we kind of all agree on these principles and then it's a matter of what we've chosen to do to implement those principles and what additional things we've tacked on. If you've deciding to spin up a DevOps or a SRE organization, look and see, "What traits do I want my organization to have? Who can I borrow from?" Then you go and create your own implementation of it. It's very much a thing where we all inherit ideas from each other.
You also see this thing happening which drives me batshit, where people who are trying to hire, they just use either of these as shorthand for ops engineers who can code, which should be basically everybody now. They think that by naming their team a thing, they get those qualities and it does not work that way.
You're right, they don't get those qualities. I flipped on this a couple...
I remember the keynote where you publicly flipped on this!
Maybe six months ago. I'm a systems administrator. I feel like I've been one my whole career, I feel like I still am. There are people who would argue with that. I don't know that I do systems administrator work all the time. I don't know that I needed to be told that I was a DevOps engineer or an SRE in order to feel like I was great in what I do. I'm a systems administrator, I'm great at it. That developed over time and I feel good about it.
The thing about DevOps that happened that lost all control, and maybe to a lesser extent SRE, was that because it was this set of principles that we loosely agreed on and then everybody was willing to run off and build their own concrete implementation, there's a marker that you're saying when you use that word. If you're using it with good intentions, there's marker that you're saying that what you want are people with that cultural background or that you want with that set of principles and calling it that is as good as anything else.
I don't know that it'd be better if you said, "I want you to join DevOps organization as a systems administrator," or just saying, "I want you to be a DevOps engineer." Whatever, man, just be a DevOps engineer, call it a day. I wasn't a DevOps engineer, so when you call me one, I'm like, "I'm not sure that's what I am." If that's what you need to call yourself, call yourself that.
It's really interesting, the idea of whatever DevOps or SRE are names of organizational styles or whether they're people's job titles. That's something that people don't really make a good distinction between. Another one that Tom Limoncelli gave a really great talk on last night at the New York Tech TalksNew York Tech Talks I organize is basically on, "Is DevOps or SRE about the technologies you're using?" Is it about Get and Docker or is it about the philosophies behind it? [People] fixates on the technologies rather than on the principles. If we spend some time unpacking the principles, we realize that we're very much on the same page regardless of what tools we're using or what job titles we have. It's the organizational model and the way we think about the work that we do.
Absolutely. I have a friend, actually, who is hiring for a startup and he has just posted the same exact job title with both DevOps engineer and SRE and just watched over time to see. DevOps is winning by ten or twenty percent, but it's just title doesn't matter, and you also don't make your team magically have these qualities just by naming that thing. Focusing on the name is wrong.
I think this transitions to our next question perfectly for Liz. We touched on something there. It's a culture. It's the idea of how the company is run just as much as it is actually the tools and the infrastructures in place. Liz, based on what you've seen, what kind of organizational layouts or reliability and performance goals, how does SRE and DevOps actually fit in?
Sure. There are a large number of ways to do it correctly. There's not one true way model. It depends a lot on your own organization but I think one of the key elements that's essential to most structures is the engineers you have working in SRE or DevOps need to be empowered equally with people that are doing product development software engineering. The problems that we are trying solve those with the founding of SRE and the founding of DevOps were that you have this model of developers throw stuff over the wall, tell Ops go make it work. That was just a really shitty situation for all the people involved.
The idea that people need to be equally valued, people need to have many of the same skill sets. Your product development software engineers need to understand how you do operations. Your operations engineers of whatever form you choose need to be able to write software and write automation. It's a matter of making sure that you have the right mixture of people and you have a respectful culture, where people can have disagreements and can resolve them as equals, rather than saying "I am hiring you. You are going to do this."
In terms of answering the concrete question of how you set up one of these organizations, there are many different models. The one that is most used frequently at Google is the model that you have a product engineering organization and you have a number of SRE teams, each of which is 12 people split between the two geographic locations. With that being said, Google is a large company, we have multiple international sites, this is something that we can do that not necessarily every other organization can do.
I know that, for instance, in Facebook and many other places they will embed one or two production engineers or DevOps engineers in each product engineering team with the idea that they will let their ideas assimilate into the and pervade the entire culture of the team. The downside of that of course is that those engineers feel a little bit more isolated and not necessarily a part of the "AY" organization, but there's also less friction. You have the identity of "I work on this team" rather than "I'm the SRE team for this product."
Totally. I just spent a couple years at Facebook and I'm going to bring up the C-word. I can't believe it hasn't been said yet. Context.
I thought it would be Catchpoint?
Catchpoint. Context. Sure. Every single thing that we're saying is, like we'll give advice, like from the mountain, but every single thing that we're saying depends completely on the context that is implemented.
For them, the hard part about the transition is less about work structure, the two things you want to talk about first are work structure and technical choices. They're concrete and you can do something about them. You can change your technology choices, you can change your work structure, you can do those things. I think in those organizations, especially in enterprise, the thing that more than anything what you tend to not know is how it feels to run high context, highly integrated operations and product development teams.
Your very first step isn't change my organizational structure or adopt a bunch of technology, it's run a very small project where your real goal and focus is to just experience what it's like to run a well-integrated operations and development process where they work together to build those platforms. That's where you start. Then later on all of the context, do you wind up having global SRE teams that serve everybody? Do you have just DevOps that's isolated into individual products? Do you run traditional Ops?
There are other products where you are like infrastructure as a service. Where, f%$# product, honestly. It comes down to you need to really a robust Ops team and they need to have final say. You need to understand what your product is and what your mission is and fit the human teams in service of that.
I think one of the things that we see out in the enterprise, one of the things I get to do, I think probably a little more than probably than Liz I don't know about Charity, Charity will have to do this soon enough because she's running her own company so 20 minutes from now Charity is going to be like "All the enterprise companies that I've seen in the last five years."
At Facebook, the model is largely you have the product software engineering teams and a couple of embedded engineers who are experts in doing productionizing. They also have teams that function much more like SREs. They also have teams that are like old school ox. It's really about ... anytime you try to just follow, define rules and follow them, without paying more attention to context and guidelines, you're taking your focus off the prize.
I totally derailed myself and my own thinking there, but context is key. Principles is good and making sure that you're communicating and getting people on board with what you want to do is way more important than following a rule book. There are plenty of products who reliability isn't that important. Let's not act like this is the most important thing for everybody. Some products can get away with two nines or less and it's fine. OPS should not be calling the plays there.
You actually have the contextual basis, the c-word, to be able to make that decision in a way that makes sense in the context of your business.
Perfect. I think that maybe, Adam, I think this is a perfect segue into what you were mentioning there. The question comes up is, obviously embedding with the product teams and making sure that DevOps or SRE or whatever you call it, is embedded with the product teams. How do you prioritize and coordinate both performance and reliability work across many product teams?
I think the first question is you have to, in a lot of organizations you actually haven't really thought through, or you don't have an organizational perspective on what it means to be a product team. Right?
You have engineers, you might have a product manager, you might have operations people but you haven't thought about "What's our orientation toward product, how do we think about product development? Where does it begin, how does it end?"
I think that when you think about how you coordinate things like operations, things like release management, things like monitoring, bug fixing, performance reliability testing, the idea that the product doesn't stop when the software is written, but instead it stops when the customer receives its value.
There's a loop and it goes all the way to the customer and it goes all the way back. When you think about designing your product teams you have to design them to cover the whole loop. What that means is the product owner is responsible for the availability of the product that they produce. They're responsible for whether or not that system meets the requirements it needs. If it doesn't you have a requirements problem.
It starts at the top of the loop and goes all the way through. Ideally that loop's not super crazy long. Everybody makes fun of waterfall. You read the original waterfall paper, the one he wrote in the seventies, the diagram we've all been hating on he hates on it too in his own paper. He's like "don't do this though." Then there's another diagram that no one ever shows you with all these tight little feedback loops, where he's like "this should be really fast." That's what you're trying to get to.
The product goes across. When you think about how do you scale it out, that's how you scale it out. You scale out this idea of product ownership.
One thing we do at Chef is we say that products are owned by a product's owner an engineering owner and a UX owner. Every funded engineering project has all three of those things and they make those decisions together. That's one of the ways that you can avoid ... you can get the context back in of "Oh I have a deep engineering problem. We really need to fix this bug right now." You don't have to negotiate all the way around because that context is much closer.
I think that what Adam is saying is totally true and if you think about it as interesting and challenging and hard as software engineering is, can be, it actually the cost of building a product rounds down to zero when you compare it to the amortized cost of maintaining it, operating it over the life cycle of the product. Which is why looping operational stuff in earlier is one of the best ways to build a sustainable product and a sustainable team.
DevOps is really focus this conversation so much around operations people and teaching system administrators how to write code. Good job! Yes! Message totally received. We so leveled up as a discipline over the last ten years. We've not completely, but largely targeted the message at systems people and we're starting to catch up and be like "No, software engineers this is equally about you learning to own your shit." Learning Ops skills, you don't have to be a specialist, but this is an equally shared burden. It's not just about lecturing Ops people about learning to write unit tests.
Can you discuss a little bit more about product requirements? When you design a product, one of the features you should be designing for is what is the desired reliability level, what's your error budget? Even before you necessarily are working with SRE's. Then when you say, "This is my error budget, this is the reliability level I want to hit," then you can have a kind of discussion about the resources it's going to take to do this. This is what the affect is going to be, do we want to negotiate that, what are users going to accept? These are useful things to consider.
What, to both of your points ... one of the things we see a lot of now, the religion of DevOps, the cultural piece of it is super easy to adopt. It's pretty simple to be like "I want to believe those principles. Those things sound. Yes!" Where we see product, is actually the next frontier for making it all work. The missing leg in a stool there is product thinking about reliability as a requirement. Thinking about that as a first class piece of what you build into the system and thinking about holistically as we build our products are we building them horizontally? To be able to see the entire product end to end and see what that's like for the customer.
Those are things that Google and Facebook are better at than everybody else in the planet is. When you look at how large enterprises build their products, they don't build them that way. They build project plans that build up a bunch of value over six or eight months and then ship it, and then build more value and then ship it and the cycle's really tough.
You can really look at the Googles and the Facebooks as like where everyone else is going to be in a couple years because it's the force and function of size and scale just forces them to think about these things sooner. Pretty consistent. And be like "Yeah, we're all going to be doing that in a couple years to some degree."
Perfect. Adam, just to wrap up on that subject, there's two other questions. You touched on products and a little bit on scales, so do the SRE/DevOps teams provide the tools that the product and project teams hook into? Is it their responsibility or is it what you were talking about, a shared responsibility, figure out your tooling and monitoring strategy?
I'm sure there's more than one way to do it. I'm just going to give you ... I'm going to stop using that caveat because eventually it'll get boring. You'll just get tired of me being like "Well, it super depends."
I think it's the responsibility of the entire team to build tooling that works for them and for their problem. I think rather than thinking so much, I'm a believer that the tools you use reinforce the culture you want to have. If you want to have a culture that works a certain way, that releases often, but you're release process sucks un-shockingly you won't release often because it sucks to do.
The people who know best what that tooling is happen to already be the people who are building the product and doing the work to operationalize it. Rather than saying "this is our platform and thou shalt use that platform," my preference is the version that says "here are the requirements of software to get into production." In example, you need to have an acceptance environment where you can deploy that code and see it running.
Now that may be a private acceptance environment, it might be a slice of production. Those huge amounts of work where we deployed to 1% of our user base and saw what happened and that's a completely acceptable acceptance environment. Which one you should have, depends on the product, depends on how it goes but the idea you should have one is there. Who decides what the shape of that acceptance environment is and who decides what technology goes into it, to me it's that product team. That product team is engineering and operations and all the other people who are involved.
You can look at the rest of that process, that pipeline, that gets you to production and you have one, even if it's messy or at hawk, that there are steps there that no matter where you come from everyone does. Acceptance is a good example. Code review is another one. If you're going to do this and be big and be fast, everyone does code review. You could be like "I hate code reviews, I don't want to do them," and that's fine, but then your process will suck and eventually you'll do code review just like everybody else.
How you do it, whatever, you have to do it. How it stitches together, it's more up to you, it's more up to the individual teams but it's definitely owned by the holistic team not but operations people providing a platform to engineers and going "if you write all your codes to this platform then everything will work out."
The second side effect there is the bad engineering one. Which is, if you're building a product you want to be able to build product that is delightful to your customers, that has some kind of thing in it that makes it special. Often, that special thing will come because engineers are willing to risk on what's possible to build. If they have to work against this constrictive platform, they're not free to build the thing that they think will be amazing because they have to constrain their choices based on your operational platform. Which is not worth it.
I think a lot of what you're talking about comes down to what the community calls or talks about in terms of whether you're optimizing locally or globally, because there are tradeoffs. Every single time you make one of these decisions. If you choose the perfect local solution, in terms of your language or storage system, for every individual local thing you could have perfectly constructed solutions in each of your tiny little things and an absolutely unattainable mess. Nobody can share stuff, they're not using ... you don't have the ability to move engineers between teams. You don't have the ability to reuse each other’s work.
This is something that every organization has to negotiate for itself basically in every single situation. The larger you get the more you need to choose to optimize globally at the expense of some of the local optimizations. Whereas when you're small, and you're a tiny little startup and you could very well, you'll probably fail in the next year or so, optimizing for all of your teams using the same tools and building codes everyone can ... forcing everyone to use the same storage systems, this is not actually a priority mostly because you are focused on empowering people to move really fast.
Startups don't die because they move too quickly.
Move too quickly. Well on the flip of that is in the enterprise where [inaudible] tooling sucks right now. For every single large enterprise you have a bunch of organization wide things you tried to propose and you're like "thou shall only use blah blah blah," and at the same time you want to change dramatically how you operate but you want to keep that same platform choice [crosstalk] centralized [crosstalk].
These choices are almost irrelevant from a technical perspective and they're relevant mostly from a human perspective because you're humans will get so burned out if you are making the wrong tooling choices. Sorry Liz, go ahead.
I feel like it's a matter of ownership. We talk about the notion of having one platform and having it be someone else's problem. You can have one platform and have everyone share it and contribute to it. Or even three platforms. If you have a large organization, say the size of Google, you have three platforms it still is better than having two hundred different platforms.
It's still everybody's problem.
Yes. That is the key point. You are building one thing that everyone shares rather than saying "Oh I'm going to blame those people for constructing such a shitty platform," it's our platform.
Mic drop. Perfect. We're going to shift the subject a little bit but I think the next topic in a few of the questions we saw is the discussion around build versus buy. I think we're going to look for your help on this one Charity. What are some of the lessons learned from building versus buying versus assembling solutions?
I was super, selfishly sees on there were a couple questions around this because my thinking has changed very radically over the last year or two. I wouldn't have done the startup if it hadn't changed. I came from parse, where we were a back end as a service for mobile apps. A million mobile apps got built on parse. I was like "Holy shit. This is not as if I was building a mobile app. I would never use someone else's. Thank God the lack of control." I had so much fun building it.
I saw how it empowered so many people to move and experiment and try so many things so much faster. Eventually a lot of them needing to build their own stuff, but they were able to just rapidly iterate and stamp things out by building on the expertise of this group of world-class backend engineers who had built a platform.
I loved working on Parse. A year ago I did ... VividCortex, Baron Schwartz has us monitoring as a service and I've always been locally, out-outsourcing your monitoring, your metrics. This is key this is core to what I do as a systems engineer. I just couldn't even envision giving up that kind of control.
Baron asked me to do a 100% honest, like I'm capable of not being blunt, but he's like "just do a blunt review of our service. If you hate it, say so." I spent a week playing with the cortex and monitoring different databases and I came out of that experience going "I could never build this in house." The reason is engineering cycles are scarce, it is the scarcest resource you have. Almost always. It's so much more expensive than money. Money is so much cheaper than engineering time, if you're doing it right. That's actually the right constraint.
It reminded me of, I was a system administrator in the early 2000's, remember there was a year when everyone decided to outsource their email? We all use to run Postfix, and [inaudible] and ClamAV, right? And our own anti-virus and IMAP. I loved it! I loved being a mail administrator. Then Gmail came along and we all just collectively went wait "we're spending and engineer's worth of time out of this? We're inventing the same fucking wheel over and over. We should not do that." It was a very, very rapid shift in a short period of time.
I feel like the entire world of metrics is not as quick and as decisive of a shift but the technical expertise has gotten so deep. I think that we're moving from a monitoring system that we run in house, we all have the same Ganglia and Nagios and whatever, to having sort of an observability platform. Where we select a couple of the things that are going to empower us to succeed on our core differentiators.
When you're asking about how do you make these choices, I think that being absolutely ruthless about what is core to your success, what are the things that you're doing that nobody else is doing and that's why you're in business and to the extent that you can, spend your resources on those things. Try to get people who are better than you, or who are focused on doing the things that are more an salary that are helping you get there.
Don't spend engineering time on things that are not your competency.
Yeah. Exactly. Exactly. Building loosely coupled systems? This is the other thing. We're all platforms now, whether we think of ourselves as platforms or not. A platform isn't just an API, it's a product that gives people a lot of flexibility and a lot of choice and a lot of creativity. These are the kinds of things that people are increasingly expecting out of every product.
This brings chaos, because humans are the greatest drivers of chaos. Every [inaudible] so hard, do not give people the ability to run regular expressions on your databases because they will fucking do that and you will hate them and yourself.
Giving people the freedom and flexibility to be really creative with your product means that you're behaving like a platform. And you just have to embrace that complexity. Trade off by making ... Ops problems are your problem. They're everybody's responsibility like everybody was just saying, but make as many of those problems as possible not your problem. Pay somebody else. It's cheaper.
I'm going to be the wet blanket just for like a snudge of a wet blanket. One of the things that happens, especially in the enterprise, for us, and by us I mean like me and Charity and probably Liz and others, we come from organizations who are almost never afraid to build whatever they need to build.
If we see a problem, and the way through that problem is to build software we will build software. That's actually our default position. The story Charity just told her flip was "Oh my gosh! I could buy software that solves this problem for me better than the software I would engineer." Her default position was I could engineer this and then she flipped. Often in the enterprise it's the opposite. The default position is I could never engineer it. I have to buy something to solve my problem.
That's a really interesting point, thank you.
And that's actually very perniciously bad. When you think about how you do DevOps and SRE because the question of when to buy something, you have to know the value and the qualities you need in a platform in order to build on top of it.
Absolutely. You should never buy something you don't know how to think about.
What it is. Often in order to do that you need to build something. Building a prototype ... I've done this a lot, I've built a prototype which worked for what I needed and then brought a product that was better than my prototype. I knew enough about the problem to understand that the product I was buying was going to fit into the overall, I love the idea of observability platform, the overall platform we were building into what we wanted.
Start from the position that says, "I could build this." I build generic software, I build configuration management and automation platforms to be used in the broad by everyone in the world. That's way harder than building automation platform for one company, even if the company is Google. The part where it's not generic is a Godsend, in terms of how much you can decrease it's still really, really hard. Don't get me wrong, I'm not saying that it's like trivial, anyone can do it, it's not, it's super hard, but it's harder to be generic. It takes longer it's more expensive, the outcomes are [inaudible], there’s a difficulty curve. The way they feel isn't as good.
In the enterprise if everything you touch is generic, if you're unwilling to build anything in order to make the experience right, what you wind up with is an awful user experience. Whereas you go to Facebook and you use a Phabricator and you use Arcanist and you use all this tooling that's super well integrated that drives the whole company, it feels great and that's because they were willing to buy that software. They're still using lots of commercial or open source software that they didn't write behind the scenes. So there you go.
I would like to echo in and absolutely amplify your observation at building a platform is exponentially harder. You are better at thinking through all the nicks and cranny's than someone who's building a one off. It's not just like linearly harder it's much much, yeah that way. That's how it looks.
The important part is that doesn't mean that the non-generic one is worse. It's probably better for the person that's using just that one thing that does exactly what they want.
Often very true. At Parse, absolutely. If I was trying to build ... we were supporting what, every single kind of app you could think of. Write-heavy, read happy, social apps, B2B, geographical apps. I could have built a better backend for any one of them, than all million of them, but there's only so many of me and people who are on my team who are amazing.
This is just kind of part of scaling as an industry is specialists go and build something that is hopefully, 80% is good. For way more people than like 100% is good for one or two people.
What about places like banks with closed IT infrastructures where new tools or new solutions are very heavily scrutinized and the existing ones are dated. What can a particular DevOps or SRE do about that? How do we influence the change to where some of those closed systems become more open systems, while maintaining some of the security requirements?
Step one is you focus on the platform and step two is you reacquaint yourself with the software you already have. Let's use banks as a good example. I was in South Africa doing some work with a bank, they use SAP core banking to run the bank. They didn't write software to run the bank, they brought software from SAP that runs the bank. That software has security updates and has a life cycle and does all the things that you'd expect it to do.
Is there value in destroying the core banking platform so that it is more DevOps friendly? Probably not. Maybe. I don't know. How often does the core banking platform have to change? What you can do is you can think about how does change get introduced into the environment? How does change come into core banking? How do we roll out those new releases and can we put it through a similar flow as we put the faster pieces of what we do.
As we think about how people work, can we get them to work in a similar way even though the technology they control is far less. The kind of change that goes into something like core banking is more often like a black box software update, then it is a really fast iteration loop on a [inaudible 00:43:28]. Does that mean that we don't need to have an acceptance environment? That we don't need to have code review? That we don't need to have observability into how that environment behaved? We don't need to know who approved those changes and why? All those things are true.
Rather than thinking about "in order to do this you have to replace all the software," instead you start looking at what's the work flow that we want people to engage in? What's the way we want people to work together? Build tooling that supports that and then use that to sort of strangle the legacy platforms so that yes they're still legacy in that they're still the same software, but you manage them in a completely different way than you were before.
It's difficult, but that's the pattern we've seen.
What happens when there are problems? We're at the stage where it's not if something will happen, it's a matter of when and so we instrument systems for resiliency and high reliability and disaster recovery, but when something does go wrong should SRE and DevOps have a say if a deployment goes out or not? More importantly, are they responsible for enforcing some of the monitoring strategies in place to make sure that as things happen, we're actually learning from it.
The way that we start first is we start by having the error budget. If we say you know this problem consumes 10% of our error budget, we're 90% of the way through the time period of the error budget, we're going to say "Okay, that's probably within spec." Yes, maybe we can do some things to make it less likely to happen again, am I going to spend inordinate resources to [inaudible] that again, no, because again we've figured out what the product requirements are.
However if we have a situation in which something is going to cause us to exceed our error budget if it's left unchecked then we go and rigorously say "What are the things that we learned?" Then we figure out who's going to work on them. Because of the fact that reliability is a feature of the product, the product is everyone's responsibility.
Some of these things get assigned to people that are doing product development, software engineering, some of these things get assigned to SREs and it really depends not necessarily ... It's not "Oh this is SRE's fault or problem," it's instead "SRE's may happen to have the right skill set to help with a given post-mortem action item or they may not." Then we kind of use that as a guide to figure out who to assign things to.
In general though, in terms of whether or not a release goes out, if we are doing fine on our error budget and doing release is definitely consumer error budget or consumes a tiny fraction of our error budget as we reach hard instances, well that's okay. The friction comes when you're exhausting your error budget and someone has to make the call of is this release more or less important than preserving the reliability feature? It's one feature versus another and we can prioritize that.
It's very much a discussion and we're all equal partners, we all care about the product succeeding.
What is the unit of measurement of an error budget, Liz? Is it dollars is it time, what is it? What's the unit of measurement you use when you discuss error budgets at Google?
It depends on the product. In the case of most user facing services at Google it's basically of the form the number of queries we received as the denominator, the number of queries we successfully responded to within the time limit is the numerator. For other services the error budget may be that the billing pipeline runs 99.9% of the time on 99.9% of the days and if you fail ... Sorry that's a bad example.
The billing pipeline will complete in three hours on 99% of the days. You basically have to figure out what are the metrics that you're customers care about and then go from there. It's not "we're going to serve no errors" or "we're going to serve every query in under twenty milliseconds." Great. What are your queries? Are all of them equal?
Yeah. Context, right? Even at Google there's a lot of difference between what meeting your error budget is for search quality versus Dev tools, internal stuff. This is often a personality thing.
I've gone in for interviews where I could tell it wasn't going to be a great fit for me, when they're like "every time that we have a 500, we post-mortem that." And I'm like "whoa, this is not my..."
I require a larger tolerant for risk than that in order to be happy than that. For some people that is what makes them happy in life. This is a thing that not enough companies think about it in really explicit terms and talk about it through the implications, they just have this shared cultural context that evolves from who they have, what the product is. I like having these conversations out loud. It exposes so many of the implicit assumptions that you're making.
If you're interviewing some place and they haven't thought about these things that's a bad sign for me. I want to work some place that is really conscious about all of these factors that are cultural as well as technical.
What's been the biggest outage that you've seen and been a part of? And more importantly, what was the big takeaway, either at the organizational level or you've learned over time?
I'll use my own experience so as not to out someone else's outage. I was working as a systems administrator, this was in the era of when people who weren't Google could have ad platforms that were successful. Back when ad platforms were a thing, we had this big advertising platform. I'm sure there's someone on this thing who's like "I run a very successful ad platform," I'm sorry, I'm sure it's beautiful.
We were running an ad service and they did a rebuild so they had built it one way using my sequel. Then they decided they were going to get greater availability by buying oracle rack. So we courted the whole system to a big Oracle rack cluster, if you've never run a well-built rack cluster, it's amazing. Queries in flight on a box, the box can die and the query will finish off of another machine. It's so good. It works like magic.
We put it together, it took forever to build and it was really complex but once it was running, if you didn't touch it, it just worked. It was very expensive. It was not cheap to buy. The software was expensive, the gear was expensive. When it came time to sign the invoice for this thing, they cut the $70,000 or $80,000 backup software that was required to back up rack from the thing. I was like, "Guys you can't cut the backup software. We have to back up the database." And they were like "no we don't, it's highly available. It's this magic database that never goes down and everything will work." I'm like, "Well, okay but you really need to buy the backup software. Just seriously, just do it." They were like "Nope, we're not going to do it. That's an order, we accept the risk." I was like "Okay, boss!"
We implemented this platform, got it all built, put it in production. Every couple months I'd come back and was like "Hey, we're still not backing up the database." They're like "Still don't care." I was like "Still not backing it up, still don't care." Then the machine started to reboot so every once in a while one of the clusters, one of the servers would just "boop" let go and just reboot the box. It was no big deal because it was rack, everything was magic. Then one day, one of them reboots and the whole thing hard stops. The entire cluster just dies and the database is corrupt on disk.
We go in to try to debug why this has happened and what turned out over a long period of time was that the machines were rebooting because memory was getting fragmented to the point where they couldn't allocate a task_struct in the kernel anymore. One of these things happened in the moment when there was a write to the database and it literally just corrupted the on disk format. An oracle has a raw disk format where they just write directly to the drive, there's no file system, the Oracle is the file system.
We paid this dude, who just showed up at random, came out on an airplane with a suitcase and a briefcase. He'd sit at a desk and he'd open up a hex editor, like he was reading the database through hex. This guy is a con artist of the fifth dimensions, maybe the best. He literally read it visually, found the corruption, fixed it by hand and then it booted again and the lesson they learned was buy back up software even if your cluster is highly available. Which is maybe boring but it wasn't boring at the time it was awful.
My worst war story and I see Lex there, so he knows this story. We were at Linden Lab together. This was years ago. Still matters. How many of you remember upgrading from MySQL 4.1 to 5.0? Which was a breaking change where you couldn't replicate backwards?
How they published all these metrics saying that it was faster. 5.0 was faster than 4.1. I'm sure that's true for some people, definitely not true for us. We had upgraded everything, everything except the master primary, and we upgraded the primary. Cut a very long story short, ended up having to roll back but couldn't roll back so we lost like two days of data. We were down for a day or two while we were tried to repair it. Worst outage in my life. It left us so scarred that I went off, took an entirely different job within Second Life, Lex knows this story, and built a software for sniffing over tcpdump all of the queries and replaying them against new hardware. Lex took it over, there's the link, it's amazing. It's great for testing databases, especially if you have to be scary. The end. It was amazing, so much fun. I learned some much.
when it comes to hiring the right team members what are some of the hallmarks of a great SRE? How would you test them conversationally in a few minutes or in other detailed manner perhaps with an exercise? How do you make sure you get the right people to be a part of your organization that can influence this kind of change?
Absolutely. Unfortunately it's difficult to test for this in a fair way, which is why people currently do it. I would say the number one most important characteristic about someone who wants to do SRE or DevOps is you have to be curious to really understand how things work. You have to be a good human being. You have to be willing to give people the benefit of the doubt. You have to be willing to solve a problem rather than pointing fingers at everyone else.
If you have those two things, you can learn any of the other technical skills. You can learn to software engineer, you can learn to systems engineer but you have to start from the basis of being curious about things and from the basis of being someone who's great to work with that you would be fighting a fire with and have no qualms about working with.
My favorite coworker are the ones who have a sense of humor at 2 or 3am when you're up and the entire world is on fire and you don't know if the company's going out of business because you fucked up. You guys can still crack jokes at each other. Those are just the best.
Great communicators and great influencers. People who don't need to wield explicit power to get shit done.
We're winding down on five minutes so I think the last one was kind of an interesting, it's a cultural respect for DevOps right? There was a question about essentially DevOps culture in East Asia, we believe there's some relevance to all organizations today and this one's for you Charity. How can you overcome cultural bias in which Ops is viewed as second class or vice versa of the product development team or business department?
Yes, yes. I love talking and thinking about this, and again we can spend all hour on this, but super briefly. Super briefly, pay them. One of the best things ... People who are like "I can't get any great Ops people, Oh I pay them 75% of my software engineers," Okay that's just dumb. You signal what you ... this is a capitalist system, you signal what you value by compensation. That's stupid but it needs to be said.
More importantly, cultural respect for operations starts with respecting yourself. Respecting yourself when you have software engineers who are like "No, my time is too valuable to get paid or interrupted." And you're like "Okay, mine isn't?" If you don't treat yourself and your team with respect, if you don't respect your time if you let yourselves get paged all the time and interrupted and burned out. If you don't let people ... If they don't take vacations. If you respect your team, if you value the work that you do and you think it's critical and important and if you value it ... I've never worked at a place where Ops wasn't valued.
I think that's because I've always been fortunate enough to work with people who are like "Yeah, the work that we do is absolutely fucking important. It's just as important as writing the code." This has always been mysterious to me, going to other places where operations engineers have this stigma, or they don't feel as good as ... And I'm just like "No you are, but own it. Can't expect anyone else to treat you as an equal if you don't feel that way yourself." And practice that.
It's kind of interesting to think about because, at least what I've seen from Google is that most product development software engineers would love to have more SRE's and tend to adore the SRE's that they work with. Part of the reason is they've realized that people who specialize in SRE have a skill set that they want, that they aspire to have more of on their teams and that they can't do on their own.
That's something that ... If you are willing to walk away and say look "If you're going to diss us and diss us and diss us, we're going to walk away from your product and you will wind up having to do the stuff that you hate," then maybe you'll earn a little bit of respect.
They shouldn't hate it. The best software engineers that I've ever worked with ... Sorry Adam I said it, super quick. The best software engineers I've ever worked with have also been the best at Ops. They know it, they value it, it makes them super heroes when it comes to actually getting shit done. If you signal that these are skill sets that are to be sought after the rest comes.
I think when you think of it like the very high levels of cultural aspects of like "We run this company and the company doesn't, the organization the company itself doesn't value operations," I've seen two things work.
If you are inside that organization and you don't have management authority, so if you can't just use fiat, you should consider finding a new job. Basically your boss, your leverage over your management structure and your corporate culture is so low that you should bail and hopefully the brain drain will cause that organization to realize they have a problem and then maybe they can fix it.
If you're outside the organization, so if you have leverage, one of the things I have helped people do, another example was there was a place had a very strong operations team that was very not DevOpsy, but DevOps ruled the day. Operations ruled everything with an iron fist. Took a year to get a change into production because they were so perfect and they were so clenched. Going to people and simply asking them to be amazing is actually a very valuable skill. Just showing up and saying to people "Hey we could be great if we did this. If we took this one thing and you owned it."
I had that conversation with the guy who ran this Ops team and I was like "Man you are the problem. You have to change if this is going to work because you're the only one who can save these people. If you keep being who you are you will win and they will lose and the whole company will lose and eventually you'll die." There you go.
Perfect. I guess on that note, we've come to our conclusion of the hour. We are going to be wrapping up the call. First of all, thank you for joining the discussion today we had so many fantastic user generated questions roll in today. We can't get through them all today, but the ones we got through we had a great discussion on. Special thanks for O'Reilly media for co-sponsoring this event and helping bring together this amazing panel. Of course huge thanks to Adam, Liz, and Charity for sharing their time together and their experience with us.
Ask me questions on Twitter. As Liz just said, ask questions, happy to do follow-ups anytime.
Absolutely. As mentioned before we'll be sending out the recording transcripts shortly so you'll be able to see that. Look for that in your emails. Hope everyone enjoyed the "Ask Me Anything" session. Have a great rest of the day, rest of the week, rest of the month, rest of the year. Thank you so much..