Some companies that offer services expect you to do things their way or take the highway. However, Google expects people to simply adapt the tech company’s suggestions and best practices for their specific context. This is how things are done at Google, but this may not work in your environment.
Today, we’re talking to Liz Fong-Jones, a Senior Staff Site Reliability Engineer (SRE) at Google. Liz works on the Google Cloud Customer Reliability Engineering (CRE) team and enjoys helping people adapt reliability practices in a way that makes sense for their companies.
Some of the highlights of the show include:
Liz figures out an appropriate level of reliability for a service and how a service is engineered to meet that target
Staff SRE involves implementation, and then identifying and solving problems
Google’s CRE team makes sure Google Cloud customers can build seamless services on the Google Cloud Platform (GCP)
Service Level Objectives (SLOs) include error budgets, service level indicators, and key metrics to resolve issues when technology fails
Learn from failures through instant reports and shared post-mortems; be transparent with customers and yourself
GCP: Is it part of Google or not? It’s not a division between old and new.
Perceptions and misunderstandings of how Google does things and how it’s a different environment
Google’s efforts toward customer service and responsiveness to needs
Migrating between different Cloud providers vs. higher level services
How to use Cloud machine learning-based products
GCP needs to focus on usability to maintain a phase of growth
Offer sensible APIs; tear up, turn down, and update in a programmatic fashion
Promotion vs. Different Job: When you’ve learned as much as you can, look for another team to teach something new
What is Cloud and what isn’t? Cloud deployments require SRE to be successful but SREs can work on systems that do not necessarily run in the Cloud.
Full Episode Transcript:
Corey: This week’s episode of Screaming In The Cloud is sponsored by ReactiveOps—solving the world’s problems by pouring Kubernetes on them. If you’re interested in working for a company that’s fully remote and is staffed by clued people, or you have challenges handling Kubernetes in the environment because it’s new, different and frankly, not in your company’s core competency, then reach out to ReactiveOps at reactiveops.com.
Welcome to Screaming In The Cloud, I am Corey Quinn. Joining me today is Liz Fong-Jones who’s a Senior Staff Site Reliability Engineer at Google who works on the Google Cloud Customer Reliability Engineering team. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights. Welcome to the show, Liz.
Liz: Hi, Corey. It’s great to be here.
Corey: Thank you for joining me. Let’s start up to beginning, what is a Staff Site Reliability Engineer?
Liz: Corey, I think you have to break that down into two pieces, let’s talk about the SRE part first and then we’ll talk about what the staff part means. I’m a Site Reliability Engineer, we are the people that are the specialists in figuring out what’s an appropriate level of reliability for a service and how do we make sure that it is engineered to meet that target. We don’t target 100% availability but we target a reasonable level of availability that meets our customer’s requirements.
This means we write software, mostly, because it turns out that you can’t really run a highly available service by doing a bunch of manual work. We also make sure that when things go bump in the night that we learn from them that we are the people that to some degree of coordinating and sending us funds and then figuring out what to we need to proactively do next time. That’s the nutshell of the SRE role.
In terms of what it means to be a Staff SRE, our career progression basically roughly goes you start off and we hand you a project and say, “Here’s the Design doc, please go implement this.” We eventually say, “Here’s the problem, we know you can figure this out. Please write the Design doc and solve it.” Eventually we ask you to figure out which problems are useful to solve over the next year, what should the team be focusing on over the next year.
It’s the place in my career where I’m at right now where I’m a Staff Site Reliability Engineer, I work with many different teams to coordinate the road maps, figure out what it is that we need to work together on. The thing that makes me a Staff Engineer rather than a Senior Engineer is that aspect of looking outside of my team.
Corey: I would take a step further having seen some of the conference talks that you’ve given, something that has always been very distinctive about how you frame things has been the way that you set context around this is how we do things at Google, this may not work in your environment. That’s a theme that I don’t see emerging very often among speakers from large well-respected tech companies.
Liz: That’s totally a response, I think, to a lot of criticism that I’ve seen around companies saying, “Do it our way or the highway.” I think turning [...] really well that it’s a matter of context that you have to make sure that people are adapting your suggestions and your best practices for their specific context. I think that that’s a thing that makes me really excited about my current role, is being able to focus on helping people adapt reliability practices in a way that makes sense for their companies.
Corey: Taking it a step further, you mentioned that you were on the Google Cloud Customer Reliability Engineering team, what is that?
Liz: The CRE team or Customer Reliability Engineering team focuses on making sure that Google Cloud customers understand how is it that their services that they’re building on top of the GCP platform, making sure that they’re able to build services that offer it seamlessly. This is something where you can do a lift and shift onto the Cloud Platform but that’s not really the thing that’s going to give you the full benefits that you have to look beyond that and figure out how am I designing this? How am I architecting this? Am I integrating my operations with my Cloud Platform providers operations?
That’s the thing that we think about a lot, is how do we make sure that all of our major customers have SLOs, service level objectives? Did it make sure that they have error budgets? Make sure that they have their service level indicators and key metrics available to people in GCP and that our key metrics are exposed to them for their usage over platform so that can improve everyone’s time to resolve issues, so that when there is an outage, it maybe stings less and that we’re sending an expectation of this is what our service level objective is, you can expect us to deliver that but you cannot expect us to deliver 100% reliability because it would be expensive.
That’s what our team does, is we act as a conduit between customers and Google to make sure that we’re integrating and operating efficiently and that we are using best practices.
Corey: One thing that I guess is common to SREs across the entire spectrum is this stuff is very complex and outages invariably happen, technology fails, human beings fail, software has bugs in it. It’s one of those areas where no one remembers all of the times that you were up, but that one time where a thing fell over, depending on what it was, you’ll hear about that years later or it becomes almost the narrative that defines it.
There were a couple of notable outages in the last couple of years for AWS and for GCP. You see that people still tend to bring those up. How do you, I guess, transition away from the narrative being we keep things up and running except that one time we didn’t and, I guess, turn that into a story that culturally drives the narrative of outages happen, we don’t try to yell at people for them, we try to improve so that those outages, because outages don’t recur, and those systems become more robust with time.
Liz: I think there’s a couple of angles that you can take on that, one of which is instant reports and shared post-mortems. The idea that when you have a large failure, or even a small failure, that you talk to the people that were affected and you go over, “This is what went wrong on our side, this is what we’re doing to make sure that this can’t happen again.” Furthermore, we talk more concretely about what’s the impact in your error budget.
If your error budget says you can be down for two hours per year and we eat 30 minutes of that error budget and we can about it, what are we going to do to be more conservative with the other hour and a half remaining?
I think that the other piece of doing shared post-mortems is that it enables you to do things like say, “Yes, there was an outage, but these are some mitigating strategies that would’ve mitigated the impact for you earlier.” And then work together on figuring out how to implement them. It’s a mechanism of turning instance from oh my God everything exploded to this is a learning opportunity, this is what we’re going to learn from it, this is what we did learn from it.
Corey: One thing that tends to stand out about Google is that they’re very open with the learnings that come out of outages, that come out of various prices, that as they solve these global, world spanning problems, these white papers, they talk more about how they’re thinking about these problems and what they’re doing to address them than most other companies.
In many cases, you’ll see an outage that makes headlines and the company will release a very thin root cause analysis or post-mortem or whatever term we’re using this week that turns into a narrative of, “There was a problem, we fixed it. It doesn’t go deeper than that.” Has that been part of Google’s culture forever or was there something that drove that that led to an awakening?
Liz: I think that there are two aspects to that, one of which is that we, as well as many other companies, do produce very robust internal post-mortems. I think that’s a prerequisite to having that level of openness with your customers, is to be open with yourself first.
But as far as explaining what’s going on under the hood to customers, if you have a system that is really scary for customers to understand, for instance when GCP came out with the Google App Engine even before it was called GCP or when we came out with Cloud Spanner, it’s really hard for people to get a restoral sense of what are the risks involved here? How am I going to make this? How is this engineered? How can I be confident in how it’s going to work? Why the failure patterns I might expect aren’t there or the failure patterns I have seen have been remediated?
I think having so much technology that doesn’t necessarily have a large number of parallels at the time that was released tends to motivate us to be a lot of more transparent than we otherwise would be.
Corey: To that end, this, I guess, could break down on either a technical direction or a culture direction that I’m thrilled to explore both. But GCP is perceived as being part of Google but not part of Google, at least from those of us outside trying to read the tea leaves. For example, search, to my knowledge, does not run on top of GCP, from the technical side of it. From the culture side, I’m wondering how embedded the GCP teams are compared to the rest of “Google proper.”
Liz: I think that in terms of the technical details, Google is a very large user of GCP in terms of having a large number of very mission critical corporate applications running on top of GCP. For instance, things like our company directory, you can say that they’re not very serious except that they are in a sense of these are applications of the form that many customers outside of Google want to bring to GCP, things like our financial system, things like our company directory, things like our internal meme generator, all the way up to eventually being able to run engineer’s workstations on GCP.
Most customers aren’t coming to GCP and saying, “I want to build a world spanning web search engine on top of GCP.” I think the set of applications that we’ve chosen to run on top of GCP that are developed by Google represent user workloads fairly well. Some applications don’t make financial sense to run on top of GCP because virtually, they should impose this overhead. The security requirements that we have and the performance requirements that we have means that it makes sense to not impose an extra overhead. It’s the new development versus old development type of thing as well as thinking about the requirements of the application.
I think on the culture front, GCP has been built by the very same teams who developed the underlying original Google infrastructure and it’s run by the same SRE teams. One SRE team will be responsible both for running the BLOB storage system and the Google Cloud storage system, that’s one team, that’s not multiple teams. It’s not really a division between what you’re describing as old Google and new Google.
We used the lessons that we learned from having operated almost a legacy service with an enormously complicated API and we say, “How much of that do customers really need? Why don’t we simplify it because we’re not constrained by having to haul around a 15-year old API. I think that that is really the crux of GCP development, is that it’s not a division between old and new, it’s instead the same people who developed the old that’s developing the new and bringing all the lessons that we’ve learned from it.
There definitely is a distinction between building technical infrastructure at Google whether it’d be GCP or not GCP and building products like Google Web Search or ads, that’s definitely true. There’s a difference between being someone who develops technical infrastructure and someone who uses technical infrastructure.
But even so, there are definitely blurry lines. For instance, a lot of the work that the ad SRE teams have done has been building platforms that make it possible for individual ad development teams to basically implement their business logic on top of the framework that’s going to work reliably for them. I think that that’s an area in which someone can come to a GCP team and feel very comfortable, is this idea of you’re building infrastructure.
Corey: One theme that tends to emerge is that you’ll see people in relatively small companies that are getting off the ground talking about how Google does things and how it’s a very different environment there and often in a disparaging way of, “I was talking to my friend at Google, they spin things up with one command line and you have an entire environment, why can’t we do that?” Without understanding that two decades of very intelligent engineers working to build out infrastructure tooling to the point where it is push button receive cluster is a [...] investment for a company to make.
Most companies are not going to make that leap. That’s been something that has eluded people’s understanding for a long time. That said, it does feel like GCP is aiming at solving for that problem. You effectively get Google Cloud’s infrastructure build by the hour of second, depending on how you wanna slice that. Is that a fair assessment?
Liz: I think that’s a totally reasonable assessment to make, is that having a lot of these developer productivity tools available for the first time in GCP means that companies don’t have to reinvent the wheel every time, that they can instead make use of our investment in that technology.
Corey: To that end, what do you wish that people understood better about GCP out here in the wilds that are not Google?
Liz: I think that the main thing that I wish that people understood better about GCP is the notions that we want to not just have a vendor-customer relationship but instead to have a partnership relationship with large customers. I think that that’s a situation where people say, “I just want to compare on price, I just want to compare on features.”
I think that there is a difference between buying interchangeable widgets and actually working together on building a shared system that incorporates the best of Google’s technology unless you innovate on top of that. I think that that’s one misunderstanding that I see people having when they’re looking at what’s our cloud migration strategy.
Corey: Getting there I guess from even a customer service perspective has been somewhat an interesting route. Historically, Google was very focused on not having a customer service department, you should be able to have the system work. Staffing a call center back in the early days for a search engine wasn’t an area in which the company was prepared to invest.
Now that you’re running companies’ production infrastructure at a very large scale for a wide variety of clients, that requires a level of engagement with those enterprise customers that looks a lot like a traditional model that you’ve seen with Microsoft, Oracle, etc. for the past many decades. What does that transition been like as, I guess, Google woken up to the idea of, “A frequently asked questions list probably isn’t going to cut it when people are dropping tens of millions of dollars a year on this.”
Liz: I think that that’s something that Diane Greene has been super sharp about, that she has recognized that challenge and better positioned Google to be responsive to the needs of large customers.
Corey: In the past year, even from my perspective dealing with customers that are both in GCP and AWS, I’ve seen a market improvement in that respect. It’s definitely something that is evolving rapidly. The challenge in any corporate reputational style of thing is that it takes time to make a change but far longer for the reputation of the way things used to be to fade, it’s sort of the curse of success. When you’re a household name, people form opinions and don’t change them even in the light of new information.
Liz: That’s definitely a mindset and [...] issue that we are hoping to impart by talking about what are we doing, how does it impact developers. That’s why I really like working so much with our developer advocacy team in terms of getting those kinds of messages out there about, “Here’s what’s going on, here are some reasons why you should look and see whether GCP makes sense for you. If it doesn’t make sense for you, we’ll be the first people to tell you that as well.”
Corey: To that end, something that a lot of companies liked to talk about is remaining provider agnostic where they could, in theory, pick up their thing, whatever it looks like, from AWS and move it to GCP or from GCP into this rickety ancient data center that’s falling to pieces or wherever they wanna move things. I understand wanting that security blanket.
As a counterpoint, you’re offering some very differentiated higher level things such as Google’s Cloud Spanner, a world spanning asset compliance database that effectively lets you treat it like any other SQL database except it’s in multi regions, you can write to it, you can read from it from anywhere on the planet. Technically, this is amazing. From a business perspective, rolling out an application built around something like this is, in some cases, considered a non starter because it doesn’t seem like there’s another option.
What if Google decides they wanna turn all of GCP off and/or burn themselves to the ground and/or just go out the business and sell hats or something? I don’t see those things happening but people at least want a theoretical exodus story. How do you find that the desire, even if unrealized for lock in, competes with the ability to, at least in theory, be cloud agnostic?
Liz: In some way, it’s a matter of choice. It’s a business decision that companies can make, is do you want to deal with the offer ability headache of keeping all of your services on raw VMs and being able to migrate those VM based workloads between different cloud providers that all offer VMs or do you want to use higher level services.
I think that there is a tremendous amount of interest in even making some of those higher level services available across cloud. If you look at what’s going on with Kubernetes right now, it’s a huge situation where GCP offers Google Kubernetes engine obviously but there are many other cloud providers that also offer Kubernetes based services and that’s an opportunity to do something that is differentiated but is also something that people can choose to migrate if they choose to.
Another example that I’d offer there is Cloud Bigtable. I used to work in Cloud Bigtable before my current team. One of the selling points of Cloud Bigtable is you can operate your service against a regular [...] backend that you maintain yourself or you can choose to run that workload against Cloud Bigtable and it all uses the same API. You basically have to compile in the stub for [...] Cloud Bigtable and you’re done.
I think that that’s definitely the best of both worlds where everyone is using the common standard, they may have different implementations on the backend. In a way, given that Cloud Spanner is very SQL like, if you are willing to forego some of the technical benefits, you could go and use a different SQL like application if you really wanted to. That might be less reliable or less performant.
I think that there’s also the angle of if you really do care about sticking to the common denominators, then you can choose these Cloud SQL instead, you can choose to run MySQL databases on raw VMs if you happen to have that [...] of masochism. There’s a variety of different options and it’s engineering tradeoffs that people have to choose.
Corey: In a similar vein, if you are to take a look at all of the different offerings that GCP has, what’s one that you think is underappreciated in the larger community that you wish more people knew about?
Liz: I think that one of the biggest opportunities that people have that they don’t really fully understand how to use is the various cloud machine learning based products. Machine learning is a giant buzz word but I really think that over the coming couple of years that we’re going to see more people being able to use cloud machine learning in a way that makes business sense for them and that is a much easier way of doing things rather than feeling like, “Oh my God, I have to go through all of this training in order to learn how to use it.” I think that that’s one of the places that’s going to grow fairly rapidly.
Corey: Something that I’m seeing in the machine learning space is people are concerned less with the how and looking less for technical enhancements in machine learning, they’re still stuck on the why. I struggled with this for a little while myself where I love the idea of machine learning and I look at my life and what I do, I see very few areas in which I can apply it.
Before you get me excited about enhancements and speed, capability, what it costs to train models and run the stuff itself, I first need to understand how it applies to my life. Maybe this is a limitation of my own lack of imagination but I struggled to identify machine learning use cases until they’re explicitly pointed out to me. Is this uncommon? Am I just dense or is this something that tends to be more industry wide.
Liz: I think that as people who build software and who think about reliability and cost type things, we have a tendency to avoid things that are nuance scary that we don’t understand. I definitely could have counted myself in that camp a year or two ago saying, “Why should we use machine learnings on alerts? That means if it breaks, then we’re not going to understand how to debug it.”
If you’re not building consumer facing products, it’s a lot harder to see the benefits of ML and it’s a lot easier to appreciate the risks of it. Whereas for people that are trying to do consumer facing things like being able to do object recognition or being to transcribe a speech to text or being able to transcribe written words into text or being able to do a machine translation, these are all things that are powered by machine learning.
In fact, they’re offered as prepackaged solutions rather than you must train your own model. I think that’s an area that we overlook a lot as people that don’t necessarily think about the consumer facing products quite as much.
Corey: With so many other things, it feels like it’s an area that is rapidly evolving and we’re going to start seeing improvements in that space relatively soon. Speaking of, what’s something that you see that GCP itself could stand to improve upon?
Liz: I think that it is always a challenge to onboard people. There have been a lot of improvements but still, focusing on usability is something that GCP really needs to get better at in order to be able maintain a phase of growth. People who are experimenting with GCP who decide to adopt it just as much as it is, people are saying, “I’m putting out a request and proposals from the top three cloud providers for $100 million.” Those are both cases that we need to pay attention to. I think that the investment in doing that high touch cloud sales and support work also has to be accompanied with what are we doing for the next generation of developers.
Corey: I will say as someone who first picked up the GCP control panel for a project a couple of months back, I was very pleasantly surprised. At first I thought it was gonna go the opposite direction where I did a quick project and then I was done. Now, it was the fun prospect of hunt down all of the services that I’ve spun up and make sure they’re turned off so I don’t wind up with the surprise bill three months later.
In Amazon world, that takes the better part of a day. In GCP it was, click on the expansion thing next to the particular project and terminate all billing resources. It pops up a scary warning that this will turn things off, are you okay with that, which in this case I was. I clicked it, there was no step two. That was an eye opening moment for me.
Liz: I think that the set of features that are offered are very robust and powerful. I think that’s a discoverability problem where if I look at the GCP control panels and I’ve saw them a little bit like I’m sitting in the cockpit of the space show, there are so many different options. I think that’s the area that I wish that there were a little bit more effort paid into.
Corey: The first time I set something up in a cloud environment, I admit it, I’m like all of the things I make fun off in some of my own talks. I click through the console, I spin a thing up and we’re good. In Amazon land, how do I convert that into code? “Good luck, idiot,” is the effective answer they give. With GCP, it spits it out, “Here’s a curl command that does exactly what you’d wanna do.” It’s easily understood, it breaks down the API calls, and I can shove that into terraform, I can put it in a strip, I can curl bash it if I’d like to live very dangerously.
It lends itself to rapid and effective automation rather than spanning something up, then I have to retrofit all of the code to it and then tear it down and hope I got everything right or I get to explore this whole area again. That was transformative, the first time I saw it. I couldn’t believe I was seeing it. I very quickly moved on to why isn’t everything like this, this is wonderful.
Liz: I think that’s, in a large part, influenced by how we’ve done deployments with internal Google technology for years and years, is the idea of, “Yes, you have to be able to offer sensible APIs and do tear up, turn down, and updates in a programmatic fashion.
Corey: Let’s talk a little bit about you rather than GCP for minutes. You’ve been at Google a decent amount of time, how many years now?
Corey: That is forever in cloud space. During that time, you went from being an individual contributor to managing a team, now you’re an individual contributor again. Let’s talk a little bit about that. In many companies, that would be considered demotion. In Google, it’s one of the few companies that’s very explicit about having a technical ladder that is distinct from the management ladder. Going between ladders in one direction or the other is absolutely not a promotion, it’s a different job.
Liz: Absolutely. I think that one additional thing is that you can have direct reports as someone who’s on the individual contribution ladder. The difference is really where you’re focusing your time, how many reports do you have. I’ve been on, I think, 8 teams now in 10 years at Google. It doesn’t feel that long because I only spend a year to two years in each place.
The thing that I find that I do is when I feel like I’ve learned as much as I can out of one team, I’ll go and look for another team that’s going to stretch me in some dimension or teach me something new. That’s how I came to the decision to become a manager for a few years, was that I really wanted to get some experience with helping people’s career development rather than purely focusing on technology.
I think that even once I stop being a manager, that inner voice doesn’t turn off. It’s a skill that you acquire that you can hold onto and use in varying ways even if you’re not officially someone’s manager. I think that everyone should give being a manager a shot at least once if that’s something that you’re interested in because it teaches you a lot, it helps you better understand your company and helps you better understand how you’re going to interact with people.
For me personally, having this opportunity to try being a manager and then discovering that I didn’t actually wanted to, in the long term, have my career roles being tied to how big of a scope the set of people I manage, I was responsible for, but that instead I wanted to work cross cutting projects that are between multiple working groups. That was really a useful thing for me to learn and then go and pivot.
Corey: It’s nice to see companies being supportive of that. In many environments, making the transition you just described would have entailed at least three different companies. Is it fair to say that Google is almost like a bunch of companies tied together under one umbrella even down at the relatively granular organization level or is this more a story of Google being very supportive people’s needs as they grow?
Liza: As a manager, you are taught at Google to look out for the best interest of your reports even if it means that they may wind up leaving your team or going onto another job ladder, it’s your job to support people in developing their careers. I think that that mindset and perspective as opposed to I’m going to keep this person on my team because they’re doing productive work on my team, I think that’s a huge difference from a lot of companies.
As far as our culture, we have a culture that is fairly uniform between different teams. We have a set of engineering tools that are fairly uniform between teams. As a result, sure, it may take you six months or even a year to become fully productive as an engineer at Google. Once you have that base set of skills, you can take that to any team and be up and running and doing useful stuff within a few weeks.
I think that that’s kind of the dichotomy of having teams that are doing a variety of different kinds of work and working on a variety of different problems, but all kind of sharing that same cultural bases, ensuring that same technical bases, I think that that’s one of the magical things about Google.
Corey: At this point, you would be fair to say that you’d recommend working on Google to someone who is on the fence about it?
Liz: I think that Google is a company that is very self-aware of a lot of things that it knows how to do same things as well and that it also has some areas in which it faces challenges that are not unique to Google but things that we’re talking about, having public conversations within the company or even some of these externals to the company about what does it mean that you have a [...] of an inclusion?
I think that it can be scary sometimes on outside looking at that and saying, “Oh my goodness, I’m not sure if I want to work at Google because I’ve seen a bunch of awful stuff in the news about Google.” I think that on balance, Google is a place where you can have a lot opportunities to do impactful things, not just technically impactful things but things that are culturally impactful for all of information technology.
That was a way of saying, “Yes, I would recommend Google as a place to work.” But I do think that those are useful things to think about as far as what are you looking for in an opportunity? What are your interests? Do they match with what you would be doing at Google? I think that that’s also an area where you should really carefully talk to whoever is the hiring manager and make sure is this team the right fit for me or not. If not, you can say no and then [...] will find some other team to look at.
Corey: Is Google still in a hiring place where every person they bring aboard, it doesn’t matter if they’re there to clean white boards or do accounting, they still put them through a CIS 101 algorithm’s test?
Liz: The hiring mechanisms for software engineers test a mixture of can you write a code? Can you practically apply lessons from computer science? And can you do systems design? Those three things are tested during the interview process. For Site Reliability Engineers, we don’t necessarily mandate that people have a previous computer science knowledge because it’s been advantageous to hire people that are systems engineers, people who have real world practical experience with this is how systems break, this is how we can engineer systems better.
For those set of people that don’t have a computer science background, we tend to focus a lot more in interviewing people on troubleshooting, figuring out in a real situation what would you do in order to make sure that the impact on customers was as low as possible to root cause in the bug and bisect the problem or focusing much more in your systems design skills or focusing much more on do you understand at least some area of the Linux stack or of the [...] system stack in a way that you can practically describe to someone who’s interviewing you.
I think the answer is yes if you are interviewing for a software engineering position, you will probably be asked to do whiteboard coding, you will probably asked questions that rely on having some degree of ability to pick the right data structure, pick the right algorithm. But I think that there is range and flexibility at least as far as SRE is concerned.
Corey: Thank you, Liz. One more question for you before we start wrapping up and calling it a show, is there anything you’re working on that you wanna mention or tell our listeners about?
Liz: I want to point people two resources. The first resource is on the Google Cloud Platform blog, there is a set of posts made by my team, the CRE, Customer Reliability Engineering team. They’re all called CRE Life Lessons, we’ll put a link to those in the show notes. Secondly, I have a project with Seth Vargo who’s a developer advocate at Google, who is working with me on a set of YouTube videos that explain in five-minute chunks what is SRE, what are the key principles of SRE, how can you apply them. I’m really excited about this project and we’ll put a link to that in the show notes as well.
Corey: A follow up question for you on that, do you see that cloud and SRE are intrinsically linked? Can you have one without the other or is it more or less two completely separate concepts smashed together in the form of one person, you, come to life?
Liz: I think that SRE doesn’t really specifically require that you do it in the cloud. The key things about SRE are, number one, do you have service level objectives and error budgets? Number two, do you have limits on the amount of operational work that you’re doing in order to conserve your ability to do project works? As long as you’re doing those things, I pause at what you’re doing is SRE. Neither of those two things specifically mandate any kind of cloud deployment.
However, I think that if you are trying to run a cloud service at scale and you’re not adopting something in the SRE methodology space or in the DevOps space, you’re going to really struggle to operate your service that it’s going to result in having to hire a bunch of people to do manual operational work because you’re not setting targets for your reliability or you’re not setting limits on how much operational load your systems can generate and you’re not engineering that work away. I think that cloud deployments require SRE to be successful but I think that SREs can and have work on systems that are not necessarily running in the cloud.
Corey: To go deeper on that one turns very much into the question of what is cloud and what isn’t. Down that path lies madness.
Corey: Thank you for joining me, Liz. I’m Corey Quinn, this is Screaming In The Cloud.