Trying to convince a company to embrace the theory and idea of Chaos Engineering is an uphill battle. When a site keeps breaking, Gremlin’s plan involves breaking things intentionally. How do you introduce chaos as a step toward making things better?
Today, we’re talking to Ho Ming Li, lead solutions architect at Gremlin. He takes a strategic approach to deliver holistic solutions, often diving into the intersection of people, process, business, and technology. His goal is to enable everyone to build more resilient software by means of Chaos Engineering practices.
Some of the highlights of the show include:
Ho Ming Li previously worked as a technical account manager (TAM) at Amazon Web Services (AWS) to offer guidance on architectural/operational best practices
Difference between and transition to solutions architect and TAM at AWS
Role of TAM as the voice and face of AWS for customers
Ultimate goal is to bring services back up and make sure customers are happy
Amazon Leadership Principles: Mutually beneficial to have the customer get what they want, be happy with the service, and achieve success with the customer
Chaos Engineering isn’t about breaking things to prove a point
Chaos Engineering takes a scientific approach
Other than during carefully staged DR exercises, DR plans usually don’t work
Availability Theater: A passive data center is not enough; exercise DR plan
Chaos Engineering is bringing it down to a level where you exercise it regularly to build resiliency
Start small when dealing with availability
Chaos Engineering is a journey of verifying, validating, and catching surprises in a safe environment
Get started with Chaos Engineering by asking: What could go wrong?
Embrace failure and prepare for it; business process resilience
Gremlin’s GameDay and Chaos Conf allows people to share experiences
Full Episode Transcript:
Corey: This week’s episode of Screaming In The Cloud is generously sponsored by DigitalOcean. I’m going to argue that every cloud platform out there biases for different things. Some bias for having every feature you could possibly want offered as an added service at varying degrees of maturity. Others bias for, “Hey, we heard there’s some money to be made in the cloud space. Can you give us some of it?”
DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they’re using it for various things, and they all said more or less the same thing. Other offerings have a bunch of shenanigans, root access, and IP addresses. DigitalOcean makes it all simple, “In 60 seconds, you have root access to a Linux box with an IP,” that’s a direct quote albeit with profanity about other providers taken out.
DigitalOcean also offers fixed-price offerings. You always know what you’re going to wind up paying this month, so you don’t wind up having a minor heart issue when the bill comes in. Their services are also understandable, without spending three months going to cloud school. You don’t have to worry about going very deep to understand what you’re doing. Its click a button or making API call, and you receive a cloud resource. They also include very understandable monitoring and alerting.
Lastly, they’re not exactly what I would call small-time. Over 150,000 businesses are using them today. Go ahead and give them a try. Visit do.co/screaming and they’ll give you a free $100 credit to try that. That’s do.co/screaming. Thanks again to DigitalOcean for their support to Screaming In The Cloud.
Hello and welcome to Screaming In The Cloud. I’m Corey Quinn. This week, I am of location at the Gremlin offices in San Francisco. I’m joined by Ho Ming Li. We’ll talk about Gremlin in a minute. First, I want to talk a little bit about what you’ve been doing historically. First, welcome to the show.
Ho: Sure, absolutely. Thank you for having me. Happy to be here on the show.
Corey: Nope. Always great to wind up talk with new and interesting people. Before you went to Gremlin, you were a TAM at AWS, or Technical Account Manager for people who don’t eat, sleep and breathe cloud computing acronyms.
Corey: Without breaking any agreements or saying anything that’s going to cause AWS to come after me with knives in the night again, what was it like to work inside of AWS?
Ho: I actually want to step back a little bit further. When I joined AWS, this was probably four years ago, I joined as a Solutions Architect over there. I am one of the only, if not the only person that actually went from the Solutions Architect role into the Technical Account Manager role.
Corey: Interesting. Out of curiosity, what is the difference between a Solutions Architect and a Technical Account Manager?
Ho: I get that a lot, for sure. From a technical perspective, both roles are very technical. The expectation is that you’re a technical and you’re able to help the customers. I would say for Solutions Architect you are a little bit working more with the architecture. Whereas with the Technical Account Management role, you’re a lot more on the operations. You’re a lot more on the ground.
Corey: When you say that you transitioned from being a Solutions Architect into a Technical Account Manager, that was a rare transition. Is that because transitions themselves are rare or is that particular direction of transfer the rarity?
Ho: That’s a great question. Particular direction of transition. There is a tendency for people to start with a support role and Technical Account Manager has an enterprise support element to it. As Technical Account Managers, instead of being reactive on a lot of support cases, you’re actually proactively thinking about how we can make their operations a lot smoother, best practices around that. I’ll give you an example. Every year, there is usually very critical events for a lot of businesses. If you think of Amazon, the natural thing to think of is Black Fridays and Cyber Mondays. Lot of these Prime Day as well recently.
Corey: Oh absolutely. Happy Prime Day belatedly. Continue.
Ho: Exactly. For these events, there’s actually a lot of planning, a lot of thinking ahead, as Technical Account Managers, we work with the customers to make sure that they are ready for the event. Technical Account Manager is an interesting role in that you are in between the customers as well as AWS Services. In that sense, you’re helping the service team educate the customer, to let them know how to properly use a service because a lot of people use services in an unintended way which makes things very interesting, to say the least.
Corey: Oh, absolutely. I found that Secrets Manager for example makes a terrific database. Every time I talk to someone who knows what the database is supposed to do, they just stare at me, shaking their heads sadly and then try to take my computer away.
Ho: I have to say there are interesting misuse though. Because there are interesting use cases that the AWS Service team may not have thought of. The other part is actually getting somebody’s feedback from our customers and bringing back to the service teams so that the service team enhances features. It’s a really interesting role in that you’re between both the customers and the service team.
Corey: What’s interesting to me about the TAM role historically has been how maligned it is. In that when you speak to large enterprise clients, fairly frequently, something I will hear from engineers on the ground is, “Oh, the TAMs are terrible. They have no idea what they’re doing, and it’s awful.” Okay, I suspend this belief. I wind up getting engaged from time to time in those environments and speaking with the TAMs myself. I come away every time very impressed by the people that I’ve been speaking with on the AWS side.
I understand that it’s a role where you are effectively the voice and face of AWS to your customers and that means that you’re effectively the person in the room who gets blamed for any slight real or imagined that AWS winds up committing. You get blamed for all of their sins. It’s a maligned role. What do you wish that people understood more about the TAM position?
Ho: I think some part of that is really just about what people are hearing anecdotally and what words get out. This is like reviews. You generally tend to see more bad reviews than good reviews. But in my experience as a TAM, working with a lot of our customers, I’ve actually worked with a lot of good customers and I’m thankful and lucky to be in that position where a lot of our customers actually come to us with very reasonable requests and very reasonable incident management.
Working with them in finding out what the root cause and it could be something on the AWS side, it also can be something on the customer’s side. Now, I do understand where the sentiment comes from because there are definitely certain customers that likes to initially blame AWS. Sometimes when they don’t have visibility into it, it’s easy for them to blame AWS. That might where some of the sentiments come from. But for the most part, and the customers I’ve worked with have been really good in that it is usually a joined investigation to find out what’s wrong as the ultimate goal is really to bring services backup and make sure our customers are healthy and happy going forward.
Corey: It always amuses me to talk to larger enterprise customers who are grousing about enterprise support, why they don’t want to pay it, the baiting turning it off, the few times I’ve seen companies actually do that, it last at most a month and then they turn it right back on and they’re like, “Oops. That was a mistake. We’re really, really sorry about that.” Not because you need enterprise support in order to get things done. Rather that you need it in order to get visibility into problems that only really crop up at significant scale. That’s not incidentally a condonation of AWS in the least. That’s the nature of dealing with large complex platforms.
The one thing that has always surprised me about speaking with TAMs even off the record, after I poured an embarrassing number of drinks into some of them, I don’t ever get any of them to break character and say, “Oh, that customer is the worst!” There is a genuine bias for caring about the customer and having a level of empathy that I don’t think I’ve ever encountered in another support person for any company. Is that just because there is electric shock colors hidden on people or implants? Or is this because it’s something that they bias for the hiring process?
Ho: Yeah, totally. Dialling back a little bit on what you mentioned with how enterprise customers are getting value with enterprise support. I think there’s an element of embracing enterprise support. You have to really embrace it and work with the AWS staff to really get value out of it. The more you embrace it, the more value you get out of it.
To the point of why are all the staff in AWS, not just TAMs, are hotwired to really help customers is part of the Amazon leadership principles. I really think that that’s something that Amazon has it right in that the culture of it, how the hiring process requires people to read upon the leadership principles, embrace them, and really present them as you go through your hiring journey as well as even as an Amazonian, which is what they call for the staff in Amazon.
Corey: Yay! Cutesy names for employees. Every tech companies has them.
Ho: Exactly. But the leadership principles speaks really well and one of them is customer obsession. Every staff within Amazon are customer obsessed. At the end of the day, it should be mutually beneficial to have the customer get what they want, be very happy with the service, as well as the TAM achieving success with the customer to make sure that everybody is happy in the equation.
Corey: Absolutely. I think that’s a very fair point. I think that’s probably enough time dwelling on the past. Let’s about what you’re doing now. You left AWS around the beginning of this year. And then you came here to Gremlin, which is an awesomely named company especially once you delve a little bit into what they do, which is chaos engineering. Effectively from a naive perspective, my part, where I haven’t ever participated in the chaos engineering exercises, it looks to me from the outside like what you do is you’ve productized or servicized, if that’s a thing, turning off things randomly in other people’s applications.
How do you expect to find a market for this when Amazon already does this for free in US-East-1 all the time?
Ho: There’s a lot going in US-East-1 and a lot of new shiny toys happen to be there too. US-East-1’s definitely an interesting region. That said, there’s definitely a lot of misconceptions in chaos engineering. We have heard where some people would go and break other people’s things and go just break things to prove a point. As a company, we're definitely not advocating for that. That’s not the direction that we expect our customers to take.
Really, chaos engineering is about planned thoughtful experiment to reveal weaknesses in your systems. You look at the companies that are out there, that are doing similar things, like Netflix, like Amazon themselves, the intent really is to build resilience. With microservices, you hear this word a ton for sure in the industry, it’s very difficult to understand all the interactions and to really have a good grasp, even as an architect. I have been an architect and here, I’m here as a solutions architects as well. Even as an architect, it’s impossible to know all the interactions and ensure that your systems are resilient. One good way is to be thoughtful and plan out these experiments to reveal weaknesses and then to build resilience over time.
Corey: One of the things that I appreciate when I speak to chaos engineers is that they always seem to take a much more scientific approach to this. It’s a mindset shift where other departments call themselves site reliability engineering. There’s very little engineering, there’s very little scientific rigor that goes into that. It’s more or less at many shops, an upscale version of a system administrator. Whereas every time I go in depth with chaos engineers, there’s always a discussion about the process, about the methodology that ties into it.
Let’s build into this a little bit. What is chaos engineering? It’s easy to interpret this I think is just having a DR plan that it’s just better implemented and imagined than a dusty, old binder that assumes that one data center completely died but everything else is okay and just fine. What is chaos engineering?
Ho: It’s interesting you bring up the DR plans. Definitely most companies have a business continuity plan. They have a DR plan so that if their primary DR center fail, they would fail over to a secondary data center.
Corey: Spoiler. Other than during carefully staged DR exercises, they almost never work.
Ho: That’s exactly what I want to bring up. There is actually the terminology that’s recently said by Adrian Cracraft calling it Availability Theater. Just having it in the binder, just having a passive data center is not enough. First of all is to exercise it. You have to actually exercise your DR plan. If you go out and ask people, I honestly don’t know how many properly exercised their DR plan.
I can understand why so many people are reluctant to exercise their DR plan because they’re a bit of huge blast radius, it’s very dangerous to do, it just takes a lot of planning, a lot of effort to do. If you take that effort and shrink it down to a very small issue in your architecture and ask the question, what could go wrong in this environment? It could be a simple as a network link going down. That’s much easier to do, and that’s much easier to practice.
The idea behind chaos engineering really is just bringing it down to a level where you exercise it regularly so that you can build resiliency, but it has to be thoughtful, it still has to be planned. We like to think of it as an experiment where you ask the question, in this environment, what could go wrong? And then once you start understanding what parts can go wrong, you want to experiment against it. You have a hypothesis that if I disconnect my application from the database, I’m able to pull data from cache, I’m able to still show this information to my end users. That’s the hypothesis. As much as you think that that’s going to happen, you don’t really know for sure that that’s actually going to happen.
You would want to actually test it and experience it. The key here is actually injecting the fault, actually disconnecting the link to your database and see that, oh whether it is actually showing you cached data or maybe in some cases it fails horribly, which is actually still a learning. Ultimately, it comes down to learning about your systems and building resiliency over time.
Corey: One of the aspects that I think appeals to me the most is that is doesn’t need to be a world changing disaster that can be modeled a lot of times, it’s something of the order of you at 100 milliseconds of latency to every database call. What does that cause a degradation around? That’s a fascinating idea to me. Just because so many DR plans that I’ve seen are built on ridiculous fantasies. Asume the city lies on ruins but magically our employees care more about their jobs than they do about their families and they’re still valiantly trying to get to work. It never works out that way.
There’s also a general complete disregard for network effects. For example, how many DR plans have you ever seen that what if all of US-East-1 goes down and we’re just going to automatically try failing over to US-West-2 in Oregon, for example. Somehow, we’re going to magically pretend that we are the only company in the world that has thought ahead to do this. There’s no plan put in place for things like half the internet is doing exactly what you’re talking about. Perhaps provision in calls would be particularly latent so you may want to have instances already spun up.
What if there are weird issues that wind up clogging network links where, “Oh, we want to shove a bunch of data there. Oh but first we can’t get to that data that’s original case and oops there was a failure in planning or instead it winds up being way too long since it was tested and their entire services that this was never thought about.” It feels to me, on some level, like chaos engineering is in some respects an end run around the crappy history of disaster recovery and business continuity planning.
Ho: It just allows people to be a lot more granular, in my opinion. Like I said, the DR efforts needs to be there still. I’m not saying that they don’t serve any purpose and they don’t have a place, you should think about that. But it allows you to think more granular in database world. What happens if availabilities on goes down. It’s not always about an entire region goes down.
From a chaos engineering practice perspective, we actually advocate for starting small. You want to start asking questions around, “What if one link fails? What if one host fails? What if just a small component fails? Are you able to handle that?” And then you start dialing up this plus radius and think about what happens if a wider array of things fail. What happens if the entire fleet of services goes down, and eventually, gradually you definitely do want to get to the point like what Netflix can do? They can fail over regions. But to want to get there on day one is an extraordinary amount of work.
You can start small. What happens is a lot of people want to do that and they just find it too difficult and throw their hands up in the air.
Corey: I agree. I think there’s also a challenge where you see people who tried something, it was hard, and then they back away from it, and don’t want to do it again. I’m fascinated by stories of failure in a infrastructure context. Not because I enjoy pointing out the failures of others but because this is something that we can all learn from. Only a complete jerk watches their competitors websites struggle and fail and reload and he’s happy about this. Because this could be you tomorrow. There’s the idea in the operations would have hog ups. If someone else works at a company that is a direct competitor of yours, you still hope they get past their outages reasonably quickly.
I don’t think as an industry comprised of operational professionals and engineers that we are particularly competitive in that sense. We want to have to have the best technology but not at the expense of our compatriots disappointing their customers.
Ho: Yeah, absolutely. We all feel the pain. That’s definitely for sure. Whether it’s AWS, whether it’s other cloud providers, whether it’s Gremlin. Our approach is to really embrace failure but really is to learn from these failure to then build resilience.
Corey: Absolutely. That’s not something that ever comes out of the box. I remember a few years back, a particular monitoring company on Twitter, saw that a company was having an outage and chimed in with if you were using our service, you’d probably back up now. They were roundly condemned by most of the internet for that. It turned out that a marketing intern had gone a little too far in how do we wind up being relevant to what’s going on on the internet right now without the contextual awareness to understand what was going on.
That was something that became very heartwarming in a sense. In that we’re all in this together and we don’t use it as an opportunity to capitalize on the misfortune of others. Maybe that’s something that runs counter to what they teach at Harvard Business School. But fundamentally, it speaks with empathetic world in which I prefer to exist.
Ho: There’s totally a human side to this, for sure.
Corey: At this point, it seems to me that trying to convince a company to embrace the theory in the idea of chaos engineering has got to be a bit of an uphill battle in some cases. Our problem is there are site keeps breaking and falling over when things stop working. And your genius plan is to go ahead and start breaking things intentionally. “Why do we pay you again?”–seems like one of those natural objections to the theory.
In fact that was not me mocking someone else, that’s what I said the first time someone floated it past me. How do you introduce chaos as a step towards making things better and not get left out of the room.
Ho: I think there’s another misconception here where a lot of people might think about chaos engineering as something you just go in and break things on purpose. We are breaking things but an important part of thinking about chaos engineering is that thoughtful and planned nature of it. Where you are thinking about the experiments, you are planning ahead of time, you’re communicating both to upper management as well as to other services on what you to achieve. The goal is resilience. The goal is not things breaking.
Corey: The hard part about of course is getting there. Are there some companies that are frankly too immature for chaos engineering to be a viable strategy? If the company’s struggling to keep it’s website up, it feels like introducing failure that early in the game that may be the best path. Maybe that’s wrong.
Ho: I think it’s a journey. Chaos engineering is a journey, you’re not jumping into the deep end right away. We don’t advocate for you not knowing what you’re doing and just running a bunch of things that breaks production. You’re not going to just say, “Let’s shut down all our servers and productions and see what happens.” That’s not the intent of chaos engineering.
I like to dial it back to the thoughtful and plan approach all the time because you are trying to do things that are in a very controlled manner. You almost want to know what the outcome is and you’re just verifying, validating it, and also catching some surprises in a safe environment.
Corey: While I do understand the answer to this question, may very well be pay Gremlin. I like slightly one new honest answer to how do you get started with the idea of chaos engineering. For people who are listening to this right now and saying, “That’s fantastic! I’m going to implement that right now!” And because people are listening while driving, they ram a bridge to see what happens. Don’t try that.
How do people get started once they safely get to the office?
Ho: Just pay Gremlin. No, okay. Serious note, as I mentioned, I think it’s a little bit of a mindset as well as using the right tools. You can simply look at your service, look at your architecture diagram and ask the same questions, what could go wrong? Are there some hard dependency? Are there some soft dependency? Even if you know the ins and outs of your code and your architecture, there’s always some learning by experiencing some of these failures.
To really get started, you want to ask yourself what could go wrong. In terms of resources, we actually have chaos engineering Slack that is not just about Gremlin but overall chaos engineering practice, and you’re welcome to join the chaos engineering Slack.
Corey: Oh, wonderful. I’ll throw a link to that in the show notes. One question I do have and this might be a little on the sensitive side. If it is, I apologize. But it feels like in many cases there are some companies, that like to go on stage and talk about things that they are in fact running and they talk about the cases in which it works, but they don’t talk about edge cases and things where that doesn’t wind up applying.
Classic example, this might be for example a company gets on stage and talks about how everything they do is cattle instead of pets and they’re happy and thrilled with this and then you go and look at their environment and, “Well, okay, their web theory’s completely comprised of cattle, there are no pets but their payrolls running on an AS 400 somewhere.” And the databases that handled transactions are absolutely bespoke unicorns. To that end, how much of chaos engineering as implemented today in the outside world is done holistically throughout the system or is it mostly focused of one particular area and then broadened out over time?
Ho: I like that question. That’s a pretty interesting question. To put it into perspective, even a company the size of Netflix as you mentioned.
Corey: As I sneezed, not mentioned. But go on.
Ho: Has different teams, different organizations, pretty widespread, they use different types of technologies. The important thing about chaos engineering is to embrace failure and prepare for failure. Whether it is chaos monkey that are shutting down hosts, or just ensuring that if my tool’s not working, I can still use spreadsheet to track something. The important things with chaos engineering really is about having that failure mindset and making sure that you’re prepared for failure.
Corey: You just tell something that triggered or flipped a bit in my mind but the idea of using a spreadsheet when a tool isn’t available, is that something that you talk about as an aspect of chaos engineering? Most discussion I’ve seen have been purely focused on technical fail over and technical resilience, you’re talking about resilience of business process.
Ho: Absolutely. Because when we talk about GameDays which is actually an element that we haven’t discussed deeply. Let me very quickly just talk a little bit about GameDay.
Ho: Where GameDay is a time where you can bring in different people and collaboratively run experiments to reveal weaknesses as well as just to learn about a service or a system. What was the question?
Corey: The question is whether this is a business process or not.
Ho: As you’re doing the GameDay, your experiments are not only looking at the technical aspects whether an application is able to handle a failure by doing retries, timing out or some graceful degradation, you’re actually also validating your observability, whether you can see what’s going on in the system, whether you’re getting the proper alerts when a certain thresholds are crossed. And then all the way to a point where you’re on call person, when they get the alert, they have enough information to take action. That’s all part of the experiments and your learning as a whole.
This is actually an interesting aspect to chaos engineering. A lot of people experience on-call by just being given a pager and here is a run book.
Corey: Here, catch.
Ho: Exactly. Go ahead. And the real learning comes from just your first incident but you’re very nervous about it and you don’t know what to do. By thoughtfully planning and executing these experiments, you’re also allowing your on-call person to get ahead and know what’s coming so that they are more calm as they’re executed.
Even with one books. On-call engineers have make mistakes before because they’re just in this very nervous state. If you calm them down in training, in helping them out understand what the flow looks like so that they know what to expect, I’m sure they will feel better as a real incident comes in.
Corey: Which makes an awful lot of sense. There’s also a conference coming up if I’m not mistaken, where you talk about the joy, glory and pain that is chaos engineering and I’ll throw a link to that in the show notes. What are you hoping that comes out against a community gathering to talk about the principles of chaos?
Ho: Yup, absolutely. There is going to be a Chaos Conf that’s happening in September in San Francisco. We have great speaker line up that is going to talk about their experiences in chaos engineering, talking about failure scenarios that they have experienced and how they handled the situation. Overall, a good gathering of like-minded people to talk about failures, talk about how to embrace it, how to prepare for failure, as well as how to handle certain situations.
Corey: It sounds like it’ll be a lot of fun. I know I’m looking forward to it. Thank you once again for being so generous with your time. I definitely appreciate it. I have to say, it’s nice being here in the office and recording with you and not once during this entire session did the lights go out or a wall fall over. Living with chaos engineering does not mean that every minute is a disaster.
Ho: Thank you for having me.
Corey: Always a pleasure. This has been Ho Ming Li with Gremlin and I’m Corey Quinn. This is Screaming In The Cloud.