Are you a blogger? Engineer? Web guru? What do you do? If you ask Yan Cui that question, be prepared for several different answers.
Today, we’re talking to Yan, who is a principal engineer at DAZN. Also, he writes blog posts and is a course developer. His insightful, engaging, and understandable content resonates with various audiences. And, he’s an AWS serverless hero!
Some of the highlights of the show include:
Some people get tripped up because they don’t bring microservice practices they learned into the new world of serverless; face many challenges
Educate others and share your knowledge; Yan does, as an AWS hero
Chaos Engineering Meeting Serverless: Figuring out what types of failures to practice for depends on what services you are using
Environment predicated on specific behaviors may mean enumerating bad things that could happen, instead of building a resilient system that works as planned
API Gateway: Confusing for users because it can do so many different things; what is the right thing to do, given a particular context, is not always clear
Now, serverless feels like a toy, but good enough to run production workflow; future of serverless - will continue to evolve and offer more flexibility
Serverless is used to build applications; DevOps/IOT teams and enterprises are adopting serverless because it makes solutions more cost effective
Full Episode Transcript:
Corey: This week’s episode of Screaming In The Cloud is generously sponsored by DigitalOcean. I’m going to argue that every cloud platform out there biases for different things. Some bias for having every feature you could possibly want offered as an added service at varying degrees of maturity. Others bias for, “Hey, we heard there’s some money to be made in the cloud space. Can you give us some of it?”
DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they’re using it for various things, and they all said more or less the same thing. Other offerings have a bunch of shenanigans, root access, and IP addresses. DigitalOcean makes it all simple, “In 60 seconds, you have root access to a Linux box with an IP,” that’s a direct quote albeit with profanity about other providers taken out.
DigitalOcean also offers fixed-price offerings. You always know what you’re going to wind up paying this month, so you don’t wind up having a minor heart issue when the bill comes in. Their services are also understandable, without spending three months going to cloud school. You don’t have to worry about going very deep to understand what you’re doing. Its click a button or making API call, and you receive a cloud resource. They also include very understandable monitoring and alerting.
Lastly, they’re not exactly what I would call small-time. Over 150,000 businesses are using them today. Go ahead and give them a try. Visit do.co/screaming and they’ll give you a free $100 credit to try that. That’s do.co/screaming. Thanks again to DigitalOcean for their support to Screaming In The Cloud.
Corey: Welcome to Screaming in the Cloud, I’m Corey Quinn. Today, I’m joined by Serverless Hero, Yan Cui. Welcome to the show.
Yan: Hey, Corey, good to be here.
Corey: So, you do a lot of things. You are a principal engineer over at DZone. You wind up doing your own video course at productionreadyserverless.com, you blog at theburningmonk.com. It feels like you’re something of a kindred spirit, in that when someone asks me, “So what do you do?” I have to figure out, oh okay, from what perspective, because there’s about 15 different possible answers to that question.
Yan: Yeah, was that a question?
Corey: Only if you want it to be. Okay, so let’s start at the very beginning, I guess, since you have so many things that are, I guess, across the board here. Who are you? You sort of burst onto the scene, at least to my awareness, about a year or so ago. You were named a Serverless Hero at the beginning of this year. You were writing an awful lot of content that I found myself, I guess, tripping over, for lack of a better term, that resonated very well with various audiences, that hit on different points, and I always came away with the same perspective of, “Wow, that’s insightful!” but we never really got to have much of a conversation about this until relatively recently when we tripped over one another at a conference.
Yan: Yeah, so I guess for me, I’ve been writing for quite a long time now. I find that writing, blogging, a really useful way for me to remind myself, and also force myself to really understand something beyond the basics. It’s amazing that when you force yourself to get to the point where you are capable and able to explain something to someone else in words that are a lot easier to understand.You really force to reach your level of understanding that you probably didn’t think you needed at first.
I’ve been writing about various different things around the computing, the function of programming. I was really sort of active in the F Sharp and the function of programming scene for quite a while. Serverless is just a thing that I got really, really interested in, I guess 2016, around that time when I joined a social network, which eventually ran out of money. But when we were there, we moved a lot of the things that we were working on to serverless. We definitely really learned a lot about, in terms of work, when you have run serverless in production, one of the sort of challenges that I’ve run into—and I think it’s the similar problem that a lot of people are running into now—whereby it’s so easy to go to production with Lambda.
Sometimes you kind of forget that all the things you learned from the microservices, so transition from monolith to microservices, all the things still very much apply, and some people that are moving straight from, say, on premises to the Cloud with Lambda, you are missing some of that learning from that process. A lot of people are getting tripped over because they weren’t ready to think about how do you bring some of the good practices that we learned during the microservices era to this new world of serverless, and therefore find the problems of, “Okay, how do I monitor things? How do I debug this huge or this bigger system with so many different net Lambda functions and that we’ve both synchronous and asynchronous event sources?”
And so yeah, I find there’s a huge amount of things there to, you know, to learn and to share, with regards to serverless. I’ve become really busy just learning myself, but also trying to share as much as I can.
Corey: There’s something very valuable about giving back, in the context of having learned something new, and going and telling that story and sharing it with new folks onward. To some extent, I believe that that’s what they base the Hero Program from AWS on. What was it like joining that? I mean I’ve looked into it from the outside, but I’ve never been invited to participate in it.
Apparently, actively insulting what they do every week in a sarcastic newsletter isn’t the best way to get them to invite you to do things. Who knew? But it seems to me from the outside, it’s based largely upon helping educate people, helping bring people along for, I guess, the knowledge journey. Can you talk a little bit about what it was like is to be invited, and effectively, what a Serverless Hero is?
Yan: That is a good question. I don’t know if I really have a good answer for that. I think the reason why I’m invited is because I’m doing all of these articles, and also doing media courses and trying to share, and trying to, you know, I guess bring good practices, good practices into the Serverless community. As for what it’s like, it’s definitely been really helpful for me, personally, in terms of getting more recognition for what I’m doing, I guess maybe to some extent bring some authority to what I’m writing and saying as well that, you know, more people think, okay, this is not just some crazy guy’s journey from the roof, but if AWS is happy to give him some, I don’t know, an official title, maybe he does knows what he’s talking about, so from that regard, this is been really useful. And also you get a free ticket to Reinvent, which I think is pretty awesome.
Corey: Sometimes, that’s really all it’s about, yeah, those of us sitting outside of the circle get to pay for it. So what was interesting is, we wound up catching up on rounds of the recently in Dusseldorf, Germany, of all places, where you gave a talk on the concept of chaos engineering meeting serverless, which is fascinating to me on a variety of different levels. I mean, first off, isn’t running under distributed system similar to Lambda, effectively its own form of chaos engineering experiment?
Yan: Yes, and no, so I guess when the, I forgot who it was, maybe even you, someone who wrote recently about how, if you run in us-east-1, that’s kind of running a chaos experiment in itself, because us-east-1 is so prone to having all kind of problems. They probably don’t see in many other regions.
Corey: Well there’s so much running there, too, that every slight hiccup winds up affecting someone. So, to some extent, that entire region has a bit of a bad rep. But it’s interesting just from a perspective, at least where I sit, of trying to understand how chaos engineering would even look in a serverless context, because to some extent, what you’re doing is you’re writing code. You’re writing your arbitrary code and handing that to a provider, and at that point almost everything is happening on the other side, I guess, of the Amazonian wall, in that context. Where, okay, it’s going to go and it’s going to do these things, and if Amazon breaks, it could break in new and exciting and interesting ways that I may not be able to accurately predict. How do you wind up seeing that? How do you, I guess, figure out what types of failures to wind up practicing for?
Yan: I guess it really depends on what it is that you are doing or what services that you are using. Also, with chaos engineering, it’s not just about figuring out what happens to Amazon services, there’s so many different errors that can happen between your own services and also within your own services as well, and I think even though we can probably rely on Amazon to do a certain degree of chaos engineering, to make sure they’re infrastructure side of things that they are well-tested and resilient to many forms of failures, that ultimately, as an application, as someone who owns the systems and responsible for user experiences, users, we have to aim for a level of resilience beyond what we get out of the box with Lambda.
From that regard, we have to move, we have to also test failures and resistance to those failures at an application level, and that’s something that the guys at Gremlin, who offers a commercial solution for running chaos experiments itself. That’s an area that they are also focusing on, and recently they have just announced a new application layer in failure injection framework, that you can potentially use from Lambda as well. I think right now is to only available for java, but they can make it available for other languages too.
Corey: The challenge almost goes beyond that, from where I sit. To a point where, I know they’ve been beaten up for this an awful lot and I’m not trying to belabor the point, but back when they had their big S3 issue a year or so ago, that wound up not just taking down S3 but an awful lot of other things that had baked in under the hood dependencies on S3. You’ll see something similar even with or without serverless, where you’re going to have an environment that’s predicated entirely upon certain behavior patterns. But if us-east-1, for example, drops down, and your entire strategy is to move a bunch of traffic over to us-west-2, in an isolation when you test that, everything goes well. In practice there’s going to be congestion on the control plane, a lot of people are going to want to be filling over at the same time, you almost have a herd of elephants problem, where at that point, that’s the sort of latency trickle-down effect that is difficult to predict without a very thorough understanding of how Amazonian systems behave.
So to some extent, it almost feels like you’re in a position of having to enumerate all the bad things that could happen, that could possibly go wrong, rather than trying to build a resilient system aimed at a wide variety of problems. Am I thinking about that the wrong way?
Yan: I think with chaos engineering, it goes beyond predicting what can go wrong, but also for example that you just gave in terms of what happens when the region goes down. In theory you may have predicted how things would go wrong, how you can shift traffic around, but you never really know for sure whether or not things will play out the way you expect them to until you actually try it.
In the same way that a lot of companies spend a lot of time coming up with all these sophisticated plans for disaster recovery, how they move different workloads around to different data centers and so on, and how you work, but they never exercise them in reality, so chances are when something does happen, you’ve made your disaster recovery plan may not work the way you plan. So one of the practices that people do with regards to chaos engineering is to actually exercise those scenarios. So for example, you may have planned a game day whereby, well, Netflix does this from time to time, whereby they’ll plan a game day and actually trigger a region-wide failure and see how, whether or not, their system is able to recover from those regional failures the way they hypothesize that they should. So part of the chaos engineering, it is about exercising those failure modes and see how your system actually behave so that you can learn from it.
Corey: I think that’s really what it comes down to, is learning. Not to tell Netflix they’re doing it wrong, but I wish it was easy to wave a hand and see if an issue you’re seeing is just an entire region broke. It never tends to manifest that way. Things start working intermittently.
Some services start responding with strange messages, some wind up responding in increased latencies, but very rarely is it a, “Everything goes dark and nothing is responsive.” Invariably, and I still blame most modern companies for this one, where you wind up in a place where every single environment you’re in is––every person who has an issue, pops their head up, and says, “is it just me or is it my infrastructure?” I mean, the best early warning sign we still have in some cases is DevOps Twitter. There’s no great way to say: is it my crappy code, is it our last deploy, or is this a wider provider issue?
Yan: Yeah, that’s really funny you say that, because Amazon has been traditionally really slow at updating their service health dashboards, and oftentimes I find myself asking the same questions: okay, is it my infrastructure? Is there something happening, AWS? Nothing is updating in their service health dashboards, and then you go to Twitter and see whether or not other people are also complaining about AWS being impacted in the region that you are in as well. It’s funny, it kind of always also think AWS are monitoring Twitter.
Corey: Oh yeah, there are ways to fix this. I mean, it would be interesting for example if PagerDuty would wake you up with the notification that says, “Hey, by the way, we’re seeing more than two standard deviations of other people in this region also being paged right now.” It would shave 15 minutes off of most companies response plans because they’re immediately aware that, oh, wasn’t someone pushing bad code, or a disk filling up, or a database falling over. No, this is a provider-level problem. Just getting that first pass issue is something most companies can’t do themselves.
So, that’s all separate thing to rant on. Instead, let’s talk about something else that irritates people to no end: API Gateway. You’ve written a fair bit on it lately, you’ve been going into some depth as far as how to work with it, various things it can do. And I have to say, that whenever I work with API Gateway, I come away feeling more confused than I did when I started. Is that just me or is it really confusing?
Yan: It is really confusing. In part, I think, it’s because it can do so many different things. I guess it’s not always clear what, given all the different options, what is the right thing to do, given a particular context and there’s also some peculiarities as to how API Gateway works for instance when you create custom domain name, it just never occurred to me the first time around, the first time I did it, that when you create a custom domain name, it’s going to create CloudFormation, but for some reason, it doesn’t use CloudFormation caching, so if you want cache enabled, you have to do it in API Gateway layer, or now you can do it with regional endpoint and have your own CloudFormation distribution for that custom domain name, or as well as different authentication authentication mechanism that supports when should you use which one.
A lot of that is, I guess, it’s something that I have to learn myself and through experiment, also, through different use cases that has come up in my line of work. I wish there’s a better documentation, that’s just better education out there for AWS providing guidelines on when you should use, say IAM Authorization versus Cognito versus something custom, or using a custom authorizer function for example. And also just the sheer amount of things that included in API Gateway, it does very much feel like a Swiss Army knife for all the different things you may want to do. And also it’s not a very cheap service either, by comparison to, say Lambda invocations. I think for most people in production, they are likely to cost a lot more than what they pay for Lambda.
Corey: Oh, absolutely. For those aren’t aware, API Gateway acts as a http or https front-end for a variety of Lambda functions. But you can also put other things behind it, it effectively is aware of different verb, http verbs that you can wind up leveraging. It can follow all sorts of interesting and convoluted work flows. It’s more or less a networking Swiss Army knife, by which I mean all of the instructions are apparently written in Swiss-German and no one’s really clear on how to use it for certain things. The feedback from AWS around this service has largely been of the form, “Oh, use it however you like,” which is reassuring but surprisingly unhelpful, and every time I start using it I am completely convinced that I’m doing it wrong. But it seems to work for the used case that I have, so I continue to sit here and my resentment for API Gateway continues to grow.
Yan: I don’t know if you ever had to interact with API Gateway with its own, best for API to talk to its control plane is one of the most awkward APIs I’ve had to work with.
Corey: I haven’t even gotten to the point of trying to, trying to configure API Gateway via direct calls. Everything I’ve done with it so far has been through serverless framework. And that’s really the only thing that makes sense to me. But I do suspect there’s an entire sea of complexity that I’m not exposed to that could probably solve my problem in half the time. It’s one of those areas where it’s just––it’s a future stake thing. I’ll look at that one of these days.
Yan: Yeah, I do the same thing as well. I mostly interact with API Gateway through serverless framework, it simplifies things so much more. But a few times that I had do, I guess, provision API Gateway with Terraform and other things that sort of, force you to understand how API Gateway resources are managed, and to figure how they link up together, that’s just a whole sea of complexity under the hood, which, frankly, the serverless framework just shoos you from.
Corey: Absolutely. It’s one of those areas that I think is still evolving. But let’s get a little out of the weeds for a minute, look at the big picture stuff. You’re a modern-day thought leader, as far as serverless goes, which means you’ve been using it for more than twenty minutes. Let’s look forward at it, I guess, serverless, instead of as it stands today, let’s look at what it looks like a few years down the road. I’ve been saying for a little while now that today it feels a little bit like a toy, in the context of what it’s going to look like in a few years, and I’ve had some people get very angry at that characterization and say, no it’s not a toy. It’s awesome. It’s perfect, we were in production on it, shut your mouth.
Other people agree with me, and invariably, I inadvertently starting a war, and then I sneak out a back, and take a cab back to the airport and catch an early flight home, and well, we hope most of those people lived. Where do you fall on that particular spectrum?
Yan: I certainly think that there is good enough to run production and workload on that, and there are not many people who are running very heavy production and workload on serverless, or on Lambda and other similar functional service offerings. So I definitely think it is good enough for production. As for it being a toy compared to what it’s from, where you can look like, in a couple of years time, I think that is definitely the case. What we see today is something that’s useful, that’s usable in production, but it has many caveats that requires knowledge to work around, which is why I find this media course I’ve being doing, or the blog post that we’re writing that provide value to the people that are reading them, that are watching them, but at the same time, I wish I don’t have to write those things. I wish more things just worked out of the box, and I believe that in the next couple of years, things will continue to evolve. Some of the problems people are having today, in terms of order limitations, around VPCs, ENIs, and CoSTAR. All of those would just go away, and it would work a lot better as a platform, there would be more flexibility so that it’s potentially you can say, “Okay, I don’t want to use Go, I don’t want to use Node, I want to use Rust or some peculiar language that I have just discovered.”
You should be able to just bring your own language all the time to their platform and use Lambda or functional service as a more general instruction over containers and other compu-resources. I definitely think all the problems we are seeing today in terms of some of the scaling limits, all of that is going to go away as well. Some complexities around, how do you build up its ability into your serverless application.
All of that should also be improved dramatically compared to what we have today, which is often times many home-baked solutions for shipping logs to monitoring, to getting correlation IDs, and things like that into server application, which for me, for many years now we’ve been able to just offload to some third-party vendor to provide out of the box for us.
Corey: I would agree with you. There’s a lot of stuff that feels like it’s half-baked, and isn’t done yet. There is a story about how using Lambda dramatically speeds up the time it takes to write an application and get it into production. But I feel like it doesn’t really kick in until the third application you write. The first one as you learn the caveats, and you trip over it, wait, it does what? Feels like it’s going to take a lot longer in order to make sense of it. The games don’t really come until you’ve repeated it a few times, where you desperately need to be a lot more, I guess, once you’ve build your comfortable speed, you’ve found out where the sharp edges are, you understand the model now, now you’re going to be more effective. But I don’t get the sense that maybe this is just me, that you can drop it in front of a team of developers and they will be immediately more productive that week. Is that naïve?
Yan: No, I agree with that, in fact, I think that serverless exposes the development team a lot of things that they may not be used to thinking about over the operational side of things, in terms of how to set up a centralized logging, monitoring, and tracing as well. A lot of these things that traditionally developing teams has been able to offload to, I guess, a platform team or DevOps team, for lack of better name. Now they have to think by themselves, now they have to understand how to do––how to [...] their code, how to build it above their code in their native environment when they run it in the Cloud and I think that means traditional development teams who haven’t been exposed to that and now have to up their game and really learn the operational side of things and how to make serverless applications production-ready.
Corey: Who are you seeing these days, as far as people who are building up serverless applications? Who’s using it, and I guess for what type of use cases? I mean, we do see the toy problems that wind up being shown on stage at various conferences. And I’ve seen it for back end, but are you starting to see full-on applications being written start to finish using serverless technologies, or is it more of, I guess, a helper thing, in your experience?
I live in San Francisco, so I tend to see a lot of things from a different angle. Hey, we wrote this thing last night at the serverless blockchain machine learning the end. And that’s great and all, but isn’t exactly representative of what the rest of the world sees.
Yan: Yeah, I haven’t really seen any serverless blockchain, AI, and whatever the best word is, out there. What I do see is a lot of people building applications, building back ends, and I guess it really depends on the company they work in. A lot of it, a lot of options for serverless is driven by, cultures in a company. For instance, I see a lot of DevOps teams adopting serverless because it makes their life a lot easier, and also make their solutions a lot more cost-effective by moving, say, Chrome jobs to running to Lambda function and be able to do a lot of automation for resources and monitoring as well, both on the operational, but also from the security point of view by hooking into all these different events they can capture with CloudTrail, and then using Lambda function to react to them through CloudWatch Events patterns. We also see a lot of application developers, free themselves of some of their organizational constraints and the inertias around dependencies on, say, a DevOps team, which holds the key to the kingdom. And with serverless , application developers can take ownership of more of their infrastructure or more of their systems using serverless without getting entangled by all these hundreds of different tools they use for DevOps.
Another one I see as of a lot is IoT. A lot of people I’ve spoken to in the IoT world and for them, serverless introduced to be a very natural fit. I guess, iRobot and Ben Kehoe, as I often talk about, how they’re using serverless, is really just, take the whole usage to the next level, but there’s a lot of others, smaller companies, many startups in the IoT world that are making really heavy use of serverless.
In fact, not long ago, I was speaking at a local user group event at London, and one of the small owner companies, very small company, they have got their own IoT platform, and yet, they were, I think, at that point, one of the biggest users of Lambda in the whole of Europe, and they were easily doing several tens of thousands of compound executions per second. And Lambda, and when serverless give you that flexibility and that scalability pretty much out of the box, and yes, those are some of the common use case that I’ve seen for companies doing, ops automation, building your traditional web applications, as well as IoT. And there’s also a few other places, including a couple company, companies that I’ve worked for where we’ve moved a lot of our [inaudible 0:27:44.4] pipeline to run on serverless using Lambda, Kinesis Firehose, Athena, and QuickSight all of that, the entire stack so that we have the whole pipeline without having to manage and run any server ourselves and we only pay for data that we process and we query.
Corey: I think that we’re starting to see this really starting to gain steam and move toward a place where we’re not, I guess, seeing it. We haven’t seen this sort of adoption from the same players historically. It does feel an awful lot like there’s, I guess, more of an enterprise embracing this than there has been previous things. You can see enterprises rushing to the Cloud, you saw small companies doing it eventually, it turned into a wave. I feel almost, to some extent, like we’re seeing enterprises embrace serverless before we are startups.
Yan: Yes, I’m seeing that as well. In London, there are a lot financial company, financial institutions, and I have been seeing, I guess, more adoption of serverless in that world than I anticipated.
Companies like [Chapter One], I think Goldman Sachs is using serverless as well, and a few other big enterprises which I didn’t expect for them to be on this, sort of serverless wave. And almost sometimes I feel like, if they, sometimes these companies are so late to the Cloud that they just jump in over the whole containerization step and go straight to serverless because that was easier, it was easier entry into the Cloud than to go into the infrastructures of serverless and using containers.
Corey: I definitely feel to some extent like containers are almost a transitional step, same with orchestrating them in place, but that tends to be controversial, and that’s probably a conversation best reserved for another day. So you do have a talk at Reinvent coming up. Can you tell us a little bit about it?
Yan: Yes, it would be a extended version of the chaos engineering and serverless talk I did at SRECon in Europe. Again, it’s about the challenges that we face in the serverless world in terms of building greater resilience than we are able to get out of the box with AWS. Many of the things we talked about earlier, in terms of how do we then take, how do we then identify failure cases and how do we simulate it to verify that our application can actually handle those failure modes, but also try to uncover failure modes that we are not aware of yet by running scenarios that maybe we just don’t know what our systems would do, but we know is probably going to be bad, so that we can run those scenarios in an environment outside of production, so that we can learn about our systems failure modes ahead of it actually happening in productions, so that it gives a chance to then build resilience, engineer application, and how we can take the principles of chaos engineering and bring them into the application layer, rather than just applying them at the infrastructure layer.
Corey: Which I think opens up an awful lot of opportunities. The version that I saw was fantastic, I highly recommend that people wind up catching this if you can, either at Reinvent Itself or on the video after the fact. So if people like what you have to say, where else can they find you?
Yan: They can find me on Twitter. They can also find me on my video course as well, if you go to productionreadyserverless.com that would take you to the many video courses page where you can buy the video or just check out the first couple of chapters. You can also find me on my blog, theburningmonk.com I also do a lot of writing on Medium as well as few other companies. I recently write a lot of blog posts for blogs at IO on serverless versus containers from a perspective around control versus responsibility, and vendor lock-in in terms of the risk versus the rewards. So looking at it, the current state and adoption trends for both serverless as well as containers.
Corey: Thank you very much for taking the time to speak with me today. There’s always of course, the conference circuit as well, if there’s someone that’s not lucky enough to run into you at a conference like I did. I absolutely recommend it, you’re incredibly gracious, you’re an excellent speaker, and you tend to tell stories in ways that are very engaging, so thank you for that.
Yan: Thank you, that’s an amazing compliment. Thank you very much.
Corey: This has been Yan Cui, I’m Corey Quinn, and this is Screaming in the Cloud.