Are you interested in going beyond basic monitoring and visibility? Need tools to build and operate serverless applications and extract business intelligence? IOpipe provides extended visibility and metrics around AWS Lambda, including profiling, core dumps, and incoming input events.
Today, we’re talking to Erica Windisch, who is the founder and CTO of IOpipe. She brings her experience in building developer and operational tooling to serverless applications. Erica also has more than 17 years of experience designing and building Cloud infrastructure management solutions. She was an early and longtime contributor to OpenStack and maintainer of the Docker project.
Some of the highlights of the show include:
Nomenclature Battle: Serverless vs. stateless
Building a window of visibility into Lambda: Talking to users and assessing needs/pain points
Observability of the infrastructure: Necessary evil to get to automated healing
Using Lambda at significant levels of scale; some companies grow usage, others go all in right away
Current state of Lambda ecosystem
Is Lambda stable? Indications and no formal SLA
How issues manifest and are exposed
Trends include cold starts, hours-long failures, and multiple function evokes
Infrastructure powering IOpipe: Lambda issues may impact performance of monitoring system, but IOpipe is not necessarily dependent on Lambda
Future of Lambda: Builds applications a specific way, but there are limitations
What would Erica change about Lambda? Run function and define handlers
Lambda functions can be difficult to understand; some developers do not have familiarity and create bottlenecks
Capacity limits around Lambda can be difficult to establish
Erica Windisch on Twitter
Erica Windisch on Twitch
Full Episode Transcript:
Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. I’m joined this week by Erica Windisch who’s the founder and CTO of IOpipe. Welcome to the show.
Erica: Hi. Thanks for having me.
Corey: No, thanks for taking the time to speak with me. Let’s start at the beginning. What is IOpipe?
Erica: What we do is we provide tools for developers, to build and operate their serverless applications, from development to production, and increasingly also doing things like helping you extract business intelligence from applications and correlate that with operation information and operational observability, which just sounds like a lot of buzzwords. Didn’t it?
Corey: I feel like half of the space stands out that way. In fact, I first found out that you folks existed at re:Invent last year. There was a big midnight madness launch and they were going to be announcing some things. Frankly, none of us cared about that. We were there to see Shaquille O’Neal as DJ Diesel apparently “dropping sick beats” as the kids say.
But, while I was there watching your presentation, a couple of other things that came out were, in some ways, more entertaining even than watching a seven-foot tall gentleman spin discs for fun. It was neat to see.
To my understanding from back then and as it continues to evolve now as I continue to work in this space, effectively what you do is provide visibility and metrics around AWS Lambda. Is that more or less how you’re positioning yourselves these days? You can obviously pour more buzzwords onto it, but is that effectively encapsulating what you do?
Erica: I would say it’s the baseline for what we do. We have some competitors. I would say our competitors definitely fit more firmly within those parameters. We’re growing out of basic monitoring and basic visibility. We have things like profiling. We have core dumps. Now, we look at things like incoming input events.
If you are doing Alexa skills, you can filter by a specific conversation with a specific user if you want to. This works out-of-the-box. Those are things that none of our competitors for instance are able to do. I don’t know what to call this, but I think we’re doing something new and unique.
Corey: I would agree for the first part of your last sentence, which is it’s difficult to know what to call this. Someone would argue that in any significantly exciting technology, a battle always breaks out either about pronunciation or about what it is you want to call the thing that you've built. We’ve seen it with monitoring versus observability. To that end, where do you stand on the use of the word ‘serverless’?
Erica: I think the word ‘serverless’ is fine. Initially, I see the point people are making. They’ll make a big deal of the name but nobody complains about the term ‘stateless.’ We’ve agreed that we could build stateless applications, but there’s still state.
Your TCP session has state. The physical link layer has state of wire physically being connected. Your application, your user provides a session cookie and your state is stored in your database. There is state. It’s just that this part of code doesn’t necessarily worry about the state. You put the state in different layers of your application, you manage your state in certain ways, and you ignore the places where you still have state like the fact that you connect to a database, the fact that you’re storing data in a database is taking that state and moving it somewhere. It’s like I have this temporary state by the nature of running an application, and then I store it elsewhere. I don’t maintain the state.
I think serverless is very much the same way. Yes, there’s still servers, but we don’t care about them. We move them somewhere else. The concern for them in the same way we move state, but because servers are a more concrete thing that you can physically see, that there’s more push back around that time. They want state because state is such an abstract concept. You can’t see state, generally, but you can see servers. I think that these are similar, but we complain about one and don’t complain about the other.
Corey: Very aptly put. How long has IOpipe been in business?
Erica: We’ve been in business for a little more than two years. We launched about a year ago. I started on this project maybe two and a half years ago in the terms of me leaving Docker and saying, “I’m going to go do something around serverless and next-generation applications,” and figuring out what that meant. Through customer conversations, through searching for a co-founder and finding Adam, and founding the company, we found a focus and a vision and supplanted that into corporate and so forth about two years ago.
Corey: If you take a look, I think Lambda wasn’t really announced until 2015, so that’s less than a year between the announcement of a thing that no one really knew what to make of and you effectively jumping on this in a very, very early state. How did the idea of building a window of visibility into this new thing that no one quite understood what to do with, come about?
Erica: Through two trends. One was talking to users and developers on Lambda, and assessing what their needs were. We just had lots of conversations to find out where the pain points were, like, “Where did you need help? What can we fix? Is there a product here? Is there something that you need that we can serve, fix for you, and build a product?”
We were seeing a trend in users of developers of serverless, looking for monitoring observability, as well as the ability to really understand things like sessions, for HCP sessions for users of those applications, for users of Alexa applications, tracking Alexa skills. These were all things that we saw and we saw marketing for that.
But more so, the original vision of IOpipe like my vision when I left Docker was more ambitious. I saw that observability of the infrastructure was a necessary evil to get to a place where I wanted to get to, which was more of automated healing, automated application construction.
I wanted machines to do all this work for us like the idea of Amazon Glue, for instance. This idea of gluing together serverless applications or doing things like step functions.
When we build these units really small, and we had very open and standardized channels of communication and just process events, if we standardize event processes, we have the standardized input, we have the standardized output, and they’re all very, very small, we could just use machine learning to construct them. That was my original vision.
It turns out we need a feedback loop for this, which was observability, and that just didn’t exist. We started building the observability tools and we started talking to users and seeing the need for observability tools. We just went straight down that path.
I think maybe in some ways, we’re getting back to the original vision ideas, but very strong with staying within where there’s a market need.
Corey: Which is a fascinating way of almost stumbling into an offering that’s definitely resonating within the market. To that end, do you see that customers are using Lambda at significant scale at this time or people still in early days doing it for proof of concept and not really rolling it out?
Erica: It depends. There are some very large organizations that are using Lambda for a number of projects. They may be big or small. This is actually a conversation I've had with people where there was some focus in the market and some other developer evangelists and enthusiasts, giving talks and they are focusing in on the idea of go straight into production, go straight into building these applications, these are applications that are ideal for Lambda, and kind of starting there.
I was like, “Hold on a second.” It’s actually okay to say you can build simple applications, ad hoc applications on Lambda, to learn it and then learn and expand. Getting there with Lambda on low-risk applications and then get into big applications. I definitely see both of these.
I’ve seen companies go straight into, “I’m going to put $1 billion of billing into Lambda,” and just whole Fortune 100 is like, “We’re going to put all our billion in Lambda, just straight off the bat.” I’ve also seen big companies that say, “We’re going to do a small project. We’re going to do some time jobs. We’re going to become familiar with it and understand where our edge case is and then grow.”
It’s a mix. I would say it’s probably a lot of maybe the latter rather than the former because I think it’s easier to start with small things and expand out than to have big top-down initiatives like rewrite giant stacks in Lambda.
Corey: Oh, I agree wholeheartedly. You’re probably the single company that is best positioned as a global observer of what trends people are implementing with Lambda other than Amazon themselves.
One of the early use cases and a lot of the examples that Amazon themselves give about implementing Lambda tend to involve around performing certain tasks in an AWS service environment. Taking a tag and propagating it to a secondary or tertiary resource. Taking a bit of data from one service and then passing it to another, and so on and so forth.
Is that the primary use case that you start to see? Is it people using this for something else entirely to run full-featured applications? Are you just seeing it done as glue code? What is the current state of the Lambda ecosystem?
Erica: There’s definitely a mix in there. I would say that I don’t agree with this notion that Lambda is just filling service gaps in AWS. Lambda as a store procedure isn’t necessarily addressing a lack of capabilities in the database. You have custom business logic you need to implement. We’ve used Kinesis. There’s some things that we do with Kinesis. We could just technically use Firehose or we could just use some of the other AWS services that do this for us. We chose to write our own code for a number of reasons.
I would say that majority of it is you want to do a mix. I was thinking, I work this Lambda@Edge that does GWT verifications for S3. Instead of doing pre-signed URLs with S3, if you have a valid GWT JSON Web Token, you can just access the data. You’d only need to use your jot to your API Gateway Lambda to like signing this request on S3 and we return back and sign your pre-signed URL. You can just use that GWT directly to S3 through Lambda@Edge.
But this the the case where, wouldn’t it be cool if Amazon just support it like JSON Web Tokens for S3 in the first place? I could see that perspective, but it also provides so much more over that because Amazon can’t predict what’s going to be popular, like JSON Web Tokens are things that came from somewhere.
The industry came around and said, “We going to pull this JSON Web Token thing.” But there is also basic authentication. There was digest authentication. There’s LDAP authentication to web services and Amazon could have went and supported all of those or they could just say, “We could give you a mechanism where you do implement however you want to and give you the power of open source to share that code and to build an ecosystem around us instead as a platform.”
On the other side, our users are building web APIs, web applications, micro services, and what I now call nano services around in Lambda. I think those are real applications as long as you can build a “Twelve-Factor Application,” you can build on Lambda.
Corey: A question that I have, though, comes also down to the basic reliability of the platform. If I take a look right now at my Lambda functions over the past day, I’ve had 30 indications, which means that there are large swaths of time during which Lambda could have been completely down and I would have had no idea.
There is no formal SLA around it, so from my perspective, I’m looking at this and given that no one has complained about the thing that my Lambda functions power, and no one has blown up my email about this, I assume that the reliability has been perfect. How does that map to what you are seeing in the real world as people start to scale this significantly? Is Lambda fairly stable? Is it something that tends to drop out in weird ways that are difficult to diagnose?
Erica: I would say it’s been pretty stable recently. There are some outliers that are not recent. When they first launched, when they first went GA, there were a couple of issues that were resolved fairly quickly, mostly in US East 1, but it’s been pretty stable since then.
The last major, significant outage I can accurately place was the great S3 failure and that was because Lambda uses S3 for storage internally. When S3 went down, Lambda went down, too.
Corey: When you do see Lambda issues, how do those tend to manifest? I feel like there’s not enough exposure to how these things break. Is it delay in invocation? Do they fail to invoke at all? Does it hang and add latency spikes or something else entirely?
Erica: That’s really interesting because, as you said, we have maybe some of the best visibility into this outside of Amazon themselves. We definitely have internal visibility into anonymized statistics of what’s happening on Lambda that we could look at and things that we noticed were a few things.
There’s a built-in container cycle. There are cold starts because containers are spun up. Containers are also killed. There’s a life cycle. The same routine 4.5 minutes to 4.5 hours for a container servicing a Lambda function, of which a Lambda function might serve multiple containers, but each container in every process that’s in that container is supposed to live between 4.5 minutes to 4.5 hours.
We’ve seen cases where they’ve been alive for 8 hours or 16 hours instead. Sometime around that 10-hour mark or whatever, Amazon starts announcing that there are service problems.
We’ve actually have noticed some of these failure before Amazon has or at least before they they've acknowledged them, because we can see that those containers aren’t being reaped at the right time.
This may have been a case that was literally the bug. Maybe they weren’t reaping, which meant they were spawning to make containers and they had a resource exhaustion in the Lambda service because they weren’t properly garbage collecting containers.
We’ve seen things where functions would be multiple-evoked consistently where every Lambda function was evoking three or four times instead of once. But these things have mostly settled down to a very significant degree as the practice matured. These were mostly issues around initial launch.
Corey: That makes a fair bit of sense. Are you able to talk at all about the infrastructure that powers IOpipe? In other words, when there starts to be a Lambda issue, is that something that impacts the performance of the monitoring system that watches Lambda?
Erica: We are based on Lambda. A user’s Lambda runs and sends data directly to a collector service that we run, that puts data into Kinesis. None of that touches Lambda up to that point.
We’re not dependent on Lambda or any of Amazon’s serverless products for ingesting the data and getting it into our account, which is good because it does de-risk us if there were a failure in Lambda, we wouldn’t be affected by it at that point.
At that point, it’s in Kinesis. Once it’s in Kinesis, even if there was a failure with any of the service that we built entirely on Lambda, we could just process that at a delay. But the Kinesis feeds into several Lambdas that write things to our databases, run our alerts, run various intelligence tasks against them.
We use Lambda very extensively internally. I think that the collector service is perhaps the only service that’s not on Lambda for specific reasons that we've chosen to de-risk against certain things, which are really against would there be a Lambda failure or for latency? When we deploy that service, if again we did not have regional endpoints—which it does now, but at the time there didn’t—I know it’s something that we needed.
It is something that we actually reconsidered, if we would eliminate that service because we could actually implement that service without EC2 and could have led that to an API Gateway instead, without any Lambda, actually.
Corey: I was wondering on some level if there is going to be a dark secret of, “Surprise, we actually run this entire thing in a datacenter somewhere that’s in the middle of nowhere because we think this cloud thing’s a fad.” It’s always interesting when you start scratching to see how things like these are built under the hood.
Erica: I actually had a conversation with somebody who suggested we do that actually. That was a legitimate proposal.
Corey: Was this person trying to sell you ColoSpace by any chance?
Erica: I don’t think they were, actually.
Corey: As far as where you see today, at least from my perspective, Lambda started off as a curiosity and a bit of a toy. Three years in, it’s more than that. I’m seeing used for production-level workloads and a number of different environments. We’re seeing the platform itself become a lot broader as well.
In the context of being able to support new runtimes that weren’t there at launch, new versions, and for example assign more resources. I believe at last re:Invent, the RAM limit doubled. Where do you see the platform evolving into in the future? When it becomes less of a toy, even that it is now, five years from now, what does that look like?
Erica: I wouldn’t say it’s a toy now. I think you can build really amazing advanced applications on it. The limitations of Lambda, to me, are very freeing, where it is enforcing some of the twelve-factor design app decisions.
Twelve-factor was a guideline and Lambda enforces that opinionated stack design. It forces you to build applications this way. Things like the five-minute window makes you build applications a certain way, which is a good thing. It does maybe restrict you from doing some sort of map reduced kind of jobs, but for most applications I do think it’s very much not toy applications.
Any kind of microservice––HTTP service you’re looking to build, you can do it with API Gateway and Lambda. I think there are some limitations that are kind of an issue that are actually not even restricted just to Lambda that Amazon is going to get there, but they need to work on it.
For instance, there is something we’re dealing with right now. We have a service that was based on Elastic Beanstalk, our collector and exposed that collector over to a VPN. You can’t use either CloudFront nor can you use the ELBs or ALBs for that when you’re doing it over a VPN.
Amazon just announced API Gateway over VPC. Again, now actually this works. We need you to have a API Gateway to ALB, but how you do TLS termination? These are problems that I really wish that Amazon would solve.
What I am saying is, some of the services around, I wish they did a little better around those. Kinesis video stream for instance doesn’t integrate with Lambda. There are places where I just wish Lambda was, or I wish that they did a thing they just don’t do yet.
They are actually getting it and are getting there. They’re working on these things,but sometimes living on that cutting edge, you definitely run into some of these services that aren’t Lambda that have limitations that I wish it didn’t.
Corey: If you had a magic wand, what would you change about Lambda?
Erica: I think this is kind of maybe a selfish answer because I've worked on this observability platform, but it’s this thing that was actually in Azure functions that was pretty neat.
This idea of you run your function and then you can define basically handlers for the output of that function, as well as different pipes out of it. You could basically have your function run, return some value, and not just return data back to the caller, but have that output basically teed off, piped off, forked to other receivers directly, basically in Lambda you have to use step functions for.
Lambda execution itself could be an event trigger for another Lambda, for instance, directly would be really, really neat. Whenever this Lambda evokes, take the output of it and run another Lambda function or put the output of it into Kinesis. That’s a really neat thing that would actually enable me to do some things that I can’t do today.
Azure actually did do out-of-the-box and there are some things they did out-of-the-box that I don’t like, things they didn’t do out-of-the-box I wish they did do over at Azure, but there was the one thing that I was like, “Wow, that’s really cool.”
I still wished that Amazon had something like that, some sort of queue or Kinesis stream or something for the output of those functions, not just in CloudWatch data because you could do that like you do the CloudWatch stream, but something that was a little bit more alternative pipelines for data out of it. It’s hard to explain. It’s ambiguous. It’s maybe something you just explore.
Corey: Taking a bit of the opposite approach for a second. As you take a look at how people are implementing Lambda in various environments, what aspects of working with Lambda functions do you find that people either struggle to wrap their heads around, they misunderstand, or I guess fundamentally are having trouble with today because none of this stuff is easy or intuitive the first time you see it, I can assure you. I spent most of my time learning how the stuff works by getting it hilariously wrong.
Erica: For me, I personally didn’t have as much of a challenge here and I do see others having that challenge. I think it’s a way of thinking.
I think that a lot of people implementing microservices, implementing these next-generation applications, Microsoft’s applications, they came to it with this monolithic mindset and adapted to it. They weren’t familiar with actor-based programming models. They weren’t familiar with things like R lang or Haskell. They weren’t familiar with, when I’m thinking R lang, I’m thinking OTP, in particular. A lot of developers are aware of message queues. Of course, many are. But that kind of distributed computing, distributed computing problems, spilling applications at scale is a thing that a lot of developers don’t have direct familiarity with. They’re just like, “I’m going to build a note app and built it stateless and I’ll throw in on EC2, and throw in more EC2 and then that’s it.”
One of the challenges with Lambda that catches people by surprise is that Lambda scales so easily and so readily, that its massive scale can become an issue if you don’t plan for it. You can easily find yourself with 1000 concurrent invocations, 1000 active containers, and overlog your database.
You can just throw so much more at a database. You can throw so much more at a service. You can get so much concurrency in a parallelization accidentally with Lambda that you run into bottlenecks that you didn’t run into before because you just said, “Oh, well assuming the EC2 instance is fine, I’m just going to make a vertical stack here. I’m just going to make these giant vertical silos. I’m just going to build them taller.”
Instead, you now have a distributed systems problem and a lot of developers just aren’t familiar with those. That’s where there’s this surprise catch where it’s so easy to build distributed systems but if you’re not familiar with them, you just find yourself creating bottlenecks and things like databases that you just didn’t expect, if you’re new to it, if you don’t know to expect that.
Corey: Scale brings up an interesting question. The entire premise of any sort of cloud computing environment is, “That’s the beautiful part. You can scale infinitely.” Which is absolutely awesome until you actually try to do it.
Come to find out, there are theoretical upper limits, you cannot provision two million containers at the same time and expect something not to fall over. Do you see indications that there are capacity limits around Lambda that are at a point where it starts to affect individual consumers? Or does the shared nature of the platform make that very hard even to determine from the outside?
Erica: I always say it’s pretty hard to determine from the outside and now we’re going to say Lambda is shared. I would say that there is an implementation detail of Lambda that Amazon does not guarantee this, but it is in an implementation detail that you basically get your own virtual machines to run your containers on. Amazon is managing a fleet of EC2 instances just for you, for your Lambdas, as implementation detail. That again is not a guarantee from them, but that’s how they’ve chosen to implement it.
I think that the limitations of Lambda are probably closer to that of EC2. In reality, things with limitations are like the 75 gigabyte limit for all function code per account, which some of our users have ran into.
Corey: Oh, I’ve ran into that on a single function for myself because I write really inefficient nonsense.
Erica: You can actually do that. I think per function you have a limit of 500 megs, compressed, I think. You basically need to divide 75 gigabytes by 500 megabytes.
Corey: Yeah, I think it was something like 75 megabytes compressed, which, in all seriousness and snark and witticism aside, I did brush into with some of my early functions as I started trying to install everything into a monolithic function of pip dependencies over Pythonland.
It turns out that that’s a terrible anti-pattern. I should never do that. Putting a monolith into a serverless function does not make you suddenly living in the future. You do have to break these things out architecturally.
Erica: I don’t know that you really need to do that. I think there’s actually some use cases for, say, running WordPress inside of Lambda and I think that can be fine. Pocket Studio is an example of an app that is kind of a very large open source monolithic application. I think it’s like 40,000 lines of code. It’s very big, but it’s fine. There are some advantages to it. Every Alexa skill is a monolith, for better or for worse. It is just by design you have to build it that way. People are going to do it. I think tools like IOpipe do actually help with that. I think we got off your actual question.
Corey: Which is absolutely fine. Is there anything else you’d like to mention that you have coming up or talk about that would be relevant or interesting? Where can people find you?
Erica: I’m going to be keynoting for ServerlessDays London next month. I will be speaking at Velocity London, so it handles over our London people. I have a bunch of other conferences––I have so many that I can’t even remember where they and what they are, but I’ll be speaking at re:Invent. That’s happening so you can find me there.
You can find me on Twitter, twitter.com/ewindisch and at IOpipe we have a community Slack and you can find our website and you can reach out to us as well. I have also been doing Twitch streaming, twitch.tv/ewindisch. I have not been active in the last few weeks, but I’ll probably get back to streaming soon.
Corey: Perfect. Thank you so much for taking the time to speak with me today. My name is Corey Quinn and this is Screaming In The Cloud.