Have you ever been on-call duty as an IT person or otherwise? Woken up at 3 a.m. to solve a problem? Did you have to go through log files or look at a dashboard to figure out what was going on? Did you think there has got to be a better way to troubleshoot and solve problems?
Today, we’re talking to Sam Bashton, who previously ran a premiere consulting partner with Amazon Web Services (AWS). Recently, he started runbook.cloud, which is a tool built on top of serverless technology that helps people find and troubleshoot problems within their AWS environment.
Some of the highlights of the show include:
Runbook.cloud looks at metrics to generate machine learning (ML) intelligence to pinpoint issues and present users with a pre-written set of solutions
Runbook.cloud looks at all potential problems that can be detected in context with how the infrastructure is being used without being annoying and useless
ML is used to do trend analysis and understand how a specific customer is using a service for a specific auto scaling group or Lambda functions
Runbook.cloud takes all aggregate data to influence alerts; if there’s a problem in a specific region with a specific service, the tool is careful to caveat it
Various monitoring solutions are on the market; runbook.cloud is designed for a mass market environment; it takes metrics that AWS provides for free and makes it so you don’t need to worry about them
Will runbook.cloud compete with or sell out to AWS? Amazon wants to build underlying infrastructure, other people to use its APIs to build interfaces for users
Runbook.cloud is sold through AWS Marketplace; it’s a subscription service where you pay by the hour and the charges are added to your AWS bill
Amazon vs. Other Cloud Providers: Work is involved to detect problems that address multiple Clouds; it doesn’t make sense to branch out to other Clouds
Runbook.cloud was built on top of serverless technology for business financial reasons; way to align outlay and costs because you pay for exactly what you use
Analysis paralysis is real; it comes down to getting the emotional toil of making decisions down to as few decision points as possible
Save money on Lambda; instead of using several Lambda functions concurrently, put everything into a single function using Go
AWS responds to customers to discover how they use its services; it comes down to what customers need
Full Episode Transcript:
Corey: This week’s episode of Screaming In The Cloud is generously sponsored by DigitalOcean. I’m going to argue that every cloud platform out there biases for different things. Some bias for having every feature you could possibly want offered as an added service at varying degrees of maturity. Others bias for, “Hey, we heard there’s some money to be made in the cloud space. Can you give us some of it?”
DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they’re using it for various things, and they all said more or less the same thing. Other offerings have a bunch of shenanigans, root access, and IP addresses. DigitalOcean makes it all simple, “In 60 seconds, you have root access to a Linux box with an IP,” that’s a direct quote albeit with profanity about other providers taken out.
DigitalOcean also offers fixed-price offerings. You always know what you’re going to wind up paying this month, so you don’t wind up having a minor heart issue when the bill comes in. Their services are also understandable, without spending three months going to cloud school. You don’t have to worry about going very deep to understand what you’re doing. Its click a button or making API call, and you receive a cloud resource. They also include very understandable monitoring and alerting.
Lastly, they’re not exactly what I would call small-time. Over 150,000 businesses are using them today. Go ahead and give them a try. Visit do.co/screaming and they’ll give you a free $100 credit to try that. That’s do.co/screaming. Thanks again to DigitalOcean for their support to Screaming In The Cloud.
Corey: Welcome to Screaming In The Cloud, I’m Corey Quinn. I’m joined this week by Sam Bashton, who, once upon a time ran a premiere consulting partner with AWS. Recently, however, he started something new, called runbook.cloud. Welcome to the show.
Sam: Thank you. Thanks for having me on.
Corey: Always a pleasure. It’s interesting to me to talk to people where there are multiple different aspects of what they do that apply directly to how I view the world. What’s interesting to me about Runbook is that, on one hand it’s a tool that helps people find and troubleshoot problems within their AWS environment, which is fascinating and highly relevant. But what’s also equally interesting to me is that you built the entire tool on top of serverless technology. It feels like we should definitely tackle both angles of those. Which do you want into first?
Sam: Somewhat if we talk a bit first about my motivations for building runbook.cloud?
Corey: Ah, the why. Absolutely.
Sam: Basically, my entire career since leaving university was on call at some point or other, often one weekend for, I would get a call in the middle of the night and about once a week and something had gone wrong. I would have to troubleshoot what that problem was and what to do to fix it.
First, that was very nerve wracking and it quickly became less exciting and more an incredibly large chore. I don’t think anyone enjoys doing on call. There’s a certain adrenalin rush of fixing a problem quickly, but even so, it is...
Corey: Hey Sam. Hey Sam. Hey Sam. Wake up. It’s 3:00 in the morning. You know what you want to do now? That’s right. Solve a murder mystery.
Sam: Yeah, exactly. I know we’ve got to help you solve the murder mystery. It’s pages and pages and pages of graphs. That’s if you’re lucky because in the early days, you probably didn’t have metrics at all and you just had to look at some log files and do your best to try and work out what was going on. Then as things got better, you built dashboards. A dashboard came like scar tissue for an organization. Here are all the things that have failed previously. They’re probably going to be the things that go wrong in the future but at least if it breaks again, we got a way to check on that.
I thought there’s got to be a better way to do this and runbook.cloud is my attempt to try and build that better way. What we do before runbook.cloud is we look at all the metrics which people are drawing pretty graphs from, but we apprise some intelligence—when I say intelligence, I don’t mean my intelligence, I mean machine learning of course—and we pinpoint where are the issues within the infrastructure and then we have a pre-written set of, “Here are solutions to known problems that can occur,” and we present that to the user.
When you get paged at 3:00 in the morning, you still see a problem, but as well as seeing a problem, you see, “Well, here’s what it looks like the problem is, and here’s the suggested solution.” Quite often, there won’t be, “It’s definitely this, it looks like it’s probably this, but there’s a chance it could be this other thing.” You might get a list of two or three suggestions. That’s infinitely better than hunting through a ton of graphs, trying to work out, “Okay, but what does any of these mean to me,” because really with graphs, the best you can hope for is that you have the right graph to test the hypothesis that you might have. You can look at the graph and say, “Actually, the graph disproves my hypothesis, so I need to try and invent another potential reason for this problem,” or it proves it and then you can start trying to do something to fix it.
Corey: How do you keep a system like that from turning into the infrastructure equivalent of Microsoft Clippy? “It looks like you’re fighting an outage. Have you tried looking at DNS?” It seems like it the sort of thing that it’s going to be very easy to have become annoying and unhelpful. How do you avoid that problem? Obviously, this is not, “Wait. You mean it might be annoying people,” is not going to be revelatory to you. How do you thought about this as far as getting away from that particular failure mode?
Sam: That’s where the machine learning comes in. What we do is we look at all the potential problems that we can detect, look at the in context with how the infrastructure is actually being used. A good example is CPU usage. In some scenarios, high CPU usage is an indicator of problem. In other scenarios, for example, you’re running a batch computing load, high CPU usage is normal. It’s what you should be seeing. Actually, if it’s low CPU usage, that’s an indicator of a problem.
We can use machine learning to do trend analysis and to understand actually in the context of how this specific customer is using the service for this specific auto scaling group or this sets of Lambda functions. This looks wrong or this looks right and we can look at it in context. I would say machine learning becoming accessible to a wider base of developers, that’s allowed us to build something that isn’t the Microsoft Clippy of the DevOps world.
Corey: Do you tend to take only a particular user’s environment into consideration or do you take the global environment as well? I mean, when I wake up historically in years past running infrastructures, I know that a few things are going to be true. First, I know that something is broken. Secondly, I know that Amazon status page is going to be a sea of green telling me everything is perfect. Third, I know that I’m not going to be able to disambiguate between, is this a problem with my environment or this a global problem, until I go in the internet and check Twitter. Because that’s the only sort of global real-time alert system that most of us have. Are you considering suddenly seeing a bunch/flurry of activity across the board across all of your clients in these environments and then able to advice them on that or is it strictly bounded by their specific environment?
Sam: We very much take all of the data we’re seeing in aggregate and use that to influence alerts. It’s actually something that was informed by my prior experience running a consulting partner. As a consulting partner, we were doing money services, we would look after many dozens of customers, so actually, we have the same scenarios. We would see problems across multiple customers. It’s very unlikely that they see a problem specific to any of those customers and this is a wider outage. I know we could relay that to AWS in terms of support tickets.
Remember, at cloud we also look at the data that we’re receiving in aggregate and if there’s a problem in a specific region with a specific service, we’re quite careful to caveat it. The reason I think often you a sea of green in AWS’ status page is because actually at the scale AWS runs at, probably most of their customers in that region, everything is working fine for. But when you got millions of customers, 1% of your customers having a problem is a significant number of people.
Corey: Absolutely and that’s part of the challenge too that I think does vary from a point of scale. If you have 30 customers and one of them winds up breaking, that is a significant percentage of what you’re seeing. But if you have, I don’t know, 500 queries per second hitting your website and you start seeing a 1% variance, first that winds up scaling as well with tremendous number of people. It’s one of those areas where, at scale, one in a million things – one in a million occurrences, happen five times a minute. It really does turn into one of those situationally dependent issues.
Sam: Yeah, exactly and that’s where machines are excellent at aggregating that sort of data and working out what’s going on. I think in the past, as humans we’ve not been as smart at using the machines to do a lot of the work for us as we should be. Or at least, outside of large organizations, the Googles and the Facebooks of this world, I don’t think we’ve been as smart as possible. You look at most people’s monitoring setups and they are pretty dumb right now. That’s not a reflection on the people setting them up. That’s the reflection on the tool and bits available. Most things are, you set a threshold and you say, “If it crushes this threshold, then there’s a problem,” and actually, that’s not how things work in the real world.
Corey: It really does seem that this is an evolution on a very long axis. Back when I started working with technology, we started playing with the original Call of Duty video game, which is of course called Nagios. That’s the thing that woke us up in the middle of the night and everything was broken. The paradigm of setting something up, often manually, to look at individual systems and alert when they went down, didn’t age very well. In a world of ephemeral infrastructure, in a world of autoscaling, and especially in a world where you have 10 web servers that are load-balanced, if one of them blows up, I probably don’t care. If three to five of them blow up, I really care.
It turns into a story where the traditional thoughts around monitoring no longer really seem to work. The next sort of evolution of this is gone towards the idea of aggregating things, looking at metrics, looking at graphs, and that’s terrific. They’re beautiful dashboards you can hang up in an office, you could put up on a website, send to execs, and no one ever looks at them. That’s interesting and now we’re starting to see the next generation of the stuff emerge where you see things like outlier detection, where we start to see systemic issues that underlie things and it feels like you’re very much in line with the zeitgeist around monitoring thought and theory right now. Is that something you’d agree with? Am I way off base in my assessment?
Sam: I’m not going to disagree with you telling me that I’ve found the exactly the right solution to the problem. I think there are a number of solutions that people are finding and actually, I think they address different parts of the market. You look at something like Honeycomb, which is very much going for tracing. That’s a key part of what needs to be done. But actually you need a very, very technical organization to be able to implement that functionality. If you have the right sort of organization to be able to implement the functionality to write all that chosen data, then you absolutely need a tool like that.
With runbook.cloud, I’m trying to go for a more mass market environment, an organization where actually you probably don’t have a huge amount of metrics beyond what Amazon give you out of the box, which is pretty enormous. I think at last count, just 30 odd metrics purely for EC2 at load. I was giving you all of these metrics for free and what we’re doing is we’re looking at them and then trying to make it so you actually don’t need to worry about what any individual metrics are. We tell you, “Look, here’s the problem and here’s what you need to do to fix it. You don’t need to worry about what the values are.” That’s essentially all abstracted away from you by runbook.cloud.
Corey: It seems like a very interesting direction to go in. It also further seems like exactly the sort of thing AWS should be offering but of course isn’t. Do you have the haunting fear that most people do, that Amazon is going to one day effectively try and build the native platform offering of what you do? I mean, it’s Amazon so we know the first version is going to be pretty crappy and it's almost guaranteed to have a stupid name. But other than that, as it iterates and starts to turn into something real, there is the chance that Amazon decides to fix all problems. From my perspective, from a monitoring point of view, I don’t know that I necessarily trust them to tell me when things are broken in a way that’s actionable in a reasonable period of time. There’s going to be that opportunity. But do you see them coming for you in the night someday?
Sam: If Randy Jesse is listening and he’d like to buy my company, the phone’s always going to be answered to his call. I think it’s possible but the Amazon would come up with something like this. I haven’t work with Amazon for a large number of years. I know that their strength is that they are almost not one company. There are thousands of really small units which work on their own thing and then they broom those together.
Corey, you must have seen from the numerous billing CSVs, they can’t even agree on what they call a region—in some parts of the billing CSVs are we call usw2 and other parts us-west-2 and other parts they all might have an airport code for the name of the region. I think it’s quite hard as someone inside Amazon to build at all like that or at least it’s no easier that it is for someone outside of Amazon, namely me.
I think also if you look some of the solutions that are out there, I get the impression that Amazon perhaps don’t want to be in some of these spaces, specifically if you look at Amazon X-Ray. X-Ray is in theory a tracing tool. In practice it is a tracing tool and it let’s you log all the tracing data. It’s really good at logging that data and then they give an awful interface for searching through it. Having used X-Ray quite a bit, I believe that actually, that’s not by accident. It’s not that they didn’t know how to make a good interface. It’s that, that isn’t the game they want to be in. They want to build the underlying infrastructure and they want other people to come along, use their APIs and build the right interface for the users.
Corey: Absolutely. It’s like this theory or philosophy that they’re operating under that, “You know, if we just provide bare primitives, maybe customers will build the things we don’t, ideally in Lambda.”
Sam: Yup, that’s absolutely true. The other thing that I would say is actually, I’m selling runbook.cloud through AWS Marketplace. AWS Marketplace is a solution much like Amazon Marketplace. I’m selling runbook.cloud, it’s a subscription service. You pay by the hour just like a normal AWS service and the charges get added to your AWS bill...
Corey: Oh, please. There’s no such thing as a normal AWS service.
Sam: Well, absolutely true. But it gets added to your normal monthly AWS bill and of course Amazon take a cut for the privilege of doing that. Actually, AWS are still making money from this. It almost is an AWS service. It’s just a marketplace service. One thing the AWS Marketplace team are keen to point out quite frequently is the Amazon retail side, 50% of transactions are down through Amazon Marketplace and actually that’s where they see AWS Marketplace getting to as well.
I think maybe it’s naivety, but I think it’s less likely that Amazon are going to try and clone something like runbook.cloud because actually why would you bother if someone else is putting all the money into R&D doing that hard work and you getting a cut from it anyway and it’s fulfilling the needs of your customer?
Corey: One bit of feedback that I’ve gotten in my business for the last couple of years as I focused on Amazon bills is what about other cloud providers? And for my business, it doesn’t make a lot of sense for me to focus on providers that aren’t AWS. What about you? Do you wind up getting that feedback as far as, “Oh, what about GCP? What about Oracle Cloud? What about Azure, et cetera, et cetera, et cetera?”
Sam: That’s quite an interesting question. My previous role, when I built a consulting company, we actually started out pre-AWS but in the cloud world, we started out of (C database) because there are not other place in the game. We did expand and build Google Cloud practice as well. I have quite a lot of familiarity with Google Cloud. I think for me, when I’m building a product like this, there is so much work to do to be able to accurately detect the problems that addressing multiple clouds would be extremely difficult and Amazon has such a massive order of magnitude more customers than any other cloud platforms that actually it doesn’t really makes sense at the point in time to branch out to other clouds.
Of course Randy Jesse accepted, I expect that at some point in time we probably will want to make a version that is for Azure and a version for Google Cloud but we’re probably talking a good few years down the road here.
Corey: Absolutely. To my way of thinking, in this type of space, any of us who specialize in one particular provider, are going to be able to retool, to embrace a different provider, far faster than some other provider is going to gain workload, market share, and customers to the point where the one we’re focusing on is no longer dominant. In other words, you’re not going to see these giant enterprises migrating between cloud platforms faster than the ecosystem is going to be able to understand, embrace, and work with the new provider. It’s one of those things obviously worth keeping an eye on but it’s not one of those things where we’re going to wake up and read in the front page of the New York Times giant six inch high letters, “AWS suddenly irrelevant.” That isn’t how the world or how the market work.
Sam: Yeah, exactly and actually if you look, the majority of computing is not on any cloud platform right now. There’s still a lot of expansion to be done. AWS isn’t going anywhere and when you’re the market leader that means you become the default choice. I think this competition is AWS is to lose, I think no. The other cloud platforms have interesting offerings but I’m not seeing anything that’s significantly different enough that you would want to move from AWS if that’s where you were previously.
Corey: The other aspect that I wanted to chat about with you is the fact that you built this entire service on top of serverless technology. Why did you made that decision?
Sam: I made that decision primarily for business financial reasons rather than because of technology. Actually building on serverless, I saw was the best way to align our outlay, our costs in terms of providing a service to our customer with the actual amount we could charge a customer. I have a lot of experience with Kubernetes, which I know you are a massive fan of, I was using Kubernetes from about 2014 onwards when it was very early stage, and actually early on, I thought we’ll probably will be deploying runbook.cloud onto Kubernetes because it’s the platform I know best.
I hadn’t really done anything with Lambda or any significant scale. I’ve done a lot of glue code but nothing beyond that. But when I looked into the spreadsheet score of the cost model actually upfront outlay for Kubernetes is still quite a lot high. You need a certain critical mass of customers before it makes sense. With Lambda, I can scale exactly in line with my customer base and actually then when we need to decide where do we optimize cost, it’s really easy. Just look at which Lambda function is costing the most. That’s where we should expend our engineering effort optimizing things.
Suddenly, instead of optimization either code base is historically don’t get optimized further. They just get new features added on top of them and then everyone talks about technical debt, or you get just well people work with the bits of code and optimize the bits of code. They decide on a whim that should be the thing that they should tackle. Now with service, you’ve got a clear way. You look at the bill and you say, “Okay, this is costing us the most. We can do some work optimizing this and we can save ourselves some money,” and actually you are paying down that technical debt but you have a clear metric that you’re working towards, which is reducing the overall cost.
Corey: There’s a very strong economic story for serverless. I think Simon Wardley was talking about this extensively, where he was focusing on the idea of tracing capital flow throughout an organization or through an application, where if you have 15 Lambda functions tied together and you know which one is costing you more, you don’t just know what it cost to serve as a customer. You know what every function cost relative to the others per customer request and it give you a very in-depth viewpoint into where your revenue is coming from and what the economics of your business are.
Corey: I think that when you say, “I’m using serverless technology,” and people ask why, and the answer is, “Oh, for the economic story behind it,” people often hear that is, “Oh, I’m using serverless because I’m a cheap ass,” and it has nothing whatsoever to do with that. It’s very much in the realm of you pay for exactly what you use, you don’t have to worry about provisioning, you aren’t falling into the wonderful world of, “Oh, here’s some on-demand resources I need to plan the usage of in the next three years.”
It really gets back to you pay for exactly what you use and nothing else. It comes down to a very predictable model where you know exactly what a customer brings in and then you really do scale with them to bring in revenue, as opposed to having these plateaus where you buy a giant pile of things to service customers, now you’ve expanded, now it’s time to buy another big instance or something and going down that rat hole. It’s one of those stories that also not just saves money but also lets you spend it effectively and know where it’s going.
Sam: Yeah. I think that is true but I think it’s partly true because of the model that Amazon has chosen to use in terms of their reserved instance model. They charge for instances now per second. If you can spin instances up quickly enough, and actually with things like unikernels, potentially you can start instances almost as fast as you can start a Lambda function. You can use instances in a way that actually doesn’t mean that you need to worry about reserving capacity upfront, except that that is baked into the AWS economic model.
They used to talk a lot about, “Well, you’re reserving capacity because you’re literally reserving availability in a specific availability zone.” Obviously, they got rid of that hard link when they converted instances in. Actually, all the new reserved instances types, they’re not linked to a specific allocation capacity. You don’t have any extra allocation. There are no guarantees that you’ll be able to spin up instances like they used to be previously when you purchased specific capacity as part of your reserved instance. I think some of the benefit of serverless is an accident or perhaps non-accident of how AWS has chosen to charge for their service.
Corey: Absolutely. I think that they get beaten up a lot for the way that the reserved instance model exists historically. They’re making steps by things like convertible instances and instance size flexibility, and that starts to make some of this better but it’s still an analysis paralysis style decision. Last time I ran the numbers, they were exactly 140 different instance types you can spin up in us-east-1, and, “Okay, make sure you’re on the right one and buy a reservation for three years,” that’s daunting. They keep adding new instance families and it becomes trickier and trickier to assure that you’re making the right instance size, selection, and choice.
What I love about Lambda is that there’s a single variable you get to play with, and that is RAM allocation to the function. That’s it. You don’t have to pick a route, “Well, what about the I/O profile? What about the CPU? What about the network capacity?” Those are all tied to how much RAM you give the function. The more RAM, the better the rest of the resourcing. That model I think is tremendously helpful, not surely even from an economic point of view, but from a not putting decisions on people unnecessarily.
Analysis paralysis is very real. If I’m trying to sell someone a pen, generally the right way to do that is, “Do you want blue ink or black ink,” not, “Here’s a catalogue with 10,000 different kinds of pens.” It just comes down to getting the emotional toil of making decisions down to as few decision points as possible.
Sam: I think that’s true. Earlier on, when I was using AWS there were definitely situations where I was lobbying various people, AWS with different types of instances, and I guess there must be lots of people who are doing things like me but we’re all lobbying for slightly different types of instances, which is why there are so many.
Actually, I don’t see that as a negative so much. You look at the instances. Generally, it’s relatively obvious. At least it was to me when I was doing everything on EC2, it was pretty obvious what you wanted. If you go only compute-bound, you wanted to see instance. If you were doing stuff that needed GPUs, you needed a GPU instance. If you have memory-bound tasks you had an hour, if you weren’t really sure you had several mix of memory, you have memory instance. You pick the latest that’s available in the region you are running. I agree, there are massive numbers of different types of instances, but actually, there are...
Corey: Well sure you’re absolutely right. But recently, they’ve also extended that to, “Okay, now with different types of discs. Some are NVME, some are not. This one has extra-fast CPU in it but it’s designed for a fewer threads at the same time. You just wind up with these little variances between them as you get into the M suffixes and the D suffixes. I agree wholeheartedly, it used to make a lot of sense. Now, with just the flurry of new instance families, I have to go back to my traditional guidance and constantly re-evaluate it.
Sam: Yeah, I think that’s possibly true. I think most of the time, you can just pick like, “I’ll just use a C instance or use an M instance.” It will be close enough but honestly the few cents that you might save here and there is not going to be worthwhile.
Corey: Until you hit scale again and then suddenly you are having a very different, very vast conversation. That’s one of the nice things I appreciate as well about the whole Lambda model. Even at scale, the economics are still pretty decent.
Sam: They absolutely are. I had a blog post that I put out a couple of weeks ago that was very successful. I think you linked to it in your newsletter about how we actually save significant amounts of money using Lambda in a way that actually most the experts told us, “That’s not how you should use Lambda.” I don’t know if we run a talk about that a bit more in-depth?”
Sam: What the received wisdom was with Lambda is that the great thing about Lambda is everything can be single threaded because if you need concurrency, that’s fine. You just run more Lambda functions. That’s true up to a point and actually what we found is that you can do that but obviously the cost implications of running hundreds of Lambda functions in parallel when you are not fully utilizing those resources, is pretty significant.
In our specific instance, obviously for Runbook, we need to look at metrics from a large number of AWS services from a very large number of accounts because each customer has at least one account and on average, a customer will be using at least half a dozen services. Some are using an order of magnitude more than that. We will make calls to all these different AWS APIs, and obviously they take some time to respond. While we’re waiting for the responses, it might even take a few hundred milliseconds, but we’re doing nothing with the compute power that Lambda has provisioned for us, but obviously we still having to pay for it because we pay by the hundred milliseconds in a Lambda world.
What we did is we looked at this problem and actually we decided instead of firing up several hundred Lambda invocations concurrently, we would put everything into a single Lambda function and we’ve written everything in Go—Go has a really nice in-built programming model for doing concurrent operations—and we did concurrency using the programming language.
We now run a single Lambda instead of several hundred Lambdas and of course the cost implications for that are pretty huge. We saved ourselves a significant amount of money where we make the business viable. The business wouldn’t have been viable at the price point that we hit or already selected if we had used the received wisdom of you should only do concurrency by just spinning up additional Lambda functions.
Actually, although that’s the received wisdom in the community, once we look deeper into it, I’m not sure AWS really believe that because if you look, you mentioned you get more compute capacity as you allocate more RAM to your Lambda functions, the higher RAM Lambda functions actually give you more cost, which implies to me that AWS expect you to be doing things concurrently within Lambda. That’s just not how people had in the community expected things to be used.
Corey: I think you maybe on to something there. Counterpoint, it’s easy to sit here and say, “Ah, they hid this thing in and with the expectation people would find and use it this way.” I’m not so sure. I think that Amazon is very good at building things and then being surprised by how people wind up using them. It’s one of those areas where customers all have different use cases, different problems, different ways of thinking about things and it winds up being a fun conversation in some cases.
I wound up talking to an engineer at AWS recently at how I use Secrets Manager instead of Dynamo as a database. They were disappointed in me as they should be because it’s a terrible idea, never do it, but it’s getting into the idea of how people use or misuse services. The fact that these are these broad, primitive building blocks that you can put together a whole bunch of different ways is a whole lot of fun in some ways.
Sam: Yeah and I think AWS have shown a willingness to go and meet the customer, where to customer is quite a lot. I used to do a talk, are a few different to the meet-ups, AWS summits, and what have you, called Five AWS Services That Shouldn’t Exist. The obvious service whenever I told anyone the title before they’ve heard anything I’d said was, “Well, you’re obviously going to say EFS,” and EFS I think, is one of those services that, on the face of it, it shouldn’t exist.
Everything in AWS has been designed around where actually you don’t store your states in the file system because that’s absolutely the wrong place to put it. But at the same time, AWS obviously respond to their customers and realize, actually the way that people are used to using things, the way that people want to use them, we need to have a service like EFS so that customers can work in that way and we can talk to them about how they can better do things and do things differently and use S3, which obviously was one of the original services to replace EFS, but we need to provide something for them.
I guess the two ways of looking at it, either they show a lot of humility in saying, “Oh, actually we didn’t think you were going to use it like that but now we know that, we’ll build it differently,” or they are just very mercenary and they just look and say, “Well, who cares about what I believe you should be doing to do it best. If you’re going to pay us money to do it, we’ll build it for you.”
Corey: I think that you’re right. It comes down to what customers need. I’ve been making fun of EFS for a long time. It has gotten better to the point now where my single ding against it is, in a cloud-native world you probably shouldn’t be greenfielding anything that uses NFS primitives, regardless of how good the implementation thereof is. That said, that’s not realistic for companies that are migrating from unprimed environments. You’re not going to shove in that app up into the cloud. You’ve got to have something out there that speaks those languages.
If that is your scenario and that everything is architected, it’s fun to sit here and condescendingly shake your finger at people and tell them they should write their software differently. But that’s not how AWS speaks to customers. That’s how Google Cloud speaks to their customers.
Sam: I’ve got no time for your cloud people and I also know that roadmaps I’m rubbing it, comment on that. But I’m pretty sure cloud folks store overage has actually been announced for Google Cloud now. That definitely doesn’t work but I think the reality of the situation is, if you want to be the largest player in town then AWS already are and they don’t want to lose that position, as you say, you have to meet the customer whether the customer is. That means, building solutions that you think, “Well, we wouldn’t build it like that internally at AWS or for Amazon Retail or for Amazon video streaming, but actually, if that’s what the customer needs to do, that’s fine. We shouldn’t be turning them, ‘No, this is the way you have to build it.’ We should be building what the customer needs that’s what’s right for them right now.”
Corey: Absolutely. Sam, thank you so much for spending time chat with me today. I appreciate it.
Sam: No problem. Thanks for having me on.
Corey: Thanks again. My name is Corey Quinn, this has been Sam Bashton, and this is Screaming In The Cloud.