Do you enjoy watching sports? Wear your favorite team or player’s jersey? Are you a fan who has shopped at Fanatics on the Cloud?
Today, we’re talking to Johnny Sheeley, director of Cloud engineering at Fanatics, which is a sports eCommerce business that manufactures and sells sports apparel. Fanatics runs Cloud engineering to provide a robust and reliable set of services by building and deploying applications on top of the Azure Data Lake Store (ADLS) platform.
Some of the highlights of the show include:
If you compete with Amazon, be ready for it to come after you; some companies avoid its Cloud perspective or go multi-Cloud (paranoia-based movement)
Focus on your ability to make your business function smoothly
Transition, migration, and abstraction may be painful, but should not stop work; paying for Cloud-agnostic technology may not be worth it
Challenges of governing use of Cloud resources to prevent mistakes/problems related to Fanatics’ security and budget
Data collected focuses on what’s trending up or down to select an instance type that calculates costs; remain flexible and be aware of what you pay
Natural instinct is to blame people; mistakes are made, especially when a human factor is introduced to an automated system
Creating a mindset that focuses on feature and detail-oriented is challenging
Cottage industry of code bases running in Big Data and other expensive realms
As a product continues to evolve and grow, governance comes along for the ride and AWS bills are streamlined
Will serverless, Lambda, and RDS change how Amazon charges in the future?
State of scale of AWS and developing a more palatable method for releases because people can’t keep up with them and stop paying attention
Two-Pizza Team: Amazon’s management philosophy that any team that works on a service should be able to be fed with two pizzas
Such small teams work quickly and have the freedom to fail, but Amazon has a reliability for the longevity of its different services
Full Episode Transcript:
Corey: This week's episode of Screaming In The Cloud is generously sponsored by DigitalOcean. I would argue that every cloud platform out their biases for different things. Some bias for having every feature you could possibly want, offer this amount at various degrees of maturity. Others bias for, "Hey, I heard there's some money to be made in the cloud space, can you give us some of it?" DigitalOcean biases for neither.
To me, they've optimized for simplicity. I told some friends of mine who are added DigitalOcean supporters about why they're using it for various things and they all said more or less the same thing. Other offerings have a bunch of shenanigans, root access, IP addresses, DigitalOcean makes all simple. In 60 seconds, you have root access to a Linux box with an IP. That's a direct quote albeit with profanity about other providers taken out.
DigitalOcean also offers fixed price offers. You always know what you're going to wind up paying this month so you don't wind up having a minor heart issue when the bill comes in. Their services are also understandable without spending three months going to cloud school. You don't have to worry about going very deep to understand what you're doing. It clicks a button or makes an API call and you receive a cloud resource.
They also include very understandable monitoring alert and lastly, they're not exactly what I would call small time. Over 150,000 businesses are using them today. So go ahead and give them a try. Visit do.co/screaming and they’ll give you a free $100 credit to try it out. That’s do.co/screaming. Thanks again to DigitalOcean for their support with Screaming In The Cloud.
Welcome of Screaming In The Cloud. I’m Corey Quinn. I’m joined today by Johnny Sheeley, who, in addition to being a fantastic dresser, is the Director of Cloud Engineering at Fanatics. Welcome to the show, Johnny.
Johnny: Hi. Thanks for having me. That’s a really wonderful wardrobe complement. I don’t know if it’s founded, though.
Corey: That’s the beautiful thing. You have this very cultured voice so whenever people listen to you, they assume you’re well-dressed.
Johnny: Or that I’m dressed at all, which is phenomenal.
Corey: Don’t make the audio folks to believe out too much of this. Good Lord.
Johnny: I’m trying to make [...] happy.
Corey: As it works. Explain to us from the beginning. What does Fanatics do?
Johnny: Fanatics is a sports ecommerce business. We do everything from manufacturing sports apparel, to selling it on our own sites, to running major league sites, to running sites for international teams like Manchester City. If you are wearing a Titans jersey or some sort of soccer jersey or anything like that, you probably wound up interacting with us at some part of that.
Corey: That’s very reasonable. It effectively is sportswear sold through ecommerce. Got it.
Corey: And you run cloud engineering there. What does that look like from a perspective that, I guess, from an outsider’s view, what is cloud engineering at Fanatics in whatever level of depth you’re comfortable sharing publicly.
Johnny: That’s actually a really difficult question and internally, we’ve been working on defining that. I’m, I believe, the fourth or fifth iteration of management in this area. I’ve got my own specific bent and what it means to me is that we provide a robust and reliable set of services that allow our engineers an easy experience of building and deploying applications on top of the AWS platform.
Historically, there have been efforts to provide operational support and do a bunch of architecture. Over time, we found it that’s just really difficult to scale and the challenges that each individual team winds up having are really theirs to own and theirs to solve. In some ways, we’ve become more of a conduit between those teams and TAMs on the AWS side. Internally, we’re focusing a lot more on productivity tools and providing a solid platform both from the sort of service discovery secrets management and your favorite, Kubernetes.
Corey: Absolutely. If you can address a common theme that has come up, not just in this show but in loud, heated arguments I’ve had with people at conferences usually over drinks, there’s this idea of, if you’re in a market that potentially competes with Amazon, that you don’t want to wind up using their cloud perspective, or if you do, you want to at least be able to go multi cloud, but at a moment’s notice, be able to pivot to a different provider.
You mentioned in the description what Fanatics does, that you are an ecommerce company. An awful lot of folks in that position try and actively avoid Amazon. Was that ever something that was on your radar?
Johnny: I think at the end of the day, everybody has to have some sort of perspective on what will happen when Amazon comes for me because they’re coming for you. It doesn’t seem to matter what business you’re in or what city you live in. They’ve got some sort of idea of how they’re going to take that and do something with it.
The overall thing that I think is important to us is to really focus on our ability to make our business function smoothly. If we have, in the back of our mind, some thoughts on what if Amazon were to make moves in a direction that would be harmful for us, then we’ll have a way to get out of that. That’s the sort of thinking that I believe we’ve really focused on.
Really, in other words, we’re not going multicloud right off the bat. There are specific use cases where we see a stellar set of tools where there could be something where we run a Microsoft program on-premise and they have disaster recovery for it that’s plug-and-play in Azure and cool, that’s an easy thing to adopt.
Or Google’s got Spanner and Dataflow and those are really interesting technologies to take a look at, but they’re not necessarily the sort of—I don’t know if I want to call it paranoia-based movement—but real specific use cases where we gain a significant benefit from moving in that direction rather than providing abstractions everywhere so that you don’t care about what cloud provider you’re on.
Corey: I generally tend to agree with that perspective. The other piece of it, of course, winds up being somewhere that is trying to figure out, “But what if this thing happens in 3-5 years? What if we need to be able to embrace that in a reasonably quick response time window?” I’m not convinced that’s necessarily as viable of a concern as people like to pretend it is. I’m a fan of building things that could at least be theoretically be transitioned out.
For example, Google cloud’s better as a core tenet of your architecture for your software application, maybe that’s not the best move. There’s no equivalent anywhere else and you’re redesigning everything from scratch. If you’re running a traditional cloud app, then as long as you’re effectively building something that doesn’t require a tremendous number of tweaks architecturally to move somewhere else, then it’s still going to be painful, but it’s not going to be a, “All work stops for 18 months while we do a migration.” story.
Johnny: Absolutely. I think even the level of abstraction that you can find yourself getting into with a single provider can begin to open up those thousand cuts. There are a number of different service discovery tools. There are things like Kubernetes. There are all these different ways that you can be implementing your own platform—because I don’t think that, well you’re the expert, so I’ll defer to you—but in my experience, I haven’t seen people really just take Vanilla AWS for everything and usually, it’s walking that line of, “Okay, we’re not necessarily happy with just using DNS services discovery so we’ll use Console or we’ll use something that’s based on ZooKeeper,” or these other areas where you do wind up investing in a technology that is cloud-agnostic but you’re then paying rent on that. You’re continuing to have to update it, keep it running appropriately as you scale out, what’s the impact there.
I think that there’s a little tax that we all pay. But I agree that your assessment that really trying to implement now what you’ll need in five years a really difficult story and you’ll probably wind up building something that doesn’t have anything to do with what you really need in that next time frame.
Corey: I would say that you’re probably right. The challenges generally don’t tend to come from vendor lock-in so much as they do, to some extent, a governance model that doesn’t map appropriately to what the company is trying to achieve. You’re sort of a case study in that, I would imagine, in that you describe a centralized cloud engineering group that can be loaned out to other product and feature teams.
How do you effectively govern the use of cloud resources to, for example, keep people from blowing the budget, to keep people from making a hilariously awful security mistakes, from effectively just going off in a bunch of different governance directions and causing problems for the organization, either financial or risk-based.
Johnny: We have some really interesting challenges there. There are different models that I’ve seen out there where on one end of the spectrum, it seems like there is something along the lines of Netflix where you can go and just build whatever you need to. You can also expect that another team may come along and kill your stuff, it needs to be resilient or you need some sort of remediation to be there, to expect your services to live. Then there are things like former employers where I was familiar with a very specific, blessed method of handing from team to team your jar, your deliverable, that moves into a new environment, gets load and performance tested. There’s a lot of manual stuff.
Some of the challenges that we have here at Fanatics are that we don’t have a homogeneous group of people that all have the same desires as far as management of their infrastructure and applications. There are people who want to be able to hand things off and have the security model and deployments, and operation of it all handled for them. Then there are people who want to get deep down into determining what sort of instance type makes more sense for them, and what level of network ops, and any sort of disk I/O, things like that where you wind up having a really nebulous problem if you’re governing that.
Because of the different levels of maturity amongst teams and their different focuses, we’ve got a pretty wide variety in how we actually engage with different teams. The primary focus for us right now is security, and the secondary is budget. As far as security, we have a really awesome team that is able to go out and actually, very proactively, find the issues with whether it’s an OS bug or some sort of software package that we’re leveraging, be able to work with each individual team so that depending on the level of exposure of their application, they can identify like, “Hey, we need to remediate this immediately." Or, "Maybe this is an internal tool that is actually locked down in a number of other ways." It’s okay that they’ve got some sort of SQL injection issue. But keep that in the back of your pocket, and at some point, you probably want to fix that.
On the other end of the spectrum, we’ve got this budget thing, and we’ve got a number of teams that are asked by our business to deliver tremendous amounts of data processing in a very narrow time window. We want, especially as we’re approaching Black Friday, Cyber Monday, and some of the different hot markets that we serve, an ability for our internal users to be able to see, “Hey, I need to go and order 10,000 more of these jerseys or another 30,000 hats because the team that’s looking like they’re going to win will wind up really causing us to sell out of what we’ve got.” Or we need to be able to near real-time process a lot of events as the world series ends, or some other major event is ending, so we can actually, have that real-time view of, “Hey, this is what sales are doing. Maybe something’s going on with this part of the system.”
It becomes a really interesting challenge because all that data is funneled through my team and winds up being essentially shared out to other teams. We give some sort of a bit of feedback on, “Hey, you’re trending up, you’re trending down. This is great. It looks like you may be adopting different things or maybe you should be looking at different instance types.” We've actually got a principal engineer here who focuses a lot on whether people are using the right instance types if we’ve got the right reservations. But the model that we’re aiming to get to is really being able to calculate based on a declarative model, what sort of cost you’re going to be incurring? Where your services are actually exposed, so that we can do static analysis of what our entire cloud architecture looks like and be able to predict, “Hey, this commit that you just checked in to provide more Cassandra servers, that’s actually going to cost $100,000 more a month. Maybe we should reel that in and take a look at what’s going on with your team.” Alternatively, you know what you’re team needs to provide so maybe that budget is actually something that is sensible. That’s a real area where I’m very interested in seeing a continued evolution within the industry as far as how that information is shared and then governed in the way that people allocate resources, especially across teams as we move more towards the shared model.
Corey: Which makes an awful lot of sense. The counterpoint, of course, that always become one of, where is the right organizational balance per company, I suppose. You wind up, very quickly, walking into a world where you seek certain companies, try to wind up mapping forward a governance model from the on-prem days where everything is done is capex and planning ahead was something that you had to do.
They think nothing of, “Well, it used to be six weeks to provision a server. Now, we’re going to make instance provisioning take a week.” It feels like it’s the right move, but in practice, when people go through that, they never, ever turn things off because they very quickly turn into a scenario where, “Well, it takes a week to get this one back up, so I’m just going to leave it there.” You wind up effectively with a policy that works against itself.
Johnny: Absolutely. I think that it’s even fair to say that as humans, we don’t necessarily do a good job of prioritizing cleaning things up. I keep a mess at my desk on a regular basis, and it takes some level of jarring sensation that there’s dirtiness around for me to actually want to change that. Particularly, when something is digital and not in your tangible world, it’s really easy to spin up a gigantic instance that is very expensive or a cluster, walk away from that, and not really be aware. That’s something that totally has happened.
To your point, I don’t think that we’re an organization right now that optimizes for locking down every single thing. We have a lot of flexibility for our engineers. We enable them to go and use their own authority to say, “Hey, this may be a gigantic expenditure if it were to stay on for a year. But it will get something done today that I wouldn’t be able to accomplish in a number of weeks if I want to use this or I want to experiment.” that’s definitely a spot where I don’t want to be preventing anyone from being able to actually accomplish what they’re setting out to do.
It’s a rather concerning thing, as you’re talking about looking back towards the on-premise days where you kind of had to depend on a specific team or person to push your application live. That just doesn’t sound like fun for the person that’s bottleneck. I don’t want to be there. I don’t want to be saying, “Well, this is going to cost too much, so don’t do it.” It’s a really interesting area for us to need to remain flexible, but also have some semblance of guardrail, so people aren’t necessarily shooting themselves in the foot if they really step into it accidentally.
Corey: Let’s also not escape the fact that a lot of times—this is not due to any sort of bad actor sort of scenario—this instead, turns into a scenario pretty rapidly where you’re seeing people making honest mistakes. My entire life is built around my consultancy of optimizing every AWS bill that comes in front of me, which means that, yes, I spend time optimizing my currently, roughly $30 bill, and that’s a complete waste of my time. Recently, when the last bill came in, I had a $20 spike because I’d forgotten that VPC endpoints in a test account had been left running, and those incur a per-hour charge to the tune of $20 which is nothing as far as my business goes. But as a percentage of my bill, it was something like over 50% of what my existing bill was but then added on top of it, that’s terrible. That winds up just being the sort of thing that happens. While it’s frustrating, at scale, something like that leads to people getting yelled at, it leads to gatekeepers being put in, it leads to people being unable to spin up resources without going through vast walls of approval, and that model doesn’t seem to work, either.
Johnny: Oh, absolutely. I think I shared with you my new backup solution that I implemented very poorly. I think something like quintupled my AWS bill just because it’s querying S3, it wasn’t actually, even writing any additional data to S3. It’s very easy to make a mistake with cloud APIs, and interacting with them.
Corey: Oh, absolutely. None of this stuff is intuitive, none of this stuff is one of those intrinsically obvious things. It all comes back to the fact that, this is complex, this is hard to do, and no one really has a great answer as far as how to get to sanity. I wish I did, believe me, I’d sell it to people. Unfortunately, I don’t have that luxury.
Johnny: Well, yeah. The best part is it’s often not even just a human. We’ve got a system that is built in-house that is similar to Fugue or in a sort of a constantly running terraform where it sees a model of what the infrastructure should look like, it queries AWS APIs to find the delta, and then it remediates. There have been times where it’s killed things that are critical by accident. Thankfully, in dev environments. There have been times where it accidentally spun off things that a human would do and a machine could do it much more faster.
When you’ve got an automated system that is going out and interacting with cloud providers or anything that can be spinning up resources that are expensive, then adding a human factor to that, whether it’s the human implementer of that system, or the human variables saying, “Oh, we need to scale this cluster up.” You can very quickly cost yourself a lot of money accidentally.
Corey: Oh yeah, absolutely. I see that constantly. It’s one of those areas where the natural instinct is to blame people for what’s gone on; either the people who didn’t budget appropriately or people who spun resources up or try to prevent this terrible thing from ever happening again and people have taken different technological approach that resulted in mixed bags. The idea of mandating tags and shooting down infrastructure after it’s been live for a certain period of time, of having a provisioning system that nags you every week that you’re running $X in your development account, but by and large, it mostly has to do with the mindset shift. I’m not convinced, for most companies—until they hit a reasonable point of scale—that training the engineers who can provision resources on the nuances of cloud costing is necessarily the right answer.
Johnny: I think you’re hitting the nail in the head there. I would actually be curious when you actually reach that point. It seems like there’s almost always a dividing line between the folks who are focusing on feature work and those who are really coming back to do some of the more detail-oriented, “Hey, what can we be doing more efficiently?”
There are a few people throughout my career that I’ve met where it does feel like they’re able to spread across both of those realms. But it’s a really challenging mindset that I think you won’t find in a lot of people where, “Oh, I want to go out and create this great art, but then I want to leave the studio spotless when I’m done.” Is that something that you’ve encountered out there or do you typically find that, “Hey, management has reached its budget threshold and they really are concerned about what’s going on.”
Corey: What I tend to see is that there’s very few hard and fast rules that map to everything. You’re going to see some companies where coming in very early and structuring out a costing program makes sense. You see other companies that are riding a rocket ship, and while they’re spending tens of millions of dollars a year on cloud spend, that’s a tiny molehill next to the mountain of revenue that they’re seeing, or VC money that’s pouring in, or potential upside.
It’s one of those stories where when you’re all hands on deck in a hyper-growth company, optimizing to save a few bucks here and there is absolutely, not material to your business. There does come a time where that changes.
Conversely, I’m a bootstrap consultancy of one where when my cloud bill starts spiraling away from me, if I wake up to a $20,000 bill tomorrow, I should probably fix that before I do almost anything else because it doesn’t take too many of those before my business starts winding up in trouble.
It comes down to a number of different levels of maturity. That’s why I’ve never been a fan of the models for cloud governance that tend to equate everyone to being similar. That's always going to be disparate based upon who you are and what your constraints look like.
Johnny: Yeah, totally. I was just reading a really interesting article on Facebook’s newly public bug remediation and automation of suggested changes in their codebases that makes me think that might be an interesting area for us to have. Are you familiar with the blog, Accidentally Quadratic?
Corey: I am not. That sounds like math. I was told there will be no math.
Johnny: It’s all these really, really wonderful code snippets where people have found that it’s just an inefficient algorithm being used. They share a little bit of the context around what the code base is, what the intention behind implementing this way probably was, and how they went and made it better. You see companies like HashiCorp coming with some new features to help predict costs. You see a lot of the AWS trusted adviser, and other things like CloudHealth moving in different directions of helping to at least say reactively, “Hey, you spent too much. You need to solve this.”
I won’t be terribly surprised if this is a new cottage industry of ML or something where you’re actually looking at the codebases that are running, particularly, in the big data realm or the other truly expensive, as far as compute and data transfer areas go, where you’re not just saying, “Oh, let’s reserve instances but let’s actually take a look into your code and double check. Are you using the current version of the framework? Are you using the minimal amount of data that you could be?” Removing that from the responsibility of those creative types who are more responsible for going out and building new features for the business.
Corey: I don’t intend to say unfortunate things about a lot of the vendors in this space but every time I’ve seen something like this today that purports to use machine learning to determine whether your resource usage is sensible, whether things should be turned up or turned down or not. They either tend to focus on a very small portion of the overall picture or they tend to have unfortunately naive assumptions baked in.
Quick and easy example: there’s no way programmatically, to distinguish between an instance that is oversized and sitting idle and should be downscaled or turned off, and an active DR site that’s going to have about three seconds of warning before it gets slammed with traffic. In one of those, you want to turn those things off. In the other scenario, you absolutely don’t. That’s a business process problem, that is not something that I’ve ever seen any realistic chance of solving via writing code.
The same story with, to be frank, a lot of this business is price models where it’s a percentage of your bill in order to sit there and do analysis. Okay, that’s fine, I guess. But one likes the model. When I’ve tried that in my very early days of the consultancy, I got laughed out of the room. Now, I charge fixed fee with guarantees I’d do it, and I wind up not having to fight that particular battle the same way.
Johnny: Yeah and I’ll totally admit that I am a total nerd and optimist and I believe that there are a bajillion areas that, in the next 20-40 to whatever years, we’ll see some really astonishing changes. I totally agree that right now, the industry is paid little attention and it’s, as you’re saying, not a high-value proposition to come into the next unicorn and say, “Hey, as you’re making that $1 billion, I can save you $20,000 every month.” that’s not really worth their time.
Corey: No, it’s really not. It becomes a better narrative of around the idea of helping establish good practices, good governance, demonstrating they’re being responsible stewards of the money entrusted to them. It’s not the big win in this space.
For a second there, I thought you were going to say that, “Oh, compute the code is going to get better. In the future, the cloud bills will self-optimize.” At which point, I’ll be obligated to ask you, “Will we pay them in Bitcoin?” I’m sorry, I’m not one of those people with stars in their eyes. Everything is terrible up until this point. But the future is better.
We see evolutions of these things. I think, to some extent, the providers are going to have to come up with some form of simplification passed over their bill. They have to. The level of increase and complexity over time is not something that’s going to be sustained. The other side of that, though, is how do we get better than we are today? If we don’t have a perfect solution, we don’t need it to be. But how do we get better than we have now?
Johnny: Yeah, and isn’t that what you do?
Corey: From my perspective, but there’s only one of me. Also, to be very blunt with you, I shouldn’t have a business. This shouldn’t be as complex of a problem as it has become. You shouldn’t need to bring in a consultant to solve these things. Until companies are spending at least a certain baseline threshold on their cloud bill, I can’t help them because there’s no ROI for retaining me. Yes, I’ll come in and look at your bill, and you’ll hit break even on my services in only a couple of decades. That’s not a compelling sales pitch. It’s not something that’s ever going to work. You shouldn’t have to be spending a king’s ransom in order to make those numbers make sense. It should be something that as the product continues to evolve and grow that you’re building, that governance serve comes along for the ride, that your bill streamlines itself. I think that we’re a very long way away from that.
Johnny: When you look at the work that you’re doing, do you have other industries that you see similar consulting where they’re either retailers or some people dealing in physical goods, where it’s a similar problem, where they need to optimize? I could imagine that their industries were paying the right amount for raw goods that’s critical. Do you have any analogs that you’ve really used to help guide yourself as you've embarked down this road?
Corey: Not exactly in the way that you mean it. There’s nothing new about my business model. We saw this in the '70s and '80s where companies would come into large enterprises and say, “Hi, I’m a consultant. I’m just going to sit in the room quietly and tear apart your telephone bill.” Back then, telephone bills were complex, they were massive, and they would say, “We’ll find errors that the phone company made when they calculated these things out and when we save you money, we’ll take a percentage of it.”
That was a brilliant business model that I don’t think we can quite get back to but the beauty of that was first, it’s money that the company is never going to recoup. Secondly, it requires zero investment on the company's side other than, “Here’s the bills. Now, go away and tell us what you find.” It doesn’t require a team of engineers to sit there with someone explaining architecture. It doesn’t require a team of people to sit there and go back and forth with vendors and negotiation team. It became very simple and very streamlined. I don’t think that there’s quite a direct equivalent to that, but I did take inspiration to that philosophically.
Johnny: Do you think that there are similar evolutions that are coming in cloud computing? I mean, you look at our phone bills today, and I pay a flat rate every month, and when I go to Europe, it doubles. That’s fine because I know it’s also just gonna be another flat rate. Do you think that we could get somewhere like that, especially with all of the serverless, not just talking about Lambda, but moving into the RDS realm. It seems like at some point, Amazon could be charging me per cycle or per request or conversation, something that’s a little bit different than just this dollars and cents to resource reservation time.
Corey: I hesitate to try and predict the future. It always seems like that’s either one of those things that winds up leading very quickly to, “Yeah, you were right. No one cares,” or, “You were wrong. Now we’re going to laugh at you for eternity.” There’s no real upside to that.
I will say that the current pace that AWS seems to be on in several fronts is unsustainable. For example, right now the market is always talking about percentage growth. Well, if you make boats, and you sell them for $1 million apiece, and last year you sold one boat, and you were independent. Now, you've hired an assistant, and you sell two boats. This year, you’ve demonstrated 100% year over your growth.
Back when you had a $20-million cloud business, we’ve made $40 million dollars this year on it, the growth numbers are fantastic. They have eclipsed, I think, $25 billion a year now as a run rate according to their last published numbers. That is a much larger number to have to double and trying to onboard rapidly. People, generally, don’t tend to spend that much that quickly in a new platform except by accident. Accidentally charging people a few billion dollars is not great customer service. Counterpoint, it only has to work once.
Johnny: Where to I apply for that? That sound pretty great.
Corey: Absolutely. You also see this now on the other fronts at re:Invent, for example. They get on stage, and they trot out their slides showing year over year, number of feature releases and enhancements. Okay, that is good to know that you’re not resting on your laurels and you’re innovating rapidly but that line cannot continue up forever.
We’re already at a point where there are services out there that solves problems that I’ve had, and I didn’t know they existed. I spent a fair bit of time tracking this down. Instead, I had to go down this entire merry-go-round of service discovery from time-to-time around what it is being offered and what’s been released. That’s the reason my newsletter last week in AWS exists, is so that I can at least have something I can refer back to when I get confused or caught up by something new and exciting that launched.
But eventually, you’re going to see a world where the official Amazon blog, that Jeff Barr writes, just doesn’t have enough space to wind up publishing these things. He collapsed due to exhaustion from writing 85 posts a week. At some point, people working on these things, we all have jobs to do that don’t include analyzing new service releases or feature enhancements. We stop paying attention even to the things we really should be paying attention to. Things can’t go up into the right forever.
What that leveling off or normalization starts to look like, I have no clue. There are smarter people than I am at Amazon who work on these things as full-time jobs. I’m just sitting here in the cheap seats throwing peanuts at people and sometimes rattling the cage and screaming.
Johnny: Well you hide that part very well.
Corey: Oh yes, the things we say in public and the things we scream in the middle of the night while working on articles.
Johnny: I like your approach. The time makes a lot of sense.
Corey: Oh, yeah. Nothing good ever happens after 3:00 AM. Whatever running blog posts then, nothing good.
Johnny: When you’re describing these granular services and the solutions to problems that are not well-publicized, do you think that’s just the state of scale of AWS, specifically or do you think that it’s their approach and folks like Google, or Azure, AliCloud, or whoever out there might be taking different approaches that would actually be able to condense those solutions into something that’s more palatable, more meaningful, and easier to adopt?
Corey: I don’t know, but it’s a great question. But even now you wind up not just with competition from third parties. But for example, let’s say that I have a string that I want to send from me to you, and I want to do that programmatically via APIs. Within AWS, there are no fewer than 15 different services I can use to store that string and have it go to you, and that number is not getting smaller.
Incidentally, I’m talking as a not terribly abusing service, either. “Well, technically I could spin up Amazon Chime and message...” No, that’s not what I’m talking about. Or, “Well, theoretically I can spin up an EC2 instance and store that string in a tag.” No, none of that. We’re talking using services as generally intended. The varying differentiators between these services are getting harder and harder to discern.
Back when there was one queueing style service, it was easy. You use that one and complain about it. Now that there are 15 of them, you pick one or convince you do the wrong thing, complain about it, switch to something else, trip over a constraint you didn’t know existed, and the cycle repeats until you eventually give up and go raise goats on a mountainside somewhere.
Johnny: I like goats.
Corey: That’s because you never raised them.
Johnny: This is true. You’ve got a way more of a close relationship with Amazon than I do, for example.
Corey: Much to their everlasting chagrin.
Johnny: You don’t know that. That’s just what they say to your face.
Corey: You should see what they say when they think I’m not listening.
Johnny: As you’re talking about this evolution of growing nearly the same service over and over again, have you experience anything you could share around, like why that happens? I completely understand the concept of ‘not invented here.’ Is it something that they can find another two pizza team that is so dissatisfied with the service that they really just have to reinvent it?
Corey: Sort of. It’s a great question and this is sort of the Achilles heel from my perspective of the entire Amazon model. For those who are unaware, the term ‘two pizza team’ is an Amazon management philosophy. They believe that any team that works on a service should be able to be fed with two pizzas. My take on that is, “You’re not allowed on the team unless you can eat two entire pizzas yourself.” History will say which was better.
As they’re building these things out into small teams, they get ideas, they do internal style of bake-offs, to my understanding, and that’s why you wind up with services that wind up competing with one another. They move very quickly, they have the freedom to fail, which is incredibly valuable, and by the time something launches, it's generally already got customers lined up to use it. They’re building things and hoping that people use these things one day. They have customers who are asking for these specific things that they’ve built.
The counterpoint and the pain that many of us experience is that, anything that depends upon a shared service for all of those is very difficult. Take a look at the console, for example. You have to unify all of those services and present them in the same way. That’s really hard. You take a look at other shared services like the bill. Every different service team has a different billing model and the numbers of dimensions and metrics that wind up influencing that bill.
The billing system alone is an incredible service that most people don’t understand as far as the sheer volume of data that it has to process and what does it has to do to get those bills out to people on time. But people’s only interaction with that is at the end with the output where first, it’s a bill. No one’s thrilled to get one of those. Secondly, it’s super complex. No one likes that either because, "Here’s what we’re charging you. Here’s why." you look at that and you feel dumb is a crappy customer experience. How do they fix that? I couldn’t tell you.
Johnny: You’re going to feel dumb because you feel dumb because your dumb? There’s some basic expectation there.
Corey: I tend to not be a big fan of blaming people who are confused or annoyed over the bill itself and anything in the space. There’s no simple problem in anything that touches the cloud. If your answer to a problem is, “Oh, you should just..." stop speaking there because you’re already wrong.
Johnny: Yeah. That was more a comment on me feeling perpetually dumb which is something that I’m dealing with personally. One of the things you mentioned in there that I think is really interesting is you call that the freedom to fail. Hence, I’ve also seen you talk about the reliability that Amazon has as far as the longevity of the different services. What does that mean when you say that they’ve got the freedom to fail? Is that something that’s just internal? The project may not make it to production or have you seen instances where just there aren’t enough people using this thing, so we’re actually going to be sunsetting it and have some potential, significant impact on users?
Corey: I’ve never seen them sunset a product. I’ve seen them deprecate things a couple of times in strange ways. The first is reduced redundancy storage. It no longer participates in price cuts, it’s an S3 storage class and it now cost more than the good storage. It’s still there if you want to use it.
I’ve never seen them sunset a product. I’ve seen them deprecate things a couple of times in strange ways. The first is Reduced Redundancy Storage. It no longer participates in price cuts, it’s an S3 storage class, and it now costs more than the good storage. It’s still there if you want to use it. But the one that I find more interesting is SimpleDB. You don’t see it in the console, it’s not advertised, and relevant to this conversation, Andy Jassy, the CEO of AWS, publicly referred to it as a failed service which is fascinating to me.
The value of being able to say something like that publicly even though it still has active users on it. There’s still a service team maintaining it—incidentally, that feels like the saddest job in the world—but it’s not something that they’re ever going to turn off completely because they made a commitment to customers that, "You can build a business on this." Until that last customer gets off of using that product or service, Amazon’s going to continue to honor that as best I can tell. Now, I’d be surprised at this point if they don’t have teams of people actively working with some customers to migrate them off so they can finally turn it off. But to date, that hasn’t happened. I’m not particularly worried about trusting Amazon with my production infrastructure.
Johnny: That’s fair.
Corey: As opposed to other cloud companies who turn things off for kicks.
Johnny: I believe that could fall under some form of chaos. That’s just branding that’s missing.
Corey: Absolutely. “We’ve decided to turn off the database that you’re building everything on top of. Have a good day.” Yeah. No one’s having a good day when that happens.
Johnny: Business chaos.
Corey: Exactly. If people want to talk to you more, where should they find you in this wide internet of ours?
Johnny: Probably the place that I’m interacting the most is on the Rands Leadership Slack. I’m on the Go For Slack, and Hangops Slack, as well. But there’s Twitter, or they can just email me at email@example.com. I’m out there.
Corey: Perfect. I will throw links to those things in the show notes.
Corey: Thank you so much for taking the time to speak with me today. It’s appreciated.
Johnny: Yeah. Thanks, Corey. This is awesome.
Corey: It really has been. Johnny Sheeley, Director of Cloud Engineering at Fanatics. I’m Corey Quinn, and this is Screaming In The Cloud.