Google Cloud Platform (GCP) turned off a customer that it thought was doing something out of bounds. This led to an Internet outrage, and GCP tried to explain itself and prevent the problem in the future.
Today, we’re talking to Daniel Compton, an independent software consultant who focuses on Clojure and large-scale systems. He’s currently building Deps, a private Maven repository service. As a third-party observer, we pick Daniel’s brain about the GCP issue, especially because he wrote a post called, Google Cloud Platform - The Good, Bad, and Ugly (It’s Mostly Good).
Some of the highlights of the show include:
Recommendations: Use enterprise billing - costs thousands of dollars; add phone number and extra credit card to Google account; get support contract
Google describing what happened and how it plans to prevent it in the future seemed reasonable; but why did it take this for Google to make changes?
GCP has inherited cultural issues that don’t work in the enterprise market; GCP is painfully learning that they need to change some things
Google tends to focus on writing services aimed purely at developers; it struggles to put itself in the shoes of corporate-enterprise IT shops
GCP has a few key design decisions that set it apart from AWS; focuses on global resources rather than regional resources
When picking a provider, is there a clear winner? AWS or GCP? Consider company’s values, internal capabilities, resources needed, and workload
GCP’s tendency to end service on something people are still using vs. AWS never ending a service tends to push people in one direction
GCP has built a smaller set of services that are easy to get started with, while AWS has an overwhelming number of services
Different Philosophies: Not every developer writes software as if they work at Google; AWS meets customers where they are, fixes issues, and drops prices
GCP understands where it needs to catch up and continues to iterate and release features
Full Episode Transcript:
Corey: This week’s episode of Screaming In The Cloud is generously sponsored by DigitalOcean. I’m going to argue that every cloud platform out there biases for different things. Some bias for having every feature you could possibly want offered as an added service at varying degrees of maturity. Others bias for, “Hey, we heard there’s some money to be made in the cloud space. Can you give us some of it?”
DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they’re using it for various things, and they all said more or less the same thing. Other offerings have a bunch of shenanigans, root access, and IP addresses. DigitalOcean makes it all simple, “In 60 seconds, you have root access to a Linux box with an IP,” that’s a direct quote albeit with profanity about other providers taken out.
DigitalOcean also offers fixed-price offerings. You always know what you’re going to wind up paying this month, so you don’t wind up having a minor heart issue when the bill comes in. Their services are also understandable, without spending three months going to cloud school. You don’t have to worry about going very deep to understand what you’re doing. Its click a button or making API call, and you receive a cloud resource. They also include very understandable monitoring and alerting.
Lastly, they’re not exactly what I would call small-time. Over 150,000 businesses are using them today. Go ahead and give them a try. Visit do.co/screaming and they’ll give you a free $100 credit to try that. That’s do.co/screaming. Thanks again to DigitalOcean for their support to Screaming In The Cloud.
Corey: Hello and welcome to Screaming in the Cloud. This week, I am joined by Daniel Compton who is an independent consultant based in New Zealand. Welcome to the show, Daniel.
Daniel: Thanks for having me.
Corey: Thank you for taking the time to be here. You came to my notice a few weeks back when there was a bit of a kerfuffle with respect to GCP turning off a customer that they thought was doing something a little out of bounds, and it led to internet outrage. People are always mad. Google wound up posting a whole in-depth explanation of what happened, what they're doing to prevent this happening in the future, and that led me to a blogpost you not only have been writing but maintaining actively.
It comes from a perspective of looking at Google Cloud Platform, GCP, from the perspective of someone who is also familiar with AWS. There have been comparisons like this before but most of them tend to come from people with a particular horse in the race. You partner with neither company, you're effectively an independent, third-party observer, and I thought that you had one of the best write-ups that I've ever seen from that perspective and wanted to pick your brain.
Corey: As far as that unfortunate circumstance where Google wound up turning a customer off, what happened there for those who aren't familiar?
Daniel: There was a company but I don't think we ever found out exactly who they were. It sounded like they were doing some industrial work with windmills and big machinery, and they were running some of that on Google Cloud. They got a loop, saying, "Your project has been shut down because of something. Something doesn't look right about it and we're scheduled to turn everything off within 30 days." That clearly scared them and they were trying to contact Google and there were no contact numbers. It was the same kind of story that we've seen on Google's consumer services for many years now, but it applied to Google Cloud where the stakes are considerably higher.
They posted a blogpost about it and this turned into a big thing. It got a lot of negative attention for Google Cloud and so they posted some responses at the time about it. The big one that came up–there were a few recommendations that came up. The first was you should use enterprise billing from Google Cloud, which is a feature I've never heard of until this time. It's a way where you go through extra verification and they promise not to shut you down if they detect something bad is going on. That's something but, when I looked into that, that costs–you've got to be paying at least $2,500 a month to qualify for that, which I wasn't and I suspect many people using Google Cloud aren't. The other solutions were to add a phone number to your Google Cloud account, add another credit card as backup and, pretty correctly, get a support contract with Google Cloud.
Corey: When all of that fell out, despite the internet rage machine that likes to kick off on hacker news or on Twitter and drag people under the bus–and I admit that I'm occasionally guilty of participating in that myself–that it's a sympathetic problem in that you run a hosting platform that gives access to all kinds of different customers, which means that, effectively, anyone with a stolen credit card number can spin up large quantities of resources and begin doing terrible things with it.
Shutting down anything that has a hint of suspicion to it is obviously not a great plan but also being completely permissive to whatever you want to do on the platform is just fine leads to everyone blocking your network at their own border, and that doesn't work either. It's a spectrum, and where you fall on that spectrum is a very difficult problem to solve for. I do have an awful lot of sympathy for this. I thought that their mea culpa that they gave in a formal blogpost about how this happened, what they were planning on doing in order to prevent this in the future, was reasonable.It felt like they were starting to understand the level of concern this rightly causes with people who are running production infrastructure on top of their platform. My question for you as someone who's been looking at Google as an outside for a while now is, "Do you think that that's going to stick?" first off and, secondly, "Why does it take something going this far afield to get Google to acknowledge that type of thing?"
Daniel: I definitely think it is going to stick. I think it's clearly gotten enough attention that they're making changes internally to prevent this from ever happening again. It's definitely going to cost them–I'm sure it's already cost them customers and it's going to cost them customers for the several years. That kind of reputational damage isn't repaired quickly and so Google's got a long way to go. The number of people who so that blogpost would be a tenth or maybe a hundredth of the people who saw Google Cloud Shuts Down Your Account. Just from that point of view, they've got a long way to go to repair that damage. I think there's some cultural issues that Google Cloud has inherited from Google, the consumer organization, where these kinds of things where–what they had to do to scale or at least that's what they chose to do to be able to scale up to their current price. There's some behaviors like that which just don't fly in the enterprise market and learning painfully that they need to change some of those things.
Corey: One question that always leaps to my mind is–and this might be an unfair characterization, but it's always felt to some extent, like Google focuses on writing services aimed purely at developers who are similar to developers that would be found at Google. It seems they struggle to put themselves in the shoes of, for example, corporate enterprise IT shops or companies whose entire ethos does not necessarily revolve around technology. Is that an unfair stereotype in your experience?
Daniel: I don't know if that would be an unfair stereotype. I think Google Cloud definitely has a particular philosophy and product design bend, and that's different to AWS–and we can do a little more about that if you'd like–but that definitely does mean that some things are going to work, depending on the perspective you're coming from. Something's going to be more suited to you from AWS or, perhaps, Google Cloud.
Corey: You've now gone into a fairly deep dive on both AWS and on Google Cloud. Based upon that–and you go into this in extreme levels of depth in that blogpost but, at a high level, what is your takeaway?
Daniel: At a high level, my main takeaway is that Google Cloud has a few key design decisions that really set it apart from AWS. The big one is, from a developer's perspective, the focus on global resources rather than regional ones. What I mean by that is that, in AWS, pretty much everything you do–not entirely everything, but most of the things you do–scopes to a region or perhaps even a zone. That means that all of your resources are stuck within that zone and if you ever want to cross out into other regions, then that could be quite a lot of work to egress those points whereas Google Cloud has instead architected their system to be sort of global by default. Much of the resources that you use are global. Many of the resources are global, things like disc images, the view and the console, you can see across all of the regions at a single time the key management services that you use. That's kind of a big thing from a developer's point of view especially if you're looking to run across multiple regions.
Corey: Do you find that the idea of a shared control plane in that context has–I found that the counterpoint to that shared control plane where everything is global is that it does open the door for outages that are world-spanning. When you have a harsh boundary at the different region level, you might wind up losing Oregon or Virginia but the other and the rest of the world is generally going to be okay. In fact, I don't believe that, in the past 12 years, we've seen a single global service outage for virtually anything that AWS has done.
Daniel: That's definitely the trade-off and, in fact, just two weeks ago, there was about half an hour or so outage that was on the HTTP load balances, which also affected Stackdriver, their monitoring service, and a few other things.
Corey: For clarity, that was an outage from the GCP side.
Daniel: That's the trade-off, really, that they're promising a lot there, and you definitely need to–you've got a high level of dependence on them. If those load balances go down, there's very little you can do. At the time, I was running dips in production on Google Cloud and so, during that outage, I was looking, "Do I just bypass the load balances entirely and redirect stuff to the instances to work around that?" and, luckily, everything came back quickly enough. Perhaps there was a little bit of a scary moment there.
Corey: Absolutely. "Oh, by the way, everything's broken. It'll be fixed soon," even if true, for some uses cases, can be absolutely terrifying. It's, "Well, we have paying customers and we're losing money by the minute so what's going on?" is the natural, immediate panic reaction for most of us.
Daniel: I'm surely going to learn from that, and there's been–before I was using Google Cloud pretty seriously, I know–and years passed, they've also had some outages on the load balances. There's definitely a risk you take and, yeah, it's one that I'm happy to take at the moment for the features and benefits that it provides but it's definitely something I keep my eye on.
Corey: Today, let's pretend that you're a new customer, you're about to build out a thing and the time has come to pick a cloud provider. You narrow it down to GCP or AWS. Is there a clear winner today?
Daniel: I don't think there's a clear winner for everybody. I don't think either strictly dominates the other, and I think that the things that you need to think about are, first, what are your values as a company, what are the principles and the things that you really value, what your internal capabilities and what is your workload like that you're trying to run on this because there's some specialty things in both AWS and GCP that, if they take your workload, they can be gold. Those are really the big differences.
Certainly, from an ease of management perspective, I would say Google Cloud definitely wins there. You look at the ever-widening number of different instance types of AWS and Google Cloud has thus far managed to keep things much, much simpler. There's just a single–basically, an undifferentiated VCPUs and memory that you can choose. You can choose the processor family that you want if you really want to, although they don't, really, sort of push you down that path too much and then you just choose how many CPUs do you want, how much memory do you want, and you can pick just about anything on that configuration space that you'd like.
Corey: Yeah, but as a counterpoint, if you go down that path, how are you going to kill two and a half months doing RI calculations?
Daniel: The pricing in Google Cloud is just a lot simpler to calculate and understand, and they continue to make things simpler and easier for the people all the time on that perspective. That's probably not so great for your business where you're just trying to help people understand their crazy AWS space.
Corey: Believe me, I wish there wasn't a need for my business. There are many things I would rather do instead. When it comes time to pick a provider, what factors should people really consider when they're trying to decide, let's say, between GCP and AWS? It's a big decision that's kind of hard to unwind.
Daniel: Yeah, it's definitely a big decision and you're right that it's hard to unwind. This talk of multi-cloud, I guess, for some super-large companies, that makes sense, but, for many people, you have the costs and your limiting yourself to the lowest common denominator just really makes that not possible. I would look at what your team has experience in, what kind of resources you need, where they're running–where the regions are that they are running in. It's not an easy decision, and I spent probably far more time than I would like to admit, evaluating Google Cloud and AWS and a few other cloud providers before I settled on Google Cloud.
Corey: A common criticism that some people who may or may not be have levied against Google historically, among them, have been their propensity to end-of-life things that people are using. The other side of that coin is that AWS will launch a new service and that service, effectively, is going to be the trunkless legs of stone in the desert or, "King of Kings, look upon my works and you might be in despair." That service is still running after the apocalypse, and that tends to wind up pushing people in one direction or another. It does definitely bloat and complicate the AWS service catalog, but it does feel like you can rely on anything that AWS launches to a degree that you can't potentially do with GCP. Thoughts?
Daniel: I think it's important to distinguish the consumer Google from the Google Cloud. For consumer Google, shutting down products is something they do relatively often and they pay for it every time on hacker news comments.
Corey: Mean tweets are absolutely something that every product manager should take into deep consideration. "Will this offend someone on the internet before we do it?" Yeah, that should drive all the corporate decision-making.
Daniel: To my knowledge, I don't think Google Cloud has shut–once something's become general availability, I don't think anything's been removed or shut down from there. They have shut down other products in the past and people from the outside look at Google and they don't distinguish necessarily between Google Cloud and Google; they just say, "Google shuts down services." It's the same lingo everywhere and so they think, "Well, how can I trust Google Cloud? Are they going to shut down some of the services that I rely on?" or, almost as bad, "Are they going to raise the price on me 40 times?" or whatever the recent Google Maps price increase was. These are sort of unforced errors from my perspective that aren't going to cost them a lot whereas AWS just isn't making those errors and they pay for it in complexity, certainly, but, from a business perspective, I think, usually, people would prefer to be able to just rely on something, to know that it's going to be there and it's never going to get more expensive and only ever gets cheaper. That's something that AWS has done really well.
Corey: Absolutely. The challenge, too–and this sounds like a bit of a backhanded compliment in some ways but it's not intended that way–but GCP has built out a smaller set of services that are relatively easy to get started with as opposed to, "Oh, I'm going to spin up something new in AWS. I've never heard of it before. Let's see what happens. Oh, my God, I'm staring at a list of 120 services. I don't know what any of them do. I'm going to go raise goats instead." There's something to be said for being more straightforward in your offering and much more defined in messaging. Do you find that that's resonating? Right now, I look at an AWS console and I have a decent idea of what I'm looking at because I've been institutionalized for 12 years of staring at these things. For someone who's new, I don't see that that's there.
Daniel: I've been looking at AWS for 12 years but I can look at the console and get a reasonable understanding of what I can look at and what I can ignore but, differently, for someone coming in cold, there's a lot of stuff there and just even getting your bearings to even understand where you should be looking or what you should be doing there is a big job. That's one of the things that I think Google Cloud is stronger for me. It may be not for everybody. Maybe some people prefer the AWS perspective but, certainly, for smaller teams or teams who want things to be more simple, you generally get a small list of well-built flexible primitives rather than AWS's 18 different cueing services which all are slightly different and they're all relevant in a slightly different context; Google Cloud just has one. That's probably the best example I have of the different product philosophies there in simplicity.
Corey: I would also argue that there might be a company ethos discussion here with respect to how each company respectively views its customers. Google seems to me–and please feel free to correct me if I'm wrong in my assessment–that they believe that most of the world should write software the way that Google engineers tend to write software, and that's not inherently a bad thing. Google software engineers are incredible. The counterpoint is that if you take a look across the entire ecosystem, not every developer writes software like they work at Google, to the same network design principles, to the same baseline level of quality, from the same perspective. Conversely, it feels like Amazon throws a lot of ridiculous but closely-related services in an effort to meet customers where they are. Is that a fair characterization?
Daniel: I don't know if I'd say Google expects you to write software in that way but I think that, definitely, Google Cloud is definitely moving in a direction where they're giving you the tools that you can write software in the way that Google does, the global stuff, global load balances, things that span the world. These are primitives that Google uses internally and uses them to run massive fleets of software. One of the promises or the dreams of Google Cloud is that you, too, can write software that runs like Google.
I would say that EC2 also–I would say also that AWS has a product philosophy that you have to buy into; it's just a different one and it's probably more flexible. I'd give them that but, underneath this, there's sort of a meta-point I wanted to make about the ethos, which is that, the last few weeks, I've been looking at AWS Athena, which is a hosted Presto service from Amazon where you can put a bunch of data in S3 and query it in super-fast speeds. We discovered a bug that affected the billing of their service. It's charged by the number of bytes read and if you touch a particular column, type some column definitions in your queries, that ends up costing you to read the whole partition.
I was talking with my boss about this and I came away and realized that I had complete faith that AWS was going to do the right thing eventually and fix the bug and bring the pricing down for us. There was no question in my mind that AWS is going to–my perspective of them is they're always trying to work for the customer, bring the prices down. It's not that Google Cloud doesn't have that; it's just they don't have the reputation and those many years of proving it behind them. This is kind of an empty "identity" on the Google Cloud's side. Who is the Jeff Barr of Google Cloud? I don't really see anyone, and I think that's something that they would do well to develop.
Corey: Absolutely. The counterpoint, of course, is that, credit where due, they broke the mold when they made Jeff Barr.
Daniel: Yes, they're not going to find easily find another one like him.
Corey: A common observation has been that Google's feature set is, in some ways, behind AWS. That's not surprising in that, for the first five years that AWS existed, the other major players more or less ignored them for whatever reason, and they had a tremendous head start. Now, in some cases, that let them iterate and advance very quickly. In other cases, that let them go on exciting journeys and discovering exactly what didn't work. SimpleDB, I'm looking at you. How do you feel that GCP is going about catching up in that context?
Daniel: When I started writing at my blogpost, I started around January of this year and I had a bunch of complaints in there. I drafted it all out, head started all of the things I wanted to talk about but didn't flesh it out all the way through. Every couple of weeks or so, a new announcement from Google Cloud would come up and it would invalidate one of the points in my post. I would be frustrated because I'd go, "Well, there was something I was going to talk to about and now it's just a non-issue."
I'm a small-scale developer. I have no need to post developer. I don't have a ton of insight on what enterprises are looking for from Google but, at least from my perspective, it definitely seems like they understand where they need to catch up and they're doing so; they're continuing to iterate and release the features. There is definitely a feature gap there and I think they are working their best to catch up. The challenge for them, though, is that AWS is not standing still; they are accelerating much faster than Google is at the moment, honestly, and re:Invent is not that far away. You can only imagine what […] Amazon's going to give us.
Corey: I have been hinted to by little birds that there should be more than one new service launching, which I'm sure is now going to take the entire world by storm. "Wait, they're going to release new things. They're not declaring victory with what they have now and moving onto selling something else?" Absolutely. They just go down to none of those. I do encourage listeners to take about 20 minutes or so and go through your blogpost. It is a fantastic point-by-point dissection of what GCP is good at, what GCP is not terrific at, and a nuanced critique of both aspects. It's really nice to see something like this. I don't see it too often, which is why I'm so glad that you can clear time in your schedule and the stars align to finally put both of us on a call at the same time. Where else can people go to hear your impressive thought-leading?
Daniel: I'm not sure about thought-leading, but I do write a weekly newsletter about the closure of programming language @therepl.net and I run a Private Maven repository service. I am the one that's using Google Cloud. It's called Deps and it's @deps.co
Corey: Wonderful. I want to thank you, once again, for joining me. This has been Daniel Compton, an independent software consultant who focuses on closure and large-scale systems. I'm Corey Quinn, and this is Screaming in the Cloud.