More and more enterprises and on-prem applications are moving to the Cloud. Therefore, flexibility, agility, time-to-market, and cost effectiveness need to be created to address a lack of visibility and control.
Today, we’re talking to Archana Kesavan, senior product marketing manager at ThousandEyes. The company offers a network intelligence platform that provides visibility to Internet-centric, SaaS, or Cloud-based enterprise environments. Our discussion focuses on ThousandEyes’ 2018 Public Cloud Performance Benchmark Report.
Some of the highlights of the show include:
Purpose of Report: Reveals network performance and architecture connectivity for Amazon Web Services (AWS), Google Cloud (GCP), and Microsoft Azure
Report gathered more than 160 million data points by leveraging ThousandEyes’ global fleet of agents that simulate users’ application traffic
Data collected during four-week period was ran through ThousandEyes’ global inference engine to identify trends and detect anomalies
Internet X factor when calibrating network performance of public Cloud providers; best-effort medium that has no predictability and is vulnerable to attacks
AWS’ performance predictability was lower than GCP Cloud and Azure leveraged their own backbones to move user traffic
Certain regions, such as Asia, were handled better by GCP and Azure than AWS
Customers should understand value of long-distance Internet latency when selecting a Cloud provider
Determine what the report’s data means for your business; conduct customized measurements for your environment
Full Episode Transcript:
Hello and welcome to Screaming In The Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming In The Cloud.
Corey: This week’s episode of Screaming In The Cloud is generously sponsored by Digital Ocean. From where I sit, every cloud platform out there biases for something. Some bias for offering a managed service around every possible need a customer could have. Others bias for, “Hey, we heard there’s money to be made in the cloud. Maybe give some of that to us.”
Digital Ocean, for where I sit, biases for simplicity. I’ve spoken to a number of Digital Ocean customers and they all say the same thing which distills down to, they can get up and running in less than a minute and not have to spend weeks going to cloud school first. Making things simple and accessible has tremendous value in speeding up your time to market.
There’s also value in Digital Ocean offering things for a fixed price. You know what this month’s bill is going to be, you’re not going to have a minor heart issue when the bill comes due, and that winds up carrying forward a number of different ways. Their services are understandable without having to spend three months of study first. You don’t really have to go stupendously deep just to understand what you’re getting into. It’s click a button or make an API call and receive a cloud resource.
They also offer very understandable monitoring and alerting. They have managed database offering, they have an object store, and as of late last year, they offer managed Kubernetes offering that doesn’t require a deep understanding of Greek mythology for you to wrap your head around it. For those wondering what I’m talking about, Kubernetes is of course named after the Greek god of spending money on cloud services.
Lastly, Digital Ocean isn’t what I would call small-time. There are over 150,000 businesses using them today. Go ahead and give them a try or visit do.co/screaming and we’ll give you a free $100 credit to try it out. That’s do.co/screaming. Thanks again for Digital Ocean for their support of Screaming In The Cloud.
Corey: Welcome to Screaming In The Cloud and I’m Corey Quinn. I’m joined this week by Archana Kesavan, who’s a Senior Product Marketing Manager at ThousandEyes, although sitting across the table from you now, I count only two. Welcome to the show.
Archana: Thanks, Corey. A pleasure to be here.
Corey: Late last year, you folks released the 2018 Public Cloud Performance Benchmark Report, which is sort of the entire reason I wanted to talk to you folks. But we’ll get into that in a minute. To start, what does ThousandEyes do?
Archana: ThousandEyes is a Network Intelligence Platform that was designed to provide visibility for today’s internet-centric SaaS or cloud-based enterprise environments. We know that enterprises are moving to the cloud. That could be using SaaS applications like Webex, Office 365, Salesforce, or that could mean also moving their on-Prim applications to public cloud like AWS or Google Cloud, for instance.
What happens in the case of moving to the cloud is, what enterprises do is they’re actually creating in flexibility, agility, time to market, maybe costs, for a lack of visibility and control. That’s where ThousandEyes comes in, to be able to provide that end-to-end visibility across environments that you own and you don’t own, which is a lot in today’s internet-centric world, and provide visibility all the way from any user, any application, any network, and any cloud.
Corey: Which is the perfect setup for the report that you folks released. In fact, late last year, you folks did a press event where you invited a lot of luminaries from the tech press and because someone screwed up the invitation, me, where you would up unveiling the findings of this report. It was fascinating to sit there and watch and map to my own understanding of things. I know I learned a lot sitting there watching that, but what was the purpose of this report?
Archana: The 2018 Public Cloud Performance Benchmark Report is the first of its kind. Last year was the first inaugural version of it. What the report and research talks about and delves into is the network performance and network architecture connectivity of the big three: AWS, Google Cloud, and Microsoft Azure.
One of the reasons we started on this effort to do this research and collect actual real-time measurements is, when we think about cloud and as IT business leaders, they’re thinking about the cloud, there’s a lot of information in there from the perspective of global presence of these three providers. How many data centers they have? How many regions? Availability zones?
There’s a lot of competitive metrics on pricing for instance, but when it came to performance, we saw there was a complete lack of understanding in terms of who performs better. The cloud is nothing but a strong network. The network is what binds everything together. Understanding network performance, to be able to make these cloud decisions, is something we thought was important. ThousandEyes, through its infrastructure solution, was able to gather 160 million data points starting these three providers and that’s what led to this research and this report.
Corey: I’m going to assume that you discovered that by instrumenting people’s browsers or applications that are deployed in the field and you don’t have 160 million different server-type things running in various places around the planet.
Archana: No, we don’t have 160 million server-type things but how we actually got this data is by leveraging ThousandEyes’ global fleet of agents that are located in about 170 cities around the world. That’s assimilates users coming in and our agents are capable of handling and emulating application traffic. These agents, apart from being in these global vantage points, are also located within the service providers, AWS, Google Cloud, and Azure. In about 55 regions of these providers, we have our agents in there.
We were able to orchestrate these tests across all of these vantage points and these cloud providers, ran them over a period of four weeks, and periodically looking at this data that resulted in about 160 million data points.
Corey: At the end of the metrics collection phase, you wind up with an enormous pile of data, about network performance characteristics of the three major cloud providers. Now what? It turns out Excel doesn’t work so well with that many fields in it. Ask me how I know.
Archana: Exactly. The advantage of ThousandEyes is that it’s a cloud platform. We are a SaaS platform ourselves. All of the data that we collect, we run it through our global inference engine that can organically process this data and come up with trends and anomaly detection.
We were able to look at all of this data that was collected in a four-week period and actually decipher these trends that we saw and some of the findings that we’ll be talking about later today, came from that. The platform lends itself really well and not just collecting information but analyzing information because that’s what you need. Just data by itself is not worth anything.
Corey: I feel like we’ve kept people in suspense long enough. At the high level, what were the general cut of the findings that you uncovered? “What did we learned through this experiment?” to quote Mr. Wizard.
Archana: A lot of things but the one thing that really stood out while we were looking at the results was how the internet is the X factor when it comes to calibrating network performance of public cloud providers.
As it turns out, if you’re using AWS to host your services, they influence user’s traffic to stay on the public internet for as long as possible. What that did to performance metrics is AWS’ performance predictability was relatively lower than Google Cloud or Microsoft Azure because the other two providers, GCP and Azure, actually leverage their own backbones to move user traffic across.
That was a really big finding from our report and to just quote some numbers here, we noticed that AWS demonstrated 35% less performance stability than Google cloud and 56% less stability than Azure in certain parts of the world.
Corey: Did you find that those parts of the world where the network performance was more variable, tended to be similar across multiple providers? Or did you find that certain regions were handled extremely well by one provider and terribly by another but there was no consistency across the big three?
Archana: We found that certain regions were handled better by Google Cloud and Azure. Asia for instance. AWS did not fare very well in Asia. That was because AWS uses the internet to kind of offload a lot of traffic to the internet, allow the internet to carry this traffic between users in their regions. What that means is the internet is a best effort medium. It has no predictability, it’s vulnerable to attacks, we’ve seen that in the past, and there’s no SLA.
When it came to Asia, we know that the quality of the internet, the stability of the internet is not as good as, say, North America, for instance. AWS deployments were impacted more than Azure or Google Cloud in Asia.
Corey: If we were having this conversation 8-10 years ago, we would be contextualizing this radically differently. Back then, when I was first dipping my toes into the AWS waters, I would spin up a pair of instances in the same availability zone and I would see occasionally 800 millisecond response time between those two instances, which is just pants on head laughable at this point. You can send packets around the world in that period of time. You don’t see that as much anymore. There’s been a lot of work clearly done in all the major providers to handle in-region latency issues.
A common criticism that you would have seen in the cloud was, as a result, going to be that long distance network performance was irrelevant because you had such a non-deterministic approach to understanding what latency was going to be even in the datacenter. That’s not a criticism that manifest itself anymore. If I have two instances talking to each other in the same region that are taking that kind of time to get through to one another, I’m opening a support ticket because something is very wrong. Things have gotten better over time.
Now we’re starting to see this in the multiple provider world that is the internet and the easiest thing in the world to do when you see slow connections across the wind is to start finger-pointing at different providers and they’ll finger-point at other providers. By the time they get to a source of what the latency was, assuming they ever do which is by no means guaranteed, you’ve long since lost interest and changed jobs three jobs ago.
It winds up being something that we’ve always just sort of accepted. This is the first time I’d ever since something in this space that not only does an apples-to-apples comparison between the providers but also isn’t, to my understanding, funded by any of the providers as well. If you have a report like this, “Proudly sponsored by Microsoft Azure,” for example, the findings regardless how flattering are generally going to be met with skepticism. In this case, if this was sponsored by one of the big cloud providers, excellent work on keeping their name off of it. That was just spectacular.
Archana: It wasn’t, actually, and that was the point of the whole report, to be this unbiased party that can actually empower enterprises to have data on-hand before they make these decisions. All the providers do a great job of advertising and marketing how good they are and cost are almost competitive, cost are coming equal to each other. But when it comes to performance, that’s again the area that was completely missing. People were in the dark. When we embarked on this effort, the idea was not to really have anybody sponsor it. It was meant to be a completely neutral, educational data set that we can provide to our customers and the IT industry.
Corey: Three years ago when I started my consultancy, I would have seen a report like this. I would have congratulated you on a very rigorous methodology, but unfortunately it adds no value because no one is picking a cloud based upon long-distance internet latency. Then I started talking to customers. It turns out that not every customer has the same requirements, not every customer has the same constraints, and it turns out not everyone is building a small-scale application the stuff that I build. There tend to be a whole bunch of different use cases.
Every time I deal with a new company, I wind up learning new things about how people are tying various services together. Looking at this now, I would have been very dismissive about something incredibly valuable.
If you’re listening to this and you’re thinking that there is no value in understanding the long-distance internet latency other than just pure curiosity, maybe that true for you but that’s not true for everyone. I am aware of a number of companies who will actively move based upon performance results like this. To that end, are you seeing people beginning to shift workloads as a result of what you’ve done?
Archana: One of the things that the report focused on is not just understanding performance from external vantage points or end-user metrics or performance. We took a look at inter-region, inter-AZ multi cloud performance as well. To your point of performance might have not been a metric depending on your architecture but it can be now, we have some interesting data there. For instance, one of the inter-region measurements that we have connected across these three providers, say you’re based out of Sydney, Australia and you know what your primary region is going to be Sydney, Australia for instance. But you’re like, “Okay, that’s where I’m going to pick my primary datacenter.” But you’re looking for redundancy, you’re looking to load balance, you want this to failover to another region. What’s the right region to pick? That’s the question we want this report to answer and that’s the way we want enterprises to be thinking about before they’re even moving to the cloud.
What data showed us is if you are picking your region in Sydney, Australia, Singapore and Bombay might not be a best secondary option if you’re going with AWS or Google Cloud. Azure did really well across Singapore and Bombay and even Tokyo, for instance. With this data you can say, “Okay, where do I want my secondary to be?” Obviously, you need to look at pricing and a lot of other variables do come in there. But at least now you can have performance to guide you into that decision-making process.
To your question, have we seen people make changes already? I think it’s harder than that because once you’re in the cloud, ripping it off and moving is not an easy situation. This is why we recommend that you look at it before moving to the cloud as well. Have this data so you’re making the right decision for your enterprise, picking the right cloud and picking the right regions within the right cloud.
Corey: One thing to highlight as well is that this type of latency is incredibly important for a variety of workloads but there’s just as many, if not more, workloads where it absolutely does not matter. If you’re coming from the second category, it’s easy to handwave this away as being completely irrelevant. “I have an IoT device that sits on a shelf and it periodically reports the temperature in my office. I don’t care what the latency on something like that is,” and you’re right. You probably shouldn’t. That is not going to meaningfully change your user experience one iota.
But if you have a synchronous application that is living in a browser or an electron app and a customer’s actively using that, every time they click on something, your poorly-architected application winds up making 80 sequential requests to the origin. That’s going to be a radically different experience.
One of the problems I tend to run into myself is, I refer to it sometimes as the Bay Area bubble, where people have a different approach to what applications should look like and how things should perform from a business context. But it’s also easy to forget that we generally have good internet here. We generally are running Google Chrome in the latest version of a MacBook and if something gets slow, we just get a faster one because it’s been six months.
That’s not how the rest of the world works and when you’re not sitting down the street from a very fast connection to the thing that host your application, it’s easy to forget that. There are apps that I love, that I don’t understand why people complain about latency. And then I travel abroad. When I’m sitting in Australia or I’m sitting in Europe, suddenly what a previously been a joy to use is now actively painful. Suddenly, I see it. If you don’t feel like traveling international or you don’t have a passport, you can replicate this experience by using in-flight wifi or by switching to Comcast.
Archana: That’s an interesting thing you mentioned because applications such as voice and video—which a lot of them are SaaS these days—and where are these SaaS applications hosted, majority of the time in public cloud. There are, without naming names, a lot of collaboration apps that sit on AWS. And AWS we know does not have the best performance predictability when it comes to regions like Asia, for instance. Your latency can vary anywhere from 0-240 milliseconds at any point in time.
When you’re using a video or Wyse application that’s hosted on AWS, and you have this type of latency measurements that’s not really predictable, then what does it result in? Poor user experience. Those are the types of things that we need to be aware of and enterprises need to be aware of while moving to the cloud and to public cloud.
Corey: It never ceases to astonish me just seeing how different groups work differently with clouds. Every time I think I’ve seen it all, I get to learn something new and be surprised. That’s fascinating to see. This was announced in, I want to say, November of last year.
Archana: Yeah. November 2018.
Corey: Yes and at the end of that month, AWS released the Global Transfer Accelerator, which seems to speak directly to a number of criticisms that, I would’ve say, you’ve lobbied them against AWS but the data that you have collated and displayed for the rest of us, highlights a shortcoming in their offering. Within a number of weeks, they had a service out ready to go to address these things, which is really quick turn-around on their part for not having a lot of time to work on this.
Archana: I like to believe that the report influenced that but yes, you’re right. Later that month in November, AWS, out of one of their million services that did launch within re:Invent did make an announcement for what’s called AWS Global Accelerator. What AWS really says there is you can pay them more money to ride their own own backbone instead of riding the internet.
Our research very clearly showed that AWS deployments ride the internet longer than Google Cloud or Azure. What AWS came out with the Global Accelerator is saying, “Well, we give you a choice now. If you want better performance, pay us more, ride our own network.” It’s what I call monetization of their backbone. In line with the performance data that we collected, they did make this launch or offering agreement.
Corey: It’s easy for me to [...] my instinctive response when you say that is to bristle and get upset that AWS is charging for performance. But the counter-argument to that as well, I guess, though I put two seconds of thought into it, is for applications that frankly don’t care about that level of latency, if it helps keep cost down, if I can not have to pay for performance I don’t need, there is a benefit there.
For better or worse, one thing we see across all cloud providers is that you pay for better numbers in a variety of categories, depending upon what categories matter to you, which is fascinating and I guess an expression of meeting customers where they tend to be.
This may be a premature question, given that there is no 2019 report of which I’m aware, but are you seeing distinct differences yet in customers choosing to use AWS’ backbone and what that does to the performance numbers?
Archana: I think it’s a pretty recent announcement so we haven’t actually seen comparison data to say that one is better than the other. But to your point, Corey, which is some applications you might not care. The point being, baseline, see what you care about, see if the internet works or not, see what that means for your business, and then make that choice.
Don’t blind move into, “Well, it means better performance, which means I need to do it.” Maybe your application does not need better performance and the internet works just as fine. Again, that’s where ThousandEyes comes in as well. Baseline, use the metrics to understand if you need to make that investment.
To your point of the 2019 report, that is something that is going to come later this year. What we realized is there’s no steady state in the cloud. Doing a performance measurement benchmark in 2018 doesn’t mean the cloud stays the same or these numbers stay the same for the next few years. It’s not.
Actually, we’ve seen improvement in just AWS’ inter-AZ measurements in the last year. From 2017 to 2018, AWS has made some significant optimizations within their Europe datacenter that shrank inter-AZ latency from 5-10 milliseconds to 2 milliseconds.
There’s constant improvements that’s going on, so what we want to do is measure this every year. 2018, we plan to announce a public cloud performance benchmark report. One of the angles we’re considering there is to compare against these different service levels that providers are offering now. AWS and Google offers a tiered service as that basically monetizes their own backbone again.
Corey: As long as we do apples-to-apples comparisons, I have no problems with this. There had been stories, historically, of vendors who get very angry when you benchmark anything that they have done. In fact, some early license agreements specifically prohibit it. I imagine that you haven’t gotten yelled at by any of the cloud providers for the report that you released.
Archana: No, not yet.
Corey: Not until this episode goes out anyway.
Archana: To the exact thing that you mentioned, we didn’t give one provider something better. We stated facts as they were, our measurement methodology, our scope, our research methodology, data collected was all exactly the same across these three providers. It’s pretty unbiased data. We haven’t gotten yelled at yet.
Corey: It’s hard to argue with raw data. If you don’t mind the question, over what timeline was this data collected? Was this a given afternoon? Was this over a period of months? How long did you spent collecting the data?
Archana: We collected the data for a full week period. The way our platform works is it periodically collects this data. Every 10 minutes, we have a data set that is collected. It wasn’t like we ran it in the afternoon for 30 days. It was running over a period of 24 hours, every 10 minutes collecting data from all three providers, multiple regions, 55 regions across these providers for that full week period.
Corey: What would you advise people to take away from this report? What actionable next steps should someone trying to make decisions consider as a direct result of the report findings?
Archana: Let say, read the downloaded report and see what this means for your business. Every business moving to the cloud is different. What you’re looking from an architecture perspective for enterprise A might be different for enterprise B. As the first step, go to thousandeyes.com and the report’s available there. It’s free. You can download the report, see what this means for your deployments. That’s the first step.
The second step is to actually be a little bit more proactive and see if you can start doing some of these measurements that are customized to your environment. If you have specific availability zones that you know you’ve deployed your microservices in, measure across these availability zones. We’ve done measurements across all the regions we tested, the availability zones that existed in those regions but you might be using a different combination. Make sure wherever your services are up and running, you actually are testing latency loss, network architecture across regions, across AZ. The second step would really be to customize it for your environment so you can get continuous monitoring information.
Corey: If people want to learn more about this report and things like it, where can they find you? Where can people find out more about findings like this and other things you folks are up to?
Archana: Definitely follow us on Twitter @thousandeyes. You can also go to thousandeyes.com to get a lot of information in there, the report. There’s actually been another report that we worked on which is around DNS performance that you can look at as well. But if you want to stay tuned, more in detail about the state of the internet and the state of the cloud today, I would urge you to sign up for our blog at blog.thousandeyes.com. We have outage analysis that we do there. Every time you see an AWS outage, you will most likely have some insights to what’s going on. To learn more about us, I would definitely sign up for our blog as well.
Corey: Perfect. Thank you so much for taking the time to speak with me today. I appreciate it.
Archana: Thanks, Corey. It’s been my pleasure.
Corey: Archana Kesavan, Senior Product Marketing Manager at ThousandEyes. I’m Corey Quinn. This is Screaming In The Cloud.