Let’s chat about the Cloud and everything in between. The people in this world are pretty comfortable with not running physical servers on their own, but trusting someone else to run them. Yet, people suffer from the psychological barrier of thinking they need to build, design, and run their own monitoring system. Fortunately, more companies are turning to Datadog.
Today, we’re talking to Ilan Rabinovitch, Datadog’s vice president of product and community. He spends his days diving into container monitoring metrics, collaborating with Datadog’s open source community, and evangelizing observability best practices. Previously, Ilan led infrastructure and reliability engineering teams at various organizations, including Ooyala and Edmunds.com. He’s active in the open source and DevOps communities, where he is a co-organizer of events, such as SCALE and Texas Linux Fest.
Some of the highlights of the show include:
Datadog is well-known, especially because it is a frequent sponsor
More organizations know their core competency is not monitoring or managing servers
Monitoring/metrics is a big data problem; Datadog takes monitoring off your plate
Alternate ways, other than using Nagios, to monitor instances and regenerate configurations
Datadog is first to identify patterns when there is a widespread underlying infrastructure issue
Trends of moving from on-premise to Cloud; serverless is on the horizon
How trends affect evolution of Datadog; adjusting tools to monitor customers’ environments
Datadog’s scope is enormous; the company tries to present relevant information as the scale of what it’s watching continues to grow
Datadog’s pricing is straightforward and simple to understand; how much Cloud providers charge to use Datadog is less clear
Single Pane of Glass: Too much data to gather in small areas (dashboards)
Why didn’t monitoring catch this? Alerts need to be actionable and relevant
How to use Datadog’s workflow for setting alerts and work metrics
Datadog’s first Dash user conference will be held in July in New York; addresses how to solve real business problems, how to scale/speed up your organization
Full Episode Transcript:
Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. I’m joined today by Ilan Rabinovitch of Datadog where he’s the VP of Product and Community. Welcome to the show, Ilan.
Ilan: Thanks for having me, Corey.
Corey: Pleasure. Before we dive in, I want to call out that Datadog is relatively well-known in the, I guess, operational space. In no small part due to the fact that you folks sponsor an awful lot of things. I want to be very clear, this is not an episode that you are sponsoring. This is having a conversation with you. It is not theta play. It always feel a little weird to have folks who have sponsored things that I’ve worked on on the show and not call that out. Thank you for your support but that’s not what’s going on here.
Ilan: We’ve always enjoyed the newsletter and all the Corey talks, but yeah, this sounds like a fun time to just chat with you about the cloud and everything in between.
Corey: Absolutely. Let’s start with that. If you take a look at the history of monitoring, or observability, or whatever it is we’re calling it this hour, the world has more or less fairly comfortable with the idea of not running physical servers themselves, and trusting someone else to run them. Be it one of the large cloud providers, another with the large cloud providers, etc. But there still seems to be a bit of a psychological barrier where people will say things like, “Oh, I’m absolutely not going to run my own servers. That’s lunacy.” And then immediately follow it up with, “But we absolutely have to build, design, and run our own monitoring system.” How do you see that evolving? How do you, I guess, combat that frankly ridiculous perspective?
Ilan: Surprisingly, it’s actually not that big of a challenge these days to get folks to do that. I think we’re now in a spot in our industry where more and more organizations are realizing that their core competency is not monitoring, it’s not managing servers, it’s not necessarily the installing and racking switches in a data center. You’re focused on something else. If your customers are consuming your chat platform then you want to build the best chat platform ever. Before that, you want to make sure that you have the best monitoring system there is so that you can ensure that you’re addressing the infrastructure or code challenges you may be encountering, that you have data to make your decision based on.
It turns out that Datadog has an amazing monitoring product. We don’t have a lot of challenges getting customers to use us even when they have on-prem servers, they’re happy to take advantage of that. Monitoring and metrics, it’s a big data problem. Whether it’s metrics, tracing, or logs—these are difficult problems. If you’re having to run an indexing system for your logs and your logs are generating terabytes, or petabytes, or whatever it might be a day, that’s a really complex database that may be as difficult for you to interact with as the clickstream logs from your consumer website, for example.
Why would you want your teams focusing on that part when they could be focusing on building out the platforms that your customers actually consume? Similarly, if you’re talking about storing metrics, whether it’s columnar data storage or a time series databases, whatever else you might be coming up with, these are sometimes, in some cases, the size of your monitoring data is bigger than the size of the data that your customers actually interact with. These are difficult problems and we focus on them everyday and so our customers are willing to let us take that burden off their plate and specialize in making monitoring great.
Corey: Which makes an awful lot of sense. I started off my career as a systems administrator in on-prem data centers. Monitoring was sort of one of the things that fell to me. It was always the either, “It’s invisible.” Or, “I’m in the dog house,” because surprise, something broke and I didn’t think to monitor the thing that I was monitoring. When I started moving to the cloud it’s, “Okay, we’re going to take the same model and move it forward.” And, “Alright, I’m building an AWS environment”—I wish I was doing it at the time—and, “Alright, let’s roll out Nagios to monitor my instances.” “Oh, we’re in an auto-scaling group.” And then I’m researching, “Okay, how quickly can I regenerate Nagios configuration when the auto group scales?” I realized midway through, “Oh, I’m stupid. Wonderful.” That’s sort of what was opening my eyes to the idea of there being alternate ways to do this.
Ilan: I was a customer of Datadog long before I was an employee. I tried my damndest to automate my Nagios configs as fast as I could. Whether it was things pulling the AWS API and trying to update configs as fast as it could be, or Chef Recipes, or whatever it might be. It turns out, regardless of how tight that loop is, Amazon, or Google, or Azure, they’re going to destroy our server faster than you can reload that Nagios config. Sometimes those configs take forever to reload especially at scale, so it’s interesting.
Corey: Speaking to that end, one of the interesting challenges we see with very large services that everyone starts to use is, there is a monitoring gap—not necessarily on our own infrastructure—but on seeing what the underlying platform is doing. We’ve all had times where two in the morning, we’ve woken up because our pagers are going off, it’s not entirely clear what’s broken, and we effectively all prairie dog onto DevOps Twitter and, “Is your stuff broken? Is your stuff broken? Oh, great.” It’s effectively, “Nagios has become the original Call of Duty.” That’s sort of a terrible pattern to fall into because status page is for these providers, never update in as responsible ways as we might like them to.
There needs to be confirmation, there’s process on their side. I’m not blaming them for this but it occurs to me that the monitoring companies like Datadog are probably almost uniquely positioned to know almost if it were anyone else on the planet when there’s a widespread underlying infrastructure issue. I’m not necessarily blaming cloud providers alone. I’m talking about things like routing flaps–we’re awaiting for BGP reconversions. Is there, I guess, any effort underway to start surfacing that data in a sanitize, safe way? Both without exposing your customers as well as irritating large providers?
Ilan: We definitely are able to see those patterns in near real time. When something breaks in one of the cloud providers or a popular CDN goes down or what have you, we definitely see those patterns amongst our customers. There’s a lot of work that needs to go into things like autonomizing net data, not every customer’s willing to share that type of data, etc, but there is definitely some patterns there. We’ve done a little bit more work on the side of technologies folks use. Earlier this week, on Wednesday, we had released our annual docker adoption report.
One of the things that we looked at there every year is sort of, “How are folks using these technologies? What are they running in containers? How long are those containers are living for? Which orchestrators are popular?” Etc. It’s been interesting to be able to look at that and see the trends of our space. It’s such a large scale in near real time. When seeing docker go from pre-one data in 2014, they just now released Docker Enterprise 2.0, I believe, at DockerCon this week. Seeing the adoption trends around that skyrocket has been interesting.
You can very clearly see on graphs for example, Kubernetes hit 1.0 here, all of a sudden containers skyrocket even further into popularity. We’ve similarly done things for other technologies, orchestrators like ECS, and Kubernetes, and Mesos, as the run events like that. It’s something that we’re interested in diving in more into both in terms of monitoring those cloud providers who are already pulling in all the metrics, and that’s from CDNs, and caches, and IS providers, as well as the technologies that folks run on their VMs. There’s some interesting trends there.
Corey: It’s always delicate to wind up presenting that data in a way that isn’t naming and shaming. “Ha! Twitter for Pets is crappy—” is not a terrific narrative that wind up turning into. But to your point, of being a trans observer, there was a giant shift as the world started moving away from on-premise into cloud. Same with taking long running instances, and replacing those with the femoral nodes, then you saw the container revolution that we’re in the midst of, and now people are talking frantically about serverless. In the eight years that Datadog has been around, we’ve seen a number of giant shifts in the industry. How does seeing these trends emerge shape the direction and evolution of Datadog–the service/product?
Ilan: As a product manager, for me, a lot of these questions, a lot of these studies that we run are actually–they start off these questions internally, of, “What are our customers doing and what do I need to build for them to be successful in their migration from VMs to containers?” Or from their on-prem environment into the cloud. What types of queries are they going to look at? What types metrics should we be pulling? What integrations need more investment from me?” Any product manager is going to be looking for that kind of data and studying that.
It just turns out that in some cases this becomes interesting for our external customers as well, as we turn these into studies or into reports or blog posts about how to best monitor technology or how to best take advantage of it. The big thing that we’ve seen is just the fast rate at which things are turning.
If we look back on our studies even from year-to-year on hosts and containers, we’re seeing things like just a year ago, we were seeing living around two days at a time, and VMs having mean lifetimes of 23-30 days, depending on the environment and what have you. We’re now seeing containers, if they’re orchestrated, taking in some cases less than half a day lifetimes. That changes a lot of how you would define normal, and how you’d want to define normal in your environment, and how would you want to monitor things. That also changes on how you would want to manage them. Making sure that we’re adjusting our tools based on all that is important that our customers continue to be able to rely on us to monitor their environment.
Corey: That makes an awful lot of sense. The challenge of course is you don’t want to be the first person to support something new and find out you spent a lot of time and effort diving into what the next big thing is going to be and then do a swing and miss. But you also don’t want to be a trailing indicator and lagging. It’s interesting.
From that perspective, I sign-up for a Datadog account somewhat recently. I am probably one of the smallest, crappiest customer you can possibly imagine. I have a few lambda functions, an API gateway, and an AWS bill that I obsessively watch, and that’s about it. When I look inside of Datadog, the product, at those aspects of it, it feels like I’m just barely scratching the surface of what is it that the product is capable of doing.
The product is great, don’t get me wrong, but do you feel that it’s challenging to both present information in a relevant way to what someone’s looking for, as well as not overwhelming people as they’re coming in from the somewhat naive perspective of, “Well, I just have these two hosts that I want to monitor, what is all of this?”
Ilan: From our perspective, the goal is to make it easier to monitor what you have and monitor what’s important to you. That may be making it point and click easy, to enable a bunch of integrations for the technologies you do care about. It may mean using our machine-learning capabilities around forecasting and anomaly detection to help you discover things before you realize that they were problems or to help you do that without having to set a bunch of thresholds yourself.
With over 300 integrations out of the box right now, it’s a little hard to say that every single one’s going to be relevant to every single person. What’s important to us is that when you do adopt the technology, we’re already there to support you. Last week, EKS launched. At the launch day—ad ecosystem day—we were there launching our EKS support. Back in November, Amazon had launched Fargate at re:Invent. We were working with them as launch partners to get that out the door and make sure that there was monitoring capabilities for it.
I don’t know that there is, like you said, there’s a lot in the platform and maybe not every single integration or every metric there is for everybody. But the last thing you want to do is being in a spot where you have picked a new system and we’re not there to monitor and you don’t have a way to monitor, or worst yet, you don’t have the data when you’re trying to resolve an incident, or when you’re trying to work on a post-mortem to figure out what went wrong. We like to say that collecting the data is cheap, not having the data when you need it is the expensive part.
Corey: I like that approach a fair bit. The challenge of course is on the other side is not even the cost of the service itself but in some ways, the cost of the service itself can incur. An example, this is years ago, I was working with a non-Datadog monitoring system—but this is not any monitoring system’s fault where I was hitting great limits pulling data out of an AWS environment. “Hey, if you want your data sooner, go ahead and increase the API rate limit,” was the automated notice we got. “Terrific. Great.” We reached out to AWS support and to their credit, they warned that, “We’re willing to do this but at this rate, that’s going to turn into something that winds up costing you a couple orders of magnitude more than the monitoring system does. Are you sure?” That’s a difficult challenge where it’s not just the cost of Datadog—which I will point out is very straightforward and easy to understand at a glance—it’s the, “What other things is this going to incur on the part of the cloud provider whose pricing is generally pretty close to inscrutable?”
Ilan: It’s simply a balancing act, I think. We have knobs to help address that challenge. We have customers that want to grab every metric as it drops into CloudWatch at the very second that it showed up there, at the finest granularity available, and they want it now—and we can do that. We can turn that knob all the way up to 11 and basically pulling CloudWatch all the time. There are cost there from CloudWatch for doing so. Other cloud providers have similar cost structures. We also have the ability if there’s a particular resource or there’s a particular namespace you don’t want to monitor as much, we can dial that one back. These are trade-offs. You have to choose between frequency that we collect data on latency and over time, hopefully some of the costing models around how cloud providers exposes metrics may change but this is a choice that each person has to make for themselves.
The next thing is that a lot of the metrics that we gather within Datadog, they duplicate a lot of the metrics that are available from the cloud providers. “Are you interested in what your cloud providers thinks you’re using CPU-wise,” or, “Are you interested in the actual CPU that your VM is seeing, and memory, and network traffic, and seeing that by process or by container.” We can probably offer the visibility that you’re looking for directly from within your host using your agent and you may not necessarily need some of that cloud data if you don’t want it. It’s also nice to have it and be able to tie the two together if you’re able to do that. Of course, that’s not possible with Apache services, whether it be Redshift or ELBs or some other component. The only way to get at that data is CloudWatch and so we’re going to pull that data from there.
Corey: Yeah, I think you’re right. There’s only so much that you’re going to be able to do without having the platform that has generated the metrics working hand-in-hand with your system. If you’re looking at this from an observer perspective, you’re not going to be able to change everything about it. You’re limited inherently to what is given to you.
To that end, something that often seems to arise everytime I talk to someone about what they want from a monitoring system, the same phrase comes up all the time which is, “a single pane of glass.” Great. Awesome. But if you take a look at even a small environment in something like Datadog where you can look at this from a lot of different axis, in order to gather all of that data onto a single pane of glass, terrific, you’re turning an entire wall of your office into a television that better have retina capability because there’s going to be really small dashboards to fit all of that there.
How do you find that winds up turning into something that can be reasonably answered when a customer asks about it? It sounds like, on the one hand, it’s like arguing with Hacker News. “Oh, that doesn’t sound hard. I could build that on a weekend.” And you come to find out it’s a little more complex than that.
Ilan: I don’t think that dashboards are the answer to everything–at least not having every metric that you could possibly look at on a dashboard above your head, in the virtual NOC, or on your extra monitor. You’re not necessarily looking to have that all there right now. What you want to have above your head, or on the dashboard, in your NOC, or in your office are the key metrics that tell you whether or not your customers are happy and whether or not you’re serving them well.
That might mean, if you’re an ecommerce site, it might be, “How many checkouts have we had this hour or this second, what have you.” These are what we call your work metrics. These are the things that your customers are paying you for. These are very good indicators as to whether or not your service is working right now.
Something may change though. There might be an event, maybe you got a Super Bowl ad, maybe you went on Screaming in the Cloud and now everybody wants to buy some Datadog monitoring, and your usage jumps or drops, and you’re going to want to dig into that. That’s where you want to have additional dashboards and other things that you can query and tease out of your monitoring system. You’re going to want to have all that data there. But then you get it, I’m going to have up on a single dashboard every single metric that I collect and I’m going to look it all in my NOC. I don’t think that that’s reasonable.
You want systems like Datadog to be able to make it easy to explore that data, make it easy to raise it up for you when something changes, whether it’s our anomaly detection or other ML-type capabilities that we use to quickly identify things that are changing. That’s what you’re going to want to focus on. You want your systems to be able to raise that for you. Hopefully that answers the question.
Corey: It does but it also opens another one. In the sense of, when I was running ops teams, monitoring systems always felt like a relatively thankless thing to work on. Because invariably, people tend to ignore it, never look at the dashboards until there was an issue or something broke. The question was always raised, after the fact of, “Why didn’t monitoring patch this?” You’re always building new checks and do alarms that alert when particular patterns hit and you’re persistently fighting the last war when that happens. If you continue following that to its logical conclusion, “We’ll just alert on at everything.” Great.
Now in a typical day, you’re getting paged 4000 times, that is not going to make anyone happy. Their cellphones are running out of power after four hours. How do you wind up scaling it back? This may not be a product question, this maybe a philosophy of monitoring question, but I’m curious as to how you see that.
Ilan: I definitely think it’s a philosophy of monitoring question. I’ve lived through that approach in my career as well. Every time something breaks, let’s create an alert for that, and now we’re alerting people on every NTP time skew on every machine because one time it caused an issue for us.
You want to make sure that your alerts are actionable. I think starting with those work metrics, the ones that are actually relevant to your customers and in the services you provide, and figuring out how those systems behave, that’s going to be your first step.
It’s also important to clean that up fairly regularly over time. If you’re seeing something noisy, get rid of it. If you see something cause issues repeatedly, it’s not just create an alert for it, it’s probably also fix it, so it’s not happening as frequently. It’s on monitoring systems like Datadog and like others in the space that also try to make that a less manual and a less human process. We should be looking at your metrics and identifying things for you as they happen and raising them for you so that you’re not in this never-ending battle to create alerts for every single metric every time.
I also think that, in some cases, a lot of this data doesn’t need to be alerted on but you do want to have it. Collecting it is one thing, alerting on it is something else. But you never want to be that team that’s getting alerts just to prove that the data’s flowing. One of the things I used to do when I was more on the operation space before I joined Datadog was I’d consult with the teams in my organizations, say like, “This week you had the largest number of alerts across all the other teams in the organization. Let’s sit down for an hour or two. Let’s look at what you’re paging on. Let’s look at your systems and see how we can make them either more resilient or let’s look at your monitoring and see how we can make it more actionable.”
A team had once gotten 10,000 pages in a week. There is no way that they are sleeping if they are responding to every one of those, and more likely they’re just ignoring when there’s a page under their pillow.
In this case, what they’d done is that it was more of a heartbeat. The system is alive in many cases of these alerts. They weren’t that actionable and that was a problem. We were able to sit down, clean it up, flip things around, and get them to a more manageable spot. But a lot of this is around that alerting and monitoring philosophy. It’s not necessarily about the tooling. It’s about deciding on what you care about.
Corey: Right. The counterpoint is that when you have an outage you didn’t know you cared about a thing until right after you’ve really could have used an alert on this. An example would be if your site slams to a halt one day and there’s an incident, and the investigation determines, “Oh, it’s because the primary database had its disc fill up.” Then you pull up the graph and the past number of months, you see that the line getting closer and closer to the top of the graph, and then it hits and the incident is triggered. It’s not the most defensible thing to have on a screen in an after-action report doing a post-mortem of why the site went down, and you have a bunch of executives and partners who are very upset by that.
Ilan: Yeah. We have a sort of rule breaker, a mental model that we suggest you go through, that we’ve written about on the Datadog site. I’ll send the links over for the show notes. As I mentioned before, we tend to suggest taking a look at those work metrics and then working your way backwards.
You’re an ecommerce website and the metric you care about is things ending up in shopping carts and things actually getting checked out in those shopping carts. That’s the top-level metric that you want an alert on and probably you want to have on your dashboard because it determines the health of your business. Are you making money today or not? Are your customers actually happy or not? Great. Now work backwards from that and figure out what are the resources that go into making that?
If you do that for each of your systems as you’re building them, you’re going to get to the point where like, “Oh, I have a database. What does that database depend on? Ah, depends on disc.” That’s not to say that you’re never going to miss anything but that workflow is pretty helpful for figuring out what data would be actionable and when.
The thing is, in most cases, you don’t have just one person on a team doing these things. It’s not just one person on-call. Each team has their own work metric. The team that’s running storage or underlying your databases, their work metric is going to be around IOPS and how much storage is available. If they’re alerted on that, you probably would have avoided that outage you just talked about.
Your database team, their work metric is how many queries per second are they returning and how long each of those queries are taking. If they alert on that, they’re going to notice that, “Hey, inserts are filling right now. We should catch that incident before it happens. We should fix this before it impacts our users.”
If you work your way down the stack that way, you’re going to catch the big things that are important, and those are the areas that you really want to focus on. Everything else I think is data that you want to have around for troubleshooting purposes, but I don’t care if CPU is at 90% if my site’s still working. That’s the most useless thing to page somebody on in the middle of the night.
Corey: Absolutely. That’s right up there with load average is high. There are 15 different factors that weigh into that. Great. Tell me the real world impact on one system and I have 200 of those. Maybe I don’t care about that particular cattle hanging out in that environment. One other thing that’s coming up I believe in a month or so is your Dash conference.
Ilan: Yup. This is our first time user conference for Datadog. It’s coming up July 11th and 12th in New York. If folks are there, we would love to have you join us. We have some great presentations from folks like Shopify, Google, DraftKings, and a number of other organizations, talking about how they’re scaling up and speeding up their infrastructure, their teams, and their applications.
This is not two days of, “How do I monitor X,” but rather this is an opportunity to learn about how folks are solving real business problems. Whether it be Shopify talking about how they had to scale up their infrastructure 3X while also moving it to GCP at the same time and containerizing it. Or the folks at Segment talking about how they’ve built a culture of shared [...] within their organization, how they’re talking a lot of the challenges that Corey mentioned earlier around what the alert on, and how do we prevent promise from recurring and building that into their processes within an organization.
There’s a lot of opportunities to learn about how to, again, scale up and speed-up your organization. There should be some fun news from Datadog on various features as well. If you’re in the area or want to travel out and join us in New York this summer, July 11th and 12th, we’d be happy to have you. We’ll go ahead and put a discount code there in the show notes. I hope to see you all in New York and thanks for having us.
Corey: No, thank you very much for taking the time to speak with me. This has been Ilan Rabinovitch of Datadog. My name is Corey Quinn, and this is Screaming In The Cloud.