Aurora, from Amazon Web Services (AWS), is a MySQL-compatible service for complex database structures. It offers capabilities and opportunities. But with Aurora, you’re putting a lot of trust in AWS to “just work” in ways not traditional to relational database services (RDS).
David Torgerson, Principal DevOps Engineer at Lucidchart, is a mystery wrapped in an enigma and virtually impossible to Google. He shares Lucidchart’s experience with migrating away from a traditional RDS to Aurora to free up developer time.
Some of the highlights of the show include:
Trade off of making someone else partially responsible for keeping your site up
Lucidchart’s overall database costs decreased 25% after switching to Aurora
Aurora unknowns: What is an I/Op in Aurora? When you write one piece of data, does it count as six I/Ops?
Multi-master Aurora is coming for failover time and disaster recovery purposes
Aurora drawbacks: No dedicated DevOps, increased failover time, and misleading performance speed
Providers offer ways to simplify your business processes, but not ways to get out of using their products due to vendor and platform lock-in
Lucidchart is skeptical about Aurora Serverless; will use or not depending on performance
- Corey's architecture diagram on AWS
- Lucidchart’s Data Migration to Amazon Aurora
- Preview of Amazon Aurora Multi-master Sign Up
- This is My Architecture
Full Episode Transcript:
Corey: Hello and welcome to Screaming In The Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming In The Cloud.
This week’s episode of Screaming In The Cloud is generously sponsored by DigitalOcean. I’m going to argue that every cloud platform out there biases for different things. Some bias for having every feature you could possibly want offered as an added service at varying degrees of maturity. Others bias for, “Hey, we heard there’s some money to be made in the cloud space. Can you give us some of it?”
DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they’re using it for various things, and they all said more or less the same thing. Other offerings have a bunch of shenanigans, root access, and IP addresses. DigitalOcean makes it all simple, “In 60 seconds, you have root access to a Linux box with an IP,” that’s a direct quote albeit with profanity about other providers taken out.
DigitalOcean also offers fixed-price offerings. You always know what you’re going to wind up paying this month, so you don’t wind up having a minor heart issue when the bill comes in. Their services are also understandable, without spending three months going to cloud school. You don’t have to worry about going very deep to understand what you’re doing. Its click a button or making API call, and you receive a cloud resource. They also include very understandable monitoring and alerting.
Lastly, they’re not exactly what I would call small-time. Over 150,000 businesses are using them today. Go ahead and give them a try. Visit do.co/screaming and they’ll give you a free $100 credit to try that. That’s do.co/screaming. Thanks again to DigitalOcean for their support to Screaming In The Cloud.
Welcome to Screaming In The Cloud. My name is Corey Quinn. I’m joined this week by David Torgerson who works for Lucidchart. David came to my notice when he was featured on the This Is My Architecture video that AWS featured relatively recently, talking entirely about Lucidchart’s migration away from traditional RDS into Aurora. Welcome to the show, David.
David: Thank you for having me. I’m excited to be here.
Corey: Glad you could make it. Let’s start at the very beginning. What is Aurora for those who haven’t encountered this in the wild before?
David: Aurora is a MySQL-compatible, built from the ground-up, AWS service. What the Aurora team did is they looked at some of the challenges that faced the traditional RDS or traditional database, tried to identify those challenges, and make them a push-button solution for managing a really complex database structure.
Corey: What was it that inspired you folks to, more or less, wake up one day and say, “Well, our database is awesome. But you know we’d really like to do? Throw the whole thing away and replace it with something else.” That’s not something that people tend to test on a lark of, “Oh, I’ll try wearing dress shoes this week.” No, it tends to be something that requires a bit more thought and concern going into it. What was it that motivated you folks to go down that path?
David: That’s a great question and I appreciate the way that you put that ‘throw away our existing awesome database solution for something else.’ Really, what it comes down to is we’re a development shop. Our engineers make us a lot of happy customers just by making awesome products. Anytime that we take away from them working on our products is opportunity lost.
In Lucidchart, we actually maintain our Ops and DevOps teams comprised of our development organization. It becomes a secondary responsibility to their primary responsibility, which is to write code and implement awesome features in products.
Our database solution was nothing unique. We were running with a Master-Master MySQL implementation and it worked really well but it came with the challenges of having to have a skill set that typically is not something that most people just know. It’s not something that they’re typically exposed to.
There was a lot of overhead in maintaining our database solution. Then beyond that, besides the skill set, there’s just the complexities of always monitoring the capacity of the reads and of the writes, and making sure that our underlying disk architecture was performing enough. Designing and implementing solutions that would handle zone failures or disk failures, et cetera, it just took a lot of our time.
We had wanted to move away from managing databases ourselves for quite a while and we actually looked at quite a few managed services. None of them really met the core requirement that we had, which was free up our developer time.
There were services out there, even paid-for services, and different database architectures entirely that would have met the needs, but it wouldn’t have decreased the amount of management time that our operational group, which was again comprised of our developers. It just wouldn’t decrease the amount of time that they have to spend on it. That’s one of the primary things that we’re looking for is to keep the existing awesome implementation that we had but also freeing up our engineers to work on features.
Corey: Believe me, I am a firm believer in the idea of taking the undifferentiated heavy lifting and throwing that to a platform provider, but the double-edged sword of that is, to some extent, it’s an incredible amount of trust that you’re putting in AWS to manage a database for you.
Historically, if something breaks in a more traditional setup, you may not be able to fix it as quickly, but there’s a certain ‘everyone is frantically typing and trying to get things up’ sense, where it looks like we’re busy versus when there’s an Aurora outage, more or less all you can do is open a support ticket and frantically refresh the status page. Spoiler: it’s still going to say everything is green.
There is that trade-off in the sense of making someone else partially responsible for keeping your site up. Was that a concern?
David: Absolutely. In our implementation, we considered and we put a lot of time and effort into designing our databases so that they were resilient to any point of failure. We could lose an instance, or EBS volumes, even lose a zone or two and still have our Master-Master database implementation working.
The thought of not only giving that up, but also handing that over to another organization or another solution to manage for us was a bit intimidating. It was something that went heavily into the evaluation to make sure that we were not going to be trading the management time for a degraded database solution or something that was going to be far less stable than what we currently had. Those were the requirements that we went in open eye with that we wanted to maintain that.
I remember at re:Invent several years ago, the comment was made during one of the keynotes that 80% of the Fortune 500 companies had started evaluating or were using Aurora. All statistics are made up, but that was a pretty impressive statistic and it really started making us think that Aurora may be possible to handle our production load and keep the uptime that we had grown accustomed to.
Corey: While we’re on the topic of, shall we say, fear about the unknown, one of the quotes from your blog—I’m not going to read the entire post by a landslide—the sentence that stuck out to me as someone who focus on the economics of cloud is, “Without calculating in the engineering and opportunity costs, our overall database costs decreased roughly 25% after switching to Aurora.” Can you talk me through that?
One of the challenges I’ve seen with uptake and adoption of Aurora is that unlike other database options, it charges I believe $0.20 per million I/O operations. Most people don’t have statistics around what what looks like. It’s a big question mark. Most people I talk to expect their cost to increase. You’re showing the exact opposite.
David: Yeah, and it may partly be because of how we had implemented with the redundancy that we had and wanted to keep. In the blogpost, and I’ll just briefly go through this, we had two Master-Master nodes handling the writes of our database instances.
The read capacity came from having additional replicas that were pointing at those masters. We would have to scale the reads based off of BI requirements or business intelligent requirements for backup purposes, et cetera. Each of those instances had to have the same disk throughput capabilities as the other instances simply because our application is very write-heavy.
In fact, that’s the majority of the traffic that we’re sending. The read capacity had to also be, at a minimum, as big as the write capacity just to be able to handle it. All of our instances share the same read/write IOPS per second. Which meant that we were using provisioned IOPS and that already charged a certain portion per IOP.
Corey: Bring the money is the short version there.
David: The 20%-25% savings really did not come from the underlying disk usage. That represented a small part of it. The IOPS were slightly more expensive when moving to Aurora than a traditional EBS volume, however, we were able to decrease the number of instances that we had, which means that we were able to decrease the copies of the data significantly.
In a traditional cluster that we were managing, we would have five instances as part of that cluster. That would mean that every write would end up with multiple IOPS on cross fight instances.
Even though Amazon charges slightly less than double, we call it double per IOP for Aurora, that overall decreased by three times simply because we didn’t have to have five IOPS per cluster. We only have to have one but that IOP cost twice as much.
Our disk cost did decrease slightly, but again, it was because we weren’t comparing a single instance to Aurora. We were comparing an entire cluster and what that cluster represented from both a read-write capacity to an Aurora cluster of equal capacity.
Again, the disk volume though, or the disk’s sizes in IOPS isn’t were we saw the big savings. The big savings came from our snapshots. The reason for that is when we manage SQL ourselves we would encrypt everything using LUKS, our own managed encryption using Linux LUKS.
What would happen is as people would make changes to their data, often they would write the same document over and over and they only be changing a small portion of it. Consider opening up a Lucidchart diagram and having a process block that has the word ‘the’ in it. If you change that word ‘the’ to ‘and’ and then save it, that row is going to be updated at the database, but it’s going to be a small part of that row is going to change. Just the word ‘the’ to ‘and.’
From LUKS’ perspective and from a security perspective, it’s going to rewrite that data, but it’s going to look completely different on disk. When we take our snapshots, even though EBS snapshots are incremental, it would end up snapshotting a lot of the data that was duplicated. We are very aggressive on our snapshots, at least hourly. We keep those for an extended period of time.
At one point our snapshot cost represented over 50% of our bill. Now, we’re grown past that. That’s no longer the case. We have other workloads that are much larger, but the snapshot cost is where we saw the majority of the savings. Again, not to discount the Aurora cluster savings because that also went into it, but the majority of the savings came from our snapshots.
Corey: Did you know going into it what the savings were going to look like or was it one of those, “Well, we’ll do it as a trial and see what bears it out,” because what you’re saying makes perfect sense, but it sounds like the sort of thing that would be very hazy and nebulous until you’d seen it work.
David: Yeah, and that’s absolutely correct. One of the challenges that we had with moving to Aurora is some of the unknowns that were difficult to quantify or to test, and specifically the one that you are asking about, which is how much is it going to actually cost us, wasn’t something that we were able to fully identify until we have made the switch.
Here’s a couple of reasons why. By the time we decided that cost was coming into play, had already done the due diligence to make sure that the solution would meet our needs from a legal, from a contractual, from just our own desires of maintaining a highly available, highly manageable solution.
But by the time that cost came around, we were trying to make guesses how much EBS would cost or the disks storage would cost. We were trying to make guesses about how much IOPS would cost and snapshots would cost.
Aurora does publish that. However, some of the things that are a bit unclear is what is an IOP in Aurora? We know that Aurora is comprised of at least six disks across three availability zones. When we save one piece of data, is that one piece of data counted six times or is that one piece of data counted one time even though it’s across six disks and duplicated six times? That was not clear to us.
Going along that same logic, considering IOPS, when I write one piece of data, is that one piece of data one IOP or is it because across six disks, does that one write count is six IOPS? Those were some of the unknowns that we had moving to Aurora. While we haven’t officially found any documentation, what we have seen according to our AWS bill is that we were only charged one time for the data, not six times even though it’s duplicated six times. We’re only charged one time for that IOP not six times for each of the six disks.
Corey: That’s good to know and it was going to be my next question. Sometimes, it’s interesting to understand what you’re going to see before it shows up on the bill and then it more or less presents as complete fiction. It’s difficult to tie it back to what you’re seeing in reality.
Corey: Something else that was teased in re:Invent last year is the idea of Multi-Master Aurora. There’s apparently a preview you can sign up for today to get Multi-Master within a region. With Multi-Master Multi-Region, they were committed to having it out by the end of 2018—if you’re going to believe the announcements at re:Invent. Unfortunately, I think Dr. Vogel is still giving his re:Invent keynote and it hasn’t ended yet, so we don’t know that’s going to work from a time perspective.
With what is available today for Multi-Master within a region and then later Multi-Master and Multi-Region for writes, is that of any interest to your use case or is that something that’s nice to have?
David: Absolutely. Both, it’s interesting to us and it’s nice to have. Obviously, you can design a really well-architected solution using a master replica solution. The reason that we are very excited and interested in the Master-Master solution that Aurora is touting is that, currently the way that a failover works in Aurora is that it has to take the cluster and put it in read-only mode so that it can ensure that replication has caught up and that there is no data collisions or data loss, et cetera. When that cluster is in read-only mode, any write that attempts to go against that cluster is either blocked or hung until the cluster comes out of read-only mode.
In theory, once they implement Master-Master cross region or across availability zones, that failover time can decrease significantly because you’re not going to have to put the cluster in a read-only mode in order to initiate a failover. You should just simply be able to move the connections from one master node to another master node.
The reason that we’re really excited about Multi-Region Master-Master is for disaster recovery purposes. Today, we have replicas that are real-time replicas pointing at our production cluster, but those replicas are in a different region.
In the event of having to do an actual region failover, we would stand up an entire application stack in second region, the data would be up-to-date and we’ ll be able to point at it.
However, as soon as we issue that first write, as soon as the first document save came into the new cluster in that second region, the master cluster in our primary region would be out of sync, meaning it would no longer be a true read replica because there would not be an easy way to get those changes back into our primary region.
With Master-Master cross-region, it opens up the possibility of being able to do a site failover, but more importantly, to be able to fail back to the previous region without losing data.
Corey: Which makes perfect sense. You tell a fantastic story about the capability and how this could wind up in forming application architecture. I’m halfway to planning out an Aurora migration myself, before I stopped to realize, “Oh wait, I don’t have a database.” That does become something of a challenge from my perspective.
Let’s look at the other side for a second. I under the blog post and videos on behalf of AWS are inherently marketing exercises to a point, but your review is glowing. Where there any drawbacks to Aurora that you uncovered?
David: There were a few. I didn’t put these in the blog post, but I’m completely happy to talk about them.
Corey: Oh good. It’s time to dish.
David: This one’s a little bit silly, but it is something that we’re taking seriously. When we manage our database solutions, we were really–I don’t want to say pushing the envelope on technology because we weren’t the ones designing it, however, we were using things that other organizations dreamed of, being able to have Master-Master with skeletal slaves at single digit second failovers in detecting failures, et cetera.
After moving to Aurora, we don’t have to maintain that. We don’t have to worry about EBS volumes or pointing a replica back to a master’s replication point. We just don’t have to deal with that. Some of that knowledge and expertise is becoming a little lost among our developers who traditionally were very sharp when it came to database management.
Again, Aurora makes things so easy that we’re getting a little stagnant in our skill set. That’s one of the drawbacks. But again, that’s probably also a pro for Aurora is you don’t have to know anything to use it.
Corey: Well, on that path then, is there a viable exodus strategy if one day you decide, “Aurora’s great, but we’re going to migrate to fill-in-the-blank here.” Is there a path to get there without significantly re-architecting your entire application?
David: Yeah and in fact, this is something we could do today. The only drawback is that people are rusty on their skill set of how to do it. But Aurora truly is SQL-compatible, both being a replicant and a master. The reason that’s really cool is I can stand up a MySQL instance or a Percona instance or any other fork of MySQL, and I can take Aurora and point it as a replica to that new MySQL master. Aurora will act just like any other replica would.
We’ve actually done this several times in testing, in moving data around. We’ve also done it with two Aurora instances with a manually stood up MySQL instance part of that cluster as well.
Because Aurora truly is MySQL-compatible, the replication portion of it is no different than what you would do if you just installed MySQL from mysql.org. It makes migration–it makes pulling data out, adding data to Aurora incredibly easy.
Corey: Okay, that’s certainly something that I think a lot of companies want to have in their back pocket. When a cloud provider comes out with an exciting new technology that you can use to make your application tremendously simpler, that’s compelling and people want to embrace it intrinsically.
The counter-argument against that, in my experience, has always been one where, “Okay, what if we don’t like it anymore?” or, “What if we deemed it doesn’t quite meet our needs, how do we get out of it?” This is not specific to one provider. This is not calling anyone out without naming them. All providers do this to some extent. Vendor lock-in is a concern, but even more so than that is platform lock-in, where there is a particular service that’s being provided, that there is no other viable alternative for.
When you start seeing some of what they’re teasing with being able to go Multi-Master across multiple regions, there isn’t a terrific answer to that in many scenarios other than Aurora itself. There other cloud providers that offer similar global databases, but the semantics are often very different, the tolerances, the failure modes, and the latencies, all tend to manifest the very different application profile. It seems to me at least on some level, there’s still going to be an area of concern for companies looking at things that are sufficiently differentiated. Is that a fair assessment?
David: Absolutely and it’s something that we’re taking into consideration as well. You bring up an excellent point that as soon as you go to Master-Master cross-region, that is certainly something that you can do with a traditional MySQL cluster, but the latencies on ensuring that those writes happen across both masters across regions is something that would be very difficult to replicate just due to the latency. You’re dealing with things that can’t be improved on which is distance over network.
That is something that’s concerning. However, Aurora does have the capability of exporting data with a MySQL dump. First off, we’re speaking about hypotheticals. Amazon hasn’t released anything yet.
Assuming that it works the way that they’ve announced, meaning that you can actually have an application that guarantees that you’re not going to have replication issues, you’re not going to have to skip replication errors, then that could be a real awesome solution for maintaining disaster recoverability.
If somebody switches to them and then they decide they want to go back to a traditional MySQL self-managed solution, that would be a big feature set that they would be giving up to go back to something that is self-hosted.
I fully believe that Aurora is going to continue to be SQL-compatible, just for the ease of being able to move data to it and being able to move out of it, but with Master-Master, with Master-Master across-region especially, they’re starting into enter scenarios where that is something that not only is managed by Amazon, but it’s something you can’t do on your own.
Corey: Which makes an awful lot of sense. It always come down to the ‘I feel somewhat misguided’ idea of being able to take whatever you’ve built and however it works and overnight deploy it to a completely different provider. I get that that is a compelling story and it feels good, but it just isn’t realistic as soon as you deviate from building everything yourself more or less out of popsicle sticks.
To that end, was there anything else you saw about Aurora that wasn’t all that it could have been or drawbacks to it that they may not advertise in giant signs at large conferences?
David: It’s an interesting question. One of the claims that they had is that there’s up to a 5x performance increase. While we did see a performance increase, it wasn’t in the sense that all of our queries were all of a sudden faster. It was more that the capacity that each cluster had increased significantly.
If you’re looking at a single serial set of operations, the performance probably is not going to change very much. Some queries are going to be faster, others are going to be slower. The big benefit that we saw was that the capacity increased significantly. We were able to get much higher throughput than what we had prior. That was one thing that was—I don’t want to say misleading—obviously, it was a quote that had some qualifications around it that weren’t printed.
Corey: Have you seen anything as well about others, for example, failover time, scalability, anything in that sense that has been a bit of a regression from where you were before?
David: Yeah. When we were managing our Master-Master clusters, we had optimized for the idea that failures would happen. We went into that with eyes wide open. We fully expected to lose a database instance a week is what we designed the application for.
That didn’t always happen, but because we have that design up front, it meant that we were able to detect and recover from an EBS failure, a zone failure, a master instance failure, et cetera, within 15 seconds is a max, but often it was less than three seconds.
With Aurora, because of how failover currently works, once an error is detected or once you initiate a failover to a second region, Aurora will put the cluster in read-only mode. That read-only mode will stay that way until replication has caught up and until the Aurora cluster can confirm that it’s safe to actually move that master functionality to one of the other replicas in the cluster.
Once that’s moved, Aurora updates the DNS record to point to that new mastered node. That DNS record is replicated via Route 53. Making an API call to Route 53 is near instant on getting an updated record, however making a DNS call to Route 53, you’re bound by the TTL for that record. By default, Aurora TTLs are 60 seconds.
When moving to Aurora, we went from 3-15 seconds of downtime during a failover migration up to possibly a minute and a half if you include the read-only time as well as the database failover.
For one node, if you’re doing database maintenance, say monthly, that’s a minute and half monthly. Most I have seen at 99.99% plus can handle a minute and a half of scheduled maintenance. However, if you have mini-clusters, that minute and a half starts to add up pretty significantly.
One of the things that we did to get around that was we actually make a caching DNS call by pointing at Route 53’s API directly, to pull down the updated IP addresses much more frequently than what we would get from TTLs. After having done that, our failover times with Aurora are typically in the 30-second range.
Now, 30 seconds is still significantly more than the 3-15 seconds that we had before, and also those 30 seconds add up pretty significantly when we start talking about dozens of clusters. But it’s something that the tradeoffs from the stability that we get the other times is significantly better than what we were able to get when managing ourselves simply because there are far less failure scenarios.
We don’t have to worry about an EBS failure, or an EBS volume that is all of a sudden slow for some reason, or zone failure, et cetera. All of that is built-in to the cluster and it automatically recovers and we didn’t even know that it’s happened.
Corey: That sounds pretty awesome. One other big Aurora announcement that came out of re:Invent last year was Aurora Serverless, which to me sounds like the punchline of a joke, as in, “I’m going to store my bad jokes in a database serverlessly.” But it feels like a toy. It doesn’t strike me as the sort of thing that companies are seriously evaluating because when they have large scale data usually tends to be needed in less time to take a spin-up or spin-down of interface to that data as you see with serverless technologies. Is that something that is at all in your radar or are you also viewing that as something of a toy?
David: It really comes down to performance. I have trouble imagining a scenario in which serverless architecture would actually scale both from a data and from a compute perspective without seriously degrading the performance of a cluster.
One of the reasons that MySQL and other databases are so performant is because of the index that’s stored in memory. It doesn’t have to keep going to disk, which is significantly slower, to look up where record sets are or to retrieve records that’s depending on what it is that you’re querying for.
In a serverless architecture, in theory you just have a bunch of worker nodes that have capacity that are receiving a query and then going and retrieving that data set. But I don’t see how that would actually be performant without having an index that’s pre-warmed up in memory across that cluster.
Now, let’s say that Amazon delivers on that and that there’s not a performance degradation for using a serverless architecture. That becomes really interesting simply because now I don’t have to have static steps in my read and write capacity. That’s something that can simply scale up and scale down dynamically based off of the time of day or based off of business intelligence really terrible query that wants to look up a lot of data.
It is interesting and I’m skeptical that they’re going to be able to keep their performance, but if they do keep their performance and they truly come out with a MySQL-compatible serverless implementation, that is going to be a huge game changer.
One other thing on that is while I’m skeptical of the performance, there are certain things that we currently persist in S3, not because of the data size just because of what they are. Those types of things become a real viability in a serverless architecture.
If you’re not having to handle joins or other things that require a warmed-up index, Aurora, especially with Master-Master cross-region, opens up a lot of possibilities that would be difficult for a traditional file share like S3 or EBS or NFS, et cetera, simply because you have the replication.
Again, in disaster recovery, you not only want to be able to failover to the secondary region, you also want to fail back to the primary region, which is likely where you have your reserve instances and where you have other things already built out that are really, really nice. The serverless architecture is very interesting, but I’m skeptical that it’s going to be performant.
Corey: That’s sort of where I land on that. Obviously, I should qualify that I’m not a database or datastore person. Usually I run away from things that I can’t just represent as code and make come back. Now, this is the sort of thing that leaves a mark in the general sense.
Corey: Thank you for taking the time to speak with me. Is there anything else you’d like to talk about or let people should take a look at that’s associated with you in any sense?
David: Absolutely. I’d love for everybody to check out Lucidchart. Obviously, what we’ve been talking about is the solution that’s built on Aurora. But the reason that we have selected Aurora is because we have an awesome product, we’re really happy about it, really excited about it for building diagrams in the cloud.
Corey: I can say that I’m a very happy customer of Lucidchart myself, which is not incidentally how I got connected with you. But I build horrifying architecture diagrams of ridiculous things that I shove into AWS.
In fact, I’ll include one in the show notes just to terrify people at home. But by and large, it’s something that I have no ability to represent things graphically and Lucidchart just makes it fantastically easy. I’m starting to use that for most of the architecture diagrams that I need to represent and something other than finger paint and cram.
David: Well, thank you very much.
Corey: Of course, thank you very much for joining me. This is Screaming In The Cloud and I’m Corey Quinn.
This has been this week’s episode of Screaming In The Cloud. You can also find more firstname.lastname@example.org or wherever fine synarche is sold.