Data stores at scale

Scaling the data layer is often the trickiest part about scaling an application. It's important to keep things in perspective and not try to over-engineer a complex solution that will allow scaling high too prematurely, as it is often a complex and time-consuming effort, but at the same time, it is important to anticipate the upcoming growth to not have your back against the wall when it is time to rework that layer.

We can put things in perspective this way:

In this chapter, we saw so far that it is best to start with a well-known solution like a relational database such as AWS Aurora. In the project exploratory phase, it is best to not worry about scaling. Once things are getting more serious and you get ready to launch your application in production, it is best to avoid using join statements and transactions as it will make things harder for later. Instead, if you have a need for these features, you can implement similar concepts in your application layer. In addition, we saw that we can help scaling our read queries by adding read replicas and by caching queries using ElastiCache, but with regards to scaling writes, we didn't engineer any solutions yet.

When you get to that next stage where handling the load or the amount of data becomes an issue, it is finally time to look into more complex solutions. This will typically require rewriting at least parts of the application data layer. The two most common approaches are to add a new data store to your infrastructure. In AWS, you probably would want to opt for a DynamoDB, which is a hosted and very scalable database service, or implementing a sharding strategy for your databases. The concept behind sharding a database is as follows:

At first, the entire application is using a single database cluster. Because we are constrained to writing to only our primary instance and to a total of 10 TB, it is likely that over time, we will have to break that pattern and start using separate database clusters for some of the most important tables. The problem is that even with a single table per cluster, performance or data size can still be an issue. Once that's the case, we get to the second stage of sharding, which consists of breaking out tables across multiple database clusters using a specific key. You may decide, for example, to break out your user database into two shards and have all logins ranging from A to M present in your first database while the user whose logins are ranging from N to Z would be present in the second database.

Because there isn't a perfect solution that fits everyone's needs, you need to look at your application requirements and decide which solution to adopt based on what you fear the most.

Brewer's (CAP) theorem states that you can't have a distributed system optimized at the same time for consistency (every one client will view the same data), availability (you can always read or write), and partition tolerance (the system will work well despite an arbitrary number of messages being dropped or delayed in between the database partitions). You basically need to pick two and choose the data store based on that. If you choose to optimize for availability and consistency, then it's likely that you will want to work on a sharding strategy for Aurora; if you care the most about availability and partition tolerance, then DynamoDB will likely be a better pick.

Finally, here is a small comparison between those two data stores to help you understand some of their key differences:

	RDS Aurora	DynamoDB
Documentation	http://amzn.to/2fjc9kj	http://amzn.to/2i3aA6v
Model	Relational	Non-relational
Data	Structured and stored in tables	Key / Value pairs Can store an infinite amount of items (each item can be up to 400KB)
Schema	Strict schema – making schema changes in a sharded cluster is complex	Schema-less
Availability / Consistency	Each chunk of data is replicated six ways across 3 AZs	Data stored in three geographically distributed replicas of each table
Scalability	At scale, requires a lot of work to implement a sharding strategy	Allows you to scale your throughput up and down on demand
Performance	High performance is achieved by scaling the instance type used in your clusters vertically	Low latency – can easily handle very high throughput as long as you picked a good partitioning key to avoid hot key issues

Of course, over time, when scaling your data layer becomes an even bigger issue, you will likely have to implement both a sharded Aurora clusters strategy and rely on a NoSQL database such as DynamoDB to keep up with your application growth.

We now have a number of options to handle multiple millions of users on our service. There is one last architecture change that we haven't talked about yet; that is having a service hosted on multiple regions.

Table of Contents for Data stores at scale

Create new playlist

Sign In

Sign Up

Table of Contents for
Data stores at scale