Sharding

A singleton database, however beefy, has limitations in terms of storage space and compute resources. A single server is also not great in terms of availability. Storage systems such as Cassandra distribute data by partitioning data opaquely. However, many systems (including most RDBMS systems) don't partition data internally.

The solution is sharding. This refers to dividing the data store into a set of horizontal partitions or shards. Each shard has the same schema, but holds its own distinct set of rows. Thus, each shard by itself is a database. The application (or driver) knows how to route requests for specific data onto certain shards. The benefits are as follows:

The system can be scaled by adding additional shards/nodes
Load is balanced across shards, thereby reducing contention for resources
Intelligent placement strategies can be employed to locate data close to the computes that need it

In the cloud, shards can be located physically close to the users that'll access the data – this can improve scalability when storing and accessing large volumes of data.

Distributing data is not that tough. If specific affinity is not required, distribution can be done via a hash function. However, the challenge is in the redistribution of data when the topology changes, as described in the Cassandra deep-dive section. There are three main approaches to solve the lookup problem:

Consistent Hashing: We covered this in the Cassandra cluster earlier.
Client-side Routing: Clients have a lookup map to figure out which shard (node) hosts a particular key (hash). Whenever there is a topology change, the clients get updated maps. Redis Cluster does sharding in this way.
Brokered Routing: There is a central service that takes IO requests and routes them to the appropriate shard based on a topology map. MongoDB sharding follows this approach.

Table of Contents for Sharding

Create new playlist

Sign In

Sign Up

Table of Contents for
Sharding