Chapter 4: Geo-Partitioning

In the Chapter 3, Atomicity, Consistency, Isolation, and Durability (ACID), we learned about what ACID is, why we need it, and how it's supported in CockroachDB. Here, we will learn all about geo-partitioning. Geo-partitioning is one of the most important reasons why you will want to use a distributed SQL database such as CockroachDB.

In this chapter, you will get a basic understanding of what geo-partitioning is and why this feature is useful for you. We will also go over some cloud jargon and some of the options provided by various cloud providers to distribute your data geographically for better resiliency, performance, and availability. At the end of the chapter, we will go over different ways of geo-partitioning your data in CockroachDB.

The following topics will be covered in this chapter:

  • Introduction to geo-partitioning
  • Cloud regions and zones
  • Geo-partitioning in CockroachDB

Technical requirements

For executing the examples in this chapter, you will need to install CockroachDB. If you still haven't done so, please refer to the Technical requirements section in Chapter 2, How Does CockroachDB Work Internally? All the queries in this chapter are available at https://github.com/PacktPublishing/Getting-Started-with-CockroachDB.

Introduction to geo-partitioning  

As the word geo-partition suggests, the data is partitioned based on geographical locations. Geo-partitioning refers to the mechanism of storing the data in various geographical locations, based on where the data is being consumed.

For example, let's say you are maintaining a database for an airlines company that has international and domestic travelers as its users from every continent. Since they have a global presence, it would be beneficial to keep the users' data close to where they live. This will help in serving the data locally and quickly.

Figure 4.1 shows an example of a table whose rows are partitioned based on geo-location across three different continents. Rows are stored in specific databases based on their locality. This locality can be mapped to the user's location based on their activity:

Figure 4.1 – An example of a geo-partitioned table

Figure 4.1 – An example of a geo-partitioned table

Geo-partitioning will be useful, even within smaller regions, where you provide services locally. An example would be, let's say, you are managing a database for a company that has to serve lots of data very frequently, which is highly localized to users living in that area. This example fits companies such as Instacart, DoorDash, Uber Eats, Uber, Lyft, and Ola.

For the sake of discussion, let's consider DoorDash. In the case of DoorDash, its users search for restaurants and grocery stores, usually within 10–15 miles of where they live, and place an order on their app. In this case, we can house the data regarding user purchase orders and delivery status close to the users. This not only helps us in serving the data faster for local users, but we can also easily apply local compliance rules only on a specific subset of CockroachDB clusters.

If we consider the state of California for this example, we can strategically deploy CockroachDB nodes in San Francisco and Los Angeles. For restaurants, menus, and grocery items, which don't change often, we can create duplicate indexes that will help reduce read latencies when browsing.

Also, now, if you want to incorporate the California Consumer Privacy Act of 2018 (CCPA) for all the purchase orders that happen within California, it would be much easier to restrict it to the databases in California that manage this information. The following figure illustrates this aspect:

Figure 4.2 – Geo-partitioning for cities in California State, USA

Figure 4.2 – Geo-partitioning for cities in California State, USA

Some of the main advantages of geo-partitioning are as follows:

  • It helps serve data quickly, as we reduce the number of network hops.
  • Failures in a given geolocation only affect a small set of users.
  • It provides data compliance as per local rules and standards, for example, the California Consumer Privacy Act of 2018 (CCPA) and the European Union's General Data Protection Regulation (GDPR).
  • It helps provide better disaster recovery and resiliency if an entire geolocation experiences a natural or human-triggered calamity.

Let's now learn about cloud regions and availability zones, which is essential to understand how geo-partitioning can be realized on various cloud providers.

Cloud, regions, and zones

In this section, we will learn about some jargon related to the cloud. We will also learn about regions and zones, and how they have been realized by various cloud providers. These concepts are important to understand to decide how you want to distribute your data and what sort of guarantees you want to provide:

  • Cloud: A cloud is nothing but a bunch of servers on multiple data centers that are positioned in strategic locations across the globe. These data centers provide resources such as storage, network, and compute on demand, and they belong to a specific cloud provider.
  • Cloud provider: A cloud provider is an organization that provides various services on its public or private cloud platform.
  • Public cloud: In a public cloud, resources that you consume are hosted on the cloud provider's data center. The cloud provider is responsible for maintaining, upgrading, and operating cloud resources. Since you are consuming resources maintained by a third party, there are additional security risks here.
  • Private cloud: In the case of the private cloud, the resources are usually hosted on a company's own data center, but they can also be hosted by a cloud provider. In a private cloud, all the resources are dedicated to a single organization and isolated from other organizations; hence it's more controlled and secured.
  • Multi-cloud: A given platform is called multi-cloud where you consume resources from multiple public cloud providers.
  • Hybrid cloud: In a hybrid cloud, you will be combining resources from a public cloud along with resources from a private cloud and/or on-premises.

Region

A region refers to an actual physical location where your cloud resources are housed. Each cloud provider has different notions of a region.

It's very important to understand how regions are implemented by different cloud providers, as it determines the following things:

  • Cloud cost: Resources in some regions are cheaper than in others.
  • Multi-cloud and hybrid cloud strategy: This includes disaster recovery, high availability, data replication, data migration, data sharing, failover, and so on.
  • Latency: The whole idea of geo-partitioning is about reducing the latency by keeping the data close to the customer. So, it becomes apparent that there is a need to select a region in strategic locations, which reduces overall latency.
  • Data compliance: Depending on where the region is located, you might have different data compliance requirements. Also, some countries might insist that the data of their citizens cannot leave the country, in which case you will be forced to pick some regions in that country.
  • Services and features: Not all services and features are available in all regions. So, this sometimes reduces the choice of regions.

Zone

A region consists of multiple zones. A zone refers to a more specific location within a given region.

Availability zone

An availability zone is an isolated data center that doesn't share any resources with other zones within the same region. All the communication between availability zones happens through a high-speed network. A region is supposed to have at least two availability zones that help in implementing redundancy, failover, and high availability.

Now, let's understand the definitions of region and zone by some of the top cloud providers.

Regions and zones on various cloud providers

In this section, we will briefly go over what region and zone mean on the top four cloud providers. We will be covering the following cloud providers in this section:

  • Amazon Web Services
  • Google Cloud Platform
  • Microsoft Azure
  • Oracle Cloud

Let's get started!

Amazon Web Services

Region: A Region is a physical location that consists of multiple data centers.

Availability Zone: A group of discrete data centers that provide redundancy to cloud resources is called an Availability Zone. Availability Zones help in implementing features such as high availability, fault tolerance, reliability, and scale.

AWS Local Zone: Local Zones provide resources that are located close to your end users. These will be useful in services such as gaming and streaming, which require low latency, high throughput, and elastic services.

Google Cloud Platform

Region: A region is a collection of zones.

Zones: A zone is a deployment within a region. You should use multiple zones to provide high availability and fault tolerance.

Microsoft Azure

Region: A region is a set of data centers connected within a perimeter determined by the latency and connected through a fast network.

Geography: An area of the world containing at least one Azure region. A geography spans multiple regions and is fault-tolerant, even in the event of a complete regional failure.

Availability zones: Unique physical locations within a region. Each availability zone comprises one or more data centers with resources that are not shared with other zones.

Oracle Cloud

Region: A region is a localized area and is made of several availability domains.

Availability domains: Availability domains are made of one or more data centers, they do not share any resources amongst them, and are connected through a fast network.

Each availability domain has three fault domains. Fault domains ensure your resources are from different availability domains, which offers improved resiliency.

Next, we will learn about how to achieve geo-partitioning with CockroachDB.

Geo-partitioning in CockroachDB

CockroachDB provides two topology patterns, which provide two levels of data resiliency, latency and availability.

Single region

Here, the entire data is in a single region.

CockroachDB defines two variations of single-region topology, development and production, as follows:

  • Development: This pattern is very straightforward, where you just have a single node in an availability zone, with multiple clients talking to it. This pattern is useful for testing purposes. This topology can also be used on your laptop or desktop.

As part of your Continuous Integration/Continuous Deployment (CI/CD) pipeline, you can have a dedicated stage in which you provision a single-node cluster and later can run a bunch of system tests that interact with a real database. Since the clients will be local to the data, reads and writes will be much faster, although there is no resiliency.

The following is an example of a single-region deployment:

Figure 4.3 – Single-node deployment in the US-West (Northern California) region, where all the clients are also deployed in the same region

Figure 4.3 – Single-node deployment in the US-West (Northern California) region, where all the clients are also deployed in the same region

  • Basic production: Here, you can have nodes deployed in more than one availability zone within the same region. It is ideal to have at least three nodes in three different availability zones within the same region for consensus purposes. This pattern takes advantage of many CockroachDB features, such as replication, rebalancing, and resiliency.

This topology can withstand up to a single-node failure. If two nodes fail, then some ranges might not have any leaseholders due to a lack of consensus and will become unavailable. You would also need a load balancer to spread the traffic from clients across three nodes evenly.

The following is a single-region deployment with three nodes:

Figure 4.4 – Single-region three-node deployment in the US-West (Northern California) region, where all the clients are also deployed in the same region

Figure 4.4 – Single-region three-node deployment in the US-West (Northern California) region, where all the clients are also deployed in the same region

In basic production topology, client requests hit a load balancer first, and later they are forwarded to one of the nodes, which acts as a gateway node. The gateway node then coordinates with relevant leaseholders, gathers all the data, and serves the data back to the client.

Multi-region

Here, the data is spread across multiple regions. You can use row-level control to distribute the rows geographically. The following figure shows a multi-region deployment, which spans three different regions in the US:

Figure 4.5 – Multi-region nine-node deployment covering the US-East, US-Central, and US-West regions

Figure 4.5 – Multi-region nine-node deployment covering the US-East, US-Central, and US-West regions

You can choose the following topologies based on your requirement:

  • Geo-partitioned replicas
  • Geo-partitioned leaseholders
  • Duplicate indexes
  • Follower reads
  • Follow-the-workload

Geo-partitioning options are only available with the enterprise license of CockroachDB. If you try to use enterprise features without the enterprise license, you will see the following message:

ERROR: use of partitions requires an enterprise license. see https://cockroachlabs.com/pricing?cluster=65244c3a-2d63-432c-a8b4-c70a53459ca1 for details on how to enable enterprise features

SQLSTATE: XXC02

You can visit the URL mentioned in the error message and get an enterprise license for a specific trial period.

Applying an enterprise license to your cluster would involve the following steps:

root@localhost:26258/defaultdb> SET CLUSTER SETTING cluster.organization = 'Self';

SET CLUSTER SETTING

Time: 166ms total (execution 165ms / network 0ms)

root@localhost:26258/defaultdb> SET CLUSTER SETTING enterprise.license = 'crl-0-EKD0mYsGGAIiBFNlbGY';

SET CLUSTER SETTING

Time: 191ms total (execution 191ms / network 0ms)

Now, let's take a deeper look at each one of these topologies. All these topologies would need a multi-region deployment. The easiest way to create a multi-region cluster for experiment purposes is to create an account at https://cockroachlabs.cloud and request a three-region, nine-node cluster.

Geo-partitioned replicas

In the case of geo-partitioned replicas, you have to have a column for the geolocation. It then has to be combined with the table's unique identifier to form a compound primary key. For example, let's say you have an ID, which is a Universally Unique Identifier (UUID) – you can combine that with the city or the state to form a compound primary key. Then, you have to partition the table and all the secondary indexes based on that column, and each partition will have its own replicas. Once you have this, you can ask CockroachDB to place data belonging to each partition in a specific region.

For example, let's say you have users from California and Ohio. All the rows belonging to users from California can be stored in the US-West region and users in Ohio can be stored in the US-East region. Here, the assumption is that data will be consumed locally, so both read and write latencies will be reduced.

For example, if you operate a food delivery service, your users will place orders from the city they live in, and also the food will be delivered within the same city. So, in this case, it makes total sense to use a geo-partitioned replica for your users. Since all the data and its replica are housed in the same region, if an entire region goes down, the data will not be available.

Let's see an example in which we are going to create a geo-partitioned replicas topology. Basically, the replicas will be pinned to a particular region so that local reads and writes are faster:

  1. Create a table called users, where you have city as one of the columns. This will help in partitioning this table by city:

    CREATE TABLE users (

         id UUID NOT NULL DEFAULT gen_random_uuid(),

         first_name STRING NOT NULL,

         last_name STRING NOT NULL,

         city STRING NOT NULL,

         PRIMARY KEY (city ASC, id ASC) );

  2. Now, create a secondary index as follows:

    CREATE INDEX users_first_name_last_name_index ON users (city, first_name, last_name);

  3. Create partitions for the table based on city. Let's consider three different cities that are in the west, east, and central parts of the USA:

    ALTER TABLE users PARTITION BY LIST (city) (

        PARTITION sfo VALUES IN ('san francisco'),

        PARTITION aus VALUES IN ('austin'),

        PARTITION ny VALUES IN ('new york')

    );

  4. Create partitions for the secondary index based on city:

    ALTER INDEX users_first_name_last_name_index PARTITION BY LIST (city) (s

        PARTITION sfo VALUES IN ('san francisco'),

        PARTITION aus VALUES IN ('austin'),

        PARTITION ny VALUES IN ('new york')

    );

  5. For the table and its secondary index, create a replication zone, which will pin the replica of a given partition to its relevant region. The <table>@* syntax lets you create zone configurations for all identically named partitions of a table, saving you multiple steps:

    ALTER PARTITION sfo OF INDEX users@* CONFIGURE ZONE USING constraints = '{"+region=us-west":1}', num_replicas=3;

      

    ALTER PARTITION aus OF INDEX users@* CONFIGURE ZONE USING constraints = '{"+region=us-central":1}', num_replicas=3;

    ALTER PARTITION ny OF INDEX users@* CONFIGURE ZONE USING constraints = '{"+region=us-east":1}', num_replicas=3;   

  6. Now, let's execute the SHOW CREATE TABLE query to see how the partitions are created for the table and the secondary index:

    SHOW CREATE TABLE users;

The sample output is as follows:

SHOW CREATE TABLE users;

  table_name |                                                create_statement

-------------+------------------------------------------------------------------------------------------------------------------

  users      | CREATE TABLE public.users (

             |     id UUID NOT NULL DEFAULT gen_random_uuid(),

             |     first_name STRING NOT NULL,

             |     last_name STRING NOT NULL,

             |     city STRING NOT NULL,

             |     CONSTRAINT "primary" PRIMARY KEY (city ASC, id ASC),

             |     INDEX users_first_name_last_name_index (city ASC, first_name ASC, last_name ASC) PARTITION BY LIST (city) (

             |         PARTITION sfo VALUES IN (('san francisco')),

             |         PARTITION aus VALUES IN (('austin')),

             |         PARTITION ny VALUES IN (('new york'))

             |     ),

             |     FAMILY "primary" (id, first_name, last_name, city)

             | ) PARTITION BY LIST (city) (

             |     PARTITION sfo VALUES IN (('san francisco')),

             |     PARTITION aus VALUES IN (('austin')),

             |     PARTITION ny VALUES IN (('new york'))

             | );

             | ALTER TABLE defaultdb.public.users CONFIGURE ZONE USING

             |     num_replicas = 3;

             | ALTER PARTITION sfo OF INDEX defaultdb.public.users@primary CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-west: 1}';

             | ALTER PARTITION sfo OF INDEX defaultdb.public.users@users_first_name_last_name_index CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-west: 1}';

             | ALTER PARTITION aus OF INDEX defaultdb.public.users@primary CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-central: 1}';

             | ALTER PARTITION aus OF INDEX defaultdb.public.users@users_first_name_last_name_index CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-central: 1}';

             | ALTER PARTITION ny OF INDEX defaultdb.public.users@primary CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-east: 1}';

             | ALTER PARTITION ny OF INDEX defaultdb.public.users@users_first_name_last_name_index CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-east: 1}'

(1 row)

Time: 278ms total (execution 278ms / network 0ms)

Here, you can see that replicas are constrained to relevant regions. So, if an entire region goes down, a partition in that region becomes unavailable.

Next, we will go through the geo-partitioned leaseholders topology.

Geo-partitioned leaseholders

Like in the case of a geo-partitioned replica, you still need a column that has geolocation. You will also need a compound primary key, which is a combination of a unique ID and geolocation.

Here, the main difference is that you only pin the leaseholder to a specific location, but the replicas can be stored in different regions. Since we are only pinning the leaseholder, reads will always be faster but writes take more time, since data is replicated across regions, which takes more time as replication also involves consensus.

Let's see an example, in which we are going to create a geo-partitioned leaseholders topology. Basically, the leaseholder will be pinned to a particular region, so that local reads are faster:

  1. Create a table called users, where you have city as one of the columns. This will help in partitioning this table by city:

    CREATE TABLE users (

         id UUID NOT NULL DEFAULT gen_random_uuid(),

         first_name STRING NOT NULL,

         last_name STRING NOT NULL,

         city STRING NOT NULL,

         PRIMARY KEY (city ASC, id ASC) );

  2. Now, create a secondary index as follows:

    CREATE INDEX users_first_name_last_name_index ON users (city, first_name, last_name);

  3. Create partitions for the table based on city. Let's consider three different cities that are in the west, east, and central parts of the USA:

    ALTER TABLE users PARTITION BY LIST (city) (

        PARTITION sfo VALUES IN ('san francisco'),

        PARTITION aus VALUES IN ('austin'),

        PARTITION ny VALUES IN ('new york')

    );

  4. Create partitions for the secondary index based on city:

    ALTER INDEX users_first_name_last_name_index PARTITION BY LIST (city) (

        PARTITION sfo VALUES IN ('san francisco'),

        PARTITION aus VALUES IN ('austin'),

        PARTITION ny VALUES IN ('new york')

    );

  5. For the table and its secondary index, create a replication zone, which will pin the leaseholder of a given partition to its relevant region:

    ALTER PARTITION sfo OF INDEX users@*

        CONFIGURE ZONE USING

          num_replicas = 3,

          constraints = '{"+region=us-west":1}',

          lease_preferences = '[[+region=us-west]]';

      

    ALTER PARTITION aus OF INDEX users@*

        CONFIGURE ZONE USING

          num_replicas = 3,

          constraints = '{"+region=us-central":1}',

          lease_preferences = '[[+region=us-central]]';

      

    ALTER PARTITION ny OF INDEX users@*

        CONFIGURE ZONE USING

          num_replicas = 3,

          constraints = '{"+region=us-east":1}',

          lease_preferences = '[[+region=us-east]]';

  6. Now, let's execute SHOW CREATE TABLE to see how the partitions are created for the table and the secondary index:

    SHOW CREATE TABLE users;

The sample output is as follows:

SHOW CREATE TABLE users;

  table_name |                                                create_statement

-------------+------------------------------------------------------------------------------------------------------------------

  users      | CREATE TABLE public.users (

             |     id UUID NOT NULL DEFAULT gen_random_uuid(),

             |     first_name STRING NOT NULL,

             |     last_name STRING NOT NULL,

             |     city STRING NOT NULL,

             |     CONSTRAINT "primary" PRIMARY KEY (city ASC, id ASC),

             |     INDEX users_first_name_last_name_index (city ASC, first_name ASC, last_name ASC) PARTITION BY LIST (city) (

             |         PARTITION sfo VALUES IN (('san francisco')),

             |         PARTITION aus VALUES IN (('austin')),

             |         PARTITION ny VALUES IN (('new york'))

             |     ),

             |     FAMILY "primary" (id, first_name, last_name, city)

             | ) PARTITION BY LIST (city) (

             |     PARTITION sfo VALUES IN (('san francisco')),

             |     PARTITION aus VALUES IN (('austin')),

             |     PARTITION ny VALUES IN (('new york'))

             | );

             | ALTER TABLE defaultdb.public.users CONFIGURE ZONE USING

             |     num_replicas = 3;

             | ALTER PARTITION sfo OF INDEX defaultdb.public.users@primary CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-west: 1}',

             |     lease_preferences = '[[+region=us-west]]';

             | ALTER PARTITION sfo OF INDEX defaultdb.public.users@users_first_name_last_name_index CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-west: 1}',

             |     lease_preferences = '[[+region=us-west]]';

             | ALTER PARTITION aus OF INDEX defaultdb.public.users@primary CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-central: 1}',

             |     lease_preferences = '[[+region=us-central]]';

             | ALTER PARTITION aus OF INDEX defaultdb.public.users@users_first_name_last_name_index CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-central: 1}',

             |     lease_preferences = '[[+region=us-central]]';

             | ALTER PARTITION ny OF INDEX defaultdb.public.users@primary CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-east: 1}',

             |     lease_preferences = '[[+region=us-east]]';

             | ALTER PARTITION ny OF INDEX defaultdb.public.users@users_first_name_last_name_index CONFIGURE ZONE USING

             |     num_replicas = 3,

             |     constraints = '{+region=us-east: 1}',

             |     lease_preferences = '[[+region=us-east]]'

(1 row)

Time: 37ms total (execution 37ms / network 0ms)

Here, you can see that the lease preference is restricted to relevant regions, which will ensure that the leaseholders are always pinned to a specific region.

Next, we will go through the duplicate indexes topology.

Duplicate indexes

The duplicate indexes topology is useful in cases where you write once and read it from various locations. For example, let's say you are managing the credit cards of folks who travel throughout the US very often. If you pin the data to a single region, whenever the user moves out of that region, it will slow down the reads. So, duplicate indexes come in handy to solve this issue.

Just like the previous two cases, you will have a compound primary key with a combination of an ID and geolocation. Here, you can create a partition based on that column, but you only pin the leaseholder to a specific region.

In our example, it can be the primary address of the user. Here, the credit card information can be replicated in different regions to cover the entire US. Since only the leaseholder is responsible for writes and reads, your reads will always be routed to the pinned region of the leaseholder. This again introduces latency.

Now, you can create secondary indexes for the credit card. For example, assuming that your leaseholder is pinned to the west coast, you can create secondary indexes, such as id_creditcard_east and id_creditcard_central, which can be constrained to the US-East and US-Central regions respectively. This will also guarantee that there are local leaseholders for secondary indexes in all the regions, which drastically reduces the read latency, as they were served locally always.

Since we already have multiple copies of the original data and we are creating secondary indexes that are also replicated, now there are a lot of copies of the same data in multiple regions. So, this increases the write latencies, as all these copies have to be updated and a multi-region consensus has to be reached:

  1. Let's say you are maintaining local attractions of the USA, which can be accessed by users throughout the USA:

    CREATE TABLE local_attractions (

         id UUID NOT NULL DEFAULT gen_random_uuid(),

         name STRING NOT NULL,

         address STRING NOT NULL,

         city STRING NOT NULL,

         PRIMARY KEY (id ASC)

    );

  2. Create a replication zone and pin the leaseholder to a specific region:

    ALTER TABLE local_attractions

        CONFIGURE ZONE USING

          num_replicas = 3,

          constraints = '{"+region=us-central":1}',

          lease_preferences = '[[+region=us-central]]';

  3. Create secondary indexes for the other two regions. Here, storing a column improves the performance of queries that retrieve its values, but you cannot use these stored columns in the filtering logic:

    CREATE INDEX idx_west ON local_attractions (city)

        STORING (name);

    CREATE INDEX idx_east ON local_attractions (city)

        STORING (name);

  4. For these secondary indexes, define the replication zone, once again pinning the leaseholder to the relevant region:

    ALTER INDEX local_attractions@idx_west

        CONFIGURE ZONE USING

          num_replicas = 3,

          constraints = '{"+region=us-west":1}',

          lease_preferences = '[[+region=us-west]]';

    ALTER INDEX local_attractions@idx_east

        CONFIGURE ZONE USING

          num_replicas = 3,

          constraints = '{"+region=us-east":1}',

          lease_preferences = '[[+region=us-east]]';

  5. Now, let's execute SHOW CREATE TABLE to see how the partitions are created for the secondary indexes:

    SHOW CREATE TABLE local_attractions;

The sample output is as follows:

SHOW CREATE TABLE local_attractions;

     table_name     |                create_statement

--------------------+--------------------------------------------------

  local_attractions | CREATE TABLE public.local_attractions (

                    |     id UUID NOT NULL DEFAULT gen_random_uuid(),

                    |     name STRING NOT NULL,

                    |     address STRING NOT NULL,

                    |     city STRING NOT NULL,

                    |     CONSTRAINT "primary" PRIMARY KEY (id ASC),

                    |     INDEX idx_west (city ASC) STORING (name),

                    |     INDEX idx_east (city ASC) STORING (name),

                    |     FAMILY "primary" (id, name, address, city)

                    | )

(1 row)

Time: 28ms total (execution 28ms / network 0ms)

Here, you can see that there are multiple identical indexes for multiple regions. So, whenever there are queries involving city and name, they can be served locally, hence reducing the latency of reads. Since we are maintaining identical indexes in multiple regions, the writes are much slower. This topology is useful where the data doesn't change much but is accessed frequently in all the regions.

Next, we will go through the follower reads topology.

Follower reads

If you want low-read latency but don't care about slightly older data, you can use this topology. In this case, you add the AS OF SYSTEM TIME clause in your reads, which then avoids the round trip to the leaseholder, and data is served locally. Writes would still need a multi-region consensus. You should not use this topology if you need strong consistency. Please refer to the CAP theorem section in Chapter 1, CockroachDB – A Brief Introduction, if you want to understand what consistency here means.

Let's create a sample table to understand how this works:

CREATE TABLE local_attractions (

     id UUID NOT NULL DEFAULT gen_random_uuid(),

     name STRING NOT NULL,

     address STRING NOT NULL,

     city STRING NOT NULL,

     PRIMARY KEY (id ASC)

);

Here, in the SELECT query, you should use AS OF SYSTEM TIME follower_read_timestamp(). The follower_read_timestamp() function returns the TIMESTAMP data type with statement_timestamp() – the 4.8s value:

SELECT city FROM local_attractions

    AS OF SYSTEM TIME follower_read_timestamp()

            WHERE city = 'san francisco';

Since the data is always retrieved locally, without involving the leaseholder, you might get stale or older data. Once again, do not use this topology if you need strong consistency.

Next, we will look at the follow-the-workload topology.

Follow-the-workload

This is the default topology if you don't use any of the previous ones. This topology works well if a given table is active in a single region, which means clients are doing reads and writes that are in the same region. Here, the read latency will be low in the active region and it will be higher in non-active regions, as the leaseholder will be in the active region. Writes still need a multi-region consensus and can be slower.

The following is a table that should help you to decide which topology might be relevant for your database workload:

Figure 4.6 – Topology cheat sheet

Figure 4.6 – Topology cheat sheet

In this section, we learned about various geo-partitioning topologies and how to configure them in CockroachDB. Based on latency, data consistency, and resiliency requirements, we should select the appropriate topology.

Summary

In this chapter, we learned what geo-partitioning is and why it is useful to geo-partition your data. Then, we covered all the important jargon in the cloud world, especially how each major cloud provider has defined regions and availability zones. We later discussed how to configure various multi-region topologies based on your application requirements.

In the next chapter, we will go over fault tolerance and auto-recovery with CockroachDB.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset