Chapter 4. Data

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Data

This chapter covers

Accepting that data isn’t sacrosanct and can be inaccurate
Building alternative storage solutions with microservice messages
Representing data operations as microservices messages
Exploring alternatives to traditional data-management strategies
Storing different types of data

Using a central relational database is common practice for most software projects. Almost all data is stored there, and the system accesses the data in the database directly. The data schema is represented in the programming language used to build the system.

The advantage of a central database is that almost all of your data is in one place under one schema. You have to manage and administer only one kind of database. Scaling issues can usually be solved by throwing money at the problem. You’re building on the knowledge and practice of generations of software developers.

The disadvantage of a central database is that it provides a fertile breeding ground for technical debt. Your schema and stored procedures will become complex, and you’ll use the database as a communication layer between different parts of your system. You’ll have to accept that all of your data, regardless of business value, gets the same treatment.

The microservice architecture makes it possible to break out of this way of thinking. The decision to use a central relational database is just that: a decision. Microservices make it possible to use a wider variety of data-persistence strategies and fine-tune them depending on the kind of data you store. We’ll use the digital newspaper case study from chapter 2 to help you understand the new possibilities.

4.1. Data doesn’t mean what you think it means

To understand how to adopt this approach at a practical level, it’s necessary to reset your understanding of what data is. The abstract models used to describe data in a classical system are only a drop in the ocean of models that can describe data. Let’s question the implicit assumptions by making some assertions about data in practice.

4.1.1. Data is heterogeneous, not homogeneous

Some data in your system is mission critical. In the context of the newspaper example, consider payment records for reader subscriptions. You need to get these right. Other data is useful, but you can live without it (for example, what the most-read article on the site was yesterday). If you impose the same constraints on all of your data by using the same database for everything, you can’t adjust your data-persistence strategies to reflect the value of the data they’re handling.

The constraints on each data entity in your system are different. Awareness of this fact is a useful step toward making the right decisions regarding data storage. Consider some examples relevant to the newspaper:

User subscriptions— Details of what the user has paid for and which articles they can read. This data needs to have high accuracy, because it has a high impact on customer satisfaction. Modifications to this data need high levels of consistency and should probably use transactions. All services should use the same database, to achieve these desired data behaviors. A good, old-fashioned relational database is needed here.^[1]

¹
MySQL, Postgres, SQL Server, Oracle, and so on.
User preferences— Font size for articles; news categories to show on the home page; whether to hide article comments. These features improve the user’s experience with the newspaper but aren’t critical. Incorrect data doesn’t prevent the user from reading articles. It’s a good business decision to keep the cost of operating on this data as low as possible; and to do that, trading off some accuracy makes sense. If the font size is different than the reader’s preferred setting, it’s annoying but not fatal. A document store that isn’t too strict about schemas is helpful in this case.^[2]

²
MongoDB, CouchDB, and so on.
Article read counts— Used as input to an analytics system, to get insight into the content that appeals to readers. The aggregate data matters here, along with the relationships between data—not exact numbers. In any case, you’re only guessing at user behavior: just because a user opens an article page doesn’t mean they’ll read it. High throughput that doesn’t cause latency anywhere else is what you want in this case. A simple key-value store will do.^[3]

³
Redis, LevelDB, and so on.

The microservice architecture makes it much easier to apply these observations about your data in practice. Different microservices can deal with data in different ways and use different data-storage solutions. Over time, if data constraints change, you can also migrate data to more-appropriate storage solutions; when you do, you’ll only need to change individual microservices.

Figure 4.1 shows the proposed persistence strategy, diagramming only the relevant services. The user-profile, subs, and tracker services are new introductions. They look after user preferences, subscription payments, and tracking article views, respectively.

Figure 4.1. Using different databases for different microservices

The advantage of the microservice architecture is not so much that you can decide in advance to use all of these different databases. In the early days of a project, it’s most useful to use a schema-free document store for everything, because doing so gives you maximum flexibility. Rather, the advantage is that later, in production, you can migrate to using multiple fit-for-purpose databases if you need to.

4.1.2. Data can be private

The purpose of a relational database is to make data access easier. In practice, this means application code usually has access to all database tables. As the application expands and changes, more code accesses more of the database, and technical debt increases as this swamp of dependencies grows.

A nasty anti-pattern can easily develop. The database comes to be used as a communication mechanism between software components: different components read and write data to a shared set of tables. This implicit communication channel exists in many monolithic code bases. It’s an anti-pattern because communication happens outside of the expected channels, without well-defined semantics, and often without any explicit design. There are no control mechanisms to reign in the growth of complexity, so technical debt soon manifests.

To prevent this, it’s better to restrict access to data entities so that only business logic concerned with a given kind of entity modifies the storage table for that entity. This strategy is much easier to implement with microservices. Related data entities can be associated with limited groups of microservices. These microservices mediate access to the data entities, ensuring consistent application of business rules. No other microservices can access the data directly. Data access can even be completely hidden behind business-logic operations. Other times, the microservice may offer more-traditional read/write operations for individual entity instances. But in all cases, access to the data entity is represented by messages, rather than a data schema. These messages are no different from others in the microservice system, and thus data operations can be manipulated the same way as other messages.

In the context of the newspaper example, the User data entity is private to the user service, and the UserProfile data entity is private to the user-profile service. This is appropriate, because sensitive information such as usernames, password hashes, and recovery email addresses needs to be kept separate from user-preference settings such as font sizes and favorited articles. Even if these entities are stored in the same database in the initial versions of the system, you can enforce this separation by insisting that all access be via the responsible microservice. The correspondence between microservices and data entities isn’t necessarily one to one; and simplistic rules of thumb, such as one microservice per entity, shouldn’t be adopted as general principles.

Data entities

The term data entity is used in this chapter to refer to a single coherent data type, from the perspective of a given system domain.^[4] Data entities are typically represented by relational tables, document types, or a specific set of key-value pairs. They represent a set of records that have a common internal structure: a set of data fields of various types.

⁴
Large systems may have multiple domains. Each domain is an internally consistent view of the world. Communication between domains necessarily involves some form of entity translation. Explicitly setting things up this way is known as domain-driven design (DDD). The microservice architecture can make DDD easier to achieve but should be considered at a lower level of abstraction than DDD. For more, read Domain-Driven Design by Eric Evans (Addison-Wesley, 2003).

These fields may be subject to constraints imposed by the database; they may also be subject to additional semantic interpretation (for example, they could be country codes). The term data entity is used only in an informal manner to indicate families of data.

4.1.3. Data can be local

A traditional database architecture uses a shared database as the only persistent store. Caching may be used for transient results, but data isn’t generally stored outside of the database. In the microservice architecture, you have more freedom to store data in separate locations. You can make data local to a microservice by restricting access to a database table, assigning a database exclusively to a microservice, or storing the data locally with the microservice.

Microservice instances can have local storage capabilities, perhaps using an embedded database engine such as SQLite or LevelDB. Data can persist for the duration of the microservice’s lifetime, which could be hours, days, or weeks. Data can also persist separately on virtual data volumes that are reconnected to running microservice instances.

Consider the user profile data. To scale to millions of users, you can use a set of user profile services and distribute user profiles evenly over them. You can achieve this even distribution by sharding the key space of users and using message translation to hide the sharding from the rest of the system.^[5] Each user profile service stores user profile data locally. For fault tolerance, you run multiple instances of each shard service. When you need to read user profile data, the sharding service picks one of the shard instances to get the data, as shown in figure 4.2. When you need to write user profile data, you publish a message to all instances for the relevant shard. To populate a new shard instance, let it look for missing data on older, sibling shards. You can do this on demand or as a batch job. Because adding new shard instances is easy, any given shard instance is disposable; so if you detect data corruption in a shard, you can throw the shard away.

⁵
This ability to scale is a great example of the flexibility of microservices. You can start with just one service and not worry at all about how you’ll scale. You know that you’ll be able to use message manipulation to migrate to a different data-persistence solution if you have to.

Figure 4.2. Local user profile data stored only on shard instances

This persistence strategy is vulnerable to loss of data accuracy. If a given update message is published and doesn’t reach all shard instances, then the user profile data will be inconsistent. How bad is this? It depends on your business goals. It may be perfectly acceptable to have a consistency level of 99%.^[6] So what if the font size is wrong! Just set it again, and it will probably be saved this time. In the early stages of a project, before you have large numbers of users, this trade-off allows you to focus effort on other, more valuable features.

⁶
Here, consistency level is defined as the percentage of reads that match the most recent write.

But the day will come when you need to solve this problem. One angry user per week is materially different from 100, even if both numbers represent only 1% of users.

The solution is to delocalize the data in one of the following ways: use a single traditional database that all microservices access, use a versioned caching strategy, use a distributed data store with a conflict resolution algorithm, move to a cloud database that provides what you need, or separate reads from writes.^[7]

⁷
The fancy name for this approach is Command Query Responsibility Segregation (CQRS), a family of asynchronous data-flow patterns, described by Martin Fowler at https://martinfowler.com/bliki/CQRS.html. The pattern-matching and transport-independence principles make CQRS more a matter of message-routing configuration rather than fundamental architecture, so don’t be afraid to try it with your own microservices.

All of these solutions are more complex than storing everything locally and slow you down if you use them from the start.

Microservices make the quick-and-dirty local storage approach safer because you can migrate away from it only when you need to. This migration is easier than you may think—the new database is just another sibling shard instance. We’ll discuss this kind of shard migration later in this chapter. For a practical example of the local data strategy, see chapter 9.

4.1.4. Data can be disposable

A useful consequence of local data is that it’s disposable. Local databases can always be reconstructed from each other or from the system of record (SOR).^[8] Although this observation may seem manifestly true for in-memory caches, it now becomes true for more persistent data. When many instances of data-exposing microservices are running, each storing a subset of the full dataset, the individual instances aren’t critical to the correct functioning of the system.

⁸
None of the microservice data strategies preclude you from building a SOR, which is the ultimate authority on your data. They do make it much easier to protect the SOR from load and from performance issues.

Compare this to a more traditional sharding scenario, where there are many shards. Each shard must be managed properly and treated as a full database in its own right: maintained, upgraded, patched, and backed up. Microservice instances, on the other hand, aren’t as fragile, in aggregate. Any individual data store is disposable. It doesn’t need the same level of management.

For local data, backups aren’t needed, because the multiplicity of instances takes that responsibility. The operational complexity of backing up hundreds of microservice database instances isn’t necessary. To do so would be an architectural smell. Maintenance isn’t required: you upgrade the database by deploying new instances, not modifying old ones. Patching is also unnecessary; you can apply the same tactic. Configuration changes go the same way.

This approach is particularly useful in the world of containers. Containers are designed to be transient, and the abstraction leaks when you have persistent data. You have to answer the question, “How do I connect my long-lived storage to database engines running in containers that come and go?” A common solution is to not do so at all and instead retain the monolithic model of a sacred, separate, database cluster. Another solution is to punt and use a cloud database. With local data, you have another option: the data is stored in the container and dies when the container dies. Local data fits within the container abstraction.

As always with system architecture, you must be aware of the downsides. When you’re running lots of local-data microservices and refreshing from the well of the SOR, you can easily drown the SOR with too many requests. Remember the basic best practices: microservice systems should change only one instance at a time. You need to control the rate of entry of new microservices to prevent flooding the SOR. Micro-services aren’t practical without automation.

4.1.5. Data doesn’t have to be accurate

There’s an implicit assumption in most enterprise software development that data must be 100% accurate at all times. The unquestioning acceptance of this assumption drives the feature sets and cost implications of enterprise database solutions. One of the most useful consequences of using the microservice architecture is that it forces you to question this assumption. As an architect, and as a developer, you must take a closer look at your data. You don’t get to avoid making a decision about the cost implications of your chosen storage solution. Microservices are distributed, which allows them to scale. The trade-off is that you can’t assume a simplistic relational model that looks after accuracy for you.

How do you handle inaccurate data? Start by asking the business what level of accuracy is actually needed. What are the business goals? How are they impacted by data accuracy? Are you comparing the cost of high accuracy with the benefits? The answers to these questions are requirements that drive your technical approach. The answers don’t need to cover the entire dataset—different data has different constraints. Nor do the answers have to remain constant—they can change over time. The fine-grained nature of the microservice architecture makes this possible.

4.2. Data strategies for microservices

The microservice architecture gives you a greater degree of choice when it comes to data persistence. That presents a strategic problem: you need to choose the most effective strategies for handling data in the microservice context. This section provides a framework for decision making.

4.2.1. Using messages to expose data

A fundamental principle of the microservice architecture is that you limit communication between services to messages. This limitation preserves a desirable characteristic of the architecture: that services are independent of each other. This principle breaks down if you use the database for communication. You can easily end up doing this: if you have one service write data to a table, and you have another service read data from the same table, that’s a communication channel.

The problem is that you’re creating coupling between your services—exactly the coupling that generates technical debt. To avoid this, you need to keep data entities behind microservices. This doesn’t mean that there’s a one-to-one mapping between microservices and data entities, or that each table has its own microservice, or that one microservice handles all operations for a given database. These may all be true, but they’re not fundamental attributes of the architecture.

Keeping data entities behind microservices means exposing them only through messages. This follows the messages-first principle discussed in chapter 3. To perform a data operation, you send a message. If it’s synchronous, you expect a result. If it’s asynchronous, you don’t. The data operations can be traditional create/read/update/delete operations, or they may be higher-level operations defined by your business logic.

The data-operation messages give a unified interface to your data but don’t impose a requirement that your data be stored or managed in only one place. This gives you considerable freedom regarding the approach you take to data persistence and allows you to match persistence solutions to your performance and accuracy needs.

You have the freedom to use different databases for different kinds of data entities. In the online newspaper example, articles are suitable for a document-oriented database, user profiles are suitable for local storage, and payment records are suitable for a relational database with transaction support. In all cases, you can expose the data via messages, and the business-logic microservices have no knowledge of the underlying data stores.

You’re also free to switch databases. All you need is a microservice that translates between a given database solution and your data-operation messages. This isn’t the weak flexibility of limiting yourself to a common subset of SQL. Theoretically, that gives you the ability to at least change relational databases;^[9] it certainly doesn’t let you move to NoSQL databases, because you’ll inevitably develop a reliance on the capabilities of the SQL language. The situation is different with messages. Because they’re the only way to talk to the data store, you automatically define a simple, small set of interactions with the database. Implementing these operations for a new database requires a feasible, self-contained amount of work.

⁹
In practice, variations in the syntax and semantics of each database’s SQL language dialect can trip you up.

Allowing database schemas to emerge

The ability to easily change databases opens up a highly effective rapid-development strategy. Even in a project scenario where you must ultimately use a relational database due to external constraints (such as a policy decision above your pay grade), you can still use a NoSQL database in the earlier stages of the project, as a “development tool.” The advantages of doing so are that you don’t have to decide about the schema design too early, you avoid the need for schema migrations, and you won’t be tempted to create dependencies between data entities (aka JOIN queries).

Once you’ve completed a significant number of iterations, discovered and unraveled the client’s deeper requirements, and gotten a clearer understanding of the business problem you’re modeling, you can move to a relational database. You take the implicit schema that has emerged within the NoSQL document set and use that to build an explicit schema in the SQL database.

4.2.2. Using composition to manipulate data

By representing data operations as messages, you allow them to be componentized. That is, after all, one of the core benefits you’re seeking from the microservice architecture. Consider the scenario shown in figure 4.3, using the article-page and article services from chapter 2. The article service exposes article data using the get/add/remove/list-article set of messages.^[10] These are sent by the article-page service. The article-page service performs business-logic operations (formatting and displaying articles), and the article service exposes article data.

¹⁰
These names aren’t message types. Strictly, they refer to the messages that match certain patterns. These names are abbreviations of the patterns. The patterns will vary depending on the implementation, and thus the specific patterns aren’t relevant to this discussion.

Figure 4.3. Interaction between the article-page and article services

Chapter 2 briefly discussed the use of componentization to extend this structure. You can introduce new microservices that intercept the data-operation messages. For example, you can introduce a caching microservice that first checks a cache before querying the article service (see figure 4.4).

Figure 4.4. Composing the article-page and article services together

The key principle here is that the article-page and article services have no knowledge of this interception and behave as they always have. This message-interception strategy isn’t limited to caching. You can perform validation, access control, throttling, modification, or any other manipulation of that data. The interception strategy is a general ability that a messages-first approach gives you; by confining data operations to messages, you gain the ability to extend the data operations with a well-defined component model.

This approach also lets you use alternative data-flow models. For example, The pathway that data takes when written doesn’t have to be the same as the pathway it takes when read. You can make your write operations asynchronous and your read operations synchronous. This is a common architecture when you need performance, and eventual consistency is an acceptable strategy to achieve it. As mentioned earlier, this family of data-flow models is known as Command Query Responsibility Segregation (CQRS); it adds complexity to your system and is considered acceptable only when you need the performance benefits. In the message-oriented microservice world, this consideration doesn’t apply. Asynchronous writes and synchronous reads correspond to a message-routing configuration and are a natural capability of the system. There’s nothing special about the data-flow model, as shown in figure 4.5. From the perspective of the business-logic services, the model isn’t apparent and has no effect on implementation. Just as you can change databases more easily mid-project, you can change data flows if you have unanticipated performance needs.

Figure 4.5. CRUD message flow compared to CQRS message flow

Exposing data operations as messages also gives you extra abilities. You can easily adopt a reactive approach: more than one microservice can listen to data-operation messages. You can also introduce data-modification announcement messages: when data changes in some way, you announce this to the world, and the world can react as it needs to.

In the online newspaper case, you’ll eventually need to provide a search feature so readers can search for articles. You can use a specialist search engine solution for this feature^[11] and expose it via a microservice. When an article is saved, you need to make sure the search engine also indexes it. A simplistic approach is to define an index-article message that the search engine service acts on; the article service sends this message when it saves an article, as shown in figure 4.6.

¹¹
Elasticsearch is a great choice: www.elastic.co.

Figure 4.6. Direct synchronous interaction with the search-engine service

This architecture isn’t reactive. It’s highly coupled for a microservice architecture, even though it uses patterns to avoid identifying individual services. The article service knows that it needs to send messages to the article-index microservice and includes these messages in its workflow.

What happens if other services also need to know about changes to articles? Perhaps there’s an encoding service that creates download PDFs for each article. A better approach is to announce that the article has changed, and for interested parties to subscribe to changes. Figure 4.7 shows this asynchronous interaction.

Figure 4.7. Reactive asynchronous interaction with the search-engine service

In this approach, the article service sends an asynchronous message, article-info, which is observed by other microservices. You can easily extend your system to perform other actions when an article changes, without affecting the microservices dealing with article business logic.

Dealing with missing updates

In the reactive data-update scenario, you can reasonably ask what happens if messages get lost. The answer is that your data will slowly become stale. Some articles will never be indexed, and some PDFs will never be generated. How do you deal with this?

You must first accept that your data will be inaccurate. You can’t avoid that. The question then becomes one of reconciling conflicts in the data. In some cases, you can generate the correct data on demand. For example, if a PDF is missing when a request comes in for it, you can generate the PDF. You’ll have to accept a performance hit for doing so, but this happens in only a small number of cases, because most messages won’t be lost.

What about stale PDFs, where the article has changed? In that case, use a versioning strategy to ensure that you serve the correct version. Something as simple as appending the version number to the filename of the PDF will work. Or use a hash of the article contents as part of the filename, if you’re paranoid.

In other cases, you’ll need to actively correct errors by continuously validating your dataset. In the case of the search engine, you won’t detect missing articles just by searching, but you can run a batch process to systematically verify that each article is in the search engine. The exact mechanism will vary with the search engine, although they generally expose some form of key-value view of the data that you can use. A more general solution is to use a distributed event log such as Kafka (https://kafka.apache.org) or DL (http://distributedlog.incubator.apache.org). These maintain a history of all messages, which you can replay against your data stores to ensure that all data-update operations have been effected. Unsurprisingly, you’ll need conflict resolution policies to make this work in practice. Making data-update operations idempotent will also greatly simplify things.

In addition, you gain the ability to test microservices easily. Business-logic microservices interact with your data infrastructure only via messages. You don’t need to install the data stores and their microservices; you can mock up the data interactions. This allows you to define unit tests without any overhead or dependencies, which makes unit testing extremely fast and easy. You’ll see a concrete example of this in chapter 9.

A common technique is to use a microservice that provides a simple in-memory database as your primary data store for local development. Each test then operates against a clean dataset. It also means your development and debug cycle is fast, because you have no dependency on a real database. You perform more-complete integration testing with a real database on your build and staging systems or even, occasionally, on your local machine.^[12]

¹²
Using a container engine, such as Docker, to run transient instances of databases is convenient for testing.

You gain the ability to easily debug and measure data modifications. All data operations are represented as messages and so are subject to the same debugging and measurement tooling as the rest of your microservice system.

The message-oriented approach doesn’t apply to all data. It doesn’t make sense to embed large binary data in messages; instead, you can provide references to the binary data, which can then be streamed directly from specialist data stores.

4.2.3. Using the system configuration to control data

By representing data operations as messages, you gain control over data using only the configuration of the system, in terms of the microservices that handle the messages. This allows you to choose eventual consistency not as an absolute choice, but as something subject to fine-grained tuning. Consider again the user profile data for the newspaper. Users need to be able to view their profile data and update it; you’ve already seen one configuration for the profile-load and profile-save messages that serve these purposes. The user-profile service responds to these messages, but whether the messages are synchronous or asynchronous depends on your consistency configuration.

The user profile data isn’t critical to the basic functioning of the system, so it can be eventually consistent. This lets you, the architect, reduce costs and improve performance by using eventually consistent storage. On the other hand, a business requirement may emerge that requires strict consistency of the user profile data: perhaps the newspaper has a financial focus, and you’re adding a “track your share portfolio” feature.

In the eventually consistent case, the system can use an underlying database engine that provides eventual consistency.^[13] You front this with multiple instances of the user-profile service that connect to one or more database nodes. In this architecture, you deliberately choose inconsistent data in order to improve performance and reduce costs. In practice, this means not every profile-load message will return the most recent version of the profile data. Sometimes, a profile-save will change a user’s profile and a subsequent profile-load, within a short period of time, will return old data. This interaction is shown in figure 4.8. It’s a business decision to accept this system behavior.

¹³
The individual and separate nodes implementing the database typically communicate between themselves using a gossip protocol to establish consistent data.

Figure 4.8. Eventually consistent data interactions

Alternatively, for the financial newspaper, you may decide to use a centralized relational database for the user profile data, because it contains share-portfolio details. In this case, the multiple instances of the user-profile service connect to the centralized relational database in the same manner as a traditional monolithic system, as shown in figure 4.9. This ensures consistency at the price of introducing the same scaling issues you have with traditional systems. But you isolate technical debt, so this situation is acceptable.

Figure 4.9. Immediately consistent data interactions

How does a message-oriented approach help you here? It allows you to move between these models without committing to them as fundamental architectural decisions. For the financial newspaper, before the portfolio requirement arose, the first version of the site worked fine with the eventually consistent model. But when a business requirement demanded stricter consistency, you could easily move to this model by changing your message-routing configuration. From the perspective of client services, the profile-load and profile-save messages appear to behave the same way.

You can also use this approach to implement flexible scaling, independent of the underlying database. Let’s return to the sharded approach to user profiles. As an architect, you need to decide whether this approach is a good choice for your problem context. It has the advantage of easy scaling. You can increase throughput (the number of requests you can handle per second) linearly; adding a new user-profile-shard service always adds the same amount of extra capacity. This happens because shards have no dependencies on each other. In systems that do have dependencies, increasing the number of elements tends to have diminishing marginal returns—each new element gives less capacity.

Although you can increase throughput arbitrarily by adding shards, you can’t decrease latency the same way. The path that each message follows is always the same and takes about the same time, so there’s a lower limit to latency.^[14] Latency is higher than in a non-sharded system, because you have extra message hops. You can prevent increases in latency by adding new services, if the volume of data is causing performance issues in each shard. This architecture deliberately prefers throughput over latency.

¹⁴
You can’t decrease latency below the speed of light—the speed of light in optical fiber, that is, which is only about 66% of the speed of light in a vacuum.

Characteristic performance

It’s important to understand the fundamental relationship between system load, throughput, and latency. Typically, both throughput and latency deteriorate in a mostly linear fashion until a critical point is reached, after which an exponential collapse occurs: throughput drops dramatically, and latency climbs steeply. At this point, rescuing the system may involve considerable effort, because you have a backlog of failed operations to deal with as well as potentially inconsistent data.

In a production system, you won’t be able to detect load-induced failures in a predictable way, because the system may collapse too quickly. The workaround is to build systems that have linear performance characteristics, and measure the failure point of individual elements. This gives you an upper bound to system capacity. To keep your system healthy, make sure that, as a rule of thumb, you preemptively have enough elements deployed to keep system capacity below 50%.

Sharding provides stronger fault tolerance. Shards are independent, and issues in a given shard only affect the data that it owns. Thus, it’s essential to ensure that you have an evenly distributed key space for shards (that’s why you use hashing to create shard keys). If a shard goes down, you lose access to only a subset of your data. A shard doesn’t have to be a single database or instance. Each shard can run in fault-tolerant configurations such as single-writer, multiple-reader. The advantage of sharding here is that when the writer fails, failover is limited to the data on the shard; you don’t have to reconcile across your entire dataset, so recovery is faster.

Sharding does have complications. Your data needs to be key-value oriented, because the entity key will be the most efficient retrieval mechanism. General queries must execute across all shards, and then you need to combine the results. You’ll have to write the code to do this. You’ll also have to run and maintain many databases, and automation will be essential. This has an up-front impact on the project-delivery timeline.

You also have to manage the complexity of shard transitions. Consider the case where you need to deploy each shard onto a larger machine. From a microservice messaging perspective, you want to be able to add a new version of the user-profile-shard service for a given shard and then migrate all relevant message traffic (profile-load/save messages, and any other profile messages—abbreviate them all with profile-*) and all existing data to the new shard, while having no downtime. Figure 4.10 shows a possible deployment sequence for doing so.

Figure 4.10. Deployment sequence to replace a shard

This deployment sequence demonstrates the standard approach to message reconfiguration in microservice systems. You systematically deploy updated versions of each service, modifying the message flows one step a time. You use multiple staged versions of each service to do this so that you can avoid the need to change service behavior while a service is running (this is known as preserving immutability, which we’ll discuss in the next chapter). You also follow this staged-version strategy so you can verify, at each step, that the system still works. Zero downtime becomes a (solvable!) message-flow mini puzzle, rather than a sophisticated engineering challenge.

To use this deployment sequence in production, you’ll need to automate it and verify it at each step. It’s reversible and measurable, and thus safe. More important, it’s possible, and it takes less effort and creates far less risk than a traditional database migration.

The other sequences that are important are adding and removing shards. You add a shard to scale up and remove a shard to reduce complexity by using instances with more individual capacity. This change follows the same basic set of steps as a shard upgrade. Figure 4.11 shows the important steps in splitting shard/0 into shard/2 and shard/3.

Figure 4.11. Deployment sequence to split a shard

The microservice architecture allows you to route messages as desired. You have a flexible tool for controlling the performance and accuracy characteristics of your data, by routing data-access messages to different services. The importance of measurement when doing so can’t be stressed enough: you’re running a production system, and you need to verify that each transition leaves the system in a good state. There are only two transitions: adding a service instance and removing a service instance. By applying these in a controlled manner, you can keep your data safe under major structural changes, and do so while staying live.

4.2.4. Using weaker constraints to distribute data

It’s costly to keep distributed data accurate. Often, the cost of accuracy is greater than the benefit. You need to explicitly raise these questions with business decision makers so that you can design an appropriate system. When you make explicit the cost of complete data accuracy—or even high data accuracy—it enables you to have a rational discussion with decision makers.^[15] If you can define quantitatively the expected accuracy of the system, you can use weaker constraints on your data to achieve acceptable levels of cost by sacrificing accuracy.

¹⁵
We’ll discuss these issues in chapter 8.

One of the easiest, least damaging, and most widespread forms of constraint weakening is denormalization. Traditional best practice in data schema design recommends that you keep data in normal form, roughly meaning you shouldn’t duplicate data between tables unnecessarily. Instead, use separate tables and foreign keys. The problem with duplicated data is that it risks becoming inconsistent with the authoritative version of the data. This can happen for any number of reasons in a production system, because data is under constant modification.

But denormalization means data is local to where it’s needed. By keeping copies of dependent data with primary entities, you significantly improve performance, and you avoid additional database lookups. For example, in the newspaper system, each article has an author. You can store this fact as an author identifier in the article entity and then look up the author’s details from the author entity. Or you can duplicate the author’s then-current details into the article entity when you first create an article. You then have the author’s details available where you need them, saving you a database query because you’ve denormalized them into the article entity.

The trade-off is that subsequent changes to the author’s details will need to be propagated to article entities (if this is a requirement). Such propagation might be implemented as a batch process (with subsequent inaccuracy between batch runs) or as change-announcement messages that the article service can listen for (and some of which might be missed). With this update mechanism, article and author data are unlikely to remain entirely consistent. In the case of the newspaper system, you’d define an acceptable level of accuracy—say, 99%—and occasionally sample your dataset^[16] to verify that this is the case (if doing so is deemed necessary—you might take the simpler approach of correcting any author inaccuracies on a case-by-case base as they come to light).

¹⁶
Random sampling is a powerful, often-overlooked algorithmic approach. As long as you accept the principle that less than 100% accuracy is acceptable, you can use sampling to measure and solve many data-accuracy issues.

Denormalization is just one example of a more general approach: using conflict resolution after the fact (for example, fixing bad data) rather than conflict resolution before the fact (for example, strict transactions).^[17] If you can define conflict resolution rules, you can weaken many of the constraints on your data processing. The conflict resolution doesn’t have to be entirely accurate, just sufficiently so. For example, you can use the last modification date to define a rule that the most recent version of a data entity “wins” when there’s a conflict. This makes the assumption that your separate machines have reasonably synchronized clocks; clocks are never entirely synchronized, but you can put in place synchronization protocols that make this a workable assumption. Nonetheless, sometimes old versions will win. How often that’s acceptable depends on your acceptable-error rate.

¹⁷
This isn’t a book on the theory of distributed data, which is a wide field. As a software architect, you need to be aware that after-the-fact conflict resolution is a workable strategy and that you may need to do some research into the best approaches for your project.

4.3. Rethinking traditional data patterns

The traditional structures for data are heavily derived from the relational model. Once you’re freed from that model, you can’t rely on the standard approaches to data design. In this section we’ll look at the alternatives.

4.3.1. Primary keys

How do you choose the correct primary keys for your data entities in a microservices context? With traditional enterprise data, in contrast, you can fall back on the established choices for primary key selection. You can use natural keys, such as the user’s email address, or you can use synthetic keys, such as an incrementing integer. Relational databases use indexing and data-storage algorithms^[18] that are optimized for these kinds of keys. In a microservices context, don’t forget the distributed nature of the system: you can’t rely on the existence of a centralized database to choose keys for you. You need to generate keys in a decentralized manner. The challenge is to maintain the uniqueness of the keys.

¹⁸
The B+ tree structure is often used in database implementations. It has many entries per node, and this reduces the number of I/O operations needed to maintain the tree.

You could rely on the guaranteed uniqueness of a natural key like the user’s email address. This might seem like a great shortcut at first, but the key space isn’t evenly distributed—the distribution of characters in email addresses isn’t uniform. So, you can’t use the email address directly to determine the network location of the corresponding data. Other natural keys, such as usernames and phone numbers, tend to have the same issue. To counteract this, you can hash the natural key to get an even distribution over the key space. You then have a mechanism for scaling evenly as your data volume grows.

But there’s still a problem. If the natural key changes—a user changes their email address or username—then the hashed key will be different, and you may end up with orphaned data. You’ll need to perform a migration to resolve this issue.

As an alternative, you can use synthetic keys for data entities. In this approach, although you can query and search for the data entity via natural keys, the data entity itself has a synthetically generated unique key that isn’t dependent on anything, is evenly distributable, and is permanent. GUIDs, both standard and custom, are often used to achieve this goal.

GUIDs have two problems, though. First, they’re long strings of random characters, making them difficult for human eyes and brains to recognize quickly. This makes development, debugging, and log analysis more difficult. Second, GUIDs don’t index well, particularly in traditional databases. Their strength—their even distribution over the space of possible values—is a weakness for many indexing algorithms. Each new GUID can occur anywhere in the key space, so you lose data locality; every new entry could end up anywhere in the index data structure, necessitating movement of index subtrees. The advantage of an incrementing integer is that inserting new records is relatively efficient, because the key preserves index tree locality.

It’s possible to generate a distributed, unique, incrementing integer key, although you’ll probably need to build your own solution. This will give you the best of both worlds, but not without trade-offs. For acceptable performance, you can’t guarantee that the key increases monotonically: keys won’t generate in exactly linear order. Sorting by key only approximates the creation time of the entity, but this is a fact of life in distributed systems.

There’s no single correct approach to key generation in a microservice architecture, but you can apply the following decision principles:

Synthetic keys are preferred, because they can be permanent and avoid the need to modify references.
GUIDs aren’t an automatic choice. Although eminently suited to distributed systems, they have negative effects on performance, especially in relation to traditional database indexes.
You can use integer keys. They can still be unique, but they’re weaker than traditional autoincrementing database keys.

4.3.2. Foreign keys

The JOIN operation is one of the most useful features of the relational model, allowing you to declaratively extract different views of your data based on how that data is related. In a distributed microservice world, the JOIN operation is something you have to learn to mostly live without. There’s no guarantee that any given data entities are stored in the same database, and thus no mechanism for joining them. JOIN operations in general, if performed at all, must be performed manually.

Although this may seem to be a crushing blow against the microservice model, it’s more of a theoretical problem than a practical one. The following alternatives to JOINs are often better choices:

Denormalization— Embed sub-entities within primary entities. This is particularly easy when you’re using a document-oriented data store. The sub-entities can either live entirely within the primaries, which makes life awkward when you need to treat the sub-entities separately; or they can live in a separate data store but be duplicated, which leads to consistency issues. You need to decide based on the trade-offs.
Document-oriented design— Design your data schemas so there are fewer references between entities. Encode small entities using literal values, rather than foreign-key references. Use an internal structure in your entities (for example, store the list of items in the shopping-cart entity).
Using update events to generate reports— Publish changes to data so that other parts of the system can act on them appropriately. In particular, reporting systems shouldn’t access primary stores, but should be populated by update events. This decouples the structure of the data for reporting purposes from the structure used for production processing.
Writing custom views— Use microservices to provide custom views of your data. The microservice can pull the various entities required from their separate stores and merge the data together. This approach is practical only when used sparingly, but it has the advantage of flexibility. If you’re using message-level sharding, you’ve already done most of the work.
Aggressive caching— With this approach, your entities will still contain references to other entities. These are foreign keys, but you must perform separate lookup operations for each referenced entity. The performance of these lookups can be dramatically improved when you use caching.
Using in-memory data structures— Keep your data in-memory if possible. Many datasets can be stored fully in-memory. With sharding, you can extend this across multiple machines.
Adopting a key-value stance— Think of your data in terms of the key-value model. This gives you maximum flexibility when choosing databases and reduces interrelationships between entities.

4.3.3. Transactions

Traditional relational databases generate transactions, so you can be sure your data will remain consistent. When business operations modify multiple data entities, transactions can keep everything consistent according to your business rules. Transactions ensure that your operations are linearly serializable: all transactions can be definitively ordered in time, one after the other. This makes your life as a developer much easier, because each operation begins and ends with the database in a consistent state. Let’s examine this promise in more detail.

Returning to the digital newspaper, suppose management has decided that micropayments are the wave of the future. Forget about subscriptions—selling articles for a dime each is the new business strategy.

The micropayments feature works like this: The user purchases article credits in bulk—say, 10 articles for $1, with each article costing 10¢. Each time the user reads an article, their balance is reduced, and the count of the articles they’ve read increases. This feels like something that would require traditional transactions. Figure 4.12 shows the workflow to read an article.

Figure 4.12. Operations performed to allow a user to read an article

To ensure data consistency, you wrap this in a transaction. If there’s a failure at any point, the transaction reverts any changes made so far, moving back to a known good state. More important, the database ensures that transactions are isolated. That means the database creates a fictional timeline, where transactions happen one after the other. If the user has only 10¢ left, then they should only be able to read one more article; without a transaction, the user would be able to read extra articles for free.

Table 4.1 shows a possible timeline for how this undesirable situation could happen. In this timeline, the user is reading articles on both their laptop and mobile phone, they have 10¢ remaining, and they’ve read nine articles. They should be able to read only one more article on one device. The timeline illustrates how data inconsistency can arise if you take a simplistic direct approach to data access.

Table 4.1. Timeline showing data inconsistency

Time	Balance	Articles	Laptop	Mobile
0	10¢	9	Check balance is >= 10¢
1	10¢	9		Check balance is >= 10¢
2	0¢	9	Reduce balance by 10¢
3	-10¢	9		Reduce balance by 10¢ (<0!)
4	-10¢	10	Increment articles read
5	-10¢	11		Increment articles read

If you allow the data operations to interleave, you end up with inconsistent data. At the end of the example timeline, the user’s balance is -10¢, and they performed an action they weren’t supposed to. At least the article count is correct (11 read).

Database transactions ensure that these sequences of operations occur in isolation from each other; see table 4.2. The entire laptop operation flow completes first. Then the mobile flow starts; it fails because the user’s balance is insufficient. Transactions isolate operations from each other by imposing a linear flow of operations through time.^[19]

¹⁹
The concept of linear serialization is a powerful simplification of the real world. It’s the useful fiction that there’s a universal clock and that everything that happens can be synchronized against this clock. This model of the world makes coding business logic much simpler. It’s a pity it fails so badly for distributed systems.

Table 4.2. Timeline with transactions

Time	Balance	Articles	Laptop	Mobile
0	10¢	9	Check balance is >= 10¢
1	0¢	9	Reduce balance by 10¢
2	0¢	10	Increment articles read
3	0¢	10		Check balance is >= 10¢ STOP

But transactions aren’t free. You need to keep all relevant data in the same database, which makes you beholden to the way that database wants to work. You’re almost certainly using relational schemas and SQL, and you’re prioritizing data consistency over user responsiveness. Under heavy load, such as a major breaking news story, these transactions will slow down your article delivery API.^[20] You also need to code the transaction contexts correctly. It isn’t a trivial thing to define transactions and make sure they don’t deadlock.^[21]

²⁰
If you increase the number of database instances to handle more load, you’ll notice that doing so doesn’t give you more performance after a certain number of instances. The instances have to communicate with each other to implement distributed transactions, and this communication grows exponentially as the number of instances increases. This is exactly the wrong scaling characteristic.

²¹
A deadlock between transactions occurs when each transaction waits for the other to complete before proceeding. Because no transaction can complete, the system locks up.

Let’s consider an alternative. You can achieve many of your goals by using reservations. Reservations are exactly what they sound like: you reserve a resource for possible future use (just like a restaurant table or an airline ticket), releasing it later if you don’t use it.^[22] Instead of checking the user’s balance, you reserve 10¢. If the user doesn’t have sufficient credit, the reservation will fail. You don’t need to wait for other transactions; you know immediately not to proceed. You can safely interleave operations, as shown in table 4.3.

²²
The sample code for chapter 1 contains an example of a reservation system. See https://github.com/senecajs/ramanujan.

Table 4.3. Timeline with reservations

Time	Balance	Articles	Laptop	Mobile
0	10¢	9	Reserve 10¢? YES
1	0¢	9		Reserve 10¢? NO
2	0¢	10	Increment articles read
3	0¢	10	Confirm reservation

Reservations are faster because less coordination is required. But there’s a trade-off: you sacrifice data consistency. If you fail to deliver the article, it’s probably due to an error in the system, so you’re unlikely to be able to restore the user’s balance correctly.^[23] In addition, the read-article counter may miss some articles if there are failures during the operation flow. How much do these inaccuracies matter? That’s a question for the business. Your task, as a systems architect, is to determine the acceptable error rates. These dictate the tools you’ll use. Transactions may be necessary, but most businesses like to keep the door open even when some shelves are empty.

²³
You’ll have to handle orphan reservations; otherwise, all of your resources will end up reserved. Each reservation should have a time limit, and you’ll need to periodically review the list of outstanding reservations to see whether any are stale.

Implementing reservations

Reservations aren’t difficult to implement, but you need to be careful. A simplistic approach is to use a single service that maintains the reservations in-memory. This isn’t particularly fail-safe. A more suitable approach in production is to use atomic operations provided by in-memory stores such as memcached (https://memcached.org) and Redis (http://redis.io). You can even use a relational database: use an UPDATE ... WHERE ... query. In all cases, you’ll need a periodic cleanup batch process to review the list of open reservations and cancel those that are too old. For a deeper discussion of the reservation pattern, read the excellent book SOA Patterns (Manning, 2012, https://www.manning.com/books/soa-patterns), by Arnon Rotem-Gal-Oz.

The key observation here is that there are degrees of data accuracy, and you can make acceptable trade-offs to meet the demands of the business. A simple mental model for comparing these approaches is to chart performance versus accuracy, as shown in figure 4.13. Moving an approach from its position in this trade-off chart may be logically impossible or expensive.^[24]

²⁴
For example, if you want faster transactions, then you’ll need bigger machines to keep the number of instances low. Figure 4.13 shows the nominal trade-offs, given reasonable resources. The point of engineering isn’t that you can build a bridge that won’t fall—anyone can do that, given enough money. It’s that you can do so in a reasonable time for a reasonable cost.

Figure 4.13. Trade-offs when choosing conflict resolution strategies

There’s another benefit to the reservations approach that’s more important than performance: the flexibility to respond to changing requirements. You can place the reservation logic in a reservations microservice and the article-read counts in an analytics service. Now, you’re free to modify these separately and choose appropriate data stores for the different kinds of data.

Reservations are relatively easy to implement, because most data stores provide for atomic updates. The reservations strategy is just one example of an alternative to transactions that’s widely applicable in many scenarios.

4.3.4. Transactions aren’t as good as you think they are

Using database transactions doesn’t magically wash away the problem of dirty data. It’s a mistake to assume that transactions will give you consistent data without any further thinking required. Transactions come in different strengths, and you need to read the fine print to know what you’re getting. The promise that traditional databases make is that their transactions will satisfy the ACID properties:^[25]

²⁵
The material in this section is covered in glorious detail in most undergraduate computer science courses. We’ll skim the details for now, just as we did in our original studies!

Atomicity— Transactions either completely succeed or completely fail with no changes to data.
Consistency— Completed transactions leave the data compliant with all constraints, such as uniqueness.
Isolation— Transactions are linearly serializable.
Durability— Committed transactions preserve changes to data, even under failure of the database system.

Although these properties are important, you shouldn’t treat them as absolute requirements. By relaxing these properties, you open the floodgates to far more data-persistence strategies and trade-offs.^[26] Traditional databases don’t deliver the full set of ACID properties unless you ask for them. In particular, the isolation property demands that transactions be linearly serializable. This demand can have severe performance penalties, so databases provide different isolation levels to mitigate the impact of full isolation.

²⁶
For example, the Redis database positions itself as an in-memory data-structure server and synchronizes to disk only on a periodic basis.

The concept of isolation levels has been standardized.^[27] There are four, listed here from strongest to weakest:

²⁷
For more, see the SQL standard: ISO/IEC 9075-1, www.iso.org.

Serializable— This is full isolation. All transactions can be ordered linearly in time, and each transaction sees only a frozen snapshot of the data as it was when the transaction began.
Repeatable-read— Individual data entities (rows) remain fixed when read multiple times within a transaction, but queries within the transaction may return different result sets if other concurrent transactions insert or remove entities.
Read-committed— Individual data entities may change when read multiple times within the transaction, but only changes from committed data modified by other concurrent transactions will be seen.
Read-uncommitted— Uncommitted data from other concurrent transactions may be seen.

In the context of the micropayments example, the direct approach corresponds to the read-uncommitted level. The transaction approach is satisfied by the repeatable-read level. The reservations approach doesn’t use transactions but could be considered analogous to the read-committed level, if you’re prepared to stretch.

The SQL standard allows database vendors to have different default isolation levels.^[28] You might assume the serializable level without getting it. The standard also allows the database to use a higher level when necessary, so you might pay a performance penalty you don’t want, or need, to pay.

²⁸
For example, Oracle and Postgres are read-committed by default, whereas MySQL is repeatable-read.

Traditional databases don’t deliver on data consistency when you look closely. As a software architect, you can’t avoid your responsibilities by defaulting to using a traditional relational database. The database will make those decisions for you, and the results may not be the ones your business needs. It’s acceptable to choose a single-database solution when you’re building a monolith, because that approach makes your code less complex and provides a unified data-access layer. This forcing function on your decision making is washed away by the microservice approach. Microservices make it easier to match a data-storage solution to your data-storage needs, as well as to change that solution later.

4.3.5. Schemas draw down technical debt

The worst effect of relational databases is that they encourage the use of canonical data models. This is an open invitation to incur technical debt. The advantages of strict schemas that define a canonical data model are often outweighed by the disadvantages. It’s one of the primary positions of this book that this kind of trade-off should be made explicitly and consciously. Too often, the decision to use strict schemas is driven by convention, rather than analysis of the business context and needs of the system.^[29]

²⁹
A preference for strict schemas and contracts, versus “anything goes,” is predictive of a preference for strong versus weak typing in programming languages. There’s much debate about the subject and the relative merits of both positions. Jonathan Swift, author of Gulliver’s Travels, has the best perspective on this question: “Two mighty Powers have been engaged in a most obstinate War for six and thirty Moons past. It’s allowed on all Hands, that the primitive way of breaking Eggs, before we eat them, was upon the larger End: But his present Majesty’s Grand-father, while he was a Boy, going to eat an Egg, and breaking it according to the ancient Practice, happened to cut one of his Fingers. Whereupon the Emperor his Father published an Edict, commanding all his Subjects, upon great Penaltys, to break the smaller End of their Eggs. The People so highly resented this Law, that our Histories tell us there have been six Rebellions raised on that account; wherein one Emperor lost his Life, and another his Crown.”

A canonical data model, enforced via strict schemas, ensures that all parts of the system have the same understanding of a given data entity and its relationships. This means business rules can be consistently applied. The downside is that changes to the schema necessarily affect all parts of the system. Changes tend to come in big bangs and are hard to roll back. The details of the schema are hardcoded, literally, into the code base.

The model, and the schema describing it, must be determined at the start of the project. Although you can provide some scope for future changes, this is difficult to achieve in all cases. The initial high-level structure of the entities is almost impossible to change later, because too many parts of the system are frozen by implicit assumptions about the schema.^[30]

³⁰
Many years ago, I built a system for a stockbroking firm. The system could create historical share-price charts using pricing data from the mainframe. At the start of the project, I was given a list of indicators, price/earning ratios, and so forth. These numerical data fields were given for each stock, and they were all the same. I duly designed and built a relational model with an exact schema matching the indicators. Four weeks before go-live, we finally got access to the full dataset and imported it into the system—but there was a high rate of failure. In particular, the data for financial institutions couldn’t be imported. Upon investigation, it appeared that the data from the mainframe was corrupt. It certainly didn’t match the schema. I was told, “Of course the indicators are different for financial stocks—everybody knows that!” This was a catastrophe—we were close to the deadline, and there was no time for refactoring. I did the only thing I could think of: I shoved the data into incorrectly named columns and added a comment in the code.

And yet new business requirements must be met. In practice, this means the original schema is extended only where possible so it doesn’t break other parts of the system. The schema suffers, because it diverges from an ideal structure for new and emerging requirements. Other parts of the schema become legacy structures that must still be handled in the code base. Current business logic must retain knowledge of the old system; this is how technical debt builds.

To achieve rapid development, it’s better to be lenient. New code, delivering new features, shouldn’t be constrained by old code. How you deal with forward and backward compatibility shouldn’t affect your code. The microservice architecture approaches this problem as a matter of configuration, directing messages that make old assumptions to legacy microservices, and messages that make new assumptions to new microservices. The code in these microservices doesn’t need to know about other versions, and thus you avoid building technical debt.

4.4. A practical decision guide for microservice data

The principle benefit of microservices is that they enable rapid development without incurring large amounts of technical debt. In that context, data-storage choices are skewed toward flexibility but away from accuracy. The scaling benefits of microservices are important, but they should be a secondary consideration when you’re deciding how to proceed. The following decision tree is a rough guide to choosing appropriate data-storage solutions at key points in the lifecycle of your project.

The databases recommended for the different scenarios are broadly classified as follows:

Memory— The entire volume of data is stored in memory in convenient data structures. Serialization to disk is optional, is occasional, or may offer strong safety guarantees. This storage solution includes not only specific memory-oriented solutions, such as Redis and memcached, but also custom microservices.
Document— Data is organized as documents. In the simplest case, this reduces to a key-value store, such as LevelDB or Berkeley DB. More-complex storage solutions offer many of the querying features of the relational model, such as MongoDB, CouchDB, and CockroachDB.
Relational— The traditional database model, as offered by Postgres, MySQL, Oracle, and others.
Specialist— The solution is geared toward data with a specific shape, such as time series (for example, InfluxDB) or search (Elasticsearch), or offers significant scaling (Cassandra, DynamoDB).

This classification isn’t mutually exclusive, and many solutions can operate in multiple scenarios.

Cutting across this classification is the further decision to make the data local or remote to the service. This selection can’t be made freely, because some storage solutions are much easier to run locally. All options are also subject to caching as a mechanism to improve performance and reduce the impact of database choice.

Regardless of the database you choose, it’s important to prevent the database from becoming a mechanism for communication. At all times, keep strict boundaries around microservices that can access data, even when multiple microservices are accessing the same database.

Let’s look at the data-persistence decision from the perspective of greenfield and legacy scenarios.

4.4.1. Greenfield

You’re mostly free to determine the structure, technologies, and deployment of your project, and most code is new.

Initial development

During this early phase of the project, substantive requirements are discovered. You need to be able to rapidly change design and data structures:

Local environment— Your development machine. You need flexibility and fast unit testing. Data access is performed entirely via messages, so you can use in-memory data storage. This gives you complete flexibility, fast unit tests, and clean data on every run. Local, small installations of document or relational databases are necessary only when you need to use database-specific features. You should always be able to run small subsets of the system and mock data-operation messages.
Build and staging— These are shared environments that typically need some test and validation data to persist, even if the main body of data is wiped on each run. But you need to retain the flexibility to handle schema changes emerging from requirements uncertainty. A document-oriented store is most useful here.

Production phase

The scope of the system is now well defined. The system has either gone live or is on track to do so, and you must maintain it. Maintenance means not only keeping the system up and within acceptable error rates, but also incorporating new requirements:

Rapid development— The business focus of the system is on rapid development. In this scenario, you gain development speed by retaining the use of document-oriented storage, even when this means lower accuracy. You can also consider keeping certain data local to the service, because it’s independent of other services. Specialist databases also have a place in this scenario, because they allow you to take advantage of data-specific features.
Data integrity— If the business focus is on the accuracy of the data, you can fall back on the proven capabilities of relational databases. You represent all data operations as messages, so migrating to a relational model is relatively easy.

4.4.2. Legacy

You must work with an existing system in production, with slow deployment cycles and significant technical debt. The data schema is large and has many internal dependencies.

Initial development

You must determine what strategies you can adopt to mitigate complexity in the context of the project’s business goals, as well as develop an understanding of the legacy system’s constraints:

Local environment— If you’re lucky, you may be able to run some elements of the legacy system locally. Otherwise, you may be dependent on network access to staging systems or even production. In the worst cases, you have no access—just system documentation. You can still represent the legacy system using messages, and doing so enables you to develop your own microservices against those messages. If you’re developing new features, you can create greenfield bubbles within which to make better data-storage choices.
Build and staging— Document-oriented stores offer the most flexibility in these environments. You may also have the opportunity to validate your message layer against a staging version of the production database.

Production phase

You’re running live with both your new code and the legacy system. Your new data-storage elements are also running live. Using messages to wrap the legacy system gives you the benefits of microservices in the new code:

Rapid development— Your data-storage solution uses the legacy system of record as a last resort, and you’re rapidly migrating data over to your new database choice. Even if the legacy database remains in production as the SOR, you can make a strategic decision to reduce data accuracy by developing against more-flexible databases. A common architecture is to duplicate legacy relational data into a document store.
Data integrity— The legacy system remains in place as the SOR. Paradoxically, this gives you greater freedom to choose data-storage solutions, because you can always correct the data they contain using the SOR. Nonetheless, you’ll pay the price of querying the SOR at a higher volume. A mitigating strategy is to generate data-update events from the SOR, if possible.

4.5. Summary

The relational model and the use of transactions aren’t an automatic fit for every enterprise application. These traditional solutions have hidden costs and may not work as well you expect.
Data doesn’t have to be completely accurate. Accuracy can be delayed, with data being eventually consistent. You can choose error rates explicitly, with the business accepting a given error rate on an ongoing basis. The payoffs are faster development cycles and a better user experience.
The key principles of microservice messages, pattern matching, and transport independence can be applied to data operations. This frees your business logic from complexity introduced by scaling and performance techniques such as sharding and caching. These are hidden, and you can deploy them at your discretion.
Traditional techniques such as data normalization and table joins are far less useful in a microservice context. Instead, use data duplication.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4. Data

Create new playlist

Sign In

Sign Up

Chapter 4. Data

4.1. Data doesn’t mean what you think it means

4.1.1. Data is heterogeneous, not homogeneous

Figure 4.1. Using different databases for different microservices

4.1.2. Data can be private

4.1.3. Data can be local

Figure 4.2. Local user profile data stored only on shard instances

4.1.4. Data can be disposable

4.1.5. Data doesn’t have to be accurate

4.2. Data strategies for microservices

4.2.1. Using messages to expose data

4.2.2. Using composition to manipulate data

Figure 4.3. Interaction between the article-page and article services

Figure 4.4. Composing the article-page and article services together

Figure 4.5. CRUD message flow compared to CQRS message flow

Figure 4.6. Direct synchronous interaction with the search-engine service

Figure 4.7. Reactive asynchronous interaction with the search-engine service

4.2.3. Using the system configuration to control data

Figure 4.8. Eventually consistent data interactions

Figure 4.9. Immediately consistent data interactions

Figure 4.10. Deployment sequence to replace a shard

Figure 4.11. Deployment sequence to split a shard

4.2.4. Using weaker constraints to distribute data

4.3. Rethinking traditional data patterns

4.3.1. Primary keys

4.3.2. Foreign keys

4.3.3. Transactions

Figure 4.12. Operations performed to allow a user to read an article

Table 4.1. Timeline showing data inconsistency

Table 4.2. Timeline with transactions

Table 4.3. Timeline with reservations

Figure 4.13. Trade-offs when choosing conflict resolution strategies

4.3.4. Transactions aren’t as good as you think they are

4.3.5. Schemas draw down technical debt

4.4. A practical decision guide for microservice data

4.4.1. Greenfield

Initial development

Production phase

4.4.2. Legacy

Initial development

Production phase

4.5. Summary

Table of Contents for
Chapter 4. Data