Travel agents started around the end of the 19th century. The idea being that they would organize your holiday for you and, in return, they would take a fee for doing so. A modern travel agent will organize flights (some even have their own airlines), book hotels or accommodation, arrange transport to and from the airport, and offer excursions and trips.
I’m writing this in a unique time. I live in the UK, and the country, and indeed the world, has been thrown into turmoil by a pandemic, the like of which hasn’t been seen for 100 years! This is a relevant point because the subject of this particular chapter has been especially hard hit by the measures taken to control the virus. I intend, in no way, to make light of this situation; however, a travel agency is an excellent example of a distributed transaction, and so I decided to keep the case study as is.
When you ask a travel agent to book a holiday, you would expect them to inform you that they had successfully booked the holiday on your behalf or that the dates that you requested were not available. You would not expect to be informed that they had booked the flight and transfer from the airport, yet the hotel was full that week; you would expect the system to check all the systems for availability, and to either book all the requested services, or none of them.
In this chapter, we will investigate a system such as may be found at a travel agent – that is, where multiple third-party external systems must work together. We will investigate how to create a distributed transaction across multiple services.
Background
Lunar Polly Travel has submitted a brief for us to design a system for their brand-new travel agency. They wish to sell trips to the moon, and they are partnering with a well-known space travel company to do so.
The system is currently expected to be low volume; however, they hope that demand will rapidly increase, as people get used to the idea of space travel.
The company had recruited a well-respected and well-known figure within the travel industry as the managing director of this new firm. You have spoken to the managing director of this new company, and he has given you a list of high-level requirements. Let’s see, in details, what the requirements are.
Requirements
Unlike in previous chapters, in this situation, we are dealing with a company that does not exist yet. While the managing director may be an expert in his field, there is no existing system to model the new one on – not even a manual system. This is a factor that should be carefully considered – while the person describing the requirements may know the industry (in this case, that’s not even likely, as it’s a new industry), he may not know the specific challenges faced by the staff, the customers, or the vendors.
Just to be clear, what I’m saying here is not that you should only ever accept jobs in well-known industries, with well-trodden paths; what I’m saying is that you should ensure that the software design can cope with whatever scenarios you can consider. A good method of doing that is to design the customer journey and then play it through with someone else.
I am personally not an expert in any of the industries mentioned in this book. That statement, obviously, goes double for the industry of space tourism! The problems that I’ve created are for the purpose of demonstrating potential solutions, which can be applicable cross-industry.
Must be able to book a space flight.
Book a hospital check before the launch.
Book a hotel on the night before the launch.
If any of the bookings can’t be made, then none can be made (there’s little point in a hotel room if the flight is cancelled).
The system must be able to scale massively, as the MD expects the demand to grow exponentially.
All of the systems have a public API that we can query, and use to book the relevant resource. Each one of the providers has offered to work with us to get this running and so are willing to change their APIs or systems within reason.
Let’s consider how we could achieve this solution.
Options
The central issue here is that of a transaction. Actually building a system to call these APIs is trivial; however, we have an issue with ensuring that we either book all or none of the selected services.
We also have an issue with scaling – while it may be trivial to call the APIs, we should consider what would happen if we were to get two conflicting bookings: if we book a flight for person A and a hotel for person B and then try to book the flight for person B, we’ll find that we’ve already used that booking up for someone else.
Manual Process
As with previous chapters, let’s investigate how we could achieve this through a manual process to help us better understand the business domain.
- 1.
Contact the space flight provider to book the flight itself.
- 2.
Phone the hospital to book the pre- and postcheck.
- 3.
Ring the hotel to book a room for the night before the launch.
That seems a very easy process; however, what would we do if at stage 2, the sales assistant found out that the hospital was too busy on that day? Well, in that case, we would try to cancel the space flight.
Should we have a similar problem with the hotel, then we would need to cancel the two bookings.
- 1.
Phone the space shuttle company and ask them to reserve the day that we wish to book.
- 2.
Contact the hospital and do the same.
- 3.
Ask the hotel to reserve a room for that night.
Then, if any of these are unavailable, we could get back to the others and ask them to cancel the reservation; otherwise, we could book each service.
This looks like a much better system, but let’s put ourselves in the position of, say, the hospital; they are busy and are essentially tying up one of their bookings until we get back to them; they potentially would need to turn away other bookings for that day, as a result; similarly with the space shuttle company – although the costs involved for the space shuttle company are much larger.
The process that we’re discussing here in various forms is a transaction.
Transactions
- 1.
Account 1: Debit £300.00
- 2.
Account 2: Credit £300.00
If I contact my bank and ask them to transfer this money, the preceding process is a simplified version of what happens: they take the money out of one account, and they put it in another – a two-stage process. Now, what would happen if in between 1 and 2, the system crashed? Without a transaction to wrap around these two activities, Account 1 would still be debited, but the money would simply disappear – it would never make it to Account 2.
A transaction around the two activities ensures that either all of these activities happen, or none do. The valid result states are either the transaction failed and Account 1 still has the £300, or the transaction succeeded and Account 2 now has that money, but Account 1 does not. In order to ensure the validity of this data, most database systems implement a concept of an ACID transaction.
ACID
ACID is, in fact, an acronym; it stands for Atomic, Consistent, Isolated, and Durable.
This may seem like a divergence from the subject in hand, but as with many such things in computer science, new problems have typically already been solved years ago, but in a different environment.
I won’t delve into every tenet in detail, although the concepts are very straightforward.
Atomic
We’ve already seen what atomicity means: it guarantees that all parts of a transaction either succeed or fail.
Consistent
Consistency is concerned with the integrity of your data – this means different things depending on the database, but let’s imagine that we have a sales order and sales order lines; assuming the database was set up such that they were linked in a primary / foreign key relationship, it would not adhere to the rule of Consistency if the sales order record were allowed to be removed, but the lines left orphaned in the database.
Isolation
In fact, the principle of isolation is the one that we are primarily concerned with here: this relates to concurrency. That is, what happens if transactions or operations are executed at the same time. The rule here is deceptively simple: where operations occur concurrently, the database should be left in the same end state as if the transactions were executed sequentially.
Transaction 1 | Transaction 2 | ||
---|---|---|---|
Account 1 | Debit £300 | Print Account 1 | |
Account 2 | Credit £300 | Print Account 2 |
Transaction 2 does not make any changes to the database. This is still a valid transaction (in SQL databases, this would be a SELECT statement).
We’ll assume a starting balance of £500 in account 1 and account 2.
T1, T2 = £200 in account 1, £800 in account 2
T2, T1 = £500 in account 1, £500 in account 2
There are no other valid states here; that is, there is no other sequence in which we could execute these two transactions.
This is clearly an oversimplified example; in a real database system, you may find dozens or even hundreds of transactions executing concurrently; however, the same rule applies for isolation.
Transaction 1 | Transaction 2 | |||
---|---|---|---|---|
1 | Account 1 | Debit £300 | Print Account 1 | |
2 | ||||
3 | Account 2 | Credit £300 | Print Account 2 | |
4 |
T1: 1, 2
T2: 1, 2, 3, 4
T1: 3, 4
Account 1: £200
Account 2: £500
If we look back at our list of valid states – this is neither, and so this path of execution would not be considered isolated.
Some databases actually provide a facility for you to breach isolation in just this way; it’s generally termed a dirty read and essentially allows you to look at uncommitted data.
T2: 1, 2
T1: 1, 2, 3, 4
T1: 3, 4
Account 1: £500
Account 2: £800
Again, this is not one of the valid states that we listed earlier. Finally, we should consider what would happen if T1 were to crash at point 2 – this could leave the database in a situation where not only were the values displayed incorrectly but that (as we stated earlier) the money has disappeared.
Durable
Finally, we have durability; arguably, without this, there’s little purpose in a database. Essentially, it means that once you’ve committed some data to a database, it remains there.
In Chapter 2, we discussed the principle of event sourcing. If you consider that this method never actually changes data at all, then you are effectively dealing with a different data set for each transaction – meaning that any event-sourced system is, by definition, ACID compliant.
Now that we understand what a transaction is, we should see if we can apply this to our scenario; because we’re dealing with disparate systems as part of our transaction, what we’re actually talking about is a distributed transaction .
Distributed Transactions
A distributed transaction is a method to provide the functionality that we expect from a standard transaction across multiple systems. Let’s see how that works and whether it meets our requirements.
Figure 3-1 illustrates the sequence flow for a distributed transaction. The transaction coordinator essentially polls each participant in the transaction and asks if they are ready to begin the transaction. If the coordinator receives a confirmation from every participant, then it issues a commit instruction and, again, expects each participant to confirm that they have committed.
Each participant in this transaction is required to keep a persisted state of the transaction following the lock, until the data is committed, or an abort message is sent. The coordinator would need to persist the state of the transaction after the first commit is sent; this would repeatedly poll the participants until it receives an acknowledgment.
The transaction in the diagram would have a unique reference, which would be distributed during the initial message. From this point on, any communication would refer to this reference.
This seems an eminently usable system; admittedly, it’s a little chatty and likely slow, but we have a guarantee that all the bookings are made, or none are made. However, we should consider how this system deals with failure.
If a participant receives a message for a transaction that it wasn’t aware of, it should abort, unless it’s a commit message – in which case, it should acknowledge (without committing).
Participants must persist the transaction state between locking to unlocking.
Once a participant has confirmed in response to a prepare, it must wait for a commit or an abort.
Let’s consider some possible scenarios.
Possible Scenarios
The hospital has no available appointments on that day
- 1.
The coordinator sends a message to the hospital but receives an error back, indicating the booking isn’t available.
- 2.
Locks are released.
The space flight provider’s system crashes after receiving a prepare message but before responding
- 1.
Book messages are sent to all participants, and confirmations received.
- 2.
Prepare is send to all participants.
- 3.
Confirmations are received from the hospital and the hotel.
- 4.
After a period of time, the transaction coordinator times out the transaction, and abort messages are issued to all participants.
- 5.
When the space flight provider’s system comes back online, it reads the persisted log of the transaction and replies to the prepare with a confirm.
- 6.
The transaction coordinator now knows nothing of the transaction, as the persisted log is deleted, and so it sends an abort.
The hotel’s system crashes after receiving a commit message and committing the transaction but before acknowledging
- 1.
All the initial messages are sent and confirmed.
- 2.
Prepare messages are sent and confirmed.
- 3.
Commit messages are sent to all participants, and acknowledgments are received from both the space flight and hospital systems.
- 4.
The transaction coordinator continues to send commit messages to the hotel system.
- 5.
When the hotel system comes back online, it has already committed and removed the log of the transaction, so when it receives a commit message, it simply replies with an acknowledgment.
The transaction coordinator crashes after sending all the prepare messages but before receiving any confirmations
- 1.
All the initial messages are sent and confirmed.
- 2.
All the prepare messages are sent.
- 3.
The participants all reply to the prepare message with a confirmation.
- 4.
Since the coordinator is now down, it cannot issue a commit message; however, the participants have confirmed the prepare, and so they must await a commit or abort message; no timeout is possible in this case, as no single participant can know the state of the rest.
As we can see from the preceding examples, the final one presents a real problem. It is unlikely that the hotel, for example, would be willing to commit to a system that would indefinitely lock their internal system, and so because of this, a distributed two-phase transaction is not practical for this problem.
Although a distributed transaction is not possible, we can certainly use some elements from that system to inform our choice.
Distributed Transaction with Timeout
We could easily adapt our distributed transaction to do this by simply changing the rules slightly. We could allow the participants in the transaction to unilaterally time out – for example, when a prepare message has been issued, but no commit received. We could then have an error state where the transaction coordinator polled all of the participants at the end to determine the status of the booking. Where one or more of the participants have not secured the booking, we simply attempt to book; if we can’t, then cancel the others.
In this case, clean-up is, effectively, attempting to cancel the remaining transactions: if a confirm hasn’t been issued, then by abandoning the transaction; or, where it has, then by instigating a cancellation.
Book and Cancel
Our next option is a little less from a computer science background and more from a business background. We could simply book the resources and, where a single element of the booking is not available, attempt to cancel the other parts. The obvious risk here is that we would not be able to cancel a particular thing; for example, the hotel may refuse the cancellation.
Having spoken to people that have faced this issue in the travel industry, this is not only common practice, but it is not unheard of for this transaction to be referred to a call center for them to manually correct the booking.
Hold a Booking
This possibility appears at first glance to be the most sensible. Here, we are effectively saying to the service provider that we wish them to reserve a place for us, but not to actually make the booking. We have discussed this kind of scenario previously in this chapter. The issue here is that you are asking a provider to potentially refuse a firm booking in exchange for your potential one. In fact, this could be said to fit into our transaction diagram shown before – we need simply to accept the prepare as a hold instruction and confirm as a firm booking.
Advanced Purchase
This appears to be the approach taken by the bigger players in the travel industry today. Again, it is also not a technical solution to the problem, but a business one. What you do is project how many trips you are likely to sell and then buy that many from the supplier. In our case, we may decide to buy 20 nights in the hotel and book and pre-pay for 20 hospital appointments on the same day. However, in our case, it would not be practical for a simple reason: the space shuttle flight would be prohibitively expensive.
At the time of writing, NASA was paying SpaceX around $55 million per astronaut for a place on a shuttle. The average price for an international flight on an airplane was around $1,300.
Consequently, this is unlikely to be the best decision (given that only one flight not being re-sold could potentially bankrupt the company).
Business Decision
In fact, as with many such cases, the right answer here is not an architectural one, but a business one. The person making such a decision is going to need to balance risk against potential profit; such decisions are beyond the scope of this book, safe to say that it is very unlikely that you will face any architectural problem that doesn’t contain an element of a business decision.
Having spoken to a representative from the business, they decide to adopt the Book and Cancel option.
Target Architecture
Now that we have established how we will interface with the various providers, we should also consider how we can make this scalable, and what the system will look like.
We have already indicated that space travel is so expensive that it is probably currently limited to (at most) a few thousand people in the world; however, that doesn’t mean that it will always be so – when cars first came out, they were exclusively for the wealthy enthusiast, but today (at least, at the time of writing), they are so prevalent that car crash injuries are the eighth leading cause of death in the world.
In addition to using a (form of) distributed transaction, we need to be able to execute many transactions synchronously. As with previous chapters, in this situation, we come back to a message broker. Each individual booking must be made synchronously. Given that we intend to cancel if the booking is not successful, the order that we should do these things is important.
After establishing what the flow of the booking will look like, we should consider what will execute this. We’ve discussed that this would need to be scalable, and we’ve also considered that it would need to occur in order. There are a number of ways to achieve this.
Based on our earlier diagrams, we need to establish how we will represent the different parts of the transaction. Only the transaction coordinator falls within our domain (although as with previous chapters, the third-party systems will be emulated by us).
Stateful Service
One possibility here is to start to make the bookings and to persist the state of the booking. That is, each time a part of the transaction executes, a flag would be updated in some form of persistence layer. If the service were to crash then, when it came back up, it would check the persisted state and resume where it left off – as described in the earlier section on distributed transactions.
This service itself could then execute based on a message broker, but the responsibility for persistence is within the service. This has the advantage of meaning that the service could be taken and deployed anywhere (as it’s self-contained). However, it does mean that each individual transaction must be managed by a single instance of the coordinator.
Distributed Service
This option would use the message broker itself as a persistence layer. The way that this works is that the service behaves very much as before but simply writes a message back to a queue, and then another instance of the same service picks that up. The advantage here is that the service is more scalable, as a new instance is used for each section of the task. This means that we can increase the number of workers and the new workers will pick up the next part of the transaction.
Target Architecture Diagram
We’ve now covered the principles of a distributed transaction and discussed how we might implement this.
Examples
In the example here, we’re going to create three APIs. Since the Tech Appendix for Chapter 1 covers creating APIs in .Net, I will not re-cover that here. Save to say that we will have three APIs.
As you will be familiar with by now, the code for this will be in the following GitHub repository:
https://github.com/Apress/software-architecture-by-example
Project Structure
Our sample solution will consist of three parts: third-party APIs, a client, and a coordinator.
Before we go about creating the rest of the project, we will need to set up the Azure Service Bus. I’ve covered setting this up via the Azure Portal (https://portal.azure.com); in order to follow along, you’ll need to have an Azure account.
At the time of writing, Microsoft was offering a 12-month free Azure subscription.
Service Bus Configuration
Although we are using Azure in this example, you could easily substitute any other cloud provider message broker or any message broker at all – for example, RabbitMQ would work fine for this situation.
As we’ve discussed in previous chapters, the architectural decision to make here is whether to use a cloud provider or to try to manage the scale yourself. This particular example – low traffic now, but an expectation to ramp up rapidly – lends itself very well to the cloud model.
The first step in Azure Service Bus is to create a namespace.
Within the namespace, we can create Queues and Topics.
The next step is to generate an access policy.
Now that we have successfully configured our service bus, we can move onto the main part of the project; this is our coordinator.
Coordinator
The coordinator here will simply listen to the Service Bus queue that we’ve created (BookingQueue) and will process each message by its type. The coordinator holds absolutely no state, which means that we can run several of these processes should we encounter a rush in orders.
BookingRequestHandler.cs
In Listing 3-1, what we are doing is leveraging the C# LinkedList in order to chain together these endpoints.
Although this structure is specific to C#, the concept of a Linked List is certainly not. If you are using a different language, there is almost certainly an equivalent; and if there is not, creating a linked list is a trivial exercise.
BookingRequestHandler.cs
In Listing 3-2, we simply detect the type of request; there are four types: the initial booking request initiated by the client, and then the hospital booking request, hotel booking request, and space flight booking request – which are all initiated by the coordinator itself.
You’ll also see from this code that we have a concept of a function; this allows us to traverse through the stack of bookings but, where there has been an error, back again to cancel each in turn. Listing 3-2 shows the top level of this; that is, either the request has just started or it has tried and failed to make the booking.
As stated earlier, we won’t go into the specific code of the APIs, but each has a random chance of failing to successfully make the booking and also a random delay before responding, thereby simulating a slightly more real-world environment.
BookingRequestHandler.cs
From Listing 3-3, we can see that there are two basic logical flows within the method: either we are trying to book or we are trying to cancel . Within the logic to book, should the call fail, we initiate a cancel flow by simply sending a message back to the queue with the cancel function and to the previous entity in the linked list. This will then traverse back up the list until the top.
The cancel branch of the code attempts to cancel; and if it can’t, it logs the error and continues on through the list.
In Listing 3-3, you’ll notice that we are instantiating the proxy dependency inside the method – which does, somewhat, negate the purpose of having a proxy. I felt that structuring the code this way would better illustrate the intent of the code, although I would strongly advise against this practice.
BookingRequestHandler.cs
The method that books the space flight, shown in Listing 3-4, does not handle the cancel branch. The reason is that this is the top of the list, so it can never be called to cancel. We also established that cancelling this would be prohibitively expensive, and so it is only ever called once we have successfully booked the rest of the trip.
Summary
We’ve been to the moon and back in this chapter! Transactions, especially distributed transactions, are a complex and nuanced topic. Transactions are always a good thing for data integrity, but even for local transactions, there is a price to pay; a long running transaction will cause locking problems - extend that to a distributed transaction, and you make the same locking problem exponentially worse!
We’ve discussed how you can use a tool such as a message broker in order to coordinate a transaction – which prevents the need to have a separate transaction coordinator.
Business decisions are an inescapable factor of software architecture. There’s little point in designing an architecture for a company that they simply don’t have the budget for, nor should you discount requirements such as time to market in designing a system. It’s worth remembering that what you’re building needs to be used – otherwise you’re wasting time and money in building it. Often, more than one technical solution will present itself to a given problem – when this happens, it is incumbent on the architect to explain the options to the business and to abide by the business decision.
Further to this, once a decision is made, it should ideally be documented. There are several ways to document a decision – for example, you could simply keep the email that you receive; however, I would encourage you to research the use of Architectural Decision Records (ADR) – which are a way of recording your decision in the code itself.