Spotify is one of the companies of reference when building microservices-oriented applications. They are extremely creative and talented when it boils down to coming up with new ideas.
One of my favourite ideas among them is what they call the Simian Army. I like to call it active monitoring.
In this book, I have talked a lot times about how humans fail at performing different tasks. No matter how much effort you put in to creating your software, there are going to be bugs that will compromise the stability of the system.
This is a big problem, but it becomes a huge deal when, with the modern cloud providers, your infrastructure is automated with a script.
Think about it, what happens if in a pool of thousand servers, three of them have the time zone out of sync with the rest of the servers? Well, depending on the nature of your system, it could be fine or it could be a big deal. Can you imagine your bank giving you a statement with disordered transactions?
Spotify has solved (or mitigated) the preceding problem by creating a number of software agents (a program that moves within the different machines of our system), naming them after different species of monkeys with different purposes to ensure the robustness of their infrastructure. Let's explain it a bit further.
As you are probably aware, if you have worked before with Amazon Web Services, the machines and computational elements are divided in to regions (EU, Australia, USA, and so on) and inside every region, there are availability zones, as shown in the following diagram:
This enables us, the engineers, to create software and infrastructure without hitting what we call a single point of failure.
This configuration raised a number of questions to the engineers in Spotify, as follows:
To answer all these questions, Netflix has created various agents. An agent is a software that runs on a system (in this case, our microservices system) and carries on different operations such as checking the hardware, measuring the network latency, and so on. The idea of agent is not new, but until now, its application was nearly a futuristic subject. Let's take a look at the following agents created by Netflix:
There are a few other agents, but I just want to explain active monitoring to you. Of course, this type of monitoring is out of reach of small organizations, but it is good to know about their existence so that we can get inspired to set up our monitoring procedures.
In general, this active monitoring follows the philosophy of fail early, of which, I am a big adept. No matter how big the flaw in your system is or how critical it is, you want to find it sooner than later, and ideally, without impacting any customer.
Throughput is to our application what the monthly production is to a factory. It is a unit of measurement that gives us an indication about how our application is performing and answers the how many question of a system.
Very close to throughput, there is another unit that we can measure: latency.
Latency is the performance unit that answers the question of how long.
Let's consider the following example:
Our application is a microservices-based architecture that is responsible for calculating credit ratings of applicants to withdraw a mortgage. As we have a large volume of customers (a nice problem to have), we have decided to process the applications in batches. Let's draw a small algorithm around it:
var seneca = require('seneca')(); var senecaPendingApplications = require('seneca').client({type: 'tcp', port: 8002, host: "192.168.1.2"}); var senecaCreditRatingCalculator = require('seneca').client({type: 'tcp', port: 8002, host: "192.168.1.3"}); seneca.add({cmd: 'mortgages', action: 'calculate'}, function(args, callback) { senecaPendingApplications.act({ cmd: 'applications', section: 'mortgages', custmers: args.customers}, function(err, responseApplications) { senecaCreditRatingCalculator.act({cmd: 'rating', customers: args.customers}, function(err, response) { processApplications(response.ratings, responseApplications.applications, args.customers ); }); }); });
This is a small and simple Seneca application (this is only theoretical, don't try to run it as there is a lot of code missing!) that acts as a client for two other microservices, as follows:
This could be a real situation for processing mortgage applications. In all fairness, I worked on a very similar system in the past, and even though it was a lot more complex, the workflow was very similar.
Let's talk about throughput and latency. Imagine that we have a fairly big batch of mortgages to process and the system is misbehaving: the network is not being as fast as it should and is experiencing some dropouts.
Some of these applications will be lost and will need to be retried. In ideal conditions, our system is producing a throughput of 500 applications per hour and takes an average of 7.2 seconds on latency to process every single application. However, today, as we stated before, the system is not at its best; we are processing only 270 applications per hour and takes on average 13.3 seconds to process a single mortgage application.
As you can see, with latency and throughput, we can measure how our business transactions are behaving with respect to the previous experiences; we are operating at 54% of our normal capacity.
This could be a serious issue. In all fairness, a drop off like this should ring all the alarms in our systems as something really serious is going on in our infrastructure; however, if we have been smart enough while building our system, the performance will be degraded, but our system won't stop working. This can be easily achieved by the usage of circuit breakers and queueing technologies such as RabbitMQ.
Queueing is one of the best examples to show how to apply human behavior to an IT system. Seriously, the fact that we can easily decouple our software components having a simple message as a joint point, which our services either produce or consume, gives us a big advantage when writing complex software.
Other advantage of queuing over HTTP is that an HTTP message is lost if there is a network drop out.
We need to build our application around the fact that it is either full success or error. With queueing technologies such as RabbitMQ, our messaging delivery is asynchronous so that we don't need to worry about intermittent failures: as soon as we can deliver the message to the appropriate queue, it will get persisted until the client is able to consume it (or the message timeout occurs).
This enables us to account for intermittent errors in the infrastructure and build even more robust applications based on the communication around queues.
Again, Seneca makes our life very easy: the plugin system on which the Seneca framework is built makes writing a transport plugin a fairly simple task.
RabbitMQ transport plugin can be found in the following GitHub repository:
https://github.com/senecajs/seneca-rabbitmq-transport
There are quite few transport plugins and we can also create our own ones (or modify the existing ones!) to satisfy our needs.
If you take a quick look at the RabbitMQ plugin (just as an example), the only thing that we need to do to write a transport plugin is overriding the following two Seneca actions:
seneca.add({role: 'transport', hook: 'listen', type: 'rabbitmq'}, ...)
seneca.add({role: 'transport', hook: 'client', type: 'rabbitmq'}, ...)
Using queueing technologies, our system will be more resilient against intermittent failures and we would be able to degrade the performance instead of heading into a catastrophic failure due to missing messages.