Chapter 8. Availability

One of the key requirements in reactive programming is that your system must provide availability. But what does it mean for a system to be available? We can consider a system available when it is able to respond to requests in a timely manner. If the system takes too long to respond, or in fact doesn’t respond at all, it is considered unavailable. A system is unavailable when it is down or overloaded.

Availability is distinct from the ability of the system to scale (e.g., scalability).

As detailed in Chapter 7, there are many things that can contribute to the system becoming unavailable. And no matter how much we try, there will be times when pieces of the system will become unavailable. The key, then, is to find ways to mitigate that, so that even in the face of failures, the system can remain available. We have already discussed how you can address specific types of failures. This chapter provides a higher-level look at how you can build systems in a way that supports availability, and what features Akka provides to get you there.

Microservices Versus Monoliths

Today, there are two often-discussed approaches to building applications. Applications can either be monolithic or they can be built using microservices. Both approaches exist on either end of a spectrum on which most software projects fall somewhere in between. Very few applications are completely monolithic or perfect microservices.

Monolithic applications are built such that all of the components of the application are deployed as a single unit. Complexity in monolithic applications is isolated by creating libraries that are then knitted together into the larger whole.

Microservices, on the other hand, are built by dividing the larger application into smaller, specialized services. These microservices perform very small tasks in an isolated fashion. Complexity is isolated to the individual microservices and then put together using other microservices.

Availability is not really a question of monoliths or microservices. You can create highly available applications using either approach. However, the manner in which you do that, and the effects on the scalability of your application, are dependent on which route you choose.

In a monolithic application there is only a single deployable unit. In this case, the way we provide availability—and scalability—is to simply deploy more units. If any one unit is lost, others will pick up the slack. Akka doesn’t really help us here. The deployed units don’t need to communicate with one another, so there is no need to introduce features like Akka Cluster. Within the monolith, we can use Akka to help us make better use of resources. We can also use it to help isolate faults by using bulkheading and other techniques. Akka does have a place within a monolithic system, but when we are speaking of availability using Akka, we are more interested in techniques that don’t fit well in a monolithic application.

Providing availability with microservices is similar to monoliths in that the way to provide it is by deploying multiple instances of each microservice. However, in this case, because each microservice is separate, it can be deployed independently, scaled independently, and our availability needs can be tuned independently. Some areas of a system might be critical and need to be highly available. Other areas might be less important, perhaps only needing to run for a few hours a day or even less. In these cases, availability might be much less of a concern.

When we have multiple microservices that need to communicate with one another, Akka provides facilities that can help. But, before we go into detail about what facilities Akka provides, let’s first take a moment to talk about how we divide up an application into microservices.

Bounded Contexts as Microservices

Domain-driven design (DDD) gives us a very natural way to divide an application into smaller pieces. Bounded contexts are excellent boundaries on which to divide. By definition, bounded contexts isolate areas of the domain from one another. Therefore, following this idea, we can isolate areas of our application along the same dividing lines. This means separating those areas out into separate microservices.

Sometimes, those microservices might have some shared elements, and you might need some common libraries in certain cases to share code. In DDD, bounded contexts can share domain elements through what is known as a shared kernel. This shared kernel is basically a set of common domain elements that can be used by multiple bounded contexts. In Akka, the message protocol that is passed between clustered applications is commonly found in the shared kernel.

This message protocol, as defined in the shared kernel, represents the API for the bounded context or microservice. It defines the inputs and outputs for that microservice. Depending on the mechanism of communication being used between the microservices, it might consist of a series of simple case classes that are directly sent to other actors, or it might represent case classes that are first serialized to JSON or some other format and then sent via an HTTP call to an Akka HTTP endpoint.

In our scheduling domain example, we have three bounded contexts defined: the User Management context, the Scheduling context, and the Skills context. We can separate each of these into microservices. We can then define either an HTTP endpoint to interact with those services, or we can directly send messages to the actors using Akka Cluster or Akka Remoting.

After we have successfully partitioned our application into microservices, we can then begin thinking about how we want to scale and deploy the application. Certain microservices might need many copies in order to provide scalability, whereas others might, by necessity, be limited to a single copy. Depending on the criteria, this can have an effect on how we make our application available.

Fine-Grained Microservices

Although bounded contexts are a good starting point for dividing up an application, sometimes that’s not enough. Sometimes, we need to break things into chunks smaller than a bounded context. A common pattern in Akka is to separate the interface and the domain into separate microservices. In the case of a REST API, that might mean that the REST API exists as a single microservice, whereas the actual logic behind that API could be another. This is advantageous because it makes it possible to scale the two sides independently.

The benefit of Akka is that you can make those types of decisions without affecting the architecture of the application. Location transparency means that we can decide whether we want our REST API in the same deployable unit, or in a separate deployable unit without affecting the basic structure of the application.

In the following section, we will explore several techniques that you can use to break apart an application, all toward the goal of providing the availability and scalability that you want. Which technique is best suited for you depends on the specifics of your system’s design.

Cluster-Aware Routers

When building using a single, local actor system, routers are a scalability feature. With them, you can process more messages in parallel which makes it possible to scale up an application within a single machine. However, when dealing with a distributed system, you can also use routers as an availability feature. Cluster-aware routers are like normal routers, except that their routees might reside on other nodes in the cluster. This allows those routers to provide availability on top of scalability.

With a cluster-aware router, you can spin up multiple nodes in your cluster. These nodes can act as hosts to the routees of the router. Then, within your application, when you want to send a message to the actors, you can send it through the cluster-aware router. The router will then ensure that one of the nodes will receive the message, depending on which routing strategy you adopted.

If a node fails, the router can simply direct the message to a different, active node. From the client perspective, you don’t need to worry about which node receives the message; the router takes care of that detail for you. But you now have the benefit of added availability, given that nodes can leave the cluster without the client even realizing it. In fact, if you are using a pool router, when a node does leave the cluster, the pool router can even create new routees on other nodes in the cluster in order to compensate for the loss of that node.

One thing to be aware of is that the routing strategies work differently when using cluster-aware routers than they do with local routers. For example, with a smallest-mailbox router, no information is passed between nodes regarding the mailbox size. This means that the smallest mailbox router has no idea which node in the cluster has a smaller mailbox. In this case, the router uses the normal smallest-mailbox logic for all local routees, but remote routees are given lower priority.

So where might you use a cluster-aware router? Cluster-aware routers are a much more general-purpose construct than a cluster singleton, so their use cases are more varied. But a good example of where you might use this is in a worker pattern. In this case, you have a single actor that will push tasks to multiple workers. In our sample domain, for example, we might use the worker pattern to make schedule adjustments. We want to be able to do several schedule adjustments in parallel, particularly because the amount of work involved might be significant. In this case, we can push a message into a router, which will then distribute it to an available worker. Those workers can reside on any node in the cluster. If a node is lost, the workers are lost as well, but depending on the router configuration, they might be reallocated to another node in the cluster.

In practice, this worker pattern might look something like the following:

object Scheduler {
  case class ScheduleProject(project: Project)
  case class ProjectScheduled(project: Project)

  def props() = Props(new Scheduler)
}

class Scheduler extends Actor {
  import Scheduler._

  override def receive: Receive = {
    case ScheduleProject(project) =>
      // Do Work
      sender() ! ProjectScheduled(project)
  }
}

Again, as with the cluster singleton, there is no evidence here that we are going to use Akka Cluster. We can decide that later.

Indeed, creating instances of the workers is no different than creating instances of local actors:

val scheduler = system.actorOf(Scheduler.props(), "scheduler")

This example creates an instance of the actor and names it scheduler. Within your local system, you can treat this just as you would any other actor, because, in fact, it is just a standard actor. The magic happens on another clustered node when you want to refer to this actor:

val scheduler = system.actorOf(
    ClusterRouterGroup(RoundRobinGroup(Nil), ClusterRouterGroupSettings(
      totalInstances = 100,
      routeesPaths = immutable.Seq("/user/scheduler"),
      allowLocalRoutees = false,
      useRole = None
    )).props(),
    "scheduler"
  )

This code configures a cluster-aware group router that references the scheduler. You can see that it has a path to the scheduler, which matches the path that we created for the instance of the actor to which we want to route. In this case, we are allowing up to 100 instances to be used by this group router, but we have only created a single instance. If we had 100 nodes, though, each node could be creating and managing an instance of the scheduler. We could also specify a different sequence of paths for the routee’s paths in case the pattern changes from node to node.

This router will now do a lookup on the cluster nodes to find any instances of the scheduler. It will then route messages to those instances—in this case, using a round-robin routing strategy. One thing to note, however, is that unlike the singleton proxy, the router does not buffer messages until a node can be found. This means that if you try to send a message before your node has connected to the cluster and found the routees, that message will be lost. You can help mitigate this by listening to cluster events and only starting to send messages after your cluster is fully established. Of course, Akka’s delivery guarantee is still only At Most Once, so if you really need reliable delivery, you will need to use one of the other techniques to achieve that.

Distributed Data

Within an eventually consistent, distributed system, such as the one we are building, we sometimes encounter situations in which there is transient data. This data is not something that needs to be saved in a database. It exists only for the life of the application. If the application is terminated, it is safe to terminate that data, as well. This could include things like user session information. When a user logs in, you might need to store information about that user. When did she log in? What security token did she use? What was the last activity she performed? Information like that is interesting, but keeping it isn’t always valuable. After the user has logged out, that information becomes irrelevant and potentially needs to be cleaned up.

At the same time, it might be important that the transient information be available on all nodes. If a user is experiencing network problems and she becomes disconnected and then reconnected, she might end up reconnected to a different node. If that information is unavailable, the system loses its value. What this situation calls for is a way to maintain this data across several nodes in a cluster without saving it to a database.

In fact, there is a way. If you can represent that data by using data structures that follow specific criteria, you can replicate it reliably in an eventually consistent fashion. This replication can happen entirely in memory; no database needs to be involved. This gives you a distributed, eventually consistent method of storing and retrieving data.

These eventually consistent data types are called Conflict-Free Replicated Data Types, or CRDTs. CRDTs are an emerging concept. The idea first appeared in a paper in 2011 by Marc Shapiro et al. At this point, CRDTs are not widely used, but interest in them is growing, particularly for use cases in which a system handles very high volumes of data traffic but still needs to be performant.

Akka has its own implementation of CRDTs called Distributed Data. This is a new module in Akka, and it is still experiencing some flux. The API is changing as developers iron out the details of how it should work.

The basic idea behind CRDTs in Akka is that you have certain data types. These include Counters, Sets, Maps, and Registers. To be considered CRDTs, these data types must include a conflict-free merge function. This merge function has the job of taking two different states of the data (coming from two different locations in the cluster) and merging them together to create a final result. If this merge can be done without conflict, you can use this data structure to replicate across nodes.

Here’s how it works. As each node receives updates to its data, it broadcasts its current state to other nodes. Other nodes receive the updated state, merge it with their own state, and then store the end result.

CRDTs typically work by storing extra information along with the state. Additive operations are often safe, but removal operations become more complex. For example, what happens if you try to remove an element from a Set that has not yet received the update to add that element? How do you resolve that conflict? This is typically handled by making removal an additive operation. Rather than removing an item, you mark that item as removed, but it is still present in the data. This means that your data continually grows. Even as you try to remove information, you are in fact adding information. There are optimizations that you can apply in certain circumstances to help with this.

So how do you use Akka Distributed Data? Let’s take a look at how you might replicate some simple session information among nodes in the cluster. In this example, let’s assume that you are replicating only the session IDs, which are a custom type that wraps a UUID. We use the ORSet data type for replication. An ORSet, or Observed Remove Set, is a special type of Set that uses a version vector to keep track of the creation of elements. This version vector is then used as part of the merge function to determine causality. In an ORSet, if an add and a remove happen out of order, or concurrently, the version vector will be used to resolve the conflict. You can’t remove a record if you haven’t yet seen the add for it. If an add and remove happen concurrently, the add wins. Here’s how it looks:

case class SessionId(value: UUID = UUID.randomUUID())

object SessionManager {
  case class CreateSession(sessionId: SessionId)
  case class SessionCreated(sessionId: SessionId)

  case class TerminateSession(sessionId: SessionId)
  case class SessionTerminated(sessionid: SessionId)

  case object GetSessionIds
  case class SessionIds(ids: Set[SessionId])

  def props() = Props(new SessionManager)
}

class SessionManager extends Actor {
  import SessionManager._

  private val replicator = DistributedData(context.system).replicator
  private val sessionIdsKey = ORSetKey[SessionId]("SessionIds")
  private implicit val cluster = Cluster(context.system)

  override def receive: Receive = {
  case CreateSession(id) =>
    replicator ! Update(
      sessionIdsKey,
      ORSet.empty[SessionId],
      WriteLocal,
      request = Some(sender() -> SessionCreated(id))
    ) {
      existingIds =>
        existingIds + id
    }

  case UpdateSuccess(
    `sessionIdsKey`,
    Some((originalSender: ActorRef, response: SessionCreated))
  ) => originalSender ! response

  case TerminateSession(id) =>
    replicator ! Update(
      sessionIdsKey,
      ORSet.empty[SessionId],
      WriteLocal,
      request = Some(sender() -> SessionTerminated(id))
    ) {
      existingIds =>
        existingIds - id
    }

  case UpdateSuccess(
    `sessionIdsKey`,
    Some((originalSender: ActorRef, response: SessionTerminated))
  ) => originalSender ! response

  case GetSessionIds =>
    replicator ! Get(sessionIdsKey, ReadLocal, request = Some(sender()))

  case result @ GetSuccess(`sessionIdsKey`, Some(originalSender: ActorRef)) =>
    originalSender ! SessionIds(result.get(sessionIdsKey).elements)
  }

}

You can see from this code that in order to use data replication, you need to access the replicator. This special actor can be obtained by using the DataReplication extension. You can update replicated data by sending an Update message to the replicator. As soon as the data is updated, you will receive an UpdateSuccess message. You can retrieve replicated data by sending a Get message to the replicator. The replicator will respond with a GetSuccess message.

There are other messages that you can send to the replicator, as well. You can subscribe to updates so that you will be notified when a value changes by using the Subscribe message, and you can delete records by using the Delete message.

The replicator will take care of replicating your state across the cluster. This actor will be eventually consistent across the cluster. This means on any node in the cluster that runs the SessionManager, you can request the list of SessionIds, and you will get a list of active sessions. Of course, because this is eventually consistent, the nodes might give you slightly different lists. You can control this eventual consistency by specifying different read/write consistency values.

In the previous example, local consistency is specified, which means the replicator only looks at the local node. However, you can specify more strict values that will cause the replicator to ensure that a certain number of nodes must agree on the result before you consider it valid. ReadAll and WriteAll means that all nodes must agree, but it also affects your availability. If a node is down, the replicator won’t be able to reach the required consistency and your request will fail. You can also specify Read​Ma⁠jority or WriteMajority, which will ensure that the majority of nodes agree on the value. This gives you a nice balance of consistency versus availability.

Graceful Degradation

One of the benefits of breaking your application into microservices is that it enables graceful degradation. Within an application, we enable graceful degradation by setting up failure zones in the application. By creating actors in hierarchies that allow a section of the application to fail without bringing down the entire system, we allow our application to degrade in pieces rather than failing all at once. Microservices enable this same behavior but spread across multiple Java Virtual Machines (JVMs) and potentially multiple machines.

We would like to avoid the situation implied in the old quote “If at first you don’t succeed, then perhaps skydiving is not for you.” We want our applications to continue even in the face of failure.

In our sample scheduling domain, we can separate out a few services. We have the scheduling engine, the Project Management service, the Person Management service, and the Skills service. This means that if our scheduling engine fails, it doesn’t prevent us from adding or removing projects or people. It only prevents us from scheduling people on a project. You can do this within a single monolithic application, by using actors to create failure zones, or you can do it by using microservices.

Graceful degradation means that while portions of an application might fail, the application as a whole can continue to operate, keeping it available even in the face of failure. It also means that noncritical portions of the application can be taken down, perhaps for maintenance or other reasons, without necessarily affecting your users in an adverse way.

Let’s take a look at that. The scheduling engine in our system doesn’t necessarily need to respond in a rapid fashion. It might be reasonable for a new project to be created but for it to take a period of time, minutes, or maybe even hours, before it returns results. There is no expectation that the moment a project is created it should be scheduled, as well. This means that if that section of the application were in need of some maintenance, perhaps a database upgrade, you could take the entire application down and perform that upgrade. In the meantime, users can still add new people and they can still add new projects; they just won’t get the schedule for that project until after the maintenance is completed. And that’s OK.

You can use the Circuit Breaker pattern discussed in Chapter 7 to provide graceful degradation. Typically, detection of the failure of an external system is a time-consuming operation. The system might need to wait for a connection to time out. If it did this for every subsequent request until the resource became available again, the system as a whole would take on an additional burden. By using the Circuit Breaker, the system is able to detect the first time a problem occurs and then quickly fail any additional requests until the timeout has elapsed. This helps keep subsequent failing requests from taking longer than they need to, and it also reduces the load on the system so that it can recover. This improves availability because rather than presenting timeout alerts, which can be time consuming, the system instead can quickly inform you about the error and which service is unavailable. Even though a portion of the system is unavailable, it is still responsive, even if all it’s doing is notifying you of the error.

Deployment

After you have built your application to support availability using the techniques outlined, you then need to be able to deploy it in a way that maintains that availability. If the deployment process requires that you bring down the entire application suite, you haven’t achieved the desired goal.

With a bit of planning and forethought, though, you can come very close to 100 percent service availability, even during deployment.

In Akka, each executable process typically contains a single actor system, which joins the remainder of the processes in the cluster to create a single distributed actor system.

Staged Deployment/Rolling Restarts

One of the most common reasons for nodes to become unavailable in a system isn’t an error at all. Rather, it’s a normal operational concern: upgrades. When you have a new version of your application or service, you need to deploy it to your production environment at some point.

Automation can be critical here, particularly when you’re dealing with microservices. You will not have just a single copy of a monolithic application to deploy, but rather many copies of numerous microservices, potentially even hundreds in a large system. This would be a tedious and error-prone process if it were a manual operation, but with a bit of automation, you can make it simpler and more convenient.

The most commonly used approach with distributed systems is to use staged deployments, also called rolling restarts. The basic process for a rolling restart looks like this:

  1. Select a node in the cluster.

  2. Stop routing new traffic to this node.

  3. Allow any existing requests to complete. This is sometimes called “draining” the node.

  4. Gracefully terminate the application you are planning to upgrade. In an Akka Cluster, this would involve removing the node from the cluster.

  5. Copy the new executable to the node.

  6. Launch the new executable. In an Akka Cluster, the node would then need to join the cluster.

  7. Verify that the node is available and that it is the expected version (a monitoring status page can help here).

  8. Begin routing traffic to the new node.

  9. Repeat these steps for other nodes.

Rolling restarts require that you have multiple instances of a service running in order to maintain availability. When the node is taken down for deployment, other nodes need to be available to pick up the slack.

If you’re using a cluster singleton, you can’t maintain availability during an upgrade—the singleton pattern doesn’t allow it. However, you can minimize the downtime by allowing other nodes to host the singleton. In this case, when you bring down the current node, the singleton can fail-over to one of the other hosts, exposing only a very short period of downtime.

Blue/Green Deployment

An alternative to the rolling upgrade process is called blue/green deployment. For this method, you must have more nodes then you actually need. Usually you would have double the number of nodes that are required for normal operation, with only half of the nodes actually servicing requests at any given time.

In this deployment model, you designate 50 percent of the nodes as “blue,” and the other 50 percent as “green.” Suppose that your green nodes are currently online and servicing requests. You can then shut down all the blue nodes, upgrade them, bring them back up, and then check them for operational status (remember, no traffic is being routed to them) The final step is to swap the blue set for the green set, making the blue nodes active, and the green nodes inactive. Now you are free to upgrade the green set using the same process.

One big advantage to this process is that if a problem arises with the newly activated set after the switchover, you can simply switch back to the previous set (in this example, the green set) before upgrading them. For this reason it can be beneficial to keep your inactive set on the old version for a period of time while the new version runs, and watch for any issues. Only when you are confident that the new version is stable and working would you upgrade the inactive set.

Crash Recovery/Operational Monitoring

A critical element of availability in a clustered environment is the ability to recognize when a failure has occurred and then respond to it appropriately. Even better is when we can recognize the problem before it occurs and take steps to prevent it. To achieve this, we need to do operational monitoring. There are a variety of monitoring tools that you can use, and each of them has its place in a system.

Health Checks and Application Status Pages

Most monitoring tools have one thing in common: they usually rely on some sort of health-check mechanism. Perhaps the most common mechanism for performing health checks is using an HTTP status page.

An application status page is basically a URL that accepts a Get request. When it receives such a request it does any internal checks that are necessary and then returns the result of those checks to indicate whether the application is healthy. This might return HTML, JSON, XML, or whatever is convenient for your environment.

Akka HTTP can help you here by offering a simple and lightweight way of providing those status pages. In fact, it isn’t a bad practice to include an Akka HTTP health-check page, even when your application does not otherwise require an HTTP interface. A microservice that communicates entirely by using actors can still benefit from having a few Akka HTTP endpoints for monitoring as well as general maintenance.

So what might a health check look like? What kind of information should be included in such a page?

It is a good idea to include a section that indicates whether external dependencies of the service are available. Can the service communicate with its database? Can it communicate with other external APIs? These checks don’t need to be heavy; a simple ping type operation is all you need. And depending on the service, the failure of an external dependency might not represent a failure of the application, which can continue to operate. You will need to judge on a case-by-case basis whether a failed dependency should constitute a failure of the application, or perhaps just a warning or alert.

Another useful piece of information to include on this page is the application version number or commit hash. This can be very helpful for determining whether an upgrade was successful. It can also be useful for tracing errors. If you know the version number, you can eliminate changes to the code that are not part of that version number, which might help you to determine the source of the problem.

Here is a very simple template of what a monitoring page might look like:

case class Symptom(description: String)

sealed trait Diagnosis {
  def name: String
}

case class Healthy(name: String) extends Diagnosis

case class Unhealthy(name: String, symptoms: Set[Symptom]) extends Diagnosis

case class HealthReport(versionNumber: String, healthChecks: Seq[Diagnosis])

trait HealthCheck {
  def checkHealth(): Diagnosis
}

class HealthCheckRouting(applicationVersion: String, checks: Seq[HealthCheck])
  extends HealthReportProtocol {
  val routes = {
    path("health") {
      get {
        complete {
          HealthReport(applicationVersion, checks.map(_.checkHealth()))
        }
      }
    }
  }
}

This routing class will provide a health monitoring page. It builds a HealthReport that contains the version information as well as the status of various services in your application that have implemented the HealthCheck trait. The details of how these health checks are implemented will be different from service to service, but the basic idea is that when you perform a health check, if there are any problems, they will be returned as Symptoms in the Unhealthy Diagnosis. Of course if the service is healthy, it will return a Healthy Diagnosis. The HealthReportProtocol takes the resulting report and converts it to the appropriate format for your use case (e.g., JSON, XML, etc.).

Metrics

Health checks are a great tool for alerting you to issues when they occur, and also for automated tools to monitor in order to take appropriate actions when a problem occurs (like notifying the team). However, they tend to be reactive, in the negative sense, rather than proactive. They inform you that a problem has occurred rather than warning you that a problem will occur. So, how can you detect problems before they occur so that you can take the necessary steps to prevent them?

One way to do this is by using a good metrics system. Usually this comes in the form of a time-series database. It could be a special-purpose database, such as InfluxDB or Graphite, or it can be something more general purpose like Cassandra or SQL, where you have created time-based collections. In either case, the real power comes when you have a good visualization tool that you can use to view your metrics.

A good rule of thumb for any system, Akka based or otherwise, is to wrap timers around entry points to your system. Essentially these timers will record the start and end time of an operation, compute the difference, and then store that in the time-series database. This means that for every operation your system performs, you will know the time it took to perform that operation, and the time at which that operation occurred.

Using this information, you can begin to see patterns within your application. For instance, you can see when your traffic peaks. You can see when operations begin to speed up or slow down. And with that information available, you can notice trends. The best way to see these trends is to put them on a visual graph. Graphs are far easier for visualizing patterns than presenting the information in a table. Looking at those graphs you might notice that when your application reaches a certain number of operations per second, it tends to crash shortly thereafter. Or you might observe that when operations begin to take too long, a system crash is usually not far behind.

This type of information, when combined with other information—CPU usage, memory graphs, application startups, deployments, and more—can reveal things that you might otherwise not have noticed. It is a crucial part of operational monitoring, especially in a large distributed system. You can use it to correlate certain events in your system with failures. You might notice that when your operations begin to drag on for too long, your memory usage spikes. You can then correlate this threshold with application failures. This might lead you to conclude that you need more memory, or it might reveal that the way you are using memory in your application is flawed (a memory leak, for example). Or perhaps you observe that your application was behaving fine until a recent deployment, after which problems began cropping up. From there, you can investigate what changes were introduced in that new version.

The graphs and metrics are just the first step, however. They show you that something is going wrong, but they don’t necessarily give you the details about what exactly the problem is. Part of that comes from a human interpreting the data. Humans are very good at recognizing patterns. Although you can have machines monitor for things like thresholds being breached, sometimes the problems that a human detects are not the types of things that machines are good at (at least not yet). Sometimes, a human observer can look at a graph that doesn’t cross any monitoring thresholds, and still detect that something looks off. Maybe you aren’t using too much memory, but the way that the memory is being used has changed. Perhaps you aren’t experiencing too much load, but it is simply more or less than is normal. Although these observations might be meaningless or coincidental, often they can be the first signs of an impending problem. It is therefore important to have a human observer looking at these graphs from time to time, to spot these deviations from the norm.

Logging

After you have detected that something is different in the graphs, the next step is to determine why. This is where logging becomes critical. Your graphs and your health checks are there to alert you to a problem. Your logging is there to help you diagnose it.

Much like with metrics, a good practice is to log as much information as you can any time there is an input to the system. If a REST call is made to the system, log the details of that call. If a top-level actor receives a message, log that message. And of course, if your application throws an exception, be sure to log that exception somewhere.

These logs are going to be where you go when your application is misbehaving. You need to ensure that you have as much information available as possible. If you forget to include the logging, the information won’t be there when you need it. It’s better to log too much information than not enough, but be aware that too much logging can make it difficult to see what’s relevant and what isn’t.

Watchdog Tools

When you have your health checks, metrics, and logging in place, the next step is to introduce tools to take that information and automatically act on it so that you don’t need to.

First and foremost, you will probably want some sort of notification mechanism. These tools will need to monitor your health checks, and perhaps even your logging and metrics, and alert you through emails, phone calls, chat tools, or other mechanisms when a problem occurs.

Although alerts are important, they can be frustrating, as well. You don’t want to be the person getting the call at 3 A.M. on a weekend saying that your system is down and you need to fix it. It would be better if rather than simply alerting you, the tool could also take some corrective action.

If you have built your system using the “let it crash” mentality, when your application does crash, a logical course of action is to restart it. There are tools, available both free and commercially, that will allow this functionality. Tools like Monit, Marathon, ConductR, and more allow you to monitor your application, and in the event of a failure, they can automatically restart it, perhaps even on a different node in the cluster, without the middle-of-the-night phone call. It might not even be necessary to send a notification, or the nature of that notification can change. Rather than call, the system simply sends an email.

This is also an area for which it is important to be watching your graphs and your logs. Depending on how you have configured those restart mechanisms and the nature of the problem, it’s possible for your system to limp along for days, restarting constantly, but still managing to service requests. This isn’t a good situation. It will work, but you need to do something about it. If you aren’t paying attention to the notifications, graphs, and logging, you might never see that a problem has occurred. No one is going to phone you to complain, because from an outsider’s perspective, the system is working.

Operational monitoring is not about a single tool that does all the work for you. There are many different types of monitoring, and each of them provides a different feature set and a different kind of information. The combination of these tools is what allows you to really get a sense of the ebb and flow of your system. Health checks and metrics alert you to a problem, logs can help you diagnose the problem, and watchdog tools can automatically respond to the problem. On their own, each tool is useful, but together they are invaluable.

Conclusion

We have now shown how actor-based systems have different requirements: just logging is not enough, and static testing before deployment is insufficient as well.

With proper monitoring, however, availability can be ensured, and the overall health of your system made immediately visible despite rapidly changing load (even in the face of the system itself being continually upgraded).

Having all of these benefits while still retaining high performance, however, requires careful attention. This will be the topic of Chapter 9, as we add the final piece you need for building your own actor-based systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset