Chapter 9. Performance

Performance tuning an Akka application begins as it would with any application: first you isolate the areas of the application that have the highest cost in terms of time, and then you strive to reduce that cost.

Given that we always have a limited amount of time to apply to tuning, it makes good sense to begin with the greatest cost. Of course, it is difficult to know what this is without measurement. Therefore, as you would with any application, your first step in tuning Akka is generally to measure what takes a long time.

Many excellent performance and latency measurement tools exist for the Java Virtual Machine (JVM), and you can apply any of them to an Akka application. For instance, you can use the Gatling tool to measure the performance of an Akka application with an HTTP interface in the same way as any non-Akka application, and you can apply ScalaMeter to specific code segments.

In addition to the usual statistics on number of requests served, latency, and so on, there are a few special statistics unique to Akka that might be helpful.

Given that a slow consumer in an Akka application can become a serious runtime issue, you might need to measure mailbox size to ensure that any actor (or group of actors) isn’t growing its mailbox beyond reasonable bounds.

The time that it takes each message to be processed by a given actor is also very useful—when used in conjunction with information about the frequency of a message, it can help you narrow down bottlenecks. For instance, if a certain message (say, ComputeInvoice) takes a relatively long time (as measured by, for instance, a ScalaMeter microbenchmark), but is not that common or timing-critical, it’s not a good candidate to tune. A much more common message, say LookupInvoice, which takes a relatively short amount of time but is under heavy volume, might be a better choice.

As you can see, you’ll need to be able to simulate a typical load on your application, and perhaps even predict what an extreme load would look like so that you can measure under realistic conditions.

Isolating Bottlenecks

There’s an old story about a factory that was having a problem with one of its huge machines. The factory manager called in an expert, who walked straight over to one part of the machine, took out a crescent wrench, and tightened one bolt a quarter-turn. The machine resumed running properly immediately, and the expert left. A few days later, the factory received an invoice from the expert for $5,000. The factory manager called the expert and complained, “Five thousand dollars? You were here for only five minutes!” The expert calmly replied, “I’ll send you a new invoice with a breakdown.” A short while later, another invoice arrived that said “1) Tighten bolt, $5; 2) Knowing which bolt to tighten, $4,995.”

Akka tuning is a bit like this: often you need to make only a tiny change to get the results you need, but knowing which change to make is the trick.

The first step in tuning most scalability or performance issues is to identify the bottleneck. What are you running out of that causes the system to not perform the way you want it to? Is it memory? It is threads or cores? Is it I/O capacity?

To identify the bottleneck, you often need a combination of logging, monitoring, and live inspection, and tools such as VisualVM or YourKit can be essential for this.

If you eliminate all the usual suspects in JVM tuning (running out of memory, leaking objects, running out of file handles) and know that the Akka portion of your project contains the bottleneck, then and only then can you consider tuning Akka itself.

Keep in mind that JVM tuning issues can be much more subtle: for example, the garbage collector taking too much processing time and “thrashing” by running too often is not as obvious as an actual failure. JVM tuning is beyond the scope of this book, but many good resources exist.

Tuning Akka

After you’ve identified a potential area for tuning, what options do you have available? There are three major areas that you can tackle. Let’s take a look at each one.

Reduce or Isolate Blocking Sections

Even in a highly concurrent actor-based application, many blocking operations can still exist—an actor might need to access a nonasynchronous JDBC data source, for instance. Although the most common blocking operations are IO-based, there are others that hold the current thread captive, thereby reducing the threads available for other processes to continue, and slowing the entire system. We will elaborate on this further when we discuss dispatchers.

Make the Message Process in Less Time

Because actors are all about processing messages, anything you can do to reduce the processing time of each message is a benefit to the entire system. If it is possible to break the problem into smaller problems that can be tackled concurrently, this is often a winning strategy, and it has the added benefit of separating logic into smaller and more isolated sections. Actors are inexpensive to create, and messages between actors (especially on the same JVM) are also inexpensive, so breaking up a problem from one actor to a group of them is a good tactic, generally. Often this leads to opportunities for increased parallelism, as will be discussed shortly.

Engage More Actors on Processing the Messages

Another strategy, if the problem allows it, is to increase the number of instances of the same actor working on the messages to be handled; for example, to use a pool router to distribute work among a group of identical actors. This works only if the problem is compute-bound—that is, not constrained by external I/O, and if each instance of the message is entirely independent (it does not depend on state contained in the actor to be processed). The limit of this technique is essentially the number of available cores; increasing the number of actors much beyond this won’t help. However, it is possible to distribute the pool of actors across a cluster, effectively getting more cores than a single machine can bring to bear. This is a common Akka pattern, and if you apply it properly, this capability forms the basis of much of Akka’s power in distributed systems.

This is very different from the technique of simply load-balancing requests among a group of identical nodes. Distributed actors can apply more than one node’s power to servicing a single request, which is beyond the reach of a simple node-balancing solutiuon.

Dispatchers

Dispatchers are a critical component for tuning Akka. You can think of dispatchers as the engine that powers actor systems. Without them, the system doesn’t run. And if you are not careful with them, the system doesn’t perform.

The job of the dispatcher is to manage threads, allocate those threads to actors, and give those actors opportunities to process their mailboxes. How this works depends on the type of dispatcher as well as what configuration settings you have used for the dispatcher. So what kinds of dispatchers are available?

The Standard Dispatcher

The standard dispatcher is the one most commonly used. It is a good, general-purpose dispatcher. This dispatcher utilizes a customizable pool of threads. These threads will be shared among the actors managed by the dispatcher. Depending on the number of actors and the number of threads, the actors might have restricted access to the threads because they might be currently in use by other actors.

A simple configuration for this dispatcher looks like the following:

custom-dispatcher {
  type = Dispatcher
  executor = "fork-join-executor"
  fork-join-executor {
    parallelism-min = 4
    parallelism-factor = 3.0
    parallelism-max = 64
  }
  throughput = 5
}

In the preceding example, throughput is set to 5, meaning at most 5 messages will be processed before the thread of execution will be available for other actors; this factor varies considerably based on your specific requirements.

The example also uses a fork-join-executor; however, that is not the only option. You can also use a thread-pool-executor. Depending on which you use, the tuning parameters are slightly different. You can use thread-pool-executor like this:

custom-dispatcher {
  type = Dispatcher
  executor = "thread-pool-executor"
  thread-pool-executor {
    core-pool-size-min = 4
    core-pool-size-max = 64
  }
  throughput = 5
}

Each executor has a minimum and maximum thread value to ensure that the number of available threads remains within this boundary. Each also has a factor. The factor is a multiplier. To calculate the number of available threads, you multiply the number of cores on the machine by the factor and then bind it between the minimum and maximum. This means that if you move your application to a machine with more cores, you will have more threads, assuming the minimum/maximum allows it. Note that with a fork-join-executor, the parallelism-max is not the maximum number of threads; rather, it is the maximum number of hot or active threads. This is an important distinction because the fork-join-executor can create additional threads as they become blocked, leading to an explosion of threads.

Your decision as to whether to use a fork-join-executor or a thread-pool-executor is largely based on whether the operations in that dispatcher will be blocking. A fork-join-executor gives you a maximum number of active threads, whereas a thread-pool-executor gives you a fixed number of threads. If threads are blocked, a fork-join-executor will create more, whereas a thread-pool-executor will not. For blocking operations, you are generally better off with a thread-pool-executor because it prevents your thread counts from exploding. More “reactive” operations are better in a fork-join-executor.

Throughput is an interesting option. The throughput value determines how many messages an actor is allowed to process before the dispatcher will attempt to release the thread to another actor. It determines how fair the actors will be in their sharing of the threads. A low throughput number like 1 indicates that the actors are as fair as possible. With a value of 1, each actor will process one message and then yield the thread to another actor. On the other hand, a value of 100 is much less fair. Each actor will occupy the thread for 100 messages before yielding. When should you use a high number and when should you use a low number? You’ll need to play with the numbers in your system to really find the sweet spot, but there are a few things to consider.

Each time you yield the thread, a context switch must happen. These context switches can be expensive if they happen often enough. This means that depending on your application and the size of your messages, too many context switches can begin to have a significant impact on the performance of your application. If your dispatcher has a large number of very fast messages flowing through it all the time, those context switches can hurt. In this case, you might want to tune the throughput number to be higher. Because the messages are fast, you won’t be stuck with idle actors waiting for long periods of time. Their wait times will be fairly low, despite the high throughput value. And by increasing that throughput, you will have allowed the system to minimize the impact of context switching, which can help improve the system’s performance.

However, if your messages are not fast and take a long time to process, a high throughput value can be crippling. In this case, you can end up with many of your actors sitting idle for long periods of time while they wait for the slow messages to push through. Context switching isn’t hurting you much because your messages are long running. This is where you want to tune your dispatcher for fairness. A low throughput value is probably what you want here. This will allow the long-running actors to process a single message and then yield to one of the waiting actors.

Using this dispatcher in your code is as simple as calling the withDispatcher method on the Props for your actor, as demonstrated here:

val actor = system.actorOf(Props(new MyActor).withDispatcher("custom-dispatcher"))

Pinned Dispatcher

In certain use cases, sharing threads can be detrimental to your actor system. In this case, you might opt to use a pinned dispatcher. Rather than using a thread pool for the actors, a pinned dispatcher allocates a single thread per actor. This eliminates the worry over tuning your throughput and thread counts, but it does come at a cost.

Threads aren’t free. There is a finite amount of hardware on which to run those threads, and they need to share that hardware. This sharing comes with its own form of context switching. A large number of threads means more context switching as those threads contend for the same resources. There are also limits to the number of threads you can create. Even though the concept of one thread per actor sounds appealing, in reality its uses are much more limited than you might expect. You are usually better off going with the standard dispatcher rather than a pinned dispatcher.

Another consideration with a pinned dispatcher is the fact that it allocates only a single thread per actor. If your actor is not mixing concurrency techniques, this is not a problem, but often actors might make use of futures internally. When doing this, it is common to use the actor’s dispatcher as an execution context, but remember that in this case the actor is being assigned just a single thread on which to operate. This means that using the actor’s dispatcher is going to give you access to only that single thread. Depending on the circumstances, this can either result in a performance hit because your futures now must operate in a single-threaded manner, or in some cases it might even result in a deadlock as multiple concurrent pieces of the system contend for the same thread.

So where are the right places to use a pinned dispatcher? Pinned dispatchers are useful for situations in which you have one actor, or a small number of actors, that need to be able to go “as fast as possible.” For example, if your system has a set of operations that are considered “high priority,” you probably don’t want those operations sharing resources. You probably want to avoid a thread pool because your high-priority operations would then be coupled to each other (and possibly other actors) by sharing a thread. They can create contention for the threads, which could cause them to slow down. In that case, a pinned dispatcher for those operations might be a good solution. A pinned dispatcher means that the actors don’t need to wait for one another.

Balancing Dispatcher

A balancing dispatcher is one that will redistribute work from busy actors to idle actors. It does this by using a shared mailbox. All messages go into a single mailbox. The dispatcher can then move messages from one actor to the other, allowing them to share the load.

However, a balancing dispatcher has limited use cases, and is often not the right choice. Due to the work sharing nature of the dispatcher, you can’t use it as a general-purpose mailbox. All actors that use the dispatcher must be able to process all messages in the mailbox. This typically limits its use to actors of the same type.

Otherwise, a balancing dispatcher works much the same as the standard dispatcher. It uses a thread pool that you can tune in the same way as the standard dispatcher.

Balancing dispatchers can be useful for cases in which you have multiple actors of the same type and you want to share the workload between them, but other approaches are usually superior.

Calling-Thread Dispatcher

The calling-thread dispatcher is designed for testing. In this case, no thread pool is allocated. All operations are performed on the same thread that sent the original message. This is useful in tests for removing a lot of the concurrency concerns that make testing difficult. When everything operates on the same thread, concurrency is no longer an issue.

Of course, removing the concurrency of the system is an artificial situation. In your live system, you will not be operating in a single-threaded manner, so even though simulating this might be useful for tests, it can give you a false impression of how the system behaves. Worse, it can also lead to deadlocks in your actors as they suddenly find themselves needing to asynchronously wait but are unable to do so. It is generally better, even in tests, to use a proper thread pool–based dispatcher.

In addition, the Akka TestKit provides many facilities for testing that make explicit use of a calling-thread dispatcher largely unnecessary.

When to Use Your Own Dispatchers

When you create an actor without explicitly assigning a dispatcher, that actor is created as part of the default dispatcher. As such, it will share a thread pool with all other actors. For many actors, this is perfectly acceptable. There are certain conditions, however, that warrant a separate dispatcher.

You want to try to avoid blocking or long-running operations as much as possible inside of actors. These blocking operations can have a significant effect on the availability of threads in your thread pool, and they can hinder the performance of the application. But this isn’t always possible. There are times when despite your best efforts, you must block. In these cases, you should use a separate dispatcher.

Let’s begin by analyzing what happens when you don’t use a separate dispatcher. It is fairly common when building a system to provide a special “monitoring” or “health check” operation. These operations might do something fairly simple just to verify the application is running and responsive. It can be as simple as a ping/pong test. Or it can be more complex, including checking database connections and testing various requests. For the moment, we will consider the simple ping/pong test. This operation should be very fast. It doesn’t need to communicate with a database, or do any computation:

object Monitoring {
  case object Ping
  case object Pong
}

class Monitoring extends Actor {
  import Monitoring._

  override def receive: Receive = {
    case Ping =>
      sender() ! Pong
  }
}

What if, within the same API, we have another, more complex operation? This operation needs to read data from a database, transform it, and eventually return it. And let’s further assume that the database driver we are using is a blocking driver:

class DataReader extends Actor {
  import DataReader._

  override def receive: Actor.Receive = {
    case ReadData =>
      sender() ! Data(readFromDatabase())
  }

  private def readFromDatabase() = {
    // This reads from a database in a blocking fashion.
    // It then transforms the result into the required format.
    ...
  }
}

If these two operations are performed in the same dispatcher, how do they affect each other? If several requests to the long-running operation come in at the same time, it is possible that those requests will block all the available threads in the dispatcher. These threads will be unavailable until the operation completes. This means that when the monitoring service makes the ping/pong request to verify that the API is up and available, it will be delayed or it will fail. There are no available threads on which to perform the ping/pong request, so the service must wait.

In this very simple example, there is coupling between the two operations. Even though these operations perform very different tasks and in many respects are entirely unrelated, they are coupled to the resources that they share—in this case, the thread pool. As a result of this coupling, poor performance on one of the operations can cause poor performance on the other. This is probably not desirable.

To solve this problem, you can introduce a separate dispatcher. By putting the actors that perform each task on a separate dispatcher, you break this coupling. Now, each operation has its own pool of threads. So, when the long-running operation blocks all of its available threads, there are still additional threads, in a separate dispatcher, available to perform the ping/pong request. Thus, monitoring can remain operational even when other parts of the system have become blocked, as illustrated here:

data-reader-dispatcher {
  type = Dispatcher
  executor = "thread-pool-executor"
  thread-pool-executor {
    core-pool-size-min = 4
    core-pool-size-max = 16
  }
  throughput = 1
}


// For monitoring, because it is a very fast operation,
// it is sufficient to leave it in the default dispatcher.
val monitoring = system.actorOf(Monitoring.props())

// For our data reader, we have a separate dispatcher created to manage
// operations for reading data.
val reader = system.actorOf(
   DataReader.props().withDispatcher("data-reader-dispatcher"))

There are many cases for which you might want to look at using a separate dispatcher. A separate dispatcher makes it possible for you to tune groups of actors independently. If one group of actors requires a high-throughput setting, whereas a different group would work better with a low throughput setting, separating them into different dispatchers will enable this independent tuning. Or, if you have one set of actors that receives a large volume of messages and another that receives a lower volume, you might want to split them simply to be more fair. And what about priority? If certain messages are high priority, you might want to isolate them in their own dispatcher so that they are not contending with other actors for threads.

The key is that you don’t want to use the same dispatcher everywhere. One of the primary mistakes people make when starting out with Akka is using the default dispatcher for all actors. This can result in a system that is very unresponsive, despite the fact that it isn’t consuming a lot of hardware resources. You can end up with a system in which your CPU is essentially idle, and yet the system is not responding. Creating separate dispatchers and tuning them for specific use cases is the first step toward solving this problem.

Increase Parallelism

Increasing parallelism is another approach to tuning Akka. Often, an increase in parallelism is required before increasing the number of actors processing a request can really be of help, but parallelism, of course, does not necessarily mean actors.

How to increase parallelism in any given algorithm is a wide topic and could warrant an entire book of its own, but some general principles apply. Reducing contention by reducing coupling and shared data is one key element; breaking the problem into those portions of the problem that do not depend on others having completed is another.

Assuming that you are able to refactor an algorithm to permit more parallelism, you will still be, as discussed a bit earlier, constrained by the number of cores available on any single node as to how much true parallelism you can achieve—this is where the Actor Model allows you to surpass limitations that had existed before.

A single request or algorithm can now have its parallelizable elements distributed—that is, handled by more than one node. Each node can then bring to bear all of the cores available on that node, allowing the total parallelism of a problem to be taken much higher than is possible on a single node.

The easiest application of this is often an algorithm that is compute-bound; in other words, it cannot be processed any faster on a single node because it runs out of CPU resources. By breaking that algorithm into portions that don’t depend on one another and distributing those portions, you can effectively run the problem on a larger “machine” than any one node can provide.

Not every problem is compute-bound, however; many are I/O bound on data access from some kind of persistent store. In this case, the distributed domain-driven design (DDDD) pattern can be applied, allowing the data to be wrapped with instances of actors, and held in memory. Just like the compute-bound problem gets more CPU applied to it than any one machine could bring to bear, the distributed-domain problem gets more memory available to it than is available on any one machine, and converts an I/O-bound problem into something more closely resembling a compute-bound problem.

There is often an element of persistence needed in any case, though—perhaps an event journal must be emitted and consumed by the actors in our system. If the actor needing such access is on a different node, you again end up I/O bound, sometimes on the network instead of the disk.

Conclusion

In this chapter, you have seen how careful tuning and consideration of performance avoids critical bottlenecks that can otherwise counteract the advantages of a highly scalable and available actor-based system.

You have now seen all of the essential aspects of building power applications with Akka and the Actor Model, from the most granular to the highest levels.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset