Fault tolerance

Real programs fail, and they fail in unpredictable ways. Akka, and the Scala community in general, favors planning explicitly for failure rather than trying to write infallible applications. A fault tolerant system is a system that can continue to operate when one or more of its components fails. The failure of an individual subsystem does not necessarily mean the failure of the application. How does this apply to Akka?

The actor model provides a natural unit to encapsulate failure: the actor. When an actor throws an exception while processing a message, the default behavior is for the actor to restart, but the exception does not leak out and affect the rest of the system. For instance, let's introduce an arbitrary failure in the response interpreter. We will modify the receive method to throw an exception when it is asked to interpret the response for misto, one of Martin Odersky's followers:

// ResponseInterpreter.scala
def receive = {
  case InterpretResponse("misto", r) => 
    throw new IllegalStateException("custom error")
  case InterpretResponse(login, r) => interpret(login, r)
}

If you rerun the code through SBT, you will notice that an error gets logged. The program does not crash, however. It just continues as normal:

[ERROR] [11/07/2015 12:05:58.938] [GithubFetcher-akka.actor.default-dispatcher-2] [akka://GithubFetcher/user/$a/$b] custom error
java.lang.IllegalStateException: custom error
  at ResponseInterpreter$
   ...
[INFO] [11/07/2015 12:05:59.117] [GithubFetcher-akka.actor.default-dispatcher-2] [akka://GithubFetcher/user/$a] Pushing samfoo onto queue

None of the followers of misto will get added to the queue: he never made it past the ResponseInterpreter stage. Let's step through what happens when the exception gets thrown:

  • The interpreter is sent the InterpretResponse("misto", ...) message. This causes it to throw an exception and it dies. None of the other actors are affected by the exception.
  • A fresh instance of the response interpreter is created with the same Props instance as the recently deceased actor.
  • When the response interpreter has finished initializing, it gets bound to the same ActorRef as the deceased actor. This means that, as far as the rest of the system is concerned, nothing has changed.
  • The mailbox is tied to ActorRef rather than the actor, so the new response interpreter will have the same mailbox as its predecessor, without the offending message.

Thus, if, for whatever reason, our crawler crashes when fetching or parsing the response for a user, the application will be minimally affected—we will just not fetch this user's followers.

Any internal state that an actor carries is lost when it restarts. Thus, if, for instance, the fetcher manager died, we would lose the current value of the queue and visited users. The risks associated with losing the internal state can be mitigated by the following:

  • Adopting a different strategy for failure: we can, for instance, carry on processing messages without restarting the actor in the event of failure. Of course, this is of little use if the actor died because its internal state is inconsistent. In the next section, we will discuss how to change the failure recovery strategy.
  • Backing up the internal state by writing it to disk periodically and loading from the backup on restart.
  • Protecting actors that carry critical state by ensuring that all "risky" operations are delegated to other actors. In our crawler example, all the interactions with external services, such as querying the GitHub API and parsing the response, happen with actors that carry no internal state. As we saw in the previous example, if one of these actors dies, the application is minimally affected. By contrast, the precious fetcher manager is only allowed to interact with sanitized inputs. This is called the error kernel pattern: code likely to cause errors is delegated to kamikaze actors.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset