Timeouts

In the early days, networking issues affected only programmers working on low-level software: operating systems, network protocols, remote filesystems, and so on. Today, every system is a distributed system. Every application must grapple with the fundamental nature of networks: networks are fallible. The wire could be broken, some switch or router along the way could be broken, or the computer you are addressing could be broken. Your thermostat can’t talk to your TV because the microwave is on. Even if you’ve already established communication, any of these elements could break at any time. When that happens, your code can’t just wait forever for a response that might never come; sooner or later, it needs to give up. Hope is not a design method.

The timeout is a simple mechanism allowing you to stop waiting for an answer once you think it won’t come. I once had a project to port the BSD sockets library to a mainframe-based UNIX environment. I attacked the project with a stack of RFCs and a dusty pile of source code for UNIX System V Release 4. Two issues nagged at me throughout the entire project. First, heavy use of “#ifdef” blocks for different architectures made it look less like a portable operating system than twenty different operating systems intermingled. Second, the networking code was absolutely riddled with error handling for different flavors of timeouts. By the project’s end, I had grown to understand and appreciate the significance of timeouts.

Well-placed timeouts provide fault isolation—a problem in some other service or device does not have to become your problem. Unfortunately, at higher levels of abstraction, further from the dirty world of hardware, good placement of timeouts becomes increasingly rare. Indeed, some high-level APIs have few or no explicit timeout settings. Presumably the designers behind these APIs have never been awakened in the wee hours to recover a crashed system. Many APIs offer both a call with a timeout and a simpler, easier call that blocks forever. It would be better if, instead of overloading a single function, the no-timeout version were labeled “CheckoutAndMaybeKillMySystem.”

Commercial software client libraries are notoriously devoid of timeouts. These libraries often do direct socket calls on behalf of the system. By hiding the socket from your code, they also prevent you from setting vital timeouts.

Timeouts can also be relevant within a single service. Any resource pool can be exhausted. Conventional usage dictates that the calling thread should be blocked until one of the resources is checked in. (See Blocked Threads.)

It’s essential that any resource pool that blocks threads must have a timeout to ensure that calling threads eventually unblock, whether resources become available or not.

Also beware of language-level synchronization or mutexes. Always use the form that takes a timeout argument.

An approach to dealing with pervasive timeouts is to organize long-running operations into a set of primitives that you can reuse in many places. For example, suppose you need to check out a database connection from a resource pool, run a query, turn the result set into objects, and then check the database connection back into the pool. At least three points in that interaction could hang indefinitely. Instead of coding that sequence of interactions dozens of places, with all the associated handling of timeouts (not to mention other kinds of errors), create a query object (see Patterns of Enterprise Application Architecture [Fow03]) to represent the part of the interaction that changes.

Use a generic gateway to provide the template for connection handling, error handling, query execution, and result processing. That way you only need to get it right in one place, and calling code can provide just the essential logic. Collecting this common interaction pattern into a single class also makes it easier to apply the Circuit Breaker pattern.

Make full use of your platform. Infrastructure services like Amazon API Gateway can handle a lot of the dirty details for you. Language runtimes that use callbacks or reactive programming styles also let you specify timeouts more easily.

Timeouts are often found in the company of retries. Under the philosophy of “best effort,” the software attempts to repeat an operation that timed out. Immediately retrying an operation after a failure has a number of consequences, but only some of them are beneficial. If the operation failed because of any significant problem, it’s likely to fail again if retried immediately. Some kinds of transient failures might be overcome with a retry (for example, dropped packets over a WAN). Within the walls of a data center, however, the failure is probably because of something wrong with the other end of a connection. My experience has been that problems on the network, or with other servers, tend to last for a while. Thus, fast retries are very likely to fail again.

From the client’s perspective, making me wait longer is a very bad thing. If you cannot complete an operation because of some timeout, it is better for you to return a result. It can be a failure, a success, or a note that you’ve queued the work for later execution (if I should care about the distinction). In any case, just come back with an answer. Making me wait while you retry the operation might push your response time past my timeout. It certainly keeps my resources busy longer than needed.

On the other hand, queuing the work for a slow retry later is a good thing, making the system more robust. Imagine if every mail server between the sender and receiver had to be online, ready to process your mail, and had to respond within sixty seconds in order for email to make it through. How well would the global email system scale? The store-and-forward approach obviously makes much more sense. In the case of failure in a remote server, queue-and-retry ensures that once the remote server is healthy again, the overall system will recover. Work does not need to be lost completely just because part of the larger system isn’t functioning. How fast is fast enough? It depends on your application and your users. For a service behind a web API, “fast enough” is probably between 10 and 100 milliseconds. Beyond that, you’ll start to lose capacity and customers.

Timeouts have natural synergy with circuit breakers. A circuit breaker can tabulate timeouts, tripping to the “off” state if too many occur.

The Timeouts pattern and the Fail Fast pattern (which I discus in Fail Fast) both address latency problems. The Timeouts pattern is useful when you need to protect your system from someone else’s failure. Fail Fast is useful when you need to report why you won’t be able to process some transaction. Fail Fast applies to incoming requests, whereas the Timeouts pattern applies primarily to outbound requests. They’re two sides of the same coin.

Timeouts can also help with unbounded result sets by preventing the client from processing the entire result set, but they aren’t the most effective approach to that particular problem. They’d be a stopgap, but not much more than that.

Timeouts apply to a general class of problems. As such, they help systems recover from unanticipated events.

Remember This

Apply Timeouts to Integration Points, Blocked Threads, and Slow Responses.

The Timeouts pattern prevents calls to Integration Points from becoming Blocked Threads. Thus, timeouts avert Cascading Failures.

Apply Timeouts to recover from unexpected failures.

When an operation is taking too long, sometimes we don’t care why…we just need to give up and keep moving. The Timeouts pattern lets us do that.

Consider delayed retries.

Most of the explanations for a timeout involve problems in the network or the remote system that won’t be resolved right away. Immediate retries are liable to hit the same problem and result in another timeout. That just makes the user wait even longer for her error message. Most of the time, you should queue the operation and retry it later.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset