7. Multiprogramming: Separating Processes to Separate Function

If we believe in data structures, we must believe in independent (hence simultaneous) processing. For why else would we collect items within a structure? Why do we tolerate languages that give us the one without the other?

Epigrams in Programming, in ACM SIGPLAN (Vol 17 #9, 1982)
—Alan Perlis

The most characteristic program-modularization technique of Unix is splitting large programs into multiple cooperating processes. This has usually been called ’multiprocessing’ in the Unix world, but in this book we revive the older term ’multiprogramming’ to avoid confusion with multiprocessor hardware implementations.

Multiprogramming is a particularly murky area of design, one in which there are few guidelines to good practice. Many programmers with excellent judgment about how to break up code into subroutines nevertheless wind up writing whole applications as monster single-process monoliths that founder on their own internal complexity.

The Unix style of design applies the do-one-thing-well approach at the level of cooperating programs as well as cooperating routines within a program, emphasizing small programs connected by well-defined interprocess communication or by shared files. Accordingly, the Unix operating system encourages us to break our programs into simpler subprocesses, and to concentrate on the interfaces between these subprocesses. It does this in at least three fundamental ways:

• by making process-spawning cheap;

• by providing methods (shellouts, I/O redirection, pipes, message-passing, and sockets) that make it relatively easy for processes to communicate;

• by encouraging the use of simple, transparent, textual data formats that can be passed through pipes and sockets.

Inexpensive process-spawning and easy process control are critical enablers for the Unix style of programming. On an operating system such as VAX VMS, where starting processes is expensive and slow and requires special privileges, one must build monster monoliths because one has no choice. Fortunately the trend in the Unix family has been toward lower fork(2) overhead rather than higher. Linux, in particular, is famously efficient this way, with a process-spawn faster than thread-spawning on many other operating systems.1

1 See, for example, the results quoted in Improving Context Switching Performance of Idle Tasks under Linux [Appleton].

Historically, many Unix programmers have been encouraged to think in terms of multiple cooperating processes by experience with shell programming. Shell makes it relatively easy to set up groups of multiple processes connected by pipes, running either in background or foreground or a mix of the two.

In the remainder of this chapter, we’ll look at the implications of cheap process-spawning and discuss how and when to apply pipes, sockets, and other interprocess communication (IPC) methods to partition your design into cooperating processes. (In the next chapter, we’ll apply the same separation-of-functions philosophy to interface design.)

While the benefit of breaking programs up into cooperating processes is a reduction in global complexity, the cost is that we have to pay more attention to the design of the protocols which are used to pass information and commands between processes. (In software systems of all kinds, bugs collect at interfaces.)

In Chapter 5 we looked at the lower level of this design problem—how to lay out application protocols that are transparent, flexible and extensible. But there is a second, higher level to the problem which we blithely ignored. That is the problem of designing state machines for each side of the communication.

It is not hard to apply good style to the syntax of application protocols, given models like SMTP or BEEP or XML-RPC. The real challenge is not protocol syntax but protocol logic—designing a protocol that is both sufficiently expressive and deadlock-free. Almost as importantly, the protocol has to be seen to be expressive and deadlock-free; human beings attempting to model the behavior of the communicating programs in their heads and verify its correctness must be able to do so.

In our discussion, therefore, we will focus on the kinds of protocol logic one naturally uses with each kind of interprocess communication.

7.1 Separating Complexity Control from Performance Tuning

First, though, we need to dispose of a few red herrings. Our discussion is not going to be about using concurrency to improve performance. Putting that concern before developing a clean architecture that minimizes global complexity is premature optimization, the root of all evil (see Chapter 12 for further discussion).

A closely related red herring is threads (that is, multiple concurrent processes sharing the same memory-address space). Threading is a performance hack. To avoid a long diversion here, we’ll examine threads in more detail at the end of this chapter; the summary is that they do not reduce global complexity but rather increase it, and should therefore be avoided save under dire necessity.

Respecting the Rule of Modularity, on the other hand, is not a red herring; doing so can make your programs—and your life—simpler. All the reasons for process partitioning are continuous with the reasons for module partitioning that we developed in Chapter 4.

Another important reason for breaking up programs into cooperating processes is for better security. Under Unix, programs that must be run by ordinary users, but must have write access to security-critical system resources, get that access through a feature called the setuid bit.2 Executable files are the smallest unit of code that can hold a setuid bit; thus, every line of code in a setuid executable must be trusted. (Well-written setuid programs, however, take all necessary privileged actions first and then drop their privileges back to user level for the remainder of their existence.)

2 A setuid program runs not with the privileges of the user calling it, but with the privileges of the owner of the executable. This feature can be used to give restricted, program-controlled access to things like the password file that nonadministrators should not be allowed to modify directly.

Usually a setuid program only needs its privileges for one or a small handful of operations. It is often possible to break up such a program into cooperating processes, a smaller one that needs setuid and a larger one that does not. When we can do this, only the code in the smaller program has to be trusted. It is in significant part because this kind of partitioning and delegation is possible that Unix has a better security track record3 than its competitors.

3 That is, a better record measured in security breaches per total machine hours of Internet exposure.

7.2 Taxonomy of Unix IPC Methods

As in single-process program architectures, the simplest organization is the best. The remainder of this chapter will present IPC techniques roughly in order of escalating complexity of programming them. Before using a later, more complex technique, you should prove by demonstration—with prototypes and benchmark results—that no earlier and simpler technique will do. Often you will surprise yourself.

7.2.1 Handing off Tasks to Specialist Programs

In the simplest form of interprogram cooperation enabled by inexpensive process spawning, a program runs another to accomplish a specialized task. Because the called program is often specified as a Unix shell command through the system(3) call, this is often called shelling out to the called program. The called program inherits the user’s keyboard and display and runs to completion. When it exits, the calling program resumes control of the keyboard and display and resumes execution.4 Because the calling program does not communicate with the called program during the callee’s execution, protocol design is not an issue in this kind of cooperation, except in the trivial sense that the caller may pass command-line arguments to the callee to change its behavior.

4 A common error in programming shellouts is to forget to block signals in the parent while the subprocess runs. Without this precaution, an interrupt typed to the subprocess can have unwanted side effects on the parent process.

The classic Unix case of shelling out is calling an editor from within a mail or news program. In the Unix tradition one does not bundle purpose-built editors into programs that require general text-edited input. Instead, one allows the user to specify an editor of his or her choice to be called when editing needs to be done.

The specialist program usually communicates with its parent through the file system, by reading or modifying file(s) with specified location(s); this is how editor or mailer shellouts work.

In a common variant of this pattern, the specialist program may accept input on its standard input, and be called with the C library entry point popen(..., "w") or as part of a shellscript. Or it may send output to its standard output, and be called with popen(..., "r") or as part of a shellscript. (If it both reads from standard input and writes to standard output, it does so in a batch mode, completing all reads before doing any writes.) This kind of child process is not usually referred to as a shellout; there is no standard jargon for it, but it might well be called a ’bolt-on’.

They key point about all these cases is that the specialist programs don’t handshake with the parent while they are running. They have an associated protocol only in the trivial sense that whichever program (master or slave) is accepting input from the other has to be able to parse it.

7.2.1.1 Case Study: The mutt Mail User Agent

The mutt mail user agent is the modern representative of the most important design tradition in Unix email programs. It has a simple screen-oriented interface with single-keystroke commands for browsing and reading mail.

When you use mutt as a mail composer (either by calling it with an address as a command-line argument or by using one of the reply commands), it examines the process environment variable EDITOR, and then generates a temporary file name. The value of the EDITOR variable is called as a command with the tempfile name as an argument.5 When that command terminates, mutt resumes on the assumption that the temporary file contains the desired mail text.

5 Actually, the above is a slight oversimplification. See the discussion of EDITOR and VISUAL in Chapter 10 for the rest of the story.

Almost all Unix mail- and netnews-composition programs observe the same convention. Because they do, composer implementers don’t need to write a hundred inevitably diverging editors, and users don’t need to learn a hundred divergent interfaces. Instead, users can carry their chosen editors with them.

An important variant of this strategy shells out to a small proxy program that passes the specialist job to an already-running instance of a big program, like an editor or a Web browser. Thus, developers who normally have an instance of emacs running on their X display can set EDITOR=emacsclient, and have a buffer pop open in their emacs when they request editing in mutt. The point of this is not really to save memory or other resources, it’s to enable the user to unify all editing in a single emacs process (so that, for example, cut and paste among buffers can carry along internal emacs state information like font highlighting).

7.2.2 Pipes, Redirection, and Filters

After Ken Thompson and Dennis Ritchie, the single most important formative figure of early Unix was probably Doug McIlroy. His invention of the pipe construct reverberated through the design of Unix, encouraging its nascent do-one-thing-well philosophy and inspiring most of the later forms of IPC in the Unix design (in particular, the socket abstraction used for networking).

Pipes depend on the convention that every program has initially available to it (at least) two I/O data streams: standard input and standard output (numeric file descriptors 0 and 1 respectively). Many programs can be written as filters, which read sequentially from standard input and write only to standard output.

Normally these streams are connected to the user’s keyboard and display, respectively. But Unix shells universally support redirection operations which connect these standard input and output streams to files. Thus, typing

ls >foo

sends the output of the directory lister ls(1) to a file named ’foo’. On the other hand, typing:

wc <foo

causes the word-count utility wc(1) to take its standard input from the file ’foo’, and deliver a character/word/line count to standard output.

The pipe operation connects the standard output of one program to the standard input of another. A chain of programs connected in this way is called a pipeline. If we write

ls | wc

we’ll see a character/word/line count for the current directory listing. (In this case, only the line count is really likely to be useful.)

One favorite pipeline was “bc | speak”—a talking desk calculator. It knew number names up to a vigintillion.

—Doug McIlroy

It’s important to note that all the stages in a pipeline run concurrently. Each stage waits for input on the output of the previous one, but no stage has to exit before the next can run. This property will be important later on when we look at interactive uses of pipelines, like sending the lengthy output of a command to more(1).

It’s easy to underestimate the power of combining pipes and redirection. As an instructive example, The Unix Shell As a 4GL [Schaffer-Wolf] shows that with these facilities as a framework, a handful of simple utilities can be combined to support creating and manipulating relational databases expressed as simple textual tables.

The major weakness of pipes is that they are unidirectional. It’s not possible for a pipeline component to pass control information back up the pipe other than by terminating (in which case the previous stage will get a SIGPIPE signal on the next write). Accordingly, the protocol for passing data is simply the receiver’s input format.

So far, we have discussed anonymous pipes created by the shell. There is a variant called a named pipe which is a special kind of file. If two programs open the file, one for reading and the other for writing, a named pipe acts like a pipe-fitting between them. Named pipes are a bit of a historical relic; they have been largely displaced from use by named sockets, which we’ll discuss below. (For more on the history of this relic, see the discussion of System V IPC below.)

7.2.2.1 Case Study: Piping to a Pager

Pipelines have many uses. For one example, Unix’s process lister ps(1) lists processes to standard output without caring that a long listing might scroll off the top of the user’s display too quickly for the user to see it. Unix has another program, more(1), which displays its standard input in screen-sized chunks, prompting for a user keystroke after displaying each screenful.

Thus, if the user types “ps | more”, piping the output of ps(1) to the input of more(1), successive page-sized pieces of the list of processes will be displayed after each keystroke.

The ability to combine programs like this can be extremely useful. But the real win here is not cute combinations; it’s that because both pipes and more(1) exist, other programs can be simpler. Pipes mean that programs like ls(1) (and other programs that write to standard out) don’t have to grow their own pagers—and we’re saved from a world of a thousand built-in pagers (each, naturally, with its own divergent look and feel). Code bloat is avoided and global complexity reduced.

As a bonus, if anyone needs to customize pager behavior, it can be done in one place, by changing one program. Indeed, multiple pagers can exist, and will all be useful with every application that writes to standard output.

In fact, this has actually happened. On modern Unixes, more(1) has been largely replaced by less(1), which adds the capability to scroll back in the displayed file rather than just forward.6 Because less(1) is decoupled from the programs that use it, it’s possible to simply alias ’more’ to ’less’ in your shell, set the environment variable PAGER to ’less’ (see Chapter 10), and get all the benefits of a better pager with all properly-written Unix programs.

6 The less(1) man page explains the name by observing “Less is more”.

7.2.2.2 Case Study: Making Word Lists

A more interesting example is one in which pipelined programs cooperate to do some kind of data transformation for which, in less flexible environments, one would have to write custom code.

Consider the pipeline

tr -c '[:alnum:]' '[ *]' | sort -iu | grep -v '^[0-9]*$'

The first command translates non-alphanumerics on standard input to newlines on standard output. The second sorts lines on standard input and writes the sorted data to standard output, discarding all but one copy of spans of adjacent identical lines. The third discards all lines consisting solely of digits. Together, these generate a sorted wordlist to standard output from text on standard input.

7.2.2.3 Case Study: pic2graph

Shell source code for the program pic2graph(1) ships with the groff suite of text-formatting tools from the Free Software Foundation. It translates diagrams written in the PIC language to bitmap images. Example 7.1 shows the pipeline at the heart of this code.

Example 7.1. The pic2graph pipeline.

images

The pic2graph(1) implementation illustrates how much one pipeline can do purely by calling pre-existing tools. It starts by massaging its input into an appropriate form, continues by feeding it through groff(1) to produce PostScript, and finishes by converting the PostScript to a bitmap. All these details are hidden from the user, who simply sees PIC source go in one end and a bitmap ready for inclusion in a Web page come out the other.

This is an interesting example because it illustrates how pipes and filtering can adapt programs to unexpected uses. The program that interprets PIC, pic(1), was originally designed only to be used for embedding diagrams in typeset documents. Most of the other programs in the toolchain it was part of are now semiobsolescent. But PIC remains handy for new uses, such as describing diagrams to be embedded in HTML. It gets a renewed lease on life because tools like pic2graph(1) can bundle together all the machinery needed to convert the output of pic(1) into a more modern format.

We’ll examine pic(1) more closely, as a minilanguage design, in Chapter 8.

7.2.2.4 Case Study: bc(1) and dc(1)

Part of the classic Unix toolkit dating back to Version 7 is a pair of calculator programs. The dc(1) program is a simple calculator that accepts text lines consisting of reverse-Polish notation (RPN) on standard input and emits calculated answers to standard output. The bc(1) program accepts a more elaborate infix syntax resembling conventional algebraic notation; it includes as well the ability to set and read variables and define functions for elaborate formulas.

While the modern GNU implementation of bc(1) is standalone, the classic version passed commands to dc(1) over a pipe. In this division of labor, bc(1) does variable substitution and function expansion and translates infix notation into reverse-Polish—but doesn’t actually do calculation itself, instead passing RPN translations of input expressions to dc(1) for evaluation.

There are clear advantages to this separation of function. It means that users get to choose their preferred notation, but the logic for arbitrary-precision numeric calculation (which is moderately tricky) does not have to be duplicated. Each of the pair of programs can be less complex than one calculator with a choice of notations would be. The two components can be debugged and mentally modeled independently of each other.

In Chapter 8 we will reexamine these programs from a slightly different example, as examples of domain-specific minilanguages.

7.2.2.5 Anti-Case Study: Why Isn’t fetchmail a Pipeline?

In Unix terms, fetchmail is an uncomfortably large program that bristles with options. Thinking about the way mail transport works, one might think it would be possible to decompose it into a pipeline. Suppose for a moment it were broken up into several programs: a couple of fetch programs to get mail from POP3 and IMAP sites, and a local SMTP injector. The pipeline could pass Unix mailbox format. The present elaborate fetchmail configuration could be replaced by a shellscript containing command lines. One could even insert filters in the pipeline to block spam.

images

This would be very elegant and Unixy. Unfortunately, it can’t work. We touched on the reason earlier; pipelines are unidirectional.

One of the things the fetcher program (imap or pop) would have to do is decide whether to send a delete request for each message it fetches. In fetchmail’s present organization, it can delay sending that request to the POP or IMAP server until it knows that the local SMTP listener has accepted responsibility for the message. The pipelined, small-component version would lose that property.

Consider, for example, what would happen if the smtp injector fails because the SMTP listener reports a disk-full condition. If the fetcher has already deleted the mail, we lose. This means the fetcher cannot delete mail until it is notified to do so by the smtp injector. This in turn raises a host of questions. How would they communicate? What message, exactly, would the injector pass back? The global complexity of the resulting system, and its vulnerability to subtle bugs, would almost certainly be higher than that of a monolithic program.

Pipelines are a marvelous tool, but not a universal one.

7.2.3 Wrappers

The opposite of a shellout is a wrapper. A wrapper creates a new interface for a called program, or specializes it. Often, wrappers are used to hide the details of elaborate shell pipelines. We’ll discuss interface wrappers in Chapter 11. Most specialization wrappers are quite simple, but nevertheless very useful.

As with shellouts, there is no associated protocol because the programs do not communicate during the execution of the callee; but the wrapper usually exists to specify arguments that modify the callee’s behavior.

7.2.3.1 Case Study: Backup Scripts

Specialization wrappers are a classic use of the Unix shell and other scripting languages. One kind of specialization wrapper that is both common and representative is a backup script. It may be a one-liner as simple as this:

tar -czvf /dev/st0 "$@"

This is a wrapper for the tar(1) tape archiver utility which simply supplies one fixed argument (the tape device /dev/st0) and passes to tar all the other arguments supplied by the user (“$@”).7

7 A common error is to use $* rather than “$@”. This does bad things when handed a filename with embedded spaces.

7.2.4 Security Wrappers and Bernstein Chaining

One common use of wrapper scripts is as security wrappers. A security script may call a gatekeeper program to check some sort of credential, then conditionally execute another based on the status value returned by the gatekeeper.

Bernstein chaining is a specialized security-wrapper technique first invented by Daniel J. Bernstein, who has employed it in a number of his packages. (A similar pattern appears in commands like nohup(1) and su(1), but the conditionality is absent.) Conceptually, a Bernstein chain is like a pipeline, but each successive stage replaces the previous one rather than running concurrently with it.

The usual application is to confine security-privileged applications to some sort of gatekeeper program, which can then hand state to a less privileged one. The technique pastes several programs together using execs, or possibly a combination of forks and execs. The programs are all named on one command line. Each program performs some function and (if successful) runs exec(2) on the rest of its command line.

Bernstein’s rblsmtpd package is a prototypical example. It serves to look up a host in the antispam DNS zone of the Mail Abuse Prevention System. It does this by doing a DNS query on the IP address passed into it in the TCPREMOTEIP environment variable. If the query is successful, then rblsmtpd runs its own SMTP that discards the mail. Otherwise the remaining command-line arguments are presumed to constitute a mail transport agent that knows the SMTP protocol, and are handed to exec(2) to be run.

Another example can be found in Bernstein’s qmail package. It contains a program called condredirect. The first parameter is an email address, and the remainder a gatekeeper program and arguments. condredirect forks and execs the gatekeeper with its arguments. If the gatekeeper exits successfully, condredirect forwards the email pending on stdin to the specified email address. In this case, opposite to that of rblsmtpd, the security decision is made by the child; this case is a bit more like a classical shellout.

A more elaborate example is the qmail POP3 server. It consists of three programs, qmail-popup, checkpassword, and qmail-pop3d. Checkpassword comes from a separate package cleverly called checkpassword, and unsurprisingly it checks the password. The POP3 protocol has an authentication phase and mailbox phase; once you enter the mailbox phase you cannot go back to the authentication phase. This is a perfect application for Bernstein chaining.

The first parameter of qmail-popup is the hostname to use in the POP3 prompts. The rest of its parameters are forked and passed to exec(2), after the POP3 username and password have been fetched. If the program returns failure, the password must be wrong, so qmail-popup reports that and waits for a different password. Otherwise, the program is presumed to have finished the POP3 conversation, so qmail-popup exits.

The program named on qmail-popup’s command line is expected to read three null-terminated strings from file descriptor 3.8 These are the username, password, and response to a cryptographic challenge, if any. This time it’s checkpassword which accepts as parameters the name of qmail-pop3d and its parameters. The checkpassword program exits with failure if the password does not match; otherwise it changes to the user’s uid, gid, and home directory, and executes the rest of its command line on behalf of that user.

8 qmail-popup’s standard input and standard output are the socket, and standard error (which will be file descriptor 2) goes to a log file. File descriptor 3 is guaranteed to be the next to be allocated. As an infamous kernel comment once observed: “You are not expected to understand this”.

Bernstein chaining is useful for situations in which the application needs setuid or setgid privileges to initialize a connection, or to acquire some credential, and then drop those privileges so that following code does not have to be trusted. Following the exec, the child program cannot set its real user ID back to root. It’s also more flexible than a single process, because you can modify the behavior of the system by inserting another program into the chain.

For example, rblsmtpd (mentioned above) can be inserted into a Bernstein chain, in between tcpserver (from the ucspi-tcp package) and the real SMTP server, typically qmail-smtpd. However, it works with inetd(8) and sendmail -bs as well.

7.2.5 Slave Processes

Occasionally, child programs both accept data from and return data to their callers through pipes connected to standard input and output, interactively. Unlike simple shellouts and what we have called ’bolt-ons’ above, both master and slave processes need to have internal state machines to handle a protocol between them without deadlocking or racing. This is a drastically more complex and more difficult-to-debug organization than a simple shellout.

Unix’s popen(3) call can set up either an input pipe or an output pipe for a shellout, but not both for a slave process—this seems intended to encourage simpler programming. And, in fact, interactive master-slave communication is tricky enough that it is normally only used when either (a) the implied protocol is utterly trivial, or (b) the slave process has been designed to speak an application protocol along the lines we discussed in Chapter 5. We’ll return to this issue, and ways to cope with it, in Chapter 8.

When writing a master/slave pair, it is good practice for the master to support a command-line switch or environment variable that allows callers to set their own slave command. Among other things, this is useful for debugging; you will often find it handy during development to invoke the real slave process from within a harness that monitors and logs transactions between slave and master.

If you find that master/slave interactions in your program are becoming nontrivial, it may be time to think about going the rest of the way to a more peer-to-peer organization, using techniques like sockets or shared memory.

7.2.5.1 Case Study: scp and ssh

One common case in which the implied protocol really is trivial is progress meters. The scp(1) secure-copy command calls ssh(1) as a slave process, intercepting enough information from ssh’s standard output to reformat the reports as an ASCII animation of a progress bar.9

9 The friend who suggested this case study comments: “Yes, you can get away with this technique...if there are just a few easily-recognizable nuggets of information coming back from the slave process, and you have tongs and a radiation suit”.

7.2.6 Peer-to-Peer Inter-Process Communication

All the communication methods we’ve discussed so far have a sort of implicit hierarchy about them, with one program effectively controlling or driving another and zero or limited feedback passing in the opposite direction. In communications and networking we frequently need channels that are peer-to-peer, usually (but not necessarily) with data flowing freely in both directions. We’ll survey peer-to-peer communications methods under Unix here, and develop some case studies in later chapters.

7.2.6.1 Tempfiles

The use of tempfiles as communications drops between cooperating programs is the oldest IPC technique there is. Despite drawbacks, it’s still useful in shellscripts, and in one-off programs where a more elaborate and coordinated method of communication would be overkill.

The most obvious problem with using tempfiles as an IPC technique is that it tends to leave garbage lying around if processing is interrupted before the tempfile can be deleted. A less obvious risk is that of collisions between multiple instances of a program using the same name for a tempfile. This is why it is conventional for shellscripts that make tempfiles to include $$ in their names; this shell variable expands to the process-ID of the enclosing shell and effectively guarantees that the filename will be unique (the same trick is supported in Perl).

Finally, if an attacker knows the location to which a tempfile will be written, it can overwrite on that name and possibly either read the producer’s data or spoof the consumer process by inserting modified or spurious data into the file.10 This is a security risk. If the processes involved have root privileges, this is a very serious risk. It can be mitigated by setting the permissions on the tempfile directory carefully, but such arrangements are notoriously likely to spring leaks.

10 A particularly nasty variant of this attack is to drop in a named Unix-domain socket where the producer and consumer programs are expecting the tempfile to be.

All these problems aside, tempfiles still have a niche because they’re easy to set up, they’re flexible, and they’re less vulnerable to deadlocks or race conditions than more elaborate methods. And sometimes, nothing else will do. The calling conventions of your child process may require that it be handed a file to operate on. Our first example of a shellout to an editor demonstrates this perfectly.

7.2.6.2 Signals

The simplest and crudest way for two processes on the same machine to communicate with each other is for one to send the other a signal. Unix signals are a form of soft interrupt; each one has a default effect on the receiving process (usually to kill it). A process can declare a signal handler that overrides the default action for the signal; the handler is a function that is executed asynchronously when the signal is received.

Signals were originally designed into Unix as a way for the operating system to notify programs of certain errors and critical events, not as an IPC facility. The SIGHUP signal, for example, is sent to every program started from a given terminal session when that session is terminated. The SIGINT signal is sent to whatever process is currently attached to the keyboard when the user enters the currently-defined interrupt character (often control-C). Nevertheless, signals can be useful for some IPC situations (and the POSIX-standard signal set includes two signals, SIGUSR1 and SIGUSR2, intended for this use). They are often employed as a control channel for daemons (programs that run constantly, invisibly, in background), a way for an operator or another program to tell a daemon that it needs to either reinitialize itself, wake up to do work, or write internal-state/debugging information to a known location.

I insisted SIGUSR1 and SIGUSR2 be invented for BSD. People were grabbing system signals to mean what they needed them to mean for IPC, so that (for example) some programs that segfaulted would not coredump because SIGSEGV had been hijacked.

This is a general principle—people will want to hijack any tools you build, so you have to design them to either be un-hijackable or to be hijacked cleanly. Those are your only choices. Except, of course, for being ignored—a highly reliable way to remain unsullied, but less satisfying than might at first appear.

—Ken Arnold

A technique often used with signal IPC is the so-called pidfile. Programs that will need to be signaled will write a small file to a known location (often in /var/run or the invoking user’s home directory) containing their process ID or PID. Other programs can read that file to discover that PID. The pidfile may also function as an implicit lock file in cases where no more than one instance of the daemon should be running simultaneously.

There are actually two different flavors of signals. In the older implementations (notably V7, System III, and early System V), the handler for a given signal is reset to the default for that signal whenever the handler fires. The result of sending two of the same signal in quick succession is therefore usually to kill the process, no matter what handler was set.

The BSD 4.x  versions of Unix changed to “reliable” signals, which do not reset unless the user explicitly requests it. They also introduced primitives to block or temporarily suspend processing of a given set of signals. Modern Unixes support both styles. You should use the BSD-style nonresetting entry points for new code, but program defensively in case your code is ever ported to an implementation that does not support them.

Receiving N signals does not necessarily invoke the signal handler N times. Under the older System V signal model, two or more signals spaced very closely together (that is, within a single timeslice of the target process) can result in various race conditions11 or anomalies. Depending on what variant of signals semantics the system supports, the second and later instances may be ignored, may cause an unexpected process kill, or may have their delivery delayed until earlier instances have been processed (on modern Unixes the last is most likely).

11 A ’race condition’ is a class of problem in which correct behavior of the system relies on two independent events happening in the right order, but there is no mechanism for ensuring that they actually will. Race conditions produce intermittent, timing-dependent problems that can be devilishly difficult to debug.

The modern signals API is portable across all recent Unix versions, but not to Windows or classic (pre-OS X) MacOS.

7.2.6.3 System Daemons and Conventional Signals

Many well-known system daemons accept SIGHUP (originally the signal sent to programs on a serial-line drop, such as was produced by hanging up a modem connection) as a signal to reinitialize (that is, reload their configuration files); examples include Apache and the Linux implementations of bootpd(8), gated(8), inetd(8), mountd(8), named(8), nfsd(8), and ypbind(8). In a few cases, SIGHUP is accepted in its original sense of a session-shutdown signal (notably in Linux pppd(8)), but that role nowadays generally goes to SIGTERM.

SIGTERM (’terminate’) is often accepted as a graceful-shutdown signal (this is as distinct from SIGKILL, which does an immediate process kill and cannot be blocked or handled). SIGTERM actions often involve cleaning up tempfiles, flushing final updates out to databases, and the like.

When writing daemons, follow the Rule of Least Surprise: use these conventions, and read the manual pages to look for existing models.

7.2.6.4 Case Study: fetchmail’s Use of Signals

The fetchmail utility is normally set up to run as a daemon in background, periodically collecting mail from all remote sites defined in its run-control file and passing the mail to the local SMTP listener on port 25 without user intervention. fetchmail sleeps for a user-defined interval (defaulting to 15 minutes) between collection attempts, so as to avoid constantly loading the network.

When you invoke fetchmail with no arguments, it checks to see if you have a fetchmail daemon already running (it does this by looking for a pidfile). If no daemon is running, fetchmail starts up normally using whatever control information has been specified in its run-control file. If a daemon is running, on the other hand, the new fetchmail instance just signals the old one to wake up and collect mail immediately; then the new instance terminates. In addition, fetchmail -q sends a termination signal to any running fetchmail daemon.

Thus, typing fetchmail means, in effect, “poll now and leave a daemon running to poll later; don’t bother me with the detail of whether a daemon was already running or not”. Observe that the detail of which particular signals are used for wakeup and termination is something the user doesn’t have to know.

7.2.6.5 Sockets

Sockets were developed in the BSD lineage of Unix as a way to encapsulate access to data networks. Two programs communicating over a socket typically see a bidirectional byte stream (there are other socket modes and transmission methods, but they are of only minor importance). The byte stream is both sequenced (that is, even single bytes will be received in the same order sent) and reliable (socket users are guaranteed that the underlying network will do error detection and retry to ensure delivery). Socket descriptors, once obtained, behave essentially like file descriptors.

Sockets differ from read/write in one important case. If the bytes you send arrive, but the receiving machine fails to ACK, the sending machine’s TCP/IP stack will time out. So getting an error does not necessarily mean that the bytes didn’t arrive; the receiver may be using them. This problem has profound consequences for the design of reliable protocols, because you have to be able to work properly when you don’t know what was received in the past. Local I/O is ’yes/no’. Socket I/O is ’yes/no/maybe’. And nothing can ensure delivery—the remote machine might have been destroyed by a comet.

—Ken Arnold

At the time a socket is created, you specify a protocol family which tells the network layer how the name of the socket is interpreted. Sockets are usually thought of in connection with the Internet, as a way of passing data between programs running on different hosts; this is the AF_INET socket family, in which addresses are interpreted as host-address and service-number pairs. However, the AF_UNIX (aka AF_LOCAL) protocol family supports the same socket abstraction for communication between two processes on the same machine (names are interpreted as the locations of special files analogous to bidirectional named pipes). As an example, client programs and servers using the X windowing system typically use AF_LOCAL sockets to communicate.

All modern Unixes support BSD-style  sockets, and as a matter of design they are usually the right thing to use for bidirectional IPC no matter where your cooperating processes are located. Performance pressure may push you to use shared memory or tempfiles or other techniques that make stronger locality assumptions, but under modern conditions it is best to assume that your code will need to be scaled up to distributed operation. More importantly, those locality assumptions may mean that portions of your system get chummier with each others’ internals than ought to be the case in a good design. The separation of address spaces that sockets enforce is a feature, not a bug.

To use sockets gracefully, in the Unix tradition, start by designing an application protocol for use between them—a set of requests and responses which expresses the semantics of what your programs will be communicating about in a succinct way. We’ve already discussed the some major issues in the design of application protocols in Chapter 5.

Sockets are supported in all recent Unixes, under Windows, and under classic MacOS as well.

7.2.6.5.1 Case Study: PostgreSQL

PostgreSQL is an open-source database program. Had it been implemented as a monster monolith, it would be a single program with an interactive interface that manipulates database files on disk directly. Interface would be welded together with implementation, and two instances of the program attempting to manipulate the same database at the same time would have serious contention and locking issues.

Instead, the PostgreSQL suite includes a server called postmaster and at least three client applications. One postmaster server process per machine runs in background and has exclusive access to the database files. It accepts requests in the SQL query minilanguage through TCP/IP sockets, and returns answers in a textual format as well. When the user runs a PostgreSQL client, that client opens a session to postmaster and does SQL transactions with it. The server can handle several client sessions at once, and sequences requests so that they don’t interfere with each other.

Because the front end and back end are separate, the server doesn’t need to know anything except how to interpret SQL requests from a client and send SQL reports back to it. The clients, on the other hand, don’t need to know anything about how the database is stored. Clients can be specialized for different needs and have different user interfaces.

This organization is quite typical for Unix databases—so much so that it is often possible to mix and match SQL clients and SQL servers. The interoperability issues are the SQL server’s TCP/IP port number, and whether client and server support the same dialect of SQL.

7.2.6.5.2 Case Study: Freeciv

In Chapter 6, we introduced Freeciv as an example of transparent data formats. But more critical to the way it supports multiplayer gaming is the client/server partitioning of the code. This is a representative example of a program in which the application needs to be distributed over a wide-area network and handles communication through TCP/IP sockets.

The state of a running Freeciv game is maintained by a server process, the game engine. Players run GUI clients which exchange information and commands with the server through a packet protocol. All game logic is handled in the server. The details of GUI are handled in the client; different clients support different interface styles.

This is a very typical organization for a multiplayer online game. The packet protocol uses TCP/IP as a transport, so one server can handle clients running on different Internet hosts. Other games that are more like real-time simulations (notably first-person shooters) use raw Internet datagram protocol (UDP) and trade lower latency for some uncertainty about whether any given packet will be delivered. In such games, users tend to be issuing control actions continuously, so sporadic dropouts are tolerable, but lag is fatal.

7.2.6.6 Shared Memory

Whereas two processes using sockets to communicate may live on different machines (and, in fact, be separated by an Internet connection spanning half the globe), shared memory requires producers and consumers to be co-resident on the same hardware. But, if your communicating processes can get access to the same physical memory, shared memory will be the fastest way to pass information between them.

Shared memory may be disguised under different APIs, but on modern Unixes the implementation normally depends on the use of mmap(2) to map files into memory that can be shared between processes. POSIX defines a shm_open(3) facility with an API that supports using files as shared memory; this is mostly a hint to the operating system that it need not flush the pseudofile data to disk.

Because access to shared memory is not automatically serialized by a discipline resembling read and write calls, programs doing the sharing must handle contention and deadlock issues themselves, typically by using semaphore variables located in the shared segment. The issues here resemble those in multithreading (see the end of this chapter for discussion) but are more manageable because default is not to share memory. Thus, problems are better contained.

On systems where it is available and reliable, the Apache web server’s scoreboard facility uses shared memory for communication between an Apache master process and the load-sharing pool of Apache images that it manages. Modern X implementations also use shared memory, to pass large images between client and server when they are resident on the same machine, to avoid the overhead of socket communication. Both uses are performance hacks justified by experience and testing, rather than being architectural choices.

The mmap(2) call is supported under all modern Unixes, including Linux and the open-source BSD versions; this is described in the Single Unix Specification. It will not normally be available under Windows, MacOS classic, and other operating systems.

Before purpose-built mmap(2) was available, a common way for two processes to communicate was for them to open the same file, and then delete that file. The file wouldn’t go away until all open filehandles were closed, but some old Unixes took the link count falling to zero as a hint that they could stop updating the on-disk copy of the file. The downside was that your backing store was the file system rather than a swap device, the file system the deleted file lived on couldn’t be unmounted until the programs using it closed, and attaching new processes to an existing shared memory segment faked up in this way was tricky at best.

After Version 7 and the split between the BSD and System V lineages, the evolution of Unix interprocess communication took two different directions. The BSD direction led to sockets. The AT&T lineage, on the other hand, developed named pipes (as previously discussed) and an IPC facility, specifically designed for passing binary data and based on shared-memory bidirectional message queues. This is called ’System V IPC’—or, among old timers, ’Indian Hill’ IPC after the AT&T facility where it was first written.

The upper, message-passing layer of System V IPC has largely fallen out of use. The lower layer, which consists of shared memory and semaphores, still has significant applications under circumstances in which one needs to do mutual-exclusion locking and some global data sharing among processes running on the same machine. These System V shared memory facilities evolved into the POSIX shared-memory API, supported under Linux, the BSDs, MacOS X and Windows, but not classic MacOS.

By using these shared-memory and semaphore facilities (shmget(2), semget(2), and friends) one can avoid the overhead of copying data through the network stack. Large commercial databases (including Oracle, DB2, Sybase, and Informix) use this technique heavily.

7.3 Problems and Methods to Avoid

While BSD-style sockets over TCP/IP have become the dominant IPC method under Unix, there are still live controversies over the right way to partition by multiprogramming. Some obsolete methods have not yet completely died, and some techniques of questionable utility have been imported from other operating systems (often in association with graphics or GUI programming). We’ll be touring some dangerous swamps here; beware the crocodiles.

7.3.1 Obsolescent Unix IPC Methods

Unix (born 1969) long predates TCP/IP (born 1980) and the ubiquitous networking of the 1990s and later. Anonymous pipes, redirection, and shellout have been in Unix since very early days, but the history of Unix is littered with the corpses of APIs tied to obsolescent IPC and networking models, beginning with the mx() facility that appeared in Version 6 (1976) and was dropped before Version 7 (1979).

Eventually BSD sockets won out as IPC was unified with networking. But this didn’t happen until after fifteen years of experimentation that left a number of relics behind. It’s useful to know about these because there are likely to be references to them in your Unix documentation that might give the misleading impression that they’re still in use. These obsolete methods are described in more detail in Unix Network Programming [Stevens90].

The real explanation for all the dead IPC facilities in old AT&T Unixes was politics. The Unix Support Group was headed by a low-level manager, while some projects that used Unix were headed by vice presidents. They had ways to make irresistible requests, and would not brook the objection that most IPC mechanisms are interchangeable.

—Doug McIlroy

7.3.1.1 System V IPC

The System V IPC facilities are message-passing facilities based on the System V shared memory facility we described earlier.

Programs that cooperate using System V IPC usually define shared protocols based on exchanging short (up to 8K) binary messages. The relevant manual pages are msgctl(2) and friends. As this style has been largely superseded by text protocols passed between sockets, we do not give an example here.

The System V IPC facilities are present in Linux and other modern Unixes. However, as they are a legacy feature, they are not exercised very often. The Linux version is still known to have bugs as of mid-2003. Nobody seems to care enough to fix them.

7.3.1.2 Streams

Streams networking was invented for Unix Version 8 (1985) by Dennis Ritchie. A re-implementation called STREAMS (yes, it is all-capitals in the documentation) first became available in the 3.0 release of System V Unix (1986). The STREAMS facility provided a full-duplex interface (functionally not unlike a BSD socket, and like sockets, accessible through normal read(2) and write(2) operations after initial setup) between a user process and a specified device driver in the kernel. The device driver might be hardware such as a serial or network card, or it might be a software-only pseudodevice set up to pass data between user processes.

An interesting feature of both streams and STREAMS12 is that it is possible to push protocol-translation modules into the kernel’s processing path, so that the device the user process ’sees’ through the full-duplex channel is actually filtered. This capability could be used, for example, to implement a line-editing protocol for a terminal device. Or one could implement protocols such as IP or TCP without wiring them directly into the kernel.

12 STREAMS was much more complex. Dennis Ritchie is reputed to have said “Streams means something different when shouted”.

Streams originated as an attempt to clean up a messy feature of the kernel called ’line disciplines’—alternative modes of processing character streams coming from serial terminals and early local-area networks. But as serial terminals faded from view, Ethernet LANs became ubiquitous, and TCP/IP drove out other protocol stacks and migrated into Unix kernels, the extra flexibility provided by STREAMS had less and less utility. In 2003, System V Unix still supports STREAMS, as do some System V/BSD hybrids such as Digital Unix and Sun Microsystems’ Solaris.

Linux and other open-source Unixes have effectively discarded STREAMS. Linux kernel modules and libraries are available from the LiS <http://www.gcom.com/home/linux/lis/> project, but (as of mid-2003) are not integrated into the stock Linux kernel. They will not be supported under non-Unix operating systems.

7.3.2 Remote Procedure Calls

Despite occasional exceptions such as NFS (Network File System) and the GNOME project, attempts to import CORBA, ASN.1, and other forms of remote-procedure-call interface have largely failed—these technologies have not been naturalized into the Unix culture.

There seem to be several underlying reasons for this. One is that RPC interfaces are not readily discoverable; that is, it is difficult to query these interfaces for their capabilities, and difficult to monitor them in action without building single-use tools as complex as the programs being monitored (we examined some of the reasons for this in Chapter 6). They have the same version skew problems as libraries, but those problems are harder to track because they’re distributed and not generally obvious at link time.

As a related issue, interfaces that have richer type signatures also tend to be more complex, therefore more brittle. Over time, they tend to succumb to ontology creep as the inventory of types that get passed across interfaces grows steadily larger and the individual types more elaborate. Ontology creep is a problem because structs are more likely to mismatch than strings; if the ontologies of the programs on each side don’t exactly match, it can be very hard to teach them to communicate at all, and fiendishly difficult to resolve bugs. The most successful RPC applications, such as the Network File System, are those in which the application domain naturally has only a few simple data types.

The usual argument for RPC is that it permits “richer” interfaces than methods like text streams—that is, interfaces with a more elaborate and application-specific ontology of data types. But the Rule of Simplicity applies! We observed in Chapter 4 that one of the functions of interfaces is as choke points that prevent the implementation details of modules from leaking into each other. Therefore, the main argument in favor of RPC is also an argument that it increases global complexity rather than minimizing it.

With classical RPC, it’s too easy to do things in a complicated and obscure way instead of keeping them simple. RPC seems to encourage the production of large, baroque, over-engineered systems with obfuscated interfaces, high global complexity, and serious version-skew and reliability problems—a perfect example of thick glue layers run amok.

Windows COM and DCOM are perhaps the archetypal examples of how bad this can get, but there are plenty of others. Apple abandoned OpenDoc, and both CORBA and the once wildly hyped Java RMI have receded from view in the Unix world as people have gained field experience with them. This may well be because these methods don’t actually solve more problems than they cause.

Andrew S. Tanenbaum and Robbert van Renesse have given us a detailed analysis of the general problem in A Critique of the Remote Procedure Call Paradigm [Tanenbaum-VanRenesse], a paper which should serve as a strong cautionary note to anyone considering an architecture based on RPC.

All these problems may predict long-term difficulties for the relatively few Unix projects that use RPC. Of these projects, perhaps the best known is the GNOME desktop effort.13 These problems also contribute to the notorious security vulnerabilities of exposing NFS servers.

13 GNOME’s main competitor, KDE, started with CORBA but abandoned it in their 2.0 release. They have been on a quest for lighter-weight IPC methods ever since.

Unix tradition, on the other hand, strongly favors transparent and discoverable interfaces. This is one of the forces behind the Unix culture’s continuing attachment to IPC through textual protocols. It is often argued that the parsing overhead of textual protocols is a performance problem relative to binary RPCs—but RPC interfaces tend to have latency problems that are far worse, because (a) you can’t readily anticipate how much data marshaling and unmarshaling a given call will involve, and (b) the RPC model tends to encourage programmers to treat network transactions as cost-free. Adding even one additional round trip to a transaction interface tends to add enough network latency to swamp any overhead from parsing or marshaling.

Even if text streams were less efficient than RPC, the performance loss would be marginal and linear, the kind better addressed by upgrading your hardware than by expending development time or adding architectural complexity. Anything you might lose in performance by using text streams, you gain back in the ability to design systems that are simpler—easier to monitor, to model, and to understand.

Today, RPC and the Unix attachment to text streams are converging in an interesting way, through protocols like XML-RPC and SOAP. These, being textual and transparent, are more palatable to Unix programmers than the ugly and heavyweight binary serialization formats they replace. While they don’t solve all the more general problems pointed out by Tanenbaum and van Renesse, they do in some ways combine the advantages of both text-stream and RPC worlds.

7.3.3 Threads—Threat or Menace?

Though Unix developers have long been comfortable with computation by multiple cooperating processes, they do not have a native tradition of using threads (processes that share their entire address spaces). These are a recent import from elsewhere, and the fact that Unix programmers generally dislike them is not merely accident or historical contingency.

From a complexity-control point of view, threads are a bad substitute for lightweight processes with their own address spaces; the idea of threads is native to operating systems with expensive process-spawning and weak IPC facilities.

By definition, though daughter threads of a process typically have separate local-variable stacks, they share the same global memory. The task of managing contentions and critical regions in this shared address space is quite difficult and a fertile source of global complexity and bugs. It can be done, but as the complexity of one’s locking regime rises, the chance of races and deadlocks due to unanticipated interactions rises correspondingly.

Threads are a fertile source of bugs because they can too easily know too much about each others’ internal states. There is no automatic encapsulation, as there would be between processes with separate address spaces that must do explicit IPC to communicate. Thus, threaded programs suffer from not just ordinary contention problems, but from entire new categories of timing-dependent bugs that are excruciatingly difficult to even reproduce, let alone fix.

Thread developers have been waking up to this problem. Recent thread implementations and standards show an increasing concern with providing thread-local storage, which is intended to limit problems arising from the shared global address space. As threading APIs move in this direction, thread programming starts to look more and more like a controlled use of shared memory.

Threads often prevent abstraction. In order to prevent deadlock, you often need to know how and if the library you are using uses threads in order to avoid deadlock problems. Similarly, the use of threads in a library could be affected by the use of threads at the application layer.

—David Korn

To add insult to injury, threading has performance costs that erode its advantages over conventional process partitioning. While threading can get rid of some of the overhead of rapidly switching process contexts, locking shared data structures so threads won’t step on each other can be just as expensive.

The X server, able to execute literally millions of ops/second, is not threaded; it uses a poll/select loop. Various efforts to make a multithreaded implementation have come to no good result. The costs of locking and unlocking get too high for something as performance-sensitive as graphics servers.

—Jim Gettys

This problem is fundamental, and has also been a continuing issue in the design of Unix kernels for symmetric multiprocessing. As your resource-locking gets finer-grained, latency due to locking overhead can increase fast enough to swamp the gains from locking less core memory.

One final difficulty with threads is that threading standards still tend to be weak and underspecified as of mid-2003. Theoretically conforming libraries for Unix standards such as POSIX threads (1003.1c) can nevertheless exhibit alarming differences in behavior across platforms, especially with respect to signals, interactions with other IPC methods, and resource cleanup times. Windows and classic MacOS have native threading models and interrupt facilities quite different from those of Unix and will often require considerable porting effort even for simple threading cases. The upshot is that you cannot count on threaded programs to be portable.

For more discussion and a lucid contrast with event-driven programming, see Why Threads Are a Bad Idea [Ousterhout96].

7.4 Process Partitioning at the Design Level

Now that we have all these methods, what should we do with them?

The first thing to notice is that tempfiles, the more interactive sort of master/slave process relationship, sockets, RPC, and all other methods of bidirectional IPC are at some level equivalent—they’re all just ways for programs to exchange data during their lifetimes. Much of what we do in a sophisticated way using sockets or shared memory we could do in a primitive way using tempfiles as mailboxes and signals for notification. The differences are at the edges, in how programs establish communication, where and when one does the marshalling and unmarshalling of messages, in what sorts of buffering problems you may have, and atomicity guarantees you get on the messages (that is, to what extent you can know that the result of a single send action from one side will show up as a single receive event on the other).

We’ve seen from the PostgreSQL study that one effective way to hold down complexity is to break an application into a client/server pair. The PostgreSQL client and server communicate through an application protocol over sockets, but very little about the design pattern would change if they used any other bidirectional IPC method.

This kind of partitioning is particularly effective in situations where multiple instances of the application must manage access to resources that are shared among all. A single server process may manage all resource contention, or cooperating peers may each take responsibility for some critical resource.

Client-server partitioning can also help distribute cycle-hungry applications across multiple hosts. Or it may make them suitable for distributed computing across the Internet (as with Freeciv). We’ll discuss the related CLI server pattern in Chapter 11.

Because all these peer-to-peer IPC techniques are alike at some level, we should evaluate them mainly on the amount of program-complexity overhead they incur, and how much opacity they introduce into our designs. This, ultimately, is why BSD sockets have won over other Unix IPC methods, and why RPC has generally failed to get much traction.

Threads are fundamentally different. Rather than supporting communication among different programs, they support a sort of timesharing within an instance of a single program. Rather than being a way to partition a big program into smaller ones with simpler behavior, threading is strictly a performance hack. It has all the problems normally associated with performance hacks, and a few special ones of its own.

Accordingly, while we should seek ways to break up large programs into simpler cooperating processes, the use of threads within processes should be a last resort rather than a first. Often, you may find you can avoid them. If you can use limited shared memory and semaphores, asynchronous I/O using SIGIO, or poll(2)/select(2) rather than threading, do it that way. Keep it simple; use techniques earlier on this list and lower on the complexity scale in preference to later ones.

The combination of threads, remote-procedure-call interfaces, and heavyweight object-oriented design is especially dangerous. Used sparingly and tastefully, any of these techniques can be valuable—but if you are ever invited onto a project that is supposed to feature all three, fleeing in terror might well be an appropriate reaction.

We have previously observed that programming in the real world is all about managing complexity. Tools to manage complexity are good things. But when the effect of those tools is to proliferate complexity rather than to control it, we would be better off throwing them away and starting from zero. An important part of the Unix wisdom is to never forget this.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset