5. Textuality: Good Protocols Make Good Practice

It’s a well-known fact that computing devices such as the abacus were invented thousands of years ago. But it’s not well known that the first use of a common computer protocol occurred in the Old Testament. This, of course, was when Moses aborted the Egyptians’ process with a control-sea.

rec.arts.comics, February 1992
—Tom Galloway

In this chapter, we’ll look at what the Unix tradition has to tell us about two different kinds of design that are closely related: the design of file formats for retaining application data in permanent storage, and the design of application protocols for passing data and commands between cooperating programs, possibly over a network.

What unifies these two kinds of design is that they both involve the serialization of in-memory data structures. For the internal operation of computer programs, the most convenient representation of a complex data structure is one in which all fields have the machine’s native data format (e.g. two’s-complement binary for integers) and all pointers are actual memory addresses (as opposed, say, to being named references). But these representations are not well suited to storage and transmission; memory addresses in the data structure lose their meaning outside memory, and emitting raw native data formats causes interoperability problems passing data between machines with different conventions (big- vs. little-endian, say, or 32-bit vs. 64-bit).

For transmission and storage, the traversable, quasi-spatial layout of data structures like linked lists needs to be flattened or serialized into a byte-stream representation from which the structure can later be recovered. The serialization (save) operation is sometimes called marshaling and its inverse (load) operation unmarshaling. These terms are usually applied with respect to objects in an OO language like C++ or Python or Java, but could be used with equal justice of operations like loading a graphics file into the internal storage of a graphics editor and saving it out after modifications.

A significant percentage of what C and C++ programmers maintain is ad-hoc code for marshaling and unmarshaling operations—even when the serialized representation chosen is as simple as a binary structure dump (a common technique under non-Unix environments). Modern languages like Python and Java tend to have built-in unmarshal and marshal functions that can be applied to any object or byte-stream representing an object, and that reduce this labor substantially.

But these naïve methods are often unsatisfactory for various reasons, including both the machine-interoperability problems we mentioned above and the negative trait of being opaque to other tools. When the application is a network protocol, economy may demand that an internal data structure (such as, say, a message with source and destination addresses) be serialized not into a single blob of data but into a series of attempted transactions or messages which the receiving machine may reject (so that, for example, a large message can be rejected if the destination address is invalid).

Interoperability, transparency, extensibility, and storage or transaction economy: these are the important themes in designing file formats and application protocols. Interoperability and transparency demand that we focus such designs on clean data representations, rather than putting convenience of implementation or highest possible performance first. Extensibility also favors textual protocols, since binary ones are often harder to extend or subset cleanly. Transaction economy sometimes pushes in the opposite direction—but we shall see that putting that criterion first is a form of premature optimization that it is often wise to resist.

Finally, we must note a difference between data file formats and the run-control files that are often used to set the startup options of Unix programs. The most basic difference is that (with sporadic exceptions like GNU Emacs’s configuration interface) programs don’t normally modify their own run-control files—the information flow is one-way, from file read at startup time to application settings. Data-file formats, on the other hand, associate properties with named resources and are both read and written by their applications. Configuration files are generally hand-edited and small, whereas data files are program-generated and can become arbitrarily large.

Historically, Unix has related but different sets of conventions for these two kinds of representation. The conventions for run control files are surveyed in Chapter 10; only conventions for data files are examined in this chapter.

5.1 The Importance of Being Textual

Pipes and sockets will pass binary data as well as text. But there are good reasons the examples we’ll see in Chapter 7 are textual: reasons that hark back to Doug McIlroy’s advice quoted in Chapter 1. Text streams are a valuable universal format because they’re easy for human beings to read, write, and edit without specialized tools. These formats are (or can be designed to be) transparent.

Also, the very limitations of text streams help enforce encapsulation. By discouraging elaborate representations with rich, densely encoded structure, text streams also discourage programs from being promiscuous with each other about their internal states and help enforce encapsulation. We’ll return to this point at the end of Chapter 7 when we discuss RPC.

When you feel the urge to design a complex binary file format, or a complex binary application protocol, it is generally wise to lie down until the feeling passes. If performance is what you’re worried about, implementing compression on the text protocol stream either at some level below or above the application protocol will give you a cleaner and perhaps better-performing design than a binary protocol (text compresses well, and quickly).

A bad example of binary formats in Unix history was the way device-independent troff read a binary file containing device information, supposedly for speed. The initial implementation generated that binary file from a text description in a somewhat unportable way. Faced with a need to port ditroff quickly to a new machine, rather than reinvent the binary goo, I ripped it out and just had ditroff read the text file. With carefully crafted file-reading code, the speed penalty was negligible.

—Henry Spencer

Designing a textual protocol tends to future-proof your system. One specific reason is that ranges on numeric fields aren’t implied by the format itself. Binary formats usually specify the number of bits allocated to a given value, and extending them is difficult. For example, IPv4 only allows 32 bits for an address. To extend address size to 128 bits (as done by IPv6) requires a major revamping.1 In contrast, if you need a larger value in a text format, just write it. It may be that a given program can’t receive values in that range, but it’s usually easier to modify the program than to modify all the data stored in that format.

1 There is a legend that some early airline reservation systems allocated exactly one byte for a plane’s passenger count. Supposedly they became very confused by the arrival of the Boeing 747, the first plane that could carry more than 255 passengers.

The only good justification for a binary protocol is if you’re going to be manipulating large enough data sets that you’re genuinely worried about getting the most bit-density out of your media, or if you’re very concerned about the time or instruction budget required to interpret the data into an in-core structure. Formats for large images and multimedia are sometimes an example of the former, and network protocols with hard latency requirements sometimes an example of the latter.

The reciprocal problem with SMTP or HTTP-like text protocols is that they tend to be expensive in bandwidth and slow to parse. The smallest X request is 4 bytes: the smallest HTTP request is about 100 bytes. X requests, including amortized overhead of transport, can be executed in the order of 100 instructions; at one point, an Apache [web server] developer proudly indicated they were down to 7000 instructions. For graphics, bandwidth becomes everything on output; hardware is designed such that these days the graphics-card bus is the bottleneck for small operations, so any protocol had better be very tight if it is not to be a worse bottleneck. This is the extreme case.

—Jim Gettys

These concerns are valid in other extreme cases as well as in X—for example, in the design of graphics file formats intended to hold very large images. But they are usually just another case of premature-optimization fever. Textual formats don’t necessarily have much lower bit density than binary ones; they do after all use seven out of eight bits per byte. And what you gain by not having to parse text, you generally lose the first time you need to generate a test load, or to eyeball a program-generated example of your format and figure out what’s in there.

In addition, the kind of thinking that goes into designing tight binary formats tends to fall down on making them cleanly extensible. The X designers experienced this:

Against the current X framework is the fact we didn’t design enough of a structure to make it easier to ignore trivial extensions to the protocol; we can do this some of the time, but a bit better framework would have been good.

—Jim Gettys

When you think you have an extreme case that justifies a binary file format or protocol, you need to think very carefully about extensibility and leaving room in the design for growth.

5.1.1 Case Study: Unix Password File Format

On many operating systems, the per-user data required to validate logins and start a user’s session is an opaque binary database. Under Unix, by contrast, it’s a text file with records one per line and colon-separated fields.

Example 5.1 consists of some randomly-chosen example lines:

Example 5.1. Password file example.

images

Without even knowing anything about the semantics of the fields, we can notice that it would be hard to pack the data much tighter in a binary format. The colon sentinel characters would have to have functional equivalents taking at least as much space (usually either count bytes or NULs). The per-user records would either have to have terminators (which could hardly be shorter than a single newline) or else be wastefully padded out to a fixed length.

Actually the prospects for saving space through binary encoding pretty much vanish if you know the actual semantics of the data. The numeric user ID (3rd) and group ID (4th) fields are integers, thus on most machines a binary representation would be at least 4 bytes, and longer than the text for values up to 999. But let’s agree to ignore this for now and suppose the best case that the numeric fields have a 0-255 range.

We could tighten up the numeric fields (3rd and 4th) by collapsing the numerics to single bytes, and the password strings (2nd) to an 8-bit encoding. On this example, that would give about an 8% size decrease.

That 8% of putative inefficiency buys us a lot. It avoids putting an arbitrary limit on the range of the numeric fields. It gives us the ability to modify the password file with any old text editor of our choice, rather than having to build a specialized tool to edit a binary format (though in the case of the password file itself, we have to be extra careful about concurrent edits). And it gives us the ability to do ad-hoc searches and filters and reports on the user account information with text-stream tools such as grep(1).

We do have to be a bit careful about not embedding a colon in any of the textual fields. Good practice is to tell the file write code to precede embedded colons with an escape character, and then to tell the file read code to interpret it. Unix tradition favors backslash for this use.

The fact that structural information is conveyed by field position rather than an explicit tag makes this format faster to read and write, but a bit rigid. If the set of properties associated with a key is expected to change with any frequency, one of the tagged formats described below might be a better choice.

Economy is not a major issue with password files to begin with, as they’re normally read seldom2 and infrequently modified. Interoperability is not an issue, since various data in the file (notably user and group numbers) are not portable off the originating machine. For password files, it’s therefore quite clear that going where the transparency criterion leads was the right thing.

2 Password files are normally read once per user session at login time, and after that occasionally by file-system utilities like ls(1) that must map from numeric user and group IDs to names.

5.1.2 Case Study: .newsrc Format

Usenet news is a worldwide distributed bulletin-board system that anticipated today’s P2P networking by two decades. It uses a message format very similar to that of RFC 822 electronic-mail messages, except that instead of being directed to personal recipients messages are sent to topic groups. Articles posted at any participating site are broadcast to each site that it has registered as a neighbor, and eventually flood-fill to all news sites.

Almost all Usenet news readers understand the .newsrc file, which records which Usenet messages have been seen by the calling user. Though it is named like a run-control file, it is not only read at startup but typically updated at the end of the newsreader run. The .newsrc format has been fixed since the first newsreaders around 1980. Example 5.2 is a representative section from a .newsrc file.

Example 5.2. A .newsrc example.

images

Each line sets properties for the newsgroup named in the first field. The name is immediately followed by a character that indicates whether the owning user is currently subscribed to the group or not; a colon indicates subscription, and an exclamation mark indicates nonsubscription. The remainder of the line is a sequence of comma-separated article numbers or ranges of article numbers, indicating which articles the user has seen.

Non-Unix programmers might have automatically tried to design a fast binary format in which each newsgroup status was described by either a long but fixed-length binary record, or a sequence of self-describing binary packets with internal length fields. The main point of such a binary representation would be to express ranges with binary data in paired word-length fields, in order to avoid the overhead of parsing all the range expressions at startup.

Such a layout could be read and written faster than a textual format, but it would have other problems. A naïve implementation in fixed-length records would have placed artificial length limits on newsgroup names and (more seriously) on the maximum number of ranges of seen-article numbers. A more sophisticated binary-packet format would avoid the length limits, but could not be edited with the user’s eyeballs and fingers—a capability that can be quite useful when you want to reset just some of the read bits in an individual newsgroup. Also, it would not necessarily be portable to different machine types.

The designers of the original newsreader chose transparency and interoperability over economy. The case for going in the other direction was not completely ridiculous; .newsrc files can get very large, and one modern reader (GNOME’s Pan) uses a speed-optimized private format to avoid startup lag. But to other implementers, textual representation looked like a good tradeoff in 1980, and has looked better as machines increased in speed and storage dropped in price.

5.1.3 Case Study: The PNG Graphics File Format

PNG (Portable Network Graphics) is a file format for bitmap graphics. It is like GIF, and unlike JPEG, in that it uses lossless compression and is optimized for applications such as line art and icons rather than photographic images. Documentation and open-source reference libraries of high quality are available at the Portable Network Graphics website <http://www.libpng.org/pub/png/>.

PNG is an excellent example of a thoughtfully designed binary format. A binary format is appropriate since graphics files may contain very large amounts of data, such that storage size and Internet download time would go up significantly if the pixel data were stored textually. Transaction economy was the prime consideration, with transparency sacrificed.3 The designers were, however, careful about interoperability; PNG specifies byte orders, integer word lengths, endianness, and (lack of) padding between fields.

3 Confusingly, PNG supports a different kind of transparency—transparent pixels in the PNG image.

A PNG file consists of a sequence of chunks, each in a self-describing format beginning with the chunk type name and the chunk length. Because of this organization, PNG does not need a release number. New chunk types can be added at any time; the case of the first letter in the chunk type name informs PNG-using software whether or not each chunk can be safely ignored.

The PNG file header also repays study. It has been cleverly designed to make various common kinds of file corruption (e.g., by 7-bit transmission links, or mangling of CR and LF characters) easy to detect.

The PNG standard is precise, comprehensive, and well written. It could serve as a model for how to write file format standards.

5.2 Data File Metaformats

A data file metaformat is a set of syntactic and lexical conventions that is either formally standardized or sufficiently well established by practice that there are standard service libraries to handle marshaling and unmarshaling it.

Unix has evolved or adopted metaformats suitable for a wide range of applications. It is good practice to use one of these (rather than an idiosyncratic custom format) wherever possible. The benefits begin with the amount of custom parsing and generation code that you may be able to avoid writing by using a service library. But the most important benefit is that developers and even many users will instantly recognize these formats and feel comfortable with them, which reduces the friction costs of learning new programs.

In the following discussion, when we refer to “traditional Unix tools” we are intending the combination of grep(1), sed(1), awk(1), tr(1), and cut(1) for doing text searches and transformations. Perl and other scripting languages tend to have good native support for parsing the line-oriented formats that these tools encourage.

Here, then, are the standard formats that can serve you as models.

5.2.1 DSV Style

DSV stands for Delimiter-Separated Values. Our first case study in textual metaformats was the /etc/passwd file, which is a DSV format with colon as the value separator. Under Unix, colon is the default separator for DSV formats in which the field values may contain whitespace.

/etc/passwd format (one record per line, colon-separated fields) is very traditional under Unix and frequently used for tabular data. Other classic examples include the /etc/group file describing security groups and the /etc/inittab file used to control startup and shutdown of Unix service programs at different run levels of the operating system.

Data files in this style are expected to support inclusion of colons in the data fields by backslash escaping. More generally, code that reads them is expected to support record continuation by ignoring backslash-escaped newlines, and to allow embedding nonprintable character data by C-style backslash escapes.

This format is most appropriate when the data is tabular, keyed by a name (in the first field), and records are typically short (less than 80 characters long). It works well with traditional Unix tools.

One occasionally sees field separators other than the colon, such as the pipe character | or even an ASCII NUL. Old-school Unix practice used to favor tabs, a preference reflected in the defaults for cut(1) and paste(1); but this has gradually changed as format designers became aware of the many small irritations that ensue from the fact that tabs and spaces are not visually distinguishable.

This format is to Unix what CSV (comma-separated value) format is under Microsoft Windows and elsewhere outside the Unix world. CSV (fields separated by commas, double quotes used to escape commas, no continuation lines) is rarely found under Unix.

In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don’t end the field.

The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote—but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).

5.2.2 RFC 822 Format

The RFC 822 metaformat derives from the textual format of Internet electronic mail messages; RFC 822 is the principal Internet RFC describing this format (since superseded by RFC 2822). MIME (Multipurpose Internet Media Extension) provides a way to embed typed binary data within RFC-822-format messages. (Web searches on either of these names will turn up the relevant standards.)

In this metaformat, record attributes are stored one per line, named by tokens resembling mail header-field names and terminated with a colon followed by whitespace. Field names do not contain whitespace; conventionally a dash is substituted instead. The attribute value is the entire remainder of the line, exclusive of trailing whitespace and newline. A physical line that begins with tab or whitespace is interpreted as a continuation of the current logical line. A blank line may be interpreted either as a record terminator or as an indication that unstructured text follows.

Under Unix, this is the traditional and preferred textual metaformat for attributed messages or anything that can be closely analogized to electronic mail. More generally, it’s appropriate for records with a varying set of fields in which the hierarchy of data is flat (no recursion or tree structure).

Usenet news uses it; so do the HTTP 1.1 (and later) formats used by the World Wide Web. It is very convenient for editing by humans. Traditional Unix search tools are still good for attribute searches, though finding record boundaries will be a little more work than in a record-per-line format.

One weakness of RFC 822 format is that when more than one RFC 822 message or record is put in a file, the record boundaries may not be obvious—how is a poor literal-minded computer to know where the unstructured text body of a message ends and the next header begins? Historically, there have been several different conventions for delimiting messages in mailboxes. The oldest and most widely supported, leading each message with a line that begins with the string "From " and sender information, is not appropriate for other kinds of records; it also requires that lines in message text beginning with "From " be escaped (typically with >)—a practice which not infrequently leads to confusion.

Some mail systems use delimiter lines consisting of control characters unlikely to appear in messages, such as several ASCII 01 (control-A) characters in succession. The MIME standard gets around the problem by including an explicit message length in the header, but this is a fragile solution which is very likely to break if messages are ever manually edited. For a somewhat better solution, see the record-jar style described later in this chapter.

For examples of RFC 822 format, look in your mailbox.

5.2.3 Cookie-Jar Format

Cookie-jar format is used by the fortune(1) program for its database of random quotes. It is appropriate for records that are just bags of unstructured text. It simply uses newline followed by %% (or sometimes newline followed by %) as a record separator. Example 5.3 is an example section from a file of email signature quotes:

Example 5.3. A fortune file example.

images

It is good practice to accept whitespace after % when looking for record delimiters. This helps cope with human editing mistakes. It’s even better practice to use %%, and ignore all text from %% to end-of-line.

The cookie-jar separator was originally %% . I wanted something a bit more visible than % would have been. In fact, any stuff after the %% is treated as a comment (or at least that’s how I wrote it).

—Ken Arnold

Simple cookie-jar format is appropriate for pieces of text that have no natural ordering, distinguishable structure above word level, or search keys other than their text context.

5.2.4 Record-Jar Format

Cookie-jar record separators combine well with the RFC 822 metaformat for records, yielding a format we’ll call ’record-jar’. If you need a textual format that will support multiple records with a variable repertoire of explicit fieldnames, one of the least surprising and human-friendliest ways to do it would look like Example 5.4.

Example 5.4. Basic data for three planets in a record-jar format.

images

Of course, the record delimiter could be a blank line, but a line consisting of “%% ” is more explicit and less likely to be introduced by accident during editing (two printable characters are better than one because it can’t be generated by a single-character typo). In a format like this it is good practice to simply ignore blank lines.

If your records have an unstructured text part, your record-jar format is closely approaching a mailbox format. In this case, it’s important that you have a well-defined way to escape the record delimiter so it can appear in text; otherwise, your record reader is going to choke on an ill-formed text part someday. Some technique analogous to byte-stuffing (described later in this chapter) is indicated.

Record-jar format is appropriate for sets of field-attribute associations that are like DSV files, but have a variable repertoire of fields, and possibly unstructured text associated with them.

5.2.5 XML

XML is a very simple syntax resembling HTML—angle-bracketed tags and ampersand-led literal sequences. It is about as simple as a plain-text markup can be and yet express recursively nested data structures. XML is just a low-level syntax; it requires a document type definition (such as XHTML) and associated application logic to give it semantics.

XML is well suited for complex data formats (the sort of things for which the old-school Unix tradition would use an RFC-822-like stanza format) though overkill for simpler ones. It is especially appropriate for formats that have a complex nested or recursive structure of the sort that the RFC 822 metaformat does not handle well. For a good introduction to the format, see XML in a Nutshell [Harold-Means].

Among the hardest things to get right in designing any text file format are issues of quoting, whitespace and other low-level syntax details. Custom file formats often suffer from slightly broken syntax that doesn’t quite match other similar formats. Using a standard format such as XML, which is verifiable and parsed by a standard library, eliminates most of these issues.

—Keith Packard

Example 5.5 is a simple example of an XML-based configuration file. It is part of the kdeprint tool shipped with the open-source KDE office suite hosted under Linux. It describes options for an image-to-PostScript filtering operation, and how to map them into arguments for a filter command. For another instructive example, see the discussion of Glade in Chapter 8.

Example 5.5. An XML example.

images

One advantage of XML is that it is often possible to detect ill-formed, corrupted, or incorrectly generated data through a syntax check, without knowing the semantics of the data.

The most serious problem with XML is that it doesn’t play well with traditional Unix tools. Software that wants to read an XML format needs an XML parser; this means bulky, complicated programs. Also, XML is itself rather bulky; it can be difficult to see the data amidst all the markup.

One application area in which XML is clearly winning is in markup formats for document files (we’ll have more to say about this in Chapter 18). Tagging in such documents tends to be relatively sparse among large blocks of plain text; thus, traditional Unix tools still work fairly well for simple text searches and transformations.

One interesting bridge between these worlds is PYX format—a line-oriented translation of XML that can be hacked with traditional line-oriented Unix text tools and then losslessly translated back to XML. A Web search for “Pyxie” will turn up resources. The xmltk toolkit takes the opposite tack, providing stream-oriented tools analogous to grep(1) and sort(1) for filtering XML documents; Web search for “xmltk” to find it.

XML can be a simplifying choice or a complicating one. There is a lot of hype surrounding it, but don’t become a fashion victim by either adopting or rejecting it uncritically. Choose carefully and bear the KISS principle in mind.

5.2.6 Windows INI Format

Many Microsoft Windows programs use a textual data format that looks like Example 5.6. This example associates optional resources named account, directory, numeric_id, and developer with named projects python, sng, fetchmail, and py-howto. The DEFAULT entry supplies values that will be used when a named entry fails to supply them.

Example 5.6. A .INI file example.

images

This style of data-file format is not native to Unix, but some Linux programs (notably Samba, the suite of tools for accessing Windows file shares from Linux) support it under Windows’s influence. This format is readable and not badly designed, but like XML it doesn’t play well with grep(1) or conventional Unix scripting tools.

The .INI format is appropriate if your data naturally falls into its two-level organization of name-attribute pairs clustered under named records or sections. It’s not good for data with a fully recursive treelike structure (XML is more appropriate for that), and it would be overkill for a simple list of name-value associations (use DSV format for that).

5.2.7 Unix Textual File Format Conventions

There are long-standing Unix traditions about how textual data formats ought to look. Most of these derive from one or more of the standard Unix metaformats we’ve just described. It is wise to follow these conventions unless you have strong and specific reasons to do otherwise.

In Chapter 10 we will discuss a different set of conventions used for program run-control files, but you should notice that it will share some of these same rules (especially about the lexical level, the rules by which characters are assembled into tokens).

One record per newline-terminated line, if possible. This makes it easy to extract records with text-stream tools. For data interchange with other operating systems, it’s wise to make your file-format parser indifferent to whether the line ending is LF or CR-LF. It’s also conventional to ignore trailing whitespace in such formats; this protects against common editor bobbles.

Less than 80 characters per line, if possible. This makes the format browseable in an ordinary-sized terminal window. If many records must be longer than 80 characters, consider a stanza format (see below).

Use # as an introducer for comments. It is good to have a way to embed annotations and comments in data files. It’s best if they’re actually part of the file structure, and so will be preserved by tools that know its format. For comments that are not preserved during parsing, # is the conventional start character.

Support the backslash convention. The least surprising way to support embedding nonprintable control characters is by parsing C-like backslash escapes— for a newline, for a carriage return, for a tab,  for backspace, f for formfeed, e for ASCII escape (27), nn or onnn or nnn for the character with octal value nnn, xnn for the character with hexadecimal value nn, dnnn for the character with decimal value nnn, \ for a literal backslash. A newer convention, but one worth following, is the use of unnnn for a hexadecimal Unicode literal.

In one-record-per-line formats, use colon or any run of whitespace as a field separator. The colon convention seems to have originated with the Unix password file. If your fields must contain instances of the separator(s), use a backslash as the prefix to escape them.

Do not allow the distinction between tab and whitespace to be significant. This is a recipe for serious headaches when the tab settings on your users’ editors are different; more generally, it’s confusing to the eye. Using tab alone as a field separator is especially likely to cause problems; allowing any run of tabs and spaces to be a field separator, on the other hand, works well.

Favor hex over octal. Hex-digit pairs and quads are easier to eyeball-map into bytes and today’s 32- and 64-bit words than octal digits of three bits each; also marginally more efficient. This rule needs emphasizing because some older Unix tools such as od(1) violate it; that’s a legacy from the instruction field sizes in the machine languages of older PDP minicomputers.

For complex records, use a ’stanza’ format: multiple lines per record, with a record separator line of %% or % . The separators make useful visual boundaries for human beings eyeballing the file.

In stanza formats, either have one record field per line or use a record format resembling RFC 822 electronic-mail headers, with colon-terminated field-name keywords leading fields. The second choice is appropriate when fields are often either absent or longer than 80 characters, or when records are sparse (e.g., often with empty fields).

In stanza formats, support line continuation. When interpreting the file, either discard backslash followed by whitespace or interpret newline followed by whitespace equivalently to a single space, so that a long logical line can be folded into short (easily editable!) physical lines. It’s also conventional to ignore trailing whitespace in these formats; this convention protects against common editor bobbles.

Either include a version number or design the format as self-describing chunks independent of each other. If there is even the faintest possibility that the format will have to be changed or extended, include a version number so your code can conditionally do the right thing on all versions. Alternatively, design the format as self-describing chunks so that you can add new chunk types without instantly breaking old code.

Beware of floating-point round-off problems. Conversion of floating-point numbers from binary to text format and back can lose precision, depending on the quality of the conversion library you are using. If the structure you are marshaling/unmarshaling contains floating point, you should test the conversion in both directions. If it looks like conversion in either direction is subject to roundoff errors, be prepared to dump the floating-point field as raw binary instead, or a string encoding thereof. If you’re coding in C or some language that has access to C printf/scanf, the C99 %a specifier may solve this problem.

Don’t bother compressing or binary-encoding just part of the file. See below...

5.2.8 The Pros and Cons of File Compression

Many modern Unix projects, such as OpenOffice.org and AbiWord, now use XML compressed with zip(1) or gzip(1) as a data file format. Compressed XML combines space economy with some of the advantages of a textual format—notably, it avoids the problem that binary formats must often allocate space for information that may not be used in particular cases (e.g., for unusual options or large ranges). But there is some dispute about this, dispute which turns on some of the central tradeoffs discussed in this chapter.

On the one hand, experiments have shown that documents in a compressed XML file are usually significantly smaller than the Microsoft Word’s native file format, a binary format that one might imagine would take less space. The reason relates to a fundamental of the Unix philosophy: Do one thing well. Creating a single tool to do the compression job well is more effective than ad-hoc compression on parts of the file, because the tool can look across all the data and exploit all repetition in the information.

Also, by separating the representation design from the particular compression method used, you leave open the possibility of using different compression methods in the future with no more than minimal changes to the actual file parsing—perhaps, with no changes at all.

On the other hand, compression does some damage to transparency. While a human being can estimate from context whether uncompressing the file is likely to show him anything useful, tools such as file(1) cannot as of mid-2003 see through the wrapping.

Some would advocate a less structured compression format—straight gzip(1)-compressed XML data, say, without the internal structure and self-identifying header chunk provided by zip(1). While using a format similar to that of zip(1) solves the identification problem, it means that decoding such files will be tricky for programs written in the simpler scripting languages.

Any of these solutions (straight text, straight binary, or compressed text) may be optimal depending on the relative weight you give to storage economy, discoverability, or making browsing tools as simple as possible to write. The point of the preceding discussion is not to advocate any one of these approaches over the others, but rather to suggest how you can think about the options and design tradeoffs clearly.

This having been said, the truly Unixy solution would probably be to fix file(1) to see file prefixes through the compression—and, failing that, to write a shellscript wrapper around file(1) that would interpret compression as a direction to apply gunzip(1) and take a second look.

5.3 Application Protocol Design

In Chapter 7, we’ll discuss the advantages of breaking complicated applications up into cooperating processes speaking an application-specific command set or protocol with each other. All the good reasons for data file formats to be textual apply to these application-specific protocols as well.

When your application protocol is textual and easily parsed by eyeball, many good things become easier. Transaction dumps become much easier to interpret. Test loads become easier to write.

Server processes are often invoked by harness programs such as inetd(8) in such a way that the server sees commands on standard input and ships responses to standard output. We describe this “CLI server” pattern in more detail in Chapter 11.

A CLI server with a command set that is designed for simplicity has the valuable property that a human tester will be able to type commands direct to the server process to probe the software’s behavior.

Another issue to bear in mind is the end-to-end design principle. Every protocol designer should read the classic End-to-End Arguments in System Design [Saltzer]. There are often serious questions about which level of the protocol stack should handle features like security and authentication; this paper provides some good conceptual tools for thinking about them. Yet a third issue is designing application protocols for good performance. We’ll cover that issue in more detail in Chapter 12.

The traditions of Internet application protocol design evolved separately from Unix before 1980.4 But since the 1980s these traditions have become thoroughly naturalized into Unix practice.

4 One relic of this pre-Unix history is that Internet protocols normally use CR-LF as a line terminator rather than Unix’s bare LF.

We’ll illustrate the Internet style by looking at three application protocols that are both among the most heavily used, and are widely regarded among Internet hackers as paradigmatic: SMTP, POP3, and IMAP. All three address different aspects of mail transport (one of the net’s two most important applications, along with the World Wide Web), but the problems they address (passing messages, setting remote state, indicating error conditions) are generic to non-email application protocols as well and are normally addressed using similar techniques.

5.3.1 Case Study: SMTP, a Simple Socket Protocol

Example 5.7 is an example transaction in SMTP (Simple Mail Transfer Protocol), which is described by RFC 2821. In the example, C: lines are sent by a mail transport agent (MTA) sending mail, and S: lines are returned by the MTA receiving it. Text emphasized like this is comments, not part of the actual transaction.

Example 5.7. An SMTP session example.

images

This is how mail is passed among Internet machines. Note the following features: command-argument format of the requests, responses consisting of a status code followed by an informational message, the fact that the payload of the DATA command is terminated by a line consisting of a single dot.

SMTP is one of the two or three oldest application protocols still in use on the Internet. It is simple, effective, and has withstood the test of time. The traits we have called out here are tropes that recur frequently in other Internet protocols. If there is any single archetype of what a well-designed Internet application protocol looks like, SMTP is it.

5.3.2 Case Study: POP3, the Post Office Protocol

Another one of the classic Internet protocols is POP3, the Post Office Protocol. It is also used for mail transport, but where SMTP is a ’push’ protocol with transactions initiated by the mail sender, POP3 is a ’pull’ protocol with transactions initiated by the mail receiver. Internet users with intermittent access (like dial-up connections) can let their mail pile up on a mail-drop machine, then use a POP3 connection to pull mail up the wire to their personal machines.

Example 5.8 is an example POP3 session. In the example, C: lines are sent by the client, and S: lines by the mail server. Observe the many similarities with SMTP. This protocol is also textual and line-oriented, sends payload message sections terminated by a line consisting of a single dot followed by line terminator, and even uses the same exit command, QUIT. Like SMTP, each client operation is acknowledged by a reply line that begins with a status code and includes an informational message meant for human eyes.

Example 5.8. A POP3 example session.

images

There are a few differences. The most obvious one is that POP3 uses status tokens rather than SMTP’s 3-digit status codes. Of course the requests have different semantics. But the family resemblance (one we’ll have more to say about when we discuss the generic Internet metaprotocol later in this chapter) is clear.

5.3.3 Case Study: IMAP, the Internet Message Access Protocol

To complete our triptych of Internet application protocol examples, we’ll look at IMAP, another post office protocol designed in a slightly different style. See Example 5.9; as before, C: lines are sent by the client, and S: lines by the mail server. Text emphasized like this is comments, not part of the actual transaction.

Example 5.9. An IMAP session example.

images

IMAP delimits payloads in a slightly different way. Instead of ending the payload with a dot, the payload length is sent just before it. This increases the burden on the server a little bit (messages have to be composed ahead of time, they can’t just be streamed up after the send initiation) but makes life easier for the client, which can tell in advance how much storage it will need to allocate to buffer the message for processing as a whole.

Also, notice that each response is tagged with a sequence label supplied by the request; in this example they have the form A000n, but the client could have generated any token into that slot. This feature makes it possible for IMAP commands to be streamed to the server without waiting for the responses; a state machine in the client can then simply interpret the responses and payloads as they come back. This technique cuts down on latency.

IMAP (which was designed to replace POP3) is an excellent example of a mature and powerful Internet application protocol design, one well worth study and emulation.

5.4 Application Protocol Metaformats

Just as data file metaformats have evolved to simplify serialization for storage, application protocol metaformats have evolved to simplify serialization for transactions across networks. The tradeoffs are a little different in this case; because network bandwidth is more expensive than storage, there is more of a premium on transaction economy. Still, the transparency and interoperability benefits of textual formats are sufficiently strong that most designers have resisted the temptation to optimize for performance at the cost of readability.

5.4.1 The Classical Internet Application Metaprotocol

Marshall Rose’s RFC 3117, On the Design of Application Protocols,5 provides an excellent overview of the design issues in Internet application protocols. It makes explicit several of the tropes in classical Internet application protocols that we observed in our examination of SMTP, POP, and IMAP, and provides an instructive taxonomy of such protocols. It is recommended reading.

5 See RFC 3117 <ftp://ftp.rfc-editor.org/in-notes/rfc3117.txt>.

The classical Internet metaprotocol is textual. It uses single-line requests and responses, except for payloads which may be multiline. Payloads are shipped either with a preceding length in octets or with a terminator that is the line ". ". In the latter case the payload is byte-stuffed; all lines that start with a period get another period prepended, and the receiver side is responsible for both recognizing the termination and stripping away the stuffing. Response lines consist of a status code followed by a human-readable message.

One final advantage of this classical style is that it is readily extensible. The parsing and state-machine framework doesn’t need to change much to accommodate new requests, and it is easy to code implementations so that they can parse unknown requests and return an error or simply ignore them. SMTP, POP3, and IMAP have all been extended in minor ways fairly often during their lifetimes, with minimal interoperability problems. Naïvely designed binary protocols are, by contrast, notoriously brittle.

5.4.2 HTTP as a Universal Application Protocol

Ever since the World Wide Web reached critical mass around 1993, application protocol designers have shown an increasing tendency to layer their special-purpose protocols on top of HTTP, using web servers as generic service platforms.

This is a viable option because, at the transaction layer, HTTP is very simple and general. An HTTP request is a message in an RFC-822/MIME-like format; typically, the headers contain identification and authentication information, and the first line is a method call on some resource specified by a Universal Resource Indicator (URI). The most important methods are GET (fetch the resource), PUT (modify the resource) and POST (ship data to a form or back-end process). The most important form of URI is a URL or Uniform Resource Locator, which identifies the resource by service type, host name, and a location on the host. An HTTP response is simply an RFC-822/MIME message and can contain arbitrary content to be interpreted by the client.

Web servers handle the transport and request-multiplexing layers of HTTP, as well as standard service types like http and ftp. It is relatively easy to write web server plugins that will handle custom service types, and to dispatch on other elements of the URI format.

Besides avoiding a lot of lower-level details, this method means the application protocol will tunnel through the standard HTTP service port and not need a TCP/IP service port of its own. This can be a distinct advantage; most firewalls leave port 80 open, but trying to punch another hole through can be fraught with both technical and political difficulties.

With this advantage comes a risk. It means that your web server and its plugins grow more complex, and cracks in any of that code can have large security implications. It may become more difficult to isolate and shut down problem services. The usual tradeoffs between security and convenience apply.

RFC 3205, On the Use of HTTP As a Substrate,6 has good design advice for anyone considering using HTTP as the underlayer of an application protocol, including a summary of the tradeoffs and problems involved.

6 See RFC 3205 <http://www.faqs.org/rfcs/rfc3205.html>.

5.4.2.1 Case Study: The CDDB/freedb.org Database

Audio CDs consist of a sequence of music tracks in a digital format called CDDA-WAV. They were designed to be played by very simple consumer-electronics devices a few years before general-purpose computers developed enough raw speed and sound capability to decode them on the fly. Because of this, there is no provision in the format for even simple metainformation such as the album and track titles. But modern computer-hosted CD players want this information so the user can assemble and edit play lists.

Enter the Internet. There are (at least two) repositories that provide a mapping between a hash code computed from the track-length table on a CD and artist/album-title/track-title records. The original was cddb.org, but another site called freedb.org which is probably now more complete and widely used. Both sites rely on their users for the enormous task of keeping the database current as new CDs come out; freedb.org arose from a developer revolt after CDDB elected to take all that user-contributed information proprietary.

Queries to these services could have been implemented as a custom application protocol on top of TCP/IP, but that would have required steps such as getting a new TCP/IP port number assigned and fighting to get a hole for it punched through thousands of firewalls. Instead, the service is implemented over HTTP as a simple CGI query (as if the CD’s hash code had been supplied by a user filling in a Web form).

This choice makes all the existing infrastructure of HTTP and Web-access libraries in various programming languages available to support programs for querying and updating this database. As a result, adding such support to a software CD player is nearly trivial, and effectively every software CD player knows how to use them.

5.4.2.2 Case Study: Internet Printing Protocol

Internet Printing Protocol (IPP) is a successful, widely implemented standard for the control of network-accessible printers. Pointers to RFCs, implementations, and much other related material are available at the IETF’s Printer Working Group <http://www.pwg.org/ipp/> site.

IPP uses HTTP 1.1 as a transport layer. All IPP requests are passed via an HTTP POST method call; responses are ordinary HTTP responses. (Section 4.2 of RFC 2568, Rationale for the Structure of the Model and Protocol for the Internet Printing Protocol, does an excellent job of explaining this choice; it repays study by anyone considering writing a new application protocol.)

From the software side, HTTP 1.1 is widely deployed. It already solves many of the transport-level problems that would otherwise distract protocol developers and implementers from concentrating on the domain semantics of printing. It is cleanly extensible, so there is room for IPP to grow. The CGI programming model for handling the POST requests is well understood and development tools are widely available.

Most network-aware printers already embed a web server, because that’s the natural way to make the status of the printer remotely queryable by human beings. Thus, the incremental cost of adding IPP service to the printer firmware is not large. (This is an argument that could be applied to a remarkably wide range of other network-aware hardware, including vending machines and coffee makers7 and hot tubs!)

7 See RFC 2324 <http://www.ietf.org/rfc/rfc2324.txt> and RFC 2325 <http://www.ietf.org/rfc/rfc2325.txt>.

About the only serious drawback of layering IPP over HTTP is that the protocol is completely driven by client requests. Thus there is no space in the model for printers to ship asynchronous alert messages back to clients. (However, smarter clients could run a trivial HTTP server to receive such alerts formatted as HTTP requests from the printer.)

5.4.3 BEEP: Blocks Extensible Exchange Protocol

BEEP (formerly BXXP) is a generic protocol machine that competes with HTTP for the role of universal underlayer for application protocols. There is a niche open because there is not as yet any other more established metaprotocol that is appropriate for truly peer-to-peer applications, as opposed to the client-server applications that HTTP handles well. A project website <http://www.beepcore.org/beepcore/docs/sl-beep.jsp> provides access to standards and open-source implementations in several languages.

BEEP has features to support both client-server and peer-to-peer modes. The authors designed the BEEP protocol and support library so that picking the right options abstracts away messy issues like data encoding, flow control, congestion-handling, support of end-to-end encryption, and assembling a large response composed of multiple transmissions,

Internally, BEEP peers exchange sequences of self-describing binary packets not unlike chunk types in PNG. The design is tuned more for economy and less for transparency than the classical Internet protocols or HTTP, and might be a better choice when data volumes are large. BEEP also avoids the HTTP problem that all requests have to be client-initiated; it would be better in situations in which a server needs to send asynchronous status messages back to the client.

BEEP is still new technology in mid-2003, and has only a few demonstration projects. But the BEEP papers are good analytical surveys of best practice in protocol design; even if BEEP itself fails to gain widespread adoption, the papers will retain considerable tutorial value.

5.4.4 XML-RPC, SOAP, and Jabber

There is a developing trend in application protocol design toward using XML within MIME to structure requests and payloads. BEEP peers use this format for channel negotiations. Three major protocols are going the XML route throughout: XML-RPC and SOAP (Simple Object Access Protocol) for remote procedure calls, and Jabber for instant messaging and presence. All three are XML document types.

XML-RPC is very much in the Unix spirit (its author observes that he learned how to program in the 1970s by reading the original source code for Unix). It’s deliberately minimalist but nevertheless quite powerful, offering a way for the vast majority of RPC applications that can get by on passing around scalar boolean/integer/float/string datatypes to do their thing in a way that is lightweight and easy to understand and monitor. XML-RPC’s type ontology is richer than that of a text stream, but still simple and portable enough to act as a valuable check on interface complexity. Open-source implementations are available. An excellent XML-RPC home page <http://www.xmlrpc.com/> points to specifications and multiple open-source implementations.

SOAP is a more heavyweight RPC protocol with a richer type ontology that includes arrays and C-like structs. It was inspired by XML-RPC, but has been plausibly accused of being an overdesigned victim of the second-system effect. As of mid-2003 the SOAP standard is still a work in progress, but a trial implementation in Apache is tracking the drafts. Open-source client modules in Perl, Python, Tcl, and Java are readily discoverable by a Web search. The W3C draft specification is available on the Web <http://www.w3.org/TR/SOAP/>.

XML-RPC and SOAP, considered as remote procedure call methods, have some associated risks that we discuss at the end of Chapter 7.

Jabber is a peer-to-peer protocol designed to support instant messaging and presence. What makes it interesting as an application protocol is that it supports passing around XML forms and live documents. Specifications, documentation, and open-source implementations are available at the Jabber Software Foundation <http://www.jabber.org/about/overview.html> site.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset