CHAPTER 2 LITERATURE AND THE WEB 

The subject of this book, the web, is a young teenager. It’s hard to believe it was in utero in 1990, emerged from the cradle with the introduction of the Netscape browser five years later, and only became widely known to the outside world around the turn of the millennium. Teenager, a juvenile between the onset of puberty and maturity, is an apt metaphor: agile, bumptious, cool, disrespectful, energetic, fun, gangly, hilarious, inane, jaunty, keen, loyal, noisy, obstreperous, promiscuous, quirky—the web is all these and more.

Today, this teenager has all but usurped our literature. The web is where people go to find things out. And while it is often denigrated as a damaged and unreliable source of questionable information, its aspiration knows no bounds: it is now threatening to take over our libraries. In future chapters, we will dissect the web, measure its size, probe its flaws and failings, review its past, and assess its future. But first, this chapter takes a look at the kind of literature that is stored in libraries: how much there is, how it has grown, how it is organized … and what is happening to it now. To understand a revolution, we must place it in the context of what has gone before.

You probably think of libraries as bastions of conservatism, institutions that have always been there and always will be, musty and dank, unlit by the beacons of progress in our society. But you would be wrong: they have changed greatly over the centuries. They have had their own revolutions, seen turmoil and fire, even rape and plunder. And there has been no time of greater uncertainty than the present. For librarians are without doubt among those whose working lives are most radically affected by the tremendous explosion of networked information sparked by the Internet. For them, advances in information technology generate shockwave after shockwave of opportunities and problems that must seem like a fierce, and sustained, onslaught on their own self-image.

How would you react if a technology juggernaut like Google suddenly declared that its mission was something that you thought society had entrusted to you alone, which you had been doing well for centuries: “to organize the world’s information and make it universally accessible and useful”? How do librarians feel when people turn from their doors and instead visit web bookstores to check their references, to ferret out related books, or to see what others recommend—even to check out facts in standard reference works? Like everyone else, they are acutely aware that the world is changing, and they know their job descriptions and institutions must follow suit. Their clients now work online, and part of the contemporary librarian’s job is to create websites, provide web interfaces to the catalogue, link to full text where available, and tend the e-mail help desk.

This is not normal evolution, it’s violent revolution. The bumptious teenager is really throwing his weight around. Now they’re digitizing libraries and chucking out the books!—not (usually) into the garbage; they’re going into remote storage depots where they will never again see the light of day nor illuminate the minds of readers who have been used to browsing library shelves. Massive digitization programs are underway that will change forever the way in which people find and use information. The events of the last quarter-century have shaken our confidence in the continued existence of the traditional library. Of course, there is a backlash: defensive tracts with titles like Future Libraries: Dreams, Madness and Reality deride “technolust” and the empty promises of early technophiles. But it is a plain fact that the web is placing libraries as we know them in serious jeopardy.

THE CHANGING FACE OF LIBRARIES

Libraries are society’s repositories for knowledge—temples, if you like, of culture and wisdom. Born in an era where agriculture was humankind’s greatest preoccupation, they experienced a resurgence with the invention of printing in the Renaissance, and really began to flourish when the industrial revolution prompted a series of inventions—such as the steam press—that mechanized the printing process.

Although libraries have been around for more than twenty-five centuries, only one has survived more than five or six centuries, and most are far younger. The exception is a collection of two thousand engraved stone slabs or “steles” situated in Xi’an, an ancient walled city in central China. This collection was established in the Song dynasty (ca. A.D. 1100) and has been gradually expanded with new work ever since. Each stele stands 2 or 3 meters high and is engraved with a poem, story, or historical record (see Figure 2.1). Confucius’s works are here, as is much classic poetry, and an account of how a Christian sect spread eastward to China along the Silk Road. Chinese writing is an art form, and this library gathers together the work of outstanding calligraphers over a period of two millennia. It contains the weightiest books in the world!

image

Figure 2.1 Rubbing from a stele in Xi’an.

We think of the library as the epitome of a stable, solid, unchanging institution, and indeed the silent looming presence of two thousand enormous stone slabs—tourist guidebooks call it the “forest of steles”—certainly projects an air of permanence. But this is an exception. Over the years, libraries have evolved beyond recognition. Originally intended for storage and preservation, they have refocused to place users and their information needs at center stage.

Ancient libraries were only useful to the tiny minority of people who could read. Even then access was strictly controlled. Medieval monastic and university libraries held chained copies of books in public reading areas. Other copies were available for loan, although substantial security was demanded for each volume borrowed.

The public library movement took hold in the nineteenth century. Libraries of the day had bookstacks that were closed to the public. Patrons perused the catalog, chose their books, and received them over the counter. In continental Europe, most libraries continue to operate this way; so do many research libraries in other countries. However, progressive twentieth-century librarians came to realize the advantage of allowing readers to browse the shelves and make their own selection. The idea of open-shelf libraries gained wide acceptance in English-speaking countries, marking the fulfillment of the principle of free access to libraries by all—the symbolic snapping of the links of the chained book.

Today we stand on the threshold of the digital library. The information revolution not only supplies the technological horsepower that drives digital libraries, it fuels an unprecedented demand for storing, organizing, and accessing information—a demand that is, for better or worse, economically, rather than curiosity, driven as in days gone by. If information is the currency of the knowledge economy, digital libraries are the banks where it is invested. Indeed, Goethe once remarked that visiting a library was like entering the presence of great wealth that was silently paying untold dividends.

BEGINNINGS

The fabled library of Alexandria, which Ptolemy I created in around 300 B.C., is widely recognized as the first great library. There were precedents, though. 350 years earlier the King of Assyria established a comprehensive, well-organized collection of tens of thousands of clay tablets, and long before that Chinese written records began, extending at least as far back as the eighteenth century B.C.

The Alexandrian Library grew at a phenomenal rate and, according to legend, contained 200,000 volumes within ten years. The work of the acquisitions department was rather more dramatic than in today’s libraries. During a famine, the king refused to sell grain to the Athenians unless he received in pledge the original manuscripts of leading authors. The manuscripts were diligently copied and the copies returned to the owners, while the originals went into the library. By far the largest single acquisition occurred when Mark Antony stole the rival library of Pergamum and gave it lock, stock, and barrel—200,000 volumes—to Cleopatra as a love token. She passed it over to Alexandria for safe keeping.

By the time Julius Caesar fired the harbor of Alexandria in 47 B.C., the library had grown to 700,000 volumes. More than 2000 years would pass before another library attained this size, notwithstanding technological innovations such as the printing press. Tragically, the Alexandrian Library was destroyed. Much remained after Caesar’s fire, but this was willfully laid waste (according to the Moslems) by Christians in A.D. 391 or (according to the Christians) by Moslems in A.D. 641. In the Arab conquest, Amru, the captain of Caliph Omar’s army, would have been willing to spare the library, but the fanatical Omar is said to have disposed of the problem of the information explosion with the tragic words, “If these writings of the Greeks agree with the Koran, they are useless and need not be preserved; if they disagree, they are pernicious and ought to be destroyed.”

THE INFORMATION EXPLOSION

Moving ahead a thousand years, let us peek into the library of a major university near the center of European civilization a century or two after Gutenberg’s introduction of the movable-type printing press around 1450.4 Trinity College, Dublin, one of the oldest universities in Western Europe, was founded in 1592 by Queen Elizabeth I. In 1600 the library contained a meager collection of 30 printed books and 10 handwritten manuscripts. This grew rapidly, by several thousand, when two of the Fellows mounted a shopping expedition to England, and by a further 10,000 when the library received the personal collection of Archbishop Ussher, a renowned Irish man of letters, on his death in 1661.

There were no journals in Ussher’s collection. The first scholarly journals appeared just after his death: the Journal des Sçavans began in January 1665 in France, and the Philosophical Transactions of the Royal Society began in March 1665 in England. These two have grown, hydra-like, into hundreds of thousands of scientific journals today—although many are being threatened with replacement by electronic archives.

Trinity’s library was dwarfed by the private collection of Duke August of Wolfenbüttel, Germany, which reached 135,000 works by his death in 1666. It was the largest contemporary library in Europe, acclaimed as the eighth wonder of the world. The pages, or “imprints,” were purchased in quires (i.e., unbound) and shipped to the duke in barrels, who had them bound in 31,000 volumes with pale parchment bindings that you can still see today. Incidentally, Casanova, after visiting the library for a week in 1764, declared that “I have sometimes thought that the life of those in heaven may be somewhat similar to [this visit]”—high praise indeed from the world’s most renowned lover!

But Trinity College was to have a windfall that would vault its library far ahead of Wolfenbüttel’s. In 1801, the British Parliament passed an act decreeing that a copy of every book printed in the British Isles was to be donated to the College. The privilege extends to this day, and is shared by five other libraries—the British National Library, the University Libraries of Oxford and Cambridge, and the National Libraries of Scotland and Wales. This “legal deposit” law had a precedent in France, where King François I decreed in 1537 that a copy of every book published was to be placed in the Bibliothèque du Roi (long since incorporated into the French National Library). Likewise, the Library of Congress receives copies of all books published in the United States.

In the eighteenth century, the technology of printing really took hold. For example, more than 30,000 titles were published in France during a sixty-year period in the mid-1700s. Ironically, the printing press that Gutenberg had developed in order to make the Bible more widely available became the vehicle for disseminating the European Enlightenment—the emancipation of human thinking from the weight of authority of the Church that we mentioned in Chapter 1—some 300 years later.

Across the Atlantic, the United States was a late starter. President John Adams created a reference library for Congress when the seat of government was moved to the new capital city of Washington in 1800, the year before the British Parliament belatedly passed their legal deposit law. He began by providing $5000 “for the purchase of such books as may be necessary for the use of Congress—and for putting up a suitable apartment for containing them therein.” The first books were ordered from England and shipped across the Atlantic in eleven trunks and a map case.

The library was housed in the new Capitol until August 1814, when—in a miniature replay of Julius Caesar’s exploits in Alexandria—British troops invaded Washington and burned the building. The small congressional collection of 3000 volumes was lost in the fire, and began again with the purchase of Jefferson’s personal library. Another fire destroyed two-thirds of the collection in 1851. Unlike Alexandria, however, the Library of Congress has regrown—indeed, its rotunda is a copy of the one built in Wolfenbüttel two centuries earlier. Today it’s the largest library in the world, with 128 million items on 530 miles of bookshelves. Its collection includes 30 million books and other printed material, 60 million manuscripts, 12 million photographs, 5 million maps, and 3 million sound recordings.

Back in Trinity College, the information explosion began to hit home in the middle of the nineteenth century. Work started in 1835 on the production of a printed library catalog, but by 1851 only the first volume, covering the letters A and B, had been completed. The catalog was finally finished in 1887, but only by restricting the books that appeared in it to those published up to the end of 1872. Other libraries were wrestling with far larger volumes of information. By the turn of the century, Trinity had a quarter of a million books, while the Library of Congress had nearly three times as many. Both were dwarfed by the British Museum (now part of the British National Library), which at the time had nearly 2 million books, and the French National Library in Paris with more than 2.5 million.

image

Figure 2.2 A page from the original Trinity College Library catalog.

THE ALEXANDRIAN PRINCIPLE: ITS RISE, FALL, AND REBIRTH

An Alexandrian librarian was reported as being “anxious to collect, if he could, all the books in the inhabited world, and, if he heard of, or saw, any book worthy of study, he would buy it.” Two millennia later, this early statement of library policy was formulated as a self-evident principle of librarianship: It is a librarian’s duty to increase the stock of his library. When asked how large a library should be, librarians answered “bigger. And with provision for further expansion.”

Only recently did the Alexandrian principle come under question. In 1974, commenting on a ten-year building boom unprecedented in library history, the Encyclopedia Britannica observed that “even the largest national libraries are doubling in size every 16 to 20 years,” and warned that such an increase can hardly be supported indefinitely. In the twentieth century’s closing decade, the national libraries of the United Kingdom, France (see Figure 2.3), Germany, and Denmark all opened new buildings. The first two are monumental in scale: their country’s largest public buildings of the century. Standing on the bank of the Seine, the Bibliothèque Nationale de France consists of four huge towers that appear like open books, surrounding a sunken garden plaza. The reading rooms occupy two levels around the garden, with bookshelves encircling them on the outer side.

image

Figure 2.3 The Bibliothéque Nationale de France; Dominique Perrault, architect.

Copyright © Alain Goustard, photographer.

Sustained exponential growth cannot continue, at least not in the physical world. A collection of essays published in 1976 entitled Farewell to Alexandria: Solutions to Space, Growth, and Performance Problems of Libraries dwells on the issues that arise when growth must be abandoned and give way to a steady state. Sheer physical space has forced librarians to rethink their principles. Now they talk about weeding and culling, no-growth libraries, the optimum size for collections, and even dare to ask “could smaller be better?” In a striking example of aggressive weeding, in 1996 the library world was rocked when it learned that the San Francisco Public Library had surreptitiously dumped 200,000 books, or 20 percent of its collection, into landfills, because its new building, though lavishly praised by architecture critics, was too small for all the books. Of course, the role of a public library is to serve the needs of the local community, whereas national libraries have a commitment to preserve the nation’s heritage.

Over the last fifty years, the notion of focused collections designed to fulfill the needs of the reader has gradually replaced the Alexandrian ideal of a library that is vast and eternally growing. The dream of a repository for all the world’s knowledge gave way to the notion of service to library users. Libraries that had outgrown their physical boundaries in the city center or university campus retired their books, or (more commonly) moved them to vast storehouses out of town, where they could be kept far more cheaply and still ordered by readers on request.

But everything changed at the start of the millennium. Our brash teenager, the World Wide Web, was born into a family where sustained exponential growth was the norm and had been for years. Gordon Moore, co-founder of Intel, observed in 1965 that the number of transistors that can be squeezed onto an electronic chip doubled every two years, a phenomenon that continues even today—only better, doubling every eighteen months or so. The term Moore’s law is commonly used to refer to the rapidly continuing advance in computing power per unit cost. Similar advances occur in disk storage—in fact, progress over the past decade has outpaced semiconductors: the cost of storage falls by half every twelve months, while capacity doubles. That Alexandrian librarian has come home at last: now he can have his heart’s desire. There are no physical limitations any more. Today, all the text in your local library might fit on your teenage child’s handheld digital music player.5

THE BEAUTY OF BOOKS

What will become of books in the brave new web world? Bibliophiles love books as much for the statements they make as objects as for the statements they contain as text. Early books were works of art. The steles in Xi’an are a monumental example. They are studied as much for their calligraphic beauty as for the philosophy, poetry and history that they record. They are a permanent record of earlier civilizations. The Book of Kells in Trinity College Library, laboriously lettered by Irish monks at the scriptorium of Iona about 1200 years ago, is one of the masterpieces of Western art; a thirteenth-century scholar fancifully enthused that “you might believe it was the work of an angel rather than a human being.” Figure 2.4 shows part of a page from the Book of Kells that exhibits an extraordinary array of pictures, interlaced shapes and ornamental details.

image

Figure 2.4 Part of a page from the Book of Kells.

Beautiful books have always been prized for their splendid illustrations, for colored impressions, for beautifully decorated illuminated letters, for being printed on uncommon paper (or uncommon materials), for their unusual bindings, or for their rarity and historic significance. In India, you can see ancient books—some 2000 years old—written on palm leaves, bound with string threaded through holes in the leaves. Figure 2.5 shows an example, which includes a picture of a deity (Sri Ranganatha) reclining with his consort on a serpent (Adishesha).

image

Figure 2.5 Pages from a palm-leaf manuscript in Thanjavur, India.

Thanjavur Maharaja Serfoji’s Sarasvat Maha Library, Tamil Nadu (1995).

For many readers, handling a book is an enjoyably exquisite part of the information seeking process. Beauty is functional: well-crafted books give their readers an experience that is rich, enlightening, memorable. Physical characteristics—the book’s size, heft, the patina of use on its pages, and so on—communicate ambient qualities of the work. Electronically presented books may never achieve the standards set by the beautiful books of a bygone age. Most web pages are appallingly brash and ugly; the better ones are merely bland and utilitarian. Designers of electronic books do pay some attention to the look and feel of the pages, with crisp text clearly formatted and attractively laid out. Many digital collections offer page images rather than electronic text, but although sometimes rather beautiful, they are blandly presented as flat, two-dimensional framed objects.

The British National Library has made a rare attempt to provide an electronic reading experience that more closely resembles reading a traditional book. Readers sit at a large screen showing a double-page spread of what appears to be a physical book. They flick their finger across the touch-sensitive screen to metaphorically pick up a page and turn it. Pages look three-dimensional; they flip over, guided by your finger, and the book’s binding eases imperceptibly as each one is turned. Page edges to right and left indicate how far through you are. The simulation is compelling, and users rapidly become absorbed in the book itself. Families cluster around the screen, discussing a beautifully illustrated text, turning pages unthinkingly. What a contrast from a web page and scroll bar! But the drawback is cost: these books are painstakingly photographed in advance to give a slow movie animation of every single page-turn.

A simulated three-dimensional book can be obtained from any electronic text by creating a computer model of the page-turn using techniques developed by the entertainment industry for animated movies. This yields a dynamic representation of a physical document that imitates the way it looks in nature, achieved by interactive graphics—albeit rendered on a two-dimensional screen and manipulated using an ordinary mouse or touch-panel. The simulation nevertheless conveys the impression of handling a physical book.

The electronic book of Figure 2.6 uses a model whose parameters are page size and thickness, number of pages, paper texture, cover image, and so on. The model captures, in excruciating detail, the turning of each page of a completely blank book, depending on where you grab it (with the mouse or touch-panel). Then the textual page images of the actual book are mapped onto the blank pages before the page-turning operation is applied, a job that the computer can do in somewhat less than the twinkling of an eye. Future library patrons can (metaphorically) handle the book, heft it, flip through the pages, sample excerpts, scan for pictures, locate interesting parts, and so on. This gives a sense of numerous salient properties: the book’s thickness, layout, typographic style, density of illustrations, color plates, and so on. When it comes to actual reading, page by page, readers switch to a conventional two-dimensional view, optimized for legibility and sequential access.

image

Figure 2.6 Views of an electronic book.

People have valuable and enjoyable interactions with books without necessarily reading them from cover to cover. And even when they do read right through, they invariably take a good look at the outside—and inside—first. When web bookstores opened, they soon began to provide front and back cover images, and progressed to the table of contents and subject index, then to sample chapters, and finally to full-page images of the books in their stock. These clues help provide readers with a sense of what it is that they are going to acquire.

Real documents give many clues to their age and usage history. Pages change color and texture with age, becoming yellow and brittle. Books fall open to well used parts, which bear the marks of heavier handling. Simulations can emulate these features and use them to enhance the reader’s experience. But computers can go further, transcending reality. Other information can be used to augment the physical model. Fingertip recesses are cut into the page edges of old-fashioned dictionaries and reference works to indicate logical sections—letters of the alphabet, in the case of dictionaries. This can be simulated in a three-dimensional book model to show chapters and sections. Simulation offers key advantages over real life: the process can be applied to any book, and it can be switched on and off according to the user’s taste and current needs.

Future virtual books will do far more than just mimic their physical counterparts. Section headings will pop out from the side, forming a complete table of contents that is keyed to physical locations within the book, or—less intrusively—as “rollover text” that pops up when your finger strokes the page edges. The result of textual searches will be indicated by coloring page edges to indicate clusters of terms. Back-of-book indexes, or lists of automatically extracted terms and phrases, will be keyed to pages in the book. It will be easy to flick through just the illustrations, as many people do when encountering a new book. (Did you, when you picked this one up?)

The invention of the “codex,” or book form, counts as one of history’s greatest innovations. Unlike the papyrus scrolls in the Library of Alexandria, books provide random access. They can be opened flat at any page. They can be read easily—and shelved easily, with title, author, and catalog number on the spine. From a starkly utilitarian point of view, three-dimensional visualizations communicate interesting and useful document features to the reader.

Life, including our interactions with digital technology, involves—or should involve—far more than merely maximizing efficiency and effectiveness. Subjective impact—how much pleasure users gain, how engaged they become—should not be overlooked or downplayed. Readers will be intrigued and enthusiastic as they interact with future three-dimensional books. The familiar, comforting, enjoyable, and useful nature of book visualizations will go some way toward helping to open up digital browsing to a broader audience than technophiles—including those with little enthusiasm for bland information experiences.

METADATA

The contents of the World Wide Web are often called “virtual.” To paraphrase the dictionary definition, something is virtual if it exists in essence or effect though not in actual fact, form, or name. In truth, a virtual representation of books has been at the core of libraries right from the beginning: the catalog. Even before Alexandria, libraries were arranged by subject, and catalogs gave the title of each work, the number of lines, the contents, and the opening words. In 240 B.C., an index was produced to provide access to the books in the Alexandrian Library that was a classified subject catalog, a bibliography, and a biographical dictionary all in one.

A library catalog is a complete model that represents, in a predictable manner, the books within. It provides a summary of—sometimes a surrogate for—the library’s holdings. Today we call this information “metadata,” often dubbed “data about data.”6 And it is highly valuable in its own right. In the generous spirit of self-congratulation, a late nineteenth-century librarian declared:

Librarians classify and catalog the records of ascertained knowledge, the literature of the whole past. In this busy generation, the librarian makes time for his fellow mortals by saving it. And this function of organizing, of indexing, of time-saving and thought-saving, is associated peculiarly with the librarian of the 19th Century.

Bowker (1883)

Along with the catalog, libraries contain other essential aids to information seeking, such as published bibliographies and indexes. Like catalogs, these are virtual representations—metadata—and they provide the traditional means of gaining access to journal articles, government documents, microfiche and microfilm, and special collections.

Organizing information on a large scale is far more difficult than it seems at first sight. In his 1674 Preface to the Catalogue for the Bodleian Library in Oxford, Thomas Hyde lamented the lack of understanding shown by those who had never been charged with building a catalog:

“What can be more easy (those lacking understanding say), having looked at the title-pages than to write down the titles?” But these inexperienced people, who think making an index of their own few private books a pleasant task of a week or two, have no conception of the difficulties that arise or realize how carefully each book must be examined when the library numbers myriads of volumes. In the colossal labour, which exhausts both body and soul, of making into an alphabetical catalogue a multitude of books gathered from every corner of the earth there are many intricate and difficult problems that torture the mind.

Hyde (1674)

Two centuries later, the Trinity College librarians, still laboring over their first printed catalog, would surely have agreed.

Librarians have a wealth of experience in classifying and organizing information in ways that make relevant documents easy to find. Although some aspects are irrelevant for digital collections—such as the physical constraint of arranging the library’s contents on a linear shelving system—we can still benefit from their experience. Casual library patrons often locate information by finding one relevant book on the shelves and then looking around for others in the same area, but libraries provide far more systematic and powerful retrieval structures.

THE LIBRARY CATALOG

Few outside the library profession appreciate the amount of effort that goes into cataloging books. Librarians have created a detailed methodology for assigning catalog information, or metadata, to documents in a way that makes them easy to locate. The methodology is designed for use by professionals who learn about cataloging during intensive courses in librarianship graduate programs. The idea is that two librarians, given the same document to catalog, will produce exactly the same record.

A conscientious librarian takes an average of two hours to create a catalog record for a single item, and it has been estimated that a total of more than a billion records have been produced worldwide—a staggering intellectual investment. How can it take two hours to write down title, author, and publisher information for a book? It’s not that librarians are slow; it’s that the job is far harder than it first sounds. Of course, for web documents, the painstaking intellectual effort that librarians put into cataloging is completely absent. Most web users don’t know what they’re missing. What are some of the complexities? Let’s look at what librarians have to reckon with.

Authors

Authors’ names seem the most straightforward of all bibliographic entities. But they often present problems. Some works emanate from organizations or institutions: are they the author? Modern scientific and medical papers can have dozens of authors because of the collaborative nature of the work and institutional conventions about who should be included. Many works are anthologies: is the editor an “author”? If not, what about anthologies that include extensive editorial commentaries: when is this deemed worthy of authorship? And what about ghostwriters?

Digital documents may or may not represent themselves as being written by particular authors. If they do, authorship is generally taken at face value. However, this is problematic—not so much because people misrepresent authorship, but because differences arise in how names are written. For one thing, people sometimes use pseudonyms. But a far greater problem is simply inconsistency in spelling and formatting. Librarians go to great pains to normalize names into a standard form so that you can tell whether two documents are by the same person or not.

Table 2.1, admittedly an extreme example, illustrates just how difficult the problem can become. It shows different ways in which the name of Muammar Qaddafi (the Libyan leader) is represented on documents received by the Library of Congress. The catalog chooses one of these forms, ostensibly the most common—Qaddafi, Muammar in this case—and groups all variants under this spelling, with cross-references from all the variants. In this case, ascribing authorship by taking documents at face value would yield 47 different authors.

Table 2.1 Spelling Variants of the Name Muammar Qaddafi

Qaddafi, Muammar

Gadhafi, Mo ammar

Kaddafi, Muammar

Qadhafi, Muammar

El Kadhafi, Moammar

Kadhafi, Moammar

Moammar Kadhafi

Gadafi, Muammar

Mu ammar al-Qadafi

Moamer El Kazzafi

Moamar al-Gaddafi

Mu ammar Al Qathafi

Muammar Al Qathafi

Mo ammar el-Gadhafi

Muammar Kaddafi

Moamar El Kadhafi

Muammar al-Qadhafi

Mu ammar al-Qadhdhafi

Qadafi, Mu ammar

El Kazzafi, Moamer

Gaddafi, Moamar

Al Qathafi, Mu ammar

Al Qathafi, Muammar

Qadhdhafi, Mu ammar

Kaddafi, Muammar

Muammar al-Khaddafi

Mu amar al-Kad’afi

Kad’afi, Mu amar alxd-

Gaddafy, Muammar

Gadafi, Muammar

Gaddafi, Muammar

Kaddafi, Muamar

Qathafi, Muammar

Gheddafi, Muammar

Muammar Gaddafy

Muammar Ghadafi

Muammar Ghaddafi

Muammar Al-Kaddafi

Muammar Qathafi

Muammar Gheddafi

Khadafy, Moammar

Qudhafi, Moammar

Qathafi, Mu’Ammar el

El Qathafi, Mu’Ammar

Kadaffi, Momar

Ed Gaddafi, Moamar

Moamar el Gaddafi

The use of standardized names is called authority control, and the files that record this information are authority files. This is an instance of the general idea of using a controlled vocabulary or set of preferred terms to describe entities. Terms that are not preferred are “deprecated” (which does not necessarily imply disapproval) and are listed explicitly, with a reference to the preferred term. Controlled vocabularies contrast with the gloriously unrestricted usage found in free text. Poets exploit the fact that there are no restrictions at all on how authors may choose to express what they want to say.

Titles

Most documents have titles. In digital collections, they (like authors) are often taken at face value from the documents themselves. In the world of books, they can exhibit wide variation, and vocabulary control is used for titles as well as for authors. Table 2.2 shows what is represented on the title pages of fifteen different editions of Shakespeare’s Hamlet.

Table 2.2 Title Pages of Different Editions of Hamlet

Amleto, Principe di Danimarca

Der erste Deutsche Buhnen-Hamlet

The First Edition of the Tragedy of Hamlet

Hamlet, A Tragedy in Five Acts

Hamlet, Prince of Denmark

Hamletas, Danijos Princas

Hamleto, Regido de Danujo

The Modern Reader’s Hamlet

Montale Traduce Amleto

Shakespeare’s Hamlet

Shakspeare’s Hamlet

The Text of Shakespeare’s Hamlet

The Tragedy of Hamlet

The Tragicall Historie of Hamlet

La Tragique Histoire d’Hamlet

Subjects

Librarians assign subject headings to documents from a standard controlled vocabulary—such as the Library of Congress Subject Headings. Subjects are far more difficult to assign objectively than titles or authors, and involve a degree of, well, subjectivity. The dictionary tellingly defines subjective as both

pertaining to the real nature of something; essential

and

proceeding from or taking place within an individual’s mind such as to be unaffected by the external world.

The evident conflict between these meanings speaks volumes about the difficulty of defining subjects objectively!

It is far easier to assign subject descriptors to scientific documents than to literary ones, particularly works of poetry. Many literary compositions and artistic works—including audio, pictorial, and video compositions—have subjects that cannot readily be named. Instead, they are distinguished as having a definite style, form, or content, using artistic categories such as “genre.”

Classification Codes

Library shelves are arranged by classification code. Each work is assigned a unique code, and books are placed on the shelves in the corresponding order. Classification codes are not the same as subject headings: any particular item has several subjects but only one classification. Their purpose is to place works into categories so that volumes treating similar topics fall close together. Classification systems that are in wide use in the English-speaking world are the Library of Congress Classification, the Dewey Decimal Classification, and India’s Colon Classification System.

Readers who browse library shelves have immediate access to the full content of the books, which is quite different from browsing catalog entries that give only summary metadata. Placing like books together adds a pinch of serendipity to searching that would please the Three Princes of Chapter 1. You catch sight of an interesting book whose title seems unrelated, and a quick glance inside—the table of contents, chapter headings, illustrations, graphs, examples, tables, bibliography—gives you a whole new perspective on the subject.

Physical placement on shelves, a one-dimensional linear arrangement, is a far less expressive way of linking content than the rich hierarchy that subject headings provide. But digital collections have no dimensionality restriction. Their entire contents can be rearranged at the click of a mouse, and rearranged again, and again, and again, in different ways depending on how you are thinking.

THE DUBLIN CORE METADATA STANDARD

As we have seen, creating accurate library catalog records is a demanding job for trained professionals. The advent of the web, with its billions of documents, calls for a simpler way of assigning metadata. Dublin Core is a minimalist standard, intended for ordinary people, designed specifically for web documents.7 It is used not just for books, but for what it terms “resources.” This subsumes pictures, illustrations, movies, animations, simulations, even virtual reality artifacts, as well as textual documents. A resource has been defined as “anything that has identity.” That includes you and me.

Table 2.3 shows the metadata standard. These fifteen attributes form a “core” set that may be augmented by additional ones for local purposes. In addition, many of the attributes can be refined through the use of qualifiers. Each one can be repeated where desired.

Table 2.3 The Dublin Core Metadata Standard

Metadata Definition
Title A name given to the resource
Creator An entity primarily responsible for making the content of the resource
Subject A topic of the content of the resource
Description An account of the content of the resource
Publisher An entity responsible for making the resource available
Contributor An entity responsible for making contributions to the content of the resource
Date A date of an event in the lifecycle of the resource
Type The nature or genre of the content of the resource
Format The physical or digital manifestation of the resource
Identifier An unambiguous reference to the resource within a given context
Source A reference to a resource from which the present resource is derived
Language A language of the intellectual content of the resource
Relation A reference to a related resource
Coverage The extent or scope of the content of the resource
Rights Information about rights held in and over the resource

The Creator might be a photographer, illustrator, or author. Subject is typically expressed as a keyword or phrase that describes the topic or content of the resource. Description might be a summary of a textual document, or a textual account of a picture or animation. Publisher is generally a publishing house, university department, or corporation. Contributor could be an editor, translator, or illustrator. Date is the date of resource creation, not the period covered by its contents. A history book will have a separate Coverage date range that defines the time period to which the book relates. Coverage might also include geographical locations that pertain to the content of the resource. Type might indicate a home page, research report, working paper, poem, or any of the media types listed before. Format can be used to identify software systems needed to run the resource.

This standard does not impose any kind of vocabulary control or authority files. Two people might easily generate quite different descriptions of the same resource. However, work is underway to encourage uniformity by specifying recommended sets of values for certain attributes. For example, the Library of Congress Subject Headings are encouraged as one way of specifying the Subject. There are standard schemes for encoding dates and languages, which Dublin Core adopts.

The original minimalist Dublin Core standard is being augmented with new ways to increase expressiveness by accommodating complexity. Certain attributes can be refined. For example, Date can be qualified as date created, date valid, date available, date issued, or date modified; multiple specifications are possible. Description can be couched as an abstract or a table of contents. Standard refinements of the Relation field include is version of, is part of, replaces, requires, and references.

DIGITIZING OUR HERITAGE

How big is our literature? And how does it compare with the web? The Library of Congress has 30 million books. The one you are reading contains 103,304 words, or 638,808 characters (including spaces)—say 650,000 bytes of text in uncompressed textual form (compression might reduce it to 25 percent without sacrificing any accuracy). Although it has few illustrations, they add quite a bit, depending on how they are stored. But let’s consider the words alone. Suppose the average size of a book in the Library of Congress is a megabyte. That makes a total of 30 terabytes for the Library’s total textual content—or maybe 100,000 copies of the Encyclopedia Britannica.

How big is the web? In the next chapter, we will learn that it was estimated at 11.5 billion pages in early 2005, totaling perhaps 40 terabytes of textual content. That’s a little more than the Library of Congress. Or take the Internet Archive. Today, according to its website,8 its historical record of the web (which probably includes a great deal of duplication) contains approximately one petabyte (1,024 terabytes) of data and is growing at the rate of 20 terabytes per month—all the text in the Library of Congress every six weeks.

The amounts we have been talking about are for text only. Text is miniscule in size compared with other electronic information. What about all the information produced on computers everywhere, not just the web? It took two centuries to fill the Library of Congress, but today’s world takes about 15 minutes to churn out an equivalent amount of new digital information, stored on print, film, magnetic, and optical media. More than 90 percent of this is stored on ordinary hard disks. And the volume doubles every three years.

You might take issue with these figures. They’re rough and ready and could be off by a large factor. But in today’s exponentially growing world, large factors are overcome very quickly. Remember the local library that we mentioned might fit on a teenager’s portable digital music player? It had half a million books. Wait six years until storage has improved by a factor of 60, and the Library of Congress will fit there. Wait another couple to store the world’s entire literature. What’s a factor of 10, or 100, in a world of exponential growth? Our teenager the web has reached the point where it dwarfs its entire ancestry.

What are the prospects for getting all our literature onto the web? Well, storage is not a problem. And neither is cost, really. If we show page images, backed up by the kind of low-accuracy searchable text that automatic optical character recognition can produce, it might cost $10 per book to digitize all the pages. The 30 million books in the Library of Congress would cost $300 million. That’s half the Library’s annual budget.

The main problem, of course, is copyright.9 There are three broad classes of material: works that are in the public domain, commercially viable works currently in print and being sold by publishers, and works still under copyright that are no longer being commercially exploited. We briefly discuss current projects that are digitizing material in these three areas. But be warned: things are moving very quickly. Indeed, the doubling every two years that Gordon Moore observed in semiconductor technology seems rather sluggish by the standards of large-scale digitization projects in the early twenty-first century! Many radical new developments will have occurred by the time you read these pages.

PROJECT GUTENBERG

Project Gutenberg was conceived in 1971 by Michael Hart, then a student, with the goal of creating and disseminating public domain electronic text. Its aim was to have 10,000 texts in distribution thirty years later. The amount added to the collection doubles every year, with one book per month in 1991, two in 1992, four in 1993, and so on. The total should have been reached in 2001, but the schedule slipped a little, and in October 2003 the 10,000th electronic text was added to the collection—the Magna Carta. A new goal was promptly announced: to grow the collection to one million by the end of the year 2015. By early 2006, the project claimed 18,000 books, with a growth of 50 per week.

Gutenberg’s first achievement was an electronic version of the U.S. Declaration of Independence, followed by the Bill of Rights and the Constitution. The Bible and Shakespeare came later. Unfortunately, however, the latter was never released, due to copyright restrictions. You might wonder why: surely Shakespeare was published long enough ago to be indisputably in the public domain? The reason is that publishers change or edit the text enough to qualify as a new edition, or add new extra material such as an introduction, critical essays, footnotes, or an index, and then put a copyright notice on the whole book. As time goes by, the number of original surviving editions shrinks, and eventually it becomes hard to prove that the work is in the public domain since few ancient copies are available as evidence.

Project Gutenberg is a grass-roots phenomenon. Text is input by volunteers, each of whom can enter however much they want—a book a week, a book a year, or just one book in a lifetime. The project does not direct the volunteers’ choice of material; instead, people are encouraged to choose books they like and enter them in a manner in which they are comfortable. In the beginning, books were typed in. Now they are scanned using optical character recognition but are still carefully proofread and edited before being added to the collection. An innovative automated scheme distributes proofreading among volunteers on the Internet. Each page passes through two proofreading rounds and two formatting rounds. With thousands of volunteers each working on one or more pages, a book can be proofed in hours.

Gutenberg deals only with works that are in the public domain. It has sister projects in other countries, where the laws are sometimes slightly different.10

MILLION BOOK PROJECT

The Million Book project was announced in 2001 by Carnegie Mellon University, with the goal of digitizing one million books within a few years. The idea was to create a free-to-read, searchable digital library about as big as Carnegie Mellon University’s library, and far larger than any high school library. The task involves scanning the books, recognizing the characters in them, and indexing the full text. Pilot Hundred- and Thousand-book projects were undertaken to test the concept.

Partners were quickly established in India and China to do the work. The United States supplies equipment, expertise, training, and copyright experts, while the partner countries provide labor and perform the actual digitization. To kick start the venture, Carnegie Mellon University pulled books published before 1923 from its shelves and boxed them for shipment to India, sending a total of 45,000 titles, mainly biographies and science books. Further material originates in the partner countries. The Chinese are digitizing rare collections from their own libraries, while the Indians are digitizing government-sponsored textbooks published in many of the country’s eighteen official languages. By the end of 2005, more than half a million books had been scanned: 28 percent in India and 69 percent in China, with the remaining 3 percent in Egypt. Roughly 135,000 of the books are in English; the others are in Indian, Chinese, Arabic, French, and other languages. However, not all are yet available online, and they are distributed between sites in three countries. The project is on track to complete a million books by 2007. It recently joined the Open Content Alliance (see page 55).

Unlike Project Gutenberg, which places great emphasis on accurate electronic text that can be both displayed and searched, the Million Book Project makes books available in the form of highly compressed image files (using DjVu, which is a commercial technology). Part of the project is developing optical character recognition technology for Chinese and for the Indian languages, an ambitious goal considering the variability and intricacy of the scripts. The project is daunting in scale and many details are unclear, such as how many of the currently scanned books have actually been converted to electronic text form, how much emphasis is being placed on accuracy, and to what extent the output is being manually corrected. It is intended that 10,000 of the million books will be available in more than one language, providing a test bed for machine translation and cross-language information retrieval research.

As well as public domain works, in-copyright but out-of-print material will be digitized. Books will be free to read on the web, but printing and saving may be restricted to a page at a time to deter users from printing or downloading entire books—this is facilitated by the use of DjVu technology for display (although open source viewers exist). Publishers are being asked for permission to digitize all their out-of-print works; others who wish to have book collections digitized are invited to contact the project. Donors receive digital copies of their books.

INTERNET ARCHIVE AND THE BIBLIOTHECA ALEXANDRINA

The Internet Archive was established in 1996 with the aim of maintaining a historical record of the World Wide Web. Its goal is to preserve human knowledge and culture by creating an Internet library for researchers, historians, and scholars. By 2003, its total collection was around 100 terabytes, growing by 12 terabytes per month; today it includes more than 40 billion web pages and occupies a petabyte of storage, growing by 20 terabytes per month.

The Internet Archive is deeply involved in digitization initiatives and now provides storage for the Million Book Project. On April 23, 2002, UNESCO’s International Day of the Book, a partnership was announced with Bibliotheca Alexandrina, an Egyptian institution that has established a great library near the site of the original Library of Alexandria. A copy of the entire Internet Archive is now held there—including the Million Book Project.

In December 2004, the Internet Archive announced a new collaboration with several international libraries to put digitized books into open-access archives. At that time, their text archive contained 27,000 books, and a million books had already been committed. This collaboration includes Carnegie Mellon University and subsumes the Million Book Project. It includes some of the original partners in India and China, as well as the Bibliotheca Alexandrina in Egypt and new partners in Canada and the Netherlands.

The Internet Archive does not target commercially successful books—ones that are available in bookstores. But neither does it want to restrict scope to the public domain. It has its eye on what are often called “orphan” works—ones that are still in copyright but not commercially viable—and is taking legal action in an attempt to obtain permission to include such books. In 2004, it filed a suit against the U.S. Attorney General claiming that statutes that extend copyright terms unconditionally—like the 1998 Sonny Bono Copyright Term Extension Act discussed in Chapter 6—are unconstitutional under the free speech clause of the First Amendment. It is requesting a judgment that copyright restrictions on works that are no longer available violate the U.S. Constitution. The U.S. District Court dismissed the case, but the ruling has been appealed.

AMAZON: A BOOKSTORE

What about commercially successful books? In October 2003, Amazon, the world’s leading online bookstore, announced a new online access service to the full text of publications called “Search inside the book.” This dramatic initiative opened with a huge collection of 120,000 fiction and nonfiction titles supplied by nearly 200 book publishers, with an average of 300 pages per title. The goal was to quickly add most of Amazon’s multimillion-title catalog.

The “search inside” feature allows customers to do just that: search the full text of books and read a few pages. Amazon restricts the feature to registered users, or ones who provide a credit card, and limits the number of pages you can see. Each user is restricted to a few thousand pages per month and at most 20 percent of any single book. You cannot download or copy the pages, or read a book from beginning to end. There’s no way to link directly to any page of a book. You can’t print using the web browser’s print function, though even minor hackers can easily circumvent this. Anyone could save a view of their screen and print that, but it’s a tedious way of obtaining low-quality hardcopy.

Our teenager is flexing its muscles. Publishers did not welcome this initiative, but they had no real choice—you can’t ignore the call of the web.11 The fact that Amazon’s mission is to sell books helped convince publishers to agree to have their content included. When the service was introduced, there was a lively debate over whether it would help sales or damage them. The Authors Guild was skeptical: its staff managed to print out 100 consecutive pages from a best-selling book using a process that was quite simple though a bit inconvenient (Amazon immediately disabled the print function). Amazon reported that early participants experienced sales gains of 9 percent from the outset. Customers reported that the service allowed them to find things they would never have seen otherwise, and gave them a far stronger basis for making a purchase decision. Pundits concluded that while it may not affect a book’s central audience, it would extend the edges. Some felt that it would hurt the sales of reference books because few people read them from cover to cover, but now most major publishers have joined the program to avoid the risk of losing significant sales to the competition.

There is evidence that putting full text on the web for free access can actually boost sales. When the National Academy Press did this for all new books, print sales increased (to the surprise of many). The Brookings Institute placed an experimental 100 books online for free and found that their paper sales doubled.

Amazon’s service is entirely voluntary: publishers choose whether to make their books available. Before doing so, they must ensure that their contract with the author includes a provision for digital rights that entitles them to exploit the work in this way. Then they supply the electronic text or a physical copy. Amazon sends most books to scanning centers in low-wage countries like India and the Philippines. Some are scanned in the United States using specialist machines to handle oversize volumes and ensure accurate color reproduction.

This initiative heralds a profound change in the bookselling business. Amazon provides an electronic archive in which readers can find books, sample them, and, if they like what they see, purchase them. It is now able to earn a profit by selling a far wider variety of books than any previous retailer. The popularity curve has a long tail—there are many books that are not profitable to stock, though they would sell a few copies every year. Under the traditional publishing model, titles become inefficient at hundreds of sales per year, whereas Amazon’s model remains profitable at far lower sales volumes.

GOOGLE: A SEARCH ENGINE

Just a year later, in October 2004, Google announced its own book search service. There is a key difference from Amazon’s program: Google does not sell books. When Google Book Search generates a search result, it lists online bookstores: you can click on one to buy the book. Publishers could purchase one of the links and sell the book directly, thereby reducing their reliance on book retailers. This provides an intriguing new business model. Google underwrites the cost of digitizing out-of-print editions for publishers who want to have their books in the Google index; it then splits advertising revenue with the publisher.

Of course, viewing restrictions like Amazon’s are imposed. Users can only see a few pages around their search hits, and there is an upper limit to the number of pages you can view (determined, in this case, by the publisher). In order to help protect copyright, certain pages of every in-copyright book are unconditionally unavailable to all users.

A few months later, at the end of 2004, Google announced a collaboration with five major libraries12 to digitize vast quantities of books. Works are divided into three categories. Those in the public domain will be available on an open-access basis—you can read their full content. For copyrighted material covered by an agreement between the publisher and Google, you can view pages up to a predetermined maximum limit, as with Amazon. For copyrighted books that are scanned without obtaining copyright permission, users will only see small snippets of text—a line or two—that contain their query in context, plus bibliographic information. The snippets are intended to inform readers about the relevance of the book to their subject of inquiry. Google will place links to bookstores and libraries next to search results.

Some of the collaborating libraries intend to offer their entire book collection for Google to scan. Others will offer just a portion; still others are giving rare materials only. A total of 15 million books are involved. What’s in it for the libraries? Google gives back the digitized versions for them to use in whatever ways they like. Whereas Google can’t offer the full content publicly, libraries can show their patrons electronic copies. For libraries that have already been spending big bucks to digitize parts of their collection, this is an offer from heaven.

The announcement rocked the world of books. There have been many enthusiasts and a few detractors. Some denounce it as an exercise in futility, others as a massive violation of copyright; others fret that it will distort the course of scholarship. The president of France’s National Library warned of the risk of “crushing American domination in the definition of how future generations conceive the world” and threatened to build a European version. To which Google characteristically replied, “Cool!”

OPEN CONTENT ALLIANCE

One year further on, toward the end of 2005, the Open Content Alliance was announced. Prompted by the closed nature of Google’s initiative, the Internet Archive, Yahoo, and Microsoft, along with technology organizations such as Adobe and HP Labs and many libraries, formed a consortium dedicated to building a free archive of digital text and multimedia. Initially, the Internet Archive will host the collection and Yahoo will index the content. The cost of selecting and digitizing material is borne by participating institutions based on their own fundraising and business models; the Internet Archive undertakes nondestructive digitization at 10¢ per page. Yahoo is funding the digitization of a corpus of American literature selected by the University of California Digital Library.

The aim is to make available high-resolution, downloadable, reusable files of the entire public domain, along with their metadata. Donors have the option to restrict the bulk availability of substantial parts of a collection, although in fact Yahoo and the University of California have decided not to place any restrictions. Use will be made of the Creative Commons license described in Chapter 6 (page 197), which allows authors to explicitly relinquish copyright and place their work in the public domain.

Whether an open consortium can rival Google’s project is moot. Google has the advantage of essentially limitless funding. But the Open Content Alliance is distinguishing its efforts from Google’s by stressing its “copyright-friendly” policy: it intends to respect the rights of copyright holders before digitizing works that are still under copyright. This may restrict the Alliance’s scope, but will leave it less exposed to discontent.

NEW MODELS OF PUBLISHING

Project Gutenberg is zealously noncommercial, digitizes books in the public domain alone, and publishes an accurate rendition of the full electronic text (but not the formatting). The Million Book Project, Internet Archives, and Open Content Alliance are noncommercial, target public-domain books, but also aspire to process copyrighted “orphan” books, and show users page images backed up by full-text searching.

Amazon and Google are commercial. As a bookseller, Amazon deals only with commercially viable books—but has aspirations to expand viability to works with very low sales. Google sells advertising, not books, and plans to offer all three categories of books. For public-domain works, it presents the full text—but you cannot print or save it. You can read books in full, but you must read them in Google. For commercial works, it presents excerpts, as Amazon does, and provides links for you to purchase the book—but not links to local libraries that have copies. For orphan works, it shows snippets and provides links to bookstores and local libraries.

These innovations—particularly the commercial ones—will have enormous repercussions. Publishers will accelerate production of digitized versions of their titles—e-Books—and sell them online through keyword advertising on channels such as Google Book Search. They will continue to rely on conventional retailers for the sale of paper copies. They will experiment with combining e-Books with pre- and post-release paper copies. E-Book technology will be forced to standardize.

Eventually, different sales models for e-Books will emerge. Highly effective search engine advertising will level the market, providing more opportunities for small publishers and self-publishing. Physical bookstores will find themselves bypassed. E-Books will provide preview options (flip through the pages at no cost) and will be rentable on a time-metered, or absolute duration, basis (like video stores), perhaps with an option to purchase. Publishers will experiment with print-on-demand technology.

E-Books present a potential threat to readers. Content owners may adopt technical and legal means to implement restrictive policies governing access to the information they sell. E-Books can restrict access to a particular computer—no lending to friends, no sharing between your home computer and your office, no backing up on another machine. They can forbid resale—no secondhand market. They can impose expiry dates—no permanent collections, no archives. These measures go far beyond the traditional legal bounds of copyright, and the result could immeasurably damage the flow of information in our society.

Can content owners really do this? To counter perceived threats of piracy, the entertainment industry is promoting digital rights management schemes that control what users can do. They are concerned solely with content owners’ rights, not at all with users’ rights. They are not implementing “permissions,” which is what copyright authorizes, but absolute, mechanically enforced “controls.” Anti-circumvention rules are sanctioned by the Digital Millennium Copyright Act (DMCA) in the United States; similar legislation is being enacted elsewhere (see Chapter 6, page 201), which gives some background about this act). Control is already firmly established by the motion picture industry, which can compel manufacturers to incorporate encryption into their products because it holds key patents on DVD players.

Commercial book publishers are promoting e-Book readers that, if widely adopted, would allow the same kind of control to be exerted over reading material. Basic rights that we take for granted (and are legally enshrined in the concept of copyright) are in jeopardy. Digital rights management allows reading rights to be controlled, monitored, and withdrawn, and DMCA legislation makes it illegal for users to seek redress by taking matters into their own hands. However, standardization and compatibility issues are delaying consumer adoption of e-Books.

In scholarly publishing, digital rights management is already well advanced. Academic libraries license access to content in electronic form, often in tandem with purchase of print versions. Because they form the entire market, they have been able to negotiate reasonable conditions with publishers. However, libraries have far less power in the consumer book market. One can envisage a scenario where publishers establish a system of commercial pay-per-view libraries for e-Books and refuse public libraries access to books in a form that can be circulated.

SO WHAT?

We live in interesting times. The World Wide Web did not come out of the library world—which has a fascinating history in its own right—but is now threatening to subsume it. This has the potential to be a great unifying and liberating force in the way in which we find and use information. But there is a disturbing side too: the web dragons, which offer an excellent free service to society, centralize access to information on a global scale. This has its own dangers.

Subsequent chapters focus on the web. But as we have seen, the web is taking over our literature. Depending on how the copyright issues (which we return to in Chapter 6) play out, the very same dragons may end up controlling all our information, including the treasury of literature held in libraries. The technical and social issues identified in this book transcend the World Wide Web as we know it today.

WHAT CAN YOU DO ABOUT ALL THIS?

• Don’t forget your local library when seeking information!

• Learn to use the library catalog.

• Look up the Library of Congress Subject Headings.

• Read about the Bio-Optic Organized Knowledge Device on the web (the acronym hints at what you will find).

• Watch a demo of a 3D book visualizer (e.g., open_the_book).

• Download an e-Book from Project Gutenberg (try a talking book).

• Investigate the Internet Archive’s Wayback Machine.

• Try searching book contents on Amazon and Google.

• Keep a watching brief on the Open Content Alliance.

• Use Citeseer and Google Scholar to retrieve some scientific literature.

NOTES AND SOURCES

A good source for the development of libraries in general is Thompson (1997), who formulates principles of librarianship, including the one quoted, that it is a librarian’s duty to increase the stock of his (sic) library. Thompson also came up with the metaphor of snapping the links of the chained book—in fact, he formulates open access as another principle: libraries are for all. Gore (1976) recounts the fascinating history of the Alexandrian Library in his book Farewell to Alexandria. Some of this chapter is a reworking of material in How to Build a Digital Library by Witten and Bainbridge (2003).

The information on Trinity College Dublin was kindly supplied by David Abrahamson; that about the Library of Congress was retrieved from the Internet. Thomas Mann (1993), a reference librarian at the Library of Congress, has prepared a wonderful source of information on libraries and librarianship, full of practical assistance on how to use conventional library resources to find things. The imaginative architectural developments that have occurred in physical libraries at the close of the twentieth century are beautifully illustrated by Wu (1999).

The view of the electronic book in Figure 2.6, earlier in this chapter, is from Chu et al. (2004). The first book on digital libraries was Practical Digital Libraries by Lesk (1997), now in an excellent second edition (2005). In contrast, Crawford and Gorman (1995) in Future Libraries: Dreams, Madness, and Reality fear that virtual libraries are real nonsense that will devastate the cultural mission of libraries.

The nineteenth-century librarian who “makes time for his fellow mortals” is Bowker (1883), quoted by Crawford and Gorman (1995). The catchphrase “data about data” for metadata is glib but content-free. Lagoze and Payette (2000) give an interesting and helpful discussion of what metadata is and isn’t. Our description of bibliographic organization draws on the classic works of Svenonius (2000) and Mann (1993). It is difficult to make books on library science racy, but these come as close as you are ever likely to find.

The Dublin Core metadata initiative is described by Weibel (1999).13 Thiele (1998) reviews related topics, while current developments are documented on the official Dublin Core Metadata Initiative website run by the Online Computer Library Center (OCLC).14

The Internet Archive project is described by Kahle (1997). The factoid that it takes just 15 minutes for the world to churn out an amount of new digital information equivalent to the entire holdings of the Library of Congress is from Smith (2005). The figure dates from 2002; it has surely halved since.15 We find that Wikipedia (www.wikipedia.org) is a valuable source of information, and gleaned some of the details about Project Gutenberg from it. The remark on page 53 about increasing sales by putting books on the web is from Lesk (2005, p. 276).

 Can the alchemists transmute a mess of books into an ethereal new structure?

4The printing press was invented in China much earlier, around five centuries before Gutenberg.

5Assume the library contains half a million 80,000-word novels. At 6 letters per word (including the inter-word space), each novel comes to half a million bytes, or 150,000 bytes when compressed. The whole library amounts to about 75 gigabytes, about the capacity of a high-end iPod today (2006). This is text only: it does not allow for illustrations.

6Because most traditional library catalogers are women whereas men tend to dominate digital libraries, metadata has been defined tongue-in-cheek as “cataloging for men.”

7Named after a meeting held in Dublin, Ohio, not Molly Malone’s fair city, in 1995.

8www.archive.org/about/faqs.php

9Chapter 6 introduces salient aspects of copyright law.

10For example, the Australian project has more freedom because until recently its copyright laws were more permissive—works entered the public domain only 50 years after the author’s death. In early 2005, the copyright term was extended by 20 years as part of the Australia–U.S. Free Trade Agreement, bringing it into line with U.S. practice. However, material for which copyright had previously expired—that is, if the author died in 1954 or before—remains in the public domain in Australia, though not in the United States. We discuss copyright further in Chapter 6.

11However, many publishers balk at making popular textbooks available through this program, for they fear that a concerted effort by a group of bright students could completely undermine their copyright.

12The University of Michigan, Harvard, Oxford, and Stanford; and the New York Public Library.

13You can download the standard from www.niso.org.

14dublincore.org.

15More information on this extraordinary estimate can be found at www.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset