Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 6 WHO CONTROLS INFORMATION?

The underlying mantra of the web is that it’s free for all, accessible by everyone. Even before the beginning, in the pre-dawn darkness, Tim Berners-Lee (1989) foresaw that “for an international hypertext to be worthwhile … many people would have to post information.” The idea was always that anyone should be able to publish, and everyone should be able to read: there would be no central control. It was understood that the information itself would be the right stuff: good, useful, and reliable, even if not always completely correct. And the hyperlinks would be apt and relevant; they would take you on instructive and informative voyages of exploration. But if we look dispassionately at the web as it is today, we have to acknowledge that some of the characteristics it has acquired do not really comply with this wholesome vision.

The web is a victim of its own success. So many people are eager to publish that no one could possibly check that all the information is presented in good faith—let alone that it is correct and reliable. Indeed, it serves so many different communities and the content is so diverse that the very idea of information being “correct and reliable” and presented “in good faith” is hopelessly naïve, almost laughable—although in the early days, the user community, far smaller and more focused back then, would have shared a broad consensus on such matters.

This chapter is about the mechanisms that control the flow of information. A cardinal advantage of the web is that it eliminates many intermediaries that the traditional publishing industry places between the demand for information and its supply. But search engines are a new kind of intermediary—an extraordinarily powerful one—that stand between users and their right to be informed. In practice, they play a central role in controlling access to information. What does this mean for democracy on the web?

Despite their youthful immaturity and the crudeness of their retrieval mechanisms, search engines have become absolutely indispensable when working with online information today. The presence of these intermediaries between users and the information they seek is double-edged. On the one hand, we must thank the dragons for the invaluable service they offer to the community (and for free). On the other, we have to accept them as arbiters of our searches.

Search engines are the world’s most popular information services. Almost all web users (85 percent) employ them. More than half (55 percent) use them every day. Without them, most sites would be practically inaccessible. They represent the only hope of visibility for millions of authors. They play a pivotal role in the distribution of information and consequently in the organization of knowledge. But there is no transparency in their operation. Their architecture is shrouded in mystery. The algorithms they follow are secret. They are accountable to no one. Nevertheless, they furnish our most widely used public service for online information retrieval. As we learned in the last chapter, they have not necessarily chosen secrecy: it is forced upon them by the dynamics of spam.

The search business is hotly competitive, and the dragons must offer excellent service in order to convince people to use them. However, as public companies, their primary responsibility must be to their shareholders, not their users. The object is to make a profit. Profit is a legitimate, honest, fair, and necessary aim of any commercial enterprise. But in this case, there is a delicate balance between performing the public service of ensuring information access and making money out of it on the way. Since they are the visibility arbiters of online information, the raw material from which new knowledge is wrought passes through the dragons’ gates. This raises technical, economic, ethical, and political issues.

In traditional publishing, it is copyright law that controls the use and reuse of information. Copyright is society’s answer to the question of how to reward authors for their creative work. On the web, most authors are not rewarded—at least, not directly. But when asking who controls information, we should take a close look at copyright law because, as we learned in Chapter 2, immense efforts are underway to digitize traditional libraries and place them online.

Does copyright apply to material that was born digital? The situation with regard to computer files, and particularly documents published on the web, is murky. Lawyers have questioned whether it is legal even to view a document on the web, since one’s browser inevitably makes a local copy that has not explicitly been authorized. Of course, it is widely accepted that you can look at web documents—after all, that’s what they’re there for. If we allow that you can view them, next comes the question of whether you can save them for personal use—or print them, link to them, copy them, share them, or distribute them to others. Note that computers copy and save documents behind the scenes all over the place. For example, web browsers accelerate delivery by interacting with web proxy servers that selectively cache pages (see Chapter 5, pages 147–148).

Search tools are at the confluence of two powerful sources of data. They enable access to all the information on the web, and they log user queries over time. Like telephone calls, query records are sensitive information: they allow customer profiling, and they reflect citizens’ thought processes. The privacy policy that they adopt is of crucial importance for all of us.

In 1986, before the web era began, Richard Mason, a pioneer in the field of information ethics, wrote an influential and provocative article that identified four critical challenges we face in the information age: privacy, accuracy, property, and accessibility. Privacy concerns the question of what information people can keep to themselves and what they are forced to reveal. Accuracy deals with the authenticity and fidelity of information—and who is accountable for error. Property is about who owns information and what right they have to determine a fair price for its exchange. Accessibility relates to the criteria and conditions of access by people or organizations to particular pieces of information. Search engines, at the heart of information distribution on the web, are critically involved in all four areas.

THE VIOLENCE OF THE ARCHIVE

The web is a kind of collective memory that is fully open and public. Its contents can be examined, commented on, cataloged, referred to, retrieved, and read by everyone. You could think of it as a world archive whose contents—for the first time in the history of mankind—are not controlled by authorities and institutions, but rest directly in the hands of those individuals who can access and nurture it.

You might be surprised to learn that archives, which you probably think of as dusty caverns of forgotten nostalgia, are historical dynamite. They can be used to rewrite history by making parts of it simply vanish. Archives are not neutral records because history is written by the winners. In the case of conflict, the records that survive are those that support the interpretation of the conquerors. Archives do violence to the truth.

Controlling the means of information preservation—the archives—bestows the ultimate power. You can decide what is public and what is private, what is secret, and what is up for discussion. Once these issues have been settled, the archive’s conservation policy sets them in stone for time immemorial. Archives are both revolutionary and conservative: revolutionary in determining what can and cannot be talked about, and conservative in propagating the decisions indefinitely. But information on the web belongs to society as a whole. Social control of the archive implies collective possession of history.

Free access to all was a signature of the early web. So was self-organization—or lack of it. Material accumulated chaotically, anarchically. But to be useful, information needs to be organized: it must be retrieved, studied, and interpreted before it can form the basis for new knowledge. As a leading contemporary philosopher put it in a book on the role of archives in the organization of knowledge:

The archive is what determines that all these things said do not accumulate endlessly in an amorphous mass, nor are they inscribed in an unbroken linearity, nor do they disappear at the mercy of chance external accidents; but they are grouped together in distinct figures, composed together in accordance with specific regularities.

– Foucault (1989)

The web by itself is not an archive in this sense. Archives involve rules that govern how their contents are organized and accessed and how they are maintained and preserved.

Transferring the ability to publish from institutions to individuals promises a welcome innovation: the democratization of information. The catch is that the effectiveness of publication is governed by the tools available for organizing information and retrieving content. And these tools are not publicly controlled; they’re supplied by private companies.

Although traditional archives preserve knowledge (and therefore the possible directions for creating new knowledge), their contents do not evolve naturally but are controlled by man-made laws that govern what is and is not included. The postmodern deconstructionist philosopher Derrida noted this in a lecture given in London in 1994—coincidentally, just as the web was taking off. An archive is the historical a priori, a record of facts, discovered not deduced, that support subsequent interpretations of history. Derrida spoke of the “violence of the archive,” which does not merely involve remembering and forgetting, inclusion and exclusion, but also occlusion—the deliberate obstruction of the record of memory. By violence, Derrida suggests the conflicts that arise as the archive suffers regressions, repressions, suppressions, even the annihilation of memory, played out against the larger cultural and historical backdrop. He broods on revisionist histories that have been based on material in archives—archives of evil—that do violence to the truth.

Viewed as an archive, the web embodies deep contradictions. No one controls its content: everybody is free to contribute whatever he or she feels is needed to build and preserve a memory of our society. However, nothing is permanent: online content suffers from link rot, and the only accurate references are spatio-temporal ones: location and access date. Existence does not necessarily imply access: in order to find things, we must consult the gatekeepers. The dragons are the de facto arbiters of online memory.

Another conundrum is that the notion of “archive” implies both an interior and an exterior: there must be some things that it does not contain. As Derrida enigmatically puts it:

There is no archive without a place of consignation, without a technique of repetition, and without a certain exteriority. No archive without outside.

– Derrick (1995, p. 11)

You need somewhere to put the archive (“consignation”), some way of accessing it (“repetition”), and an exterior to make it meaningful to talk about the interior. But the web is everywhere; it holds everything that is produced (or is at least capable of doing so). It is in constant flux. Where does its outside commence?

Again, the answer lies in the distinction between the web as a repository and the web as a retrieval system. Information retrieval is by nature selective, and different tools embody different biases. The web’s exterior is represented not by absent documents, but by inherent bias in the tools we use to access it. The spirit of the web—the spirit of those pioneers who conceived and developed it—is kept alive by a continual process of conception and invention of new means of gathering, accessing, and presenting information. Users and developers share responsibility for ensuring that the web does not degenerate into a closed machine that can be interrogated in just one way, an oracle that spouts the same answers to everybody. The web was founded on the freedom of all users to share information. The same freedom should be reflected today and tomorrow in the continued evolution of new tools to access it.

WEB DEMOCRACY

In the beginning, life was simple: when you inserted links into your page, the targets were the pages that were most appropriate, most relevant. Today the web is so immense that visibility has become the dominant factor. If you don’t know about my page, you certainly won’t link to it. Nodes with many links are de facto more visible and tend to attract yet more links. Popularity is winning out over fitness. This makes it difficult for newcomers to gain prominence—but not impossible, as the fairytale success of Google, a latecomer in the search engine arena, illustrates. As the web becomes commercial, another factor in link placement is the questionable desire to maximize visibility artificially, as we learned in the previous chapter. Even outside the business world of search engine optimization, webmasters continually receive e-mail suggesting that “if you link to me, I’ll link to you: it’ll be good for us both.”

As we all know, it’s really hard to find good, reliable information. The problem is not just one of determining whether what you are looking at represents the right stuff—correct, reliable, presented in good faith, the right kind of truth for you. It’s also that given a particular information need, crudely translated into a search engine query, there can be no single definitive list of the “best” results. The relevance of information is overwhelmingly modulated by personal preference, the context of the request, the user’s educational level, background knowledge, linguistic ability, and so on. There are as many good replies to an inquiry as there are people who pose the question.

THE RICH GET RICHER

Received wisdom, underscored by rhetoric from all quarters, is that the web is the universal key to information access. However, we now know that things are not as straightforward as they appear at first sight. Over the last decade or so, the network theory sketched in Chapter 3 has been developed to explain how structure emerges spontaneously in linked organisms, like the web, that evolve continuously over time; structure that is shared by all scale-free networks. The very fact of growth determines the network’s topological form. This structure, clearly understandable from the logical point of view, implies that when people publish information online, there is scant chance that it will actually be read by others. Indeed, Barabási, one of the pioneers of the study of the web as an enormous scale-free network, went so far as to observe:

The most intriguing result of our web-mapping project was the complete absence of democracy, fairness, and egalitarian values on the web. We learned that the topology of the web prevents us from seeing anything but a mere handful of the billion documents out there.

– Barabási (2002, p. 56)

How can this be? To probe further, let’s examine a 2003–2004 study that took a snapshot of part of the web and then another one seven months later. A sample of 150 websites was chosen, and all their pages were downloaded, plus the pages they linked to—a total of 15 million of them. The two snapshots had 8 million pages in common. The goal was to examine how the number of inlinks to each page changed over the period.

There were about 12 million new links. But surprisingly, a full 70 percent of these went to the 20 percent of the pages that were originally most popular in terms of their number of inlinks. The next 20 percent of pages received virtually all the remaining new links. In other words, the 60 percent of pages that were least popular in the original sample received essentially no new incoming links whatsoever! Even when the new links were expressed as a proportion of the original number of links, there was a negligible increase for the 60 percent of less-popular pages. This extraordinary result illustrates that, on the web, the rich get richer.

Suppose the increase was measured by PageRank rather than the number of inlinks. It’s still the case that the top 20 percent of the original sample receives the lion’s share of the increase, and the next 20 percent receives all the rest. But the total amount is conserved, for, as we learned in Chapter 4, PageRank reflects the proportion of time that a random surfer will spend on the page. It follows that the remaining 60 percent of pages suffer substantial decreases. Not only do the rich get richer, but the poor get poorer as well. In this sense, the web reflects one of the tragedies of real life.

Of course, you’d expect the rich to get richer, for two reasons. First, pages with good content attract more links. The fact that a page already has many inlinks is one indication that it has good content. If new links are added to the system, it’s natural that more of them will point to pages with good content than to ones with poor content. Second, random surfers will be more likely to encounter pages with many inlinks, and therefore more likely to create new links to them. You cannot create a link to something you have not seen.

But these effects do not explain why the bottom 60 percent of pages in the popularity ranking get virtually no new links. This is extreme. What could account for it? The dragons.

THE EFFECT OF SEARCH ENGINES

In practice, the way we all find things is through web dragons, the search engines. Their role is to bestow visibility on as much information as possible by connecting all relevant content to well-posed queries. They are the web’s gatekeepers: they let people in and help them find what they’re looking for; they let information out so it can reach the right audience. What’s their impact on the evolution of the web? The links to your page determine where it appears in the search results. But people can only link to your page if they find it in the first place. And most things are found using search engines.

At first glance, things look good. A democratic way for the web to evolve would be for random surfers to continually explore it and place links to pages they find interesting. In practice, rather than surfing, search engines are used for exploration. Consider a searcher who issues a random query and clicks on the results randomly, giving preference to ones higher up the list. Typical search engines take PageRank into account when ordering their results, placing higher-ranking pages earlier in the list. As we learned in Chapter 4 (page 116–117), PageRank reflects the experience of a random surfer, and so it seems that the experience of a random searcher will approximately mirror that of a random surfer. If that were the case, it would not matter whether links evolved through the actions of searchers or surfers, and the evolution of the web would be unaffected by the use of search engines.

Not so fast. Although this argument is plausible qualitatively, the devil is in the details. It implicitly assumes that random searchers distribute their clicks in the list of search results in the same quantitative way that random surfers visit pages. And it turns out that for typical click distributions, this is not the case. Buried deep in the list from, say, the 100th item onward, lie a huge number of pages that the searcher will hardly ever glance at. Although random surfers certainly do not frequent these pages, they do stumble upon them occasionally. And since there are so many such pages, surfers spends a fair amount of time in regions that searchers never visit.

To quantify this argument would lead us into realms of theory that are inappropriate for this book. But you can see the point by imagining that searchers never click results beyond, say, the 100th item (when was the last time you clicked your way through more than ten pages of search results?). The millions of pages that, for typical searches, appear farther down the list are completely invisible. They might as well not be there. The rich get richer and the poor do not stand a chance. More detailed analysis with a plausible model of click behavior has shown that a new page takes 60 times longer to become popular under the search-dominant model than it does under the random surfer model.

Google shot from obscurity (less than 5 percent of web users visited once a week) to popularity (more than 30 percent) in two years, from 2001 through early 2003. According to the preceding model, today, in a world where people only access the web through search, this feat would take 60 times as long—more than a century!

Surely, this must be an exaggeration. Well, it is—in a way. It assumes that people only find new pages using search engines, and that they all use the same one so they all see the same ranking. Fortunately, word of mouth still plays a role in information discovery. (How did you find your favorite web dragon?—probably not by searching for it!) And fortunately, we do not all favor the same dragon. Nevertheless, this is a striking illustration of what might happen if we all used the same mechanism to discover our information.

POPULARITY VERSUS AUTHORITY

Do web dragons simply manage goods—namely, information—in a high-tech marketplace? We don’t think so. Information is not like meat or toothpaste: it enjoys special status because it is absolutely central to democracy and freedom. Special attention must be paid to how information circulates because of the extraordinarily wide impact it has on the choices of citizens. When people make decisions that affect social conditions or influence their activities, they must have the opportunity to be fully informed. To exacerbate matters, in the case of web search users are generally blithely unaware not only of just how results are selected (we’re all unaware of that!), but of the very fact that the issue might be controversial.

As explained in Chapter 4, despite the internal complexity of search engines, a basic principle is to interpret a link from one page to another as a vote for it. Links are endorsements by the linker of the information provided by the target. Each page is given one vote that it shares equally among all pages it links to: votes are fractional. But pages do not have equal votes: more weight is given to votes from prestigious pages. The definition of prestige is recursive: the more votes a page receives, the more prestigious it is deemed to be. A page shares its prestige equally among all pages it chooses to vote for.

Search engines base prestige solely on popularity, not on any intrinsic merit. In real life, raw popularity is a questionable metric. The most popular politician is not always the most trustworthy person in the country. And though everyone knows their names, the most popular singers and actors are not considered to dispense authoritative advice on social behavior—despite the fact that the young adulate and emulate them. Why, then, do people accept popularity-based criteria in search engines? Do they do this unconsciously? Or is it simply because, despite its faults, it is the most efficacious method we know?

The way web pages relate to one another leads to the idea of an economy of links. Choosing to create a link has a positive value: it is a meaningful, conscious, and socially engaged act. Links not only imply endorsement within the abstract world of the web, they have a strong impact on commercialization of products and services in the real world. Their metaphysical status is ambiguous, falling somewhere between intellectual endorsement and commercial advertising. As we learned in the last chapter, the economy of links has real monetary value: there is a black market, and you can buy apparently prestigious endorsements from link farms. This offends common sense. It’s a clear abuse of the web, a form of both pollution and prostitution that stems from a failure to respect the web as a common good. Nowadays such practices cannot be prosecuted by law, but they are firmly—and, some would say, arbitrarily—punished by today’s search engines.

Search engines make sense of the universe that they are called upon to interpret. Their operators are strenuously engaged in a struggle against abuse. But despite their goodwill, despite all their efforts on our behalf, one cannot fail to observe that they enjoy a very unusual position of power. Although they play the role of neutral arbiter, in reality they choose a particular criterion which, whatever it is, has the power to define what is relevant and what is insignificant. They did not ask for this awesome responsibility, and they may not relish it. The consequences of their actions are immeasurably complex. No one can possibly be fully aware of them. Of course, we are not compelled to use their service—and we do not pay for it.

Defining the prestige of a web page in terms of its popularity has broad ramifications. Consider minority pages that lie outside the mainstream. They might be couched in a language that few people speak. They might not originate in prominent institutions: government departments, prestigious universities, international agencies, and leading businesses. They might not treat topics of contemporary interest. No matter how relevant they are to our query, we have to acknowledge the possibility that such pages will not be easily located in the list of results that a search engine returns, even though they do not belong to the deep web mentioned in Chapter 3. Without any conscious intent or plan—and certainly without any conspiracy theory—such pages are simply and effectively obliterated, as a by-product of the ranking. This will happen regardless of the particular strategy the dragons adopt. Obviously, no single ranking can protect all minorities. Selective discrimination is an inescapable consequence of the “one size fits all” search model.

There are so many different factors to take into account that it is impossible to evaluate the performance of a search engine fairly. Chapter 4 introduced the twin measures of precision and recall (pages 110–111). Recall concerns the completeness of the search results list in terms of the number of relevant documents it contains, while precision concerns the purity of the list in terms of the number of irrelevant documents it contains. But these metrics apply to a single query, and there is no such thing as a representative query. Anyway, on the web, recall is effectively incomputable because one would need to identify the total number of relevant documents, which is completely impractical. And, of course, the whole notion of “relevance” is contextual—the dilemma of inquiry is that no one can determine what is an effective response for questions from users who do not know exactly what they are seeking. A distinction can be made between relevance—whether a document is “about” the right thing in terms of the question being asked—and pertinence—whether it is useful to the user in that it satisfies their information need. It is very hard to measure pertinence. Add to this the fact that search engines change their coverage, and their ranking methods, in arbitrary ways and at unpredictable times.

This makes it clear that users should interact in a critical way with search engines and employ a broad range of devices and strategies for information retrieval tasks in order to uncover the treasure hidden in the web.

PRIVACY AND CENSORSHIP

Although everyone has a good idea of what privacy means, a clear and precise definition is elusive and controversial—and contextual. In many circumstances, we voluntarily renounce privacy, either in circumscribed situations or because the end justifies the means. Life partners arrive at various different understandings as to what they will share, some jealous of individual privacy and others divulging every little secret to their partner. In families, parents give their children privacy up to a certain level and then exercise their right to interfere, educating them and protecting them from external menaces—drugs, undesirable companions, violent situations. For certain purposes, we are all prepared to divulge private information. Whenever you take a blood test, you give the medical technician—a complete stranger—full access to extremely personal information. Privacy is not so much about keeping everything secret; it’s more about the ability to selectively decline our right to protection when we want to. It’s a matter of control.

Actual privacy legislation differs from one country to another. In this section, we examine privacy protection from a general, commonsense perspective and illustrate the issues raised by widespread use of pervasive information technology to manage and manipulate personal data. Protection is not just a matter of legal rules. We want to understand the underlying issues, issues that apply in many different contexts.

There are two major approaches to privacy: restricted access and control. The former identifies privacy as a limitation of other people’s access to an individual. It is clear that a crucial factor is just who the “other people” are. Prisoners are watched by guards day and night, but no one else has access, and guards are prohibited from revealing anything outside the prison. There is no access from outside—complete privacy—but full access from inside—zero privacy. As another example, disseminating information about oneself must be considered a loss of privacy from the restricted-access point of view. However, revealing a secret to a close friend is quite different from unwittingly having your telephone tapped. The question is whether you have consented to the transfer of personal information.

According to the second view, privacy is not so much about hiding information about oneself from the minds of others, but the control that we can exercise over information dissemination. There is a clear distinction between involuntary violation of privacy and a decision to reveal information to certain people. One question is whether we are talking about actual or potential revelation of information. We might live in an apartment that is visible from outside with the aid of a telescope, but in a community where nobody ever uses such a device to spy. In strict terms of control, we sacrifice privacy because our neighbors’ behavior is beyond our control, but the question is academic because no one actually exploits our lack of control.

This issue is particularly germane to cyberspace. Ubiquitous electronics record our decisions, our choices in the supermarket, our financial habits, our comings and goings. We swipe our way through the world, every swipe a record in a database. The web overwhelms us with information; meanwhile every choice we make is recorded. Internet providers know everything we do online and in some countries are legally required to keep a record of all our behavior for years. Mobile telephone operators can trace our movements and all our conversations. Are we in control of the endless trail of data that we leave behind us as we walk though life? Can we dictate the uses that will be made of it? Of course not. This state of affairs forces us to reflect anew on what privacy is and how to protect it in the age of universal online access.

PRIVACY ON THE WEB

When defining privacy, various modalities of protection of personal data can be distinguished, depending on the communication situation. In private situations, of which the extreme is one-on-one face-to-face interaction, we are less protective of the data we distribute but in full control of our audience. In public situations the reverse is true; the audience is outside our control, but we carefully choose what information to reveal. A third communication situation concerns channels for the processing of electronically recorded information.

In this third situation, there are two different principles that govern online privacy. The first, user predictability, delimits the reasonable expectations of a person about how his or her personal data will be processed and the objectives to which the processing will be applied. It is widely accepted that before people make a decision to provide personal information, they have a right to know how it will be used and what it will be used for, what steps will be taken to protect its confidentiality and integrity, what the consequences of supplying or withholding the information are, and any rights of redress they may have. The second principle, social justifiability, holds that some data processing is justifiable as a social activity even when subjects have not expressly consented to it. This does not include the processing of sensitive data, which always needs the owner’s explicit consent. It is clear that these two principles may not always coincide, and for this reason there will never be a single universally applicable privacy protection policy. It is necessary to determine on a case by case basis whether a particular use of personal data is predictable and whether and to what extent it is socially justifiable.

User predictability resembles the medical ethics notion of “informed consent,” which is the legal requirement for patients to be fully informed participants in choices about their health care. This originates from two sources: the physician’s ethical duty to involve patients in their health care, and the patients’ legal right to direct what happens to their body. Informed consent is also the cornerstone of research ethics. While not necessarily a legal requirement, most institutions insist on informed consent prior to any kind of research that involves human subjects. Social justifiability extends informed consent and weakens it in a way that would not be permitted in the ultra-sensitive medical context.

Every user’s network behavior is monitored by many operators: access providers, websites, e-mail services, chat, voice over IP (VoIP) telephone services, and so on. Users who are aware of what is going on (and many aren’t) implicitly consent to the fact that private information about their online habits will be registered and retained for some period of time. We can assess the privacy implications by considering every piece of data in terms of its predictability and justifiability, and verify the system’s legitimacy—both moral and legal—according to these principles.

Our information society involves continual negotiation between the new possibilities for communication that technology presents and the new opportunities it opens up for monitoring user behavior. There is both a freedom and a threat; privacy protection lies on the boundary. As users, we choose to trust service providers because we believe the benefits outweigh the drawbacks. The risk is loss of control of all the personal information we are obliged to divulge when using new tools.

We are never anonymous when surfing the net: the system automatically registers every click. Service providers retain data in accordance with national laws such as the U.S.A. Patriot Act.⁴¹ Although they are not compelled to do so, many individual websites keep track of our activities using the cookie mechanism. As explained in Chapter 3 (pages 73–74), cookies reside on the user’s computer and can be erased at will, either manually or by installing software that periodically removes them, paying special attention to suspected spyware. Technically speaking, users retain control of this privacy leak. But practically speaking, they do not, for most are blithely unaware of what is going on, let alone how to delete cookies (and the risks of doing so). Moreover, the balance between freedom and threat arises again, for cookies allow websites to personalize their services to each individual user—recognizing your profile and preferences, offering recommendations, advice or opportunities, and so on.

PRIVACY AND WEB DRAGONS

It is impossible to draw a clear line between protection of privacy and access to useful services. Certainly users should be aware of what is going on—or at the very least have the opportunity to learn—and how to use whatever controls are available. Every website should disclose its policy on privacy. Many do, but if you do have the patience to find and read them, you may discover that popular websites have policies that allow them to do more than you realized with the personal data you give them.

Here is a lightly paraphrased version of the privacy policy of a major search engine (Google):

We only share personal information with other companies or individuals if we have a good faith belief that access, use, preservation or disclosure of such information is reasonably necessary to (a) satisfy any applicable law, regulation, legal process or enforceable governmental request, (b) enforce applicable Terms of Service, including investigation of potential violations thereof, (c) detect, prevent, or otherwise address fraud, security or technical issues, or (d) protect against imminent harm to the rights, property or safety of the company, its users or the public as required or permitted by law.⁴²

⁴² Extracted from Google’s privacy policy at www.google.com/intl/en/privacypolicy.html#infochoices.

This example is typical: the policies of other major search engines are not materially different. Yahoo pledges that it will not share personal information about you except with trusted partners who work on its behalf, and these partners must in turn pledge that they will not use the information for their own business purposes. For instance, an insurance company interested in purchasing personal queries on medical conditions cannot procure them from search engines, either directly or indirectly. However, as you can see from the policy quoted here, under some circumstances personal information can be shared. Governments can access it if it relates to criminal activity, and the company can use it to investigate security threats. A further clause states that if the company is acquired or taken over, users must be informed before the privacy policy is changed. In a sense, the philosophy behind search engine privacy reflects the norm for telephone companies: they must prevent the circulation of personal data.

Most companies routinely track their customers’ behavior. There is enormous added value in being able to identify individual customers’ transaction histories. This has led to a proliferation of discount or “loyalty” cards that allow retailers to identify customers whenever they make a purchase. If individuals’ purchasing history can be analyzed for patterns, precisely targeted special offers can be mailed to prospective customers.

Your web queries relate to all aspects of your life (at least, ours do). People use their favorite search engine for professional activities, for leisure, to learn what their friends are doing, or to get a new date—and check up on his or her background. If you subscribe to their services, web dragons monitor your queries in order to establish your profile so that they can offer you personalized responses. In doing so, they keep track of what you are thinking about. When opting for personalized service, you face an inescapable dilemma: the more facilities the service offers, the more privacy you will have to renounce. However, unless you are working outside the law, your private data is not expected to circulate outside the dragon’s lair.

As an example of unanticipated circulation of personal information, in April 2004, dissident Chinese journalist Shi Tao was convicted of disclosing secret information to foreign websites and condemned to ten years in jail. Vital incriminating evidence was supplied to the Chinese government by Yahoo’s Hong Kong branch, which traced him and revealed his e-mail address, Internet site, and the content of some messages. Commenting on the case during an Internet conference in China, Yahoo’s founder Jerry Yang declared:

We don’t know what they wanted that information for, we’re only told what to look for. If they give us the proper documentation in a court order, we give them things that satisfy local laws. I don’t like the outcome of what happens with these things. But we have to follow the law.⁴³

⁴³ www.washingtonpost.com/wp-dyn/content/article/2005/09/10/AR2005091001222.html

CENSORSHIP ON THE WEB

During 2004, Google launched a Chinese-language edition of its news service, but in an emasculated version:

For users inside the People’s Republic of China, we have chosen not to include sources that are inaccessible from within that country.⁴⁴

⁴⁴ googleblog.blogspot.com/2004/09/china-google-news-and-source-inclusion.html

Self-censorship was judged necessary in order to launch the service. In 2002, the company had crossed swords with the Chinese government, and in some areas of the country it seems that Google cooperated by withdrawing selected pages and redirecting to a Chinese search engine from Google’s home page. In summer 2004, Google invested heavily in Baidu, a government-approved Chinese search engine. Early in 2006, after a long period of uncertainty, Google finally announced that it was actively assisting the Chinese government in censoring results from its main search service.

Human rights groups responded in a blaze of outrage whose flames were fanned by the company’s self-proclaimed motto, “Do no evil.” Of course, web dragons cannot impose their service in defiance of a hostile government that is prepared to censor access from any Internet address. They have just two choices: to go along with censorship or to refrain from offering their service. Meanwhile, when Microsoft was criticized in mid-2005 for censoring the content of Chinese blogs that it hosted, it retorted:

Microsoft works to bring our technology to people around the world to help them realize their full potential. Like other global organizations we must abide by the laws, regulations and norms of each country in which we operate.⁴⁵

⁴⁵ news.bbc.co.uk/1/hi/world/asia-pacific/4221538.stm

China today presents a golden opportunity for profit in the business of web search and related technologies. Major search engines and leading Internet companies are jockeying for a favorable market position. The problem is that local regulations sometimes dictate controversial decisions concerning personal freedom. In parts of the world where free speech is in jeopardy, the enormous power wielded by information filters placed between users and Internet services becomes all too evident, as does the danger of monitoring user behavior. The flip side is that withdrawing search services could cause damage by restricting cultural and business opportunities.

Privacy and censorship are critical problems on the web. Both involve a balance between transparency and protection of individuals from information disclosure. People need to be protected both by keeping private information secret and by censoring certain kinds of information. The problem is to define the boundary between permissible and illicit diffusion of information, balancing the rights of the individual against community norms. For example, criminals sacrifice their right to privacy: when the police demand information about a cyber thief, the community’s right to be protected against illegal appropriation may outweigh the suspect’s right to privacy. Censorship is justified when transparency is deemed to damage a segment of society. Almost everyone would agree that young children should be protected from certain kinds of content: social software helps parents censor their children’s viewing.

These issues cannot be dealt with mechanically because it is impossible to draw a line based on formal principles between the community’s right to know and the individual’s right to privacy. Moreover, any solutions should respect cultural diversity. Decisions can only be made on a case by case basis, taking into account the context and the type of information. These difficult judgments depend on social norms and pertain to society as a whole. They should not be left to automatic filters or to private companies with legitimate business interests to defend. Society must find a way to participate in the decision process, open it up, allow debate and appeal. Online privacy and censorship will remain open questions. There is an inescapable tradeoff between the opportunities offered by new information technologies and the threats that they (and their administrators) pose to our core freedoms.

COPYRIGHT AND THE PUBLIC DOMAIN

Digital collections are far more widely accessible than physical ones. This creates a problem: access to the information is far less controlled than in physical libraries. Digitizing information has the potential to make it immediately available to a virtually unlimited audience. This is great news. For the user, information around the world becomes available wherever you are. For the author, a wider potential audience can be reached than ever before. And for publishers, new markets open up that transcend geographical limitations. But there is a flip side. Authors and publishers ask, how many copies of a work will be sold if networked digital libraries enable worldwide access to an electronic copy of it? Their nightmare is that the answer is one. How many books will be published if the entire market can be extinguished by the sale of one electronic copy to a public library?

COPYRIGHT LAW

Possessing a copy of a document certainly does not constitute ownership in terms of copyright law. Though there may be many copies, every work has only one copyright owner. This applies not just to physical copies of books, but to computer files too, whether they have been digitized from a physical work or created electronically in the first place—born digital, you might say. When you buy a copy of a work, you can resell it but you certainly do not buy the right to redistribute it. That right rests with the copyright owner.

Copyright subsists in a work rather than in any particular embodiment of it. A work is an intangible intellectual object, of which a document is a physical manifestation. Lawyers use the word subsists, which in English means “to remain or continue in existence,” because copyright has no separate existence without the work. Copyright protects the way ideas are expressed, not the ideas themselves. Two works that express the same idea in different ways are independent in copyright law.

Who owns a particular work? The creator is the initial copyright owner, unless the work is made for hire. If the work is created by an employee within the scope of her employment, or under a specific contract that explicitly designates it as being made for hire, it is the employer or contracting organization that owns the copyright. Any copyright owner can transfer or “assign” copyright to another party through a specific contract, made in writing and signed. Typically an author sells copyright to a publisher (or grants an exclusive license), who reproduces, markets, and sells the work.

The copyright owner has the exclusive right to do certain things with the work: thus copyright is sometimes referred to as a “bundle” of rights. In general, there are four rights, though details vary from country to country. The reproduction right allows the owner to reproduce the work freely. The distribution right allows him to distribute it. This is a one-time right: once a copy has been distributed, the copyright owner has no right to control its subsequent distribution. For example, if you buy a book, you can do whatever you want with your copy—such as resell it. The public lending right compensates authors for public lending of their work—though an exception is granted for not-for-profit and educational uses, which do not require the copyright holder’s consent. The remaining rights, called other rights, include permitting or refusing public performance of the work, and making derivative works such as plays or movies.

Copyright law is complex, arcane, and varies from one country to another. The British Parliament adopted the first copyright act in 1710; the U.S. Congress followed suit in 1790. Although copyright is national law, most countries today have signed the Berne Convention of 1886, which lays down a basic international framework. According to this Convention, copyright subsists without formality, which means that (unlike patents), it’s not dependent on registering a work with the authorities or depositing a copy in a national library. It applies regardless of whether the document bears the international copyright symbol ©. You automatically have copyright over works you create (unless they are made for hire). Some countries—including the United States—maintain a copyright registry even though they have signed the Berne Convention, which makes it easier for a copyright holder to take legal action against infringers. Nevertheless, copyright still subsists even if the work has not been registered.

The Berne Convention decrees that it is always acceptable to make limited quotations from protected works, with acknowledgement and provided it is done fairly. The United States has a particular copyright principle called “fair use” which allows material to be copied by individuals for research purposes. The U.K. equivalent, which has been adopted by many countries whose laws were inherited from Britain in colonial times, is called “fair dealing” and is slightly more restrictive than fair use.

Making copies of works under copyright for distribution or resale is prohibited. That is the main economic point of the copyright system. The Berne Convention also recognizes certain moral rights. Unlike economic rights, these cannot be assigned or transferred to others; they remain with the author forever. They give authors the right to the acknowledgment of their authorship and to the integrity of their work—which means that they can object to a derogatory treatment of it.

THE PUBLIC DOMAIN

Works not subject to copyright are said to be in the “public domain,” which comprises the cultural and intellectual heritage of humanity that anyone may use or exploit. Often, works produced by the government are automatically placed in the public domain, or else it sets out generous rules for their use by not-for-profit organizations. This applies only in the country of origin: thus, works produced by the U.S. government are in the public domain in the United States but are subject to U.S. copyright rules in other countries.

Copyright does not last forever, but eventually expires. When that happens, the work passes into the public domain, free of all copyright restrictions. No permission is needed to use it in any way, incorporate any of its material into other works, or make any derivative works. You can copy it, sell it, excerpt it—or digitize it and put it on the web. The author’s moral rights still hold, however, so you must provide due attribution.

Today the internationally agreed Berne Convention sets out a minimum copyright term of life plus 50 years, that is, until 50 years after the author dies. This presents a practical difficulty, for it is often hard to find out when the author died. One way is to consult the authors’ association in the appropriate country, which maintains links to databases maintained by authors’ associations around the world.

The duration of copyright has an interesting and controversial history. Many countries specify a longer term than the minimum, and this changes over the years. The original British 1710 act provided a term of 14 years, renewable once if the author was alive; it also decreed that all works already published by 1710 would get a single term of 21 further years. The 1790 U.S. law followed suit, with a 14-year once-renewable term. Again, if an author did not renew copyright, the work automatically passed into the public domain.

In 1831, the U.S. Congress extended the initial period of copyright from 14 to 28 years, and in 1909 it extended the renewal term from 14 to 28 years, giving a maximum term of 56 years. From 1962 onward, it enacted a continual series of copyright extensions, some one or two years, others 19 or 20 years. In 1998, Congress passed the Sonny Bono Copyright Term Extension Act, which extended the term of existing and future copyrights by 20 years.⁴⁶

The motivation behind these moves comes from large, powerful corporations who seek protection for a miniscule number of cultural icons; opponents call them the “Mickey Mouse” copyright extensions. Many parts of the world (notably the United Kingdom) have followed suit by extending their copyright term to life plus 70 years—and in some countries (e.g., Italy), the extension was retroactive so that books belonging to the public domain were suddenly removed from it.

The upshot is that copyright protection ends at different times depending on when the work was created. It also begins at different times. In the United States, older works are protected for 95 years from the date of first publication. Through the 1998 Copyright Extension Act, newer ones are protected from the “moment of their fixation in a tangible medium of expression” until 70 years after the author’s death. Works made for hire—that is, ones belonging to corporations—are protected for 95 years after publication or 120 years after creation, whichever comes first.

The original copyright term was one-time renewable, if the copyright holder wished to do so. In fact, few did. Focusing again on the United States, in 1973 more than 85 percent of copyright holders failed to renew their copyright, which meant that, at the time, the average term of copyright was just 32 years. Today there is no renewal requirement for works created before 1978: copyright is automatically given for a period of 95 years—tripling the average duration of protection.

No copyrights will expire in the twenty-year period between 1998 and 2018. To put this into perspective, one million patents will pass into the public domain during the same period. The effect of this extension is dramatic. Of the 10,000 books published in 1930, only a handful (less than 2 percent) are still in print. If the recent copyright extensions had not occurred, all 10,000 of these books would by now be in the public domain, their copyright having expired in 1958 (after 28 years) if it was not renewed, or 1986 (after a further 28 years) if it was renewed. Unfortunately, that is not the case. If you want to digitize one of these books and make it available on the Internet, you will have to seek the permission of the copyright holder. No doubt 98 percent of them would be perfectly happy to give you permission. Indeed, many of these books are already in the public domain, having expired in 1958. You could find out which, for there is a registry of copyright renewal (though it is not available online) that lists the books that were renewed in 1959. However, for the remaining works, you would have to contact the copyright holders, for which there is no such registry. There will be a record from 1930, and again from 1959, of who registered and extended the copyright. But you’d have to track these people down.

The problem is a large one. It has been estimated that of the 32 million unique titles that are to be found in America’s collective libraries, only 6 percent are still in print, 20 percent have now passed into the public domain, and the remainder—almost 75 percent—are orphan works: out of print but still under copyright. As Lawrence Lessig put it in his book Free Culture (from which much of the preceding information was taken),

Now that technology enables us to rebuild the library of Alexandria, the law gets in the way. And it doesn’t get in the way for any useful copyright purpose, for the purpose of copyright is to enable the commercial market that spreads culture. No, we are talking about culture after it has lived its commercial life. In this context, copyright is serving no purpose at all related to the spread of knowledge. In this context, copyright is not an engine of free expression. Copyright is a brake.

– Lessig (2004, p. 227)

RELINQUISHING COPYRIGHT

As we have explained, the Berne Convention makes copyright apply without formality, without registration. Anything you write, every creative act that’s “fixated in a tangible means of expression”—be it a book, an e-mail, or a grocery list—is automatically protected by copyright until 50 years after you die (according to the Berne Convention’s minimum restrictions) or, today in the United States, 70 years after you die (assuming you did not write your grocery list as a work made for hire). People can quote from it under the principle of fair use, but they cannot otherwise use your work until such time as copyright expires, unless you reassign the copyright.

If you wish to relinquish this protection, you must take active steps to do so. In fact, it’s quite difficult. To facilitate this, a nonprofit organization called the Creative Commons has developed licenses that people can attach to the content they create. Each license is expressed in three ways: a legal version, a human-readable description, and a machine-readable tag. Content is marked with the CC mark, which does not mean that copyright is waived but that freedoms are given to others to use the material in ways that would not be permissible under copyright.

These freedoms all go beyond traditional fair use, but their precise nature depends on your choice of license. One license permits any use so long as attribution is given. Another permits only noncommercial use. A third permits any use within developing nations. Or any educational use. Or any use except for the creation of derivative works. Or any use so long as the same freedom is given to other users. Most important, according to the Creative Commons, is that these licenses express what people can do with your work in a way they can understand and rely upon without having to hire a lawyer. The idea is to help reconstruct a viable public domain.

The term “copyleft” is sometimes used to describe a license on a derivative work that imposes the same terms as imposed by the license on the original work. The GNU Free Documentation License is a copyleft license that is a counterpart to the GNU General Public License for software. It was originally designed for manuals and documentation that often accompany GNU-licensed software. The largest project that uses it is Wikipedia, the online encyclopedia mentioned in Chapter 2 (page 65). Its main stipulation is that all copies and derivative works must be made available under the very same license.

COPYRIGHT ON THE WEB

The way that computers work has led people to question whether the notion of a copy is the appropriate foundation for copyright law in the digital age. Legitimate copies of digital information are made so routinely that the act of copying has lost its applicability in regulating and controlling use on behalf of copyright owners. Computers make many internal copies whenever they are used to access information: the fact that a copy has been made says little about the intention behind the behavior. In the digital world, copying is so bound up with the way computers work that controlling it provides unexpectedly broad powers, far beyond those intended by copyright law.

As a practical example, what steps would you need to take if you wanted to digitize some documents and make them available publicly or within your institution? First, you determine whether the work to be digitized is in the public domain or attempts to faithfully reproduce a work in the public domain. If the answer to either question is yes, you may digitize it without securing anyone’s permission. Of course, the result of your own digitization labor will not be protected by copyright either, unless you produce something more than a faithful reproduction of the original. If material has been donated to your institution and the donor is the copyright owner, you can certainly go ahead, provided the donor gave your institution the right to digitize. Even without a written agreement, it may reasonably be assumed that the donor implicitly granted the right to take advantage of new media, provided the work continues to be used for the purpose for which it was donated. You do need to ensure, of course, that the donor is the original copyright owner and has not transferred copyright. You cannot, for example, assume permission to digitize letters written by others.

If you want to digitize documents and the considerations just noted do not apply, you should consider whether you can go ahead under the concept of fair use. This is a difficult judgment to make. You need to reflect on how things look from the copyright owner’s point of view, and address their concerns. Institutional policies about who can access the material, backed up by practices that restrict access appropriately, can help. Finally, if you conclude that fair use does not apply, you will have to obtain permission to digitize the work or acquire access by licensing it.

People who build and distribute digital collections must pay serious attention to the question of copyright. Such projects must be undertaken with a full understanding of ownership rights, and full recognition that permissions are essential to convert materials that are not in the public domain. Because of the potential for legal liability, prudent collection-builders will consider seeking professional advice. A full account of the legal situation is far beyond the scope of this book, but the Notes and Sources section at the end contains some pointers to sources of further practical information about copyright, including information on how to interpret fair use and the issues involved when negotiating copyright permission or licensing.

Looking at the situation from an ethical rather than a legal point of view helps crystallize the underlying issues. It is unethical to steal: deriving profit by distributing a book on which someone else has rightful claim to copyright is wrong. It is unethical to deprive someone of the fruit of their labor: giving away electronic copies of a book on which someone else has rightful claim to copyright is wrong. It is unethical to pass someone else’s work off as your own: making any collection without due acknowledgement is wrong. It is unethical to willfully misrepresent someone else’s point of view: modifying documents before including them in the collection is wrong even if authorship is acknowledged.

WEB SEARCHING AND ARCHIVING

What are the implications of copyright for web searching and archiving? These activities are in a state of rapid transition. It is impractical, and also inappropriate, for legal regulation to try to keep up with a technology in transition. If any legislation is needed, it should be designed to minimize harm to interests affected by technological change while enabling and encouraging effective lines of development. Legislators are adopting a wait-and-see policy, while leading innovators bend over backward to ensure that what they do is reasonable and accords with the spirit—if not necessarily the letter—of copyright law.

As you know, search engines are among the most widely used Internet services. We learned in Chapter 3 (page 71) that websites can safeguard against indiscriminate crawling by using the robot exclusion protocol to prevent their sites from being downloaded and indexed. This protocol is entirely voluntary, though popular search engines certainly comply. But the onus of responsibility has shifted. Previously, to use someone else’s information legitimately, one had to seek permission from the copyright holder. Now—reflecting the spirit of the web from the very beginning—the convention is to assume permission unless the provider has set up a blocking mechanism. This is a key development with wide ramifications. And some websites threaten dire consequences for computers that violate the robot exclusion protocol, such as denial of service attacks that will effectively disable the violating computer. This is law enforcement on the wild web frontier.

Other copyright issues are raised by projects such as the Internet Archive (mentioned in Chapter 2, page 51) that are storing the entire World Wide Web in order to supply documents that are no longer available and maintain a “copy of record” that will form the raw material for historical studies. Creating such an archive raises many interesting issues involving privacy and copyright, issues that are not easily resolved.

What if a college student created a Web page that had pictures of her then current boyfriend? What if she later wanted to “tear them up,” so to speak, yet they lived on in the archive? Should she have the right to remove them? In contrast, should a public figure—a U.S. senator, for instance—be able to erase data posted from his or her college years? Does collecting information made available to the public violate the “fair use” provisions of copyright law?

– Kahle (1997)

Copyright is a complex and slippery business. Providing a pointer to someone else’s document is one thing, whereas in law, serving up a copy of a document is quite a different matter. There have even been lawsuits about whether it is legal to link to a web document, on the basis that you could misappropriate advertising revenue that rightfully belongs to someone else by attaching a link to their work with your own advertising. The web is pushing at the frontiers of society’s norms for dealing with the protection and distribution of intellectual property.

The legal system is gradually coming to grips with the issues raised by web search. In 2004, a writer filed a lawsuit claiming that Google infringed copyright by archiving a posting of his and by providing excerpts from his website in their search results. In a 2006 decision, a U.S. district court judge likened search engines to Internet service providers in that they store and transmit data to users without human intervention, and ruled that these activities do not constitute infringement because “the necessary element of volition is missing.” Earlier that year, a federal judge in Nevada concluded that Google’s cached versions of web pages do not infringe copyright. However, at the same time, a Los Angeles federal judge opined that its image search feature, in displaying thumbnail versions of images found on an adult photo site, does violate U.S. copyright law.

Those who run public Internet information services tell fascinating tales of people’s differing expectations of what it is reasonable for their services to do. Search companies receive calls from irate users who have discovered that some of their documents are indexed which they think shouldn’t be. Sometimes users believe their pages couldn’t possibly have been captured legitimately because there are no links to them—whereas, in fact, at one time a link existed that they have overlooked. You might put confidential documents into a folder that is open to the web, perhaps only momentarily while you change its access permissions, only to have them grabbed by a search engine and published for the entire world to find. Fortunately, major search engines make provision for removing cached pages following a request by the page author.

Search technology makes information readily available that, though public in principle, was previously impossible to find in practice. Usenet is a huge corpus of Internet discussion groups on a wide range of topics dating from the 1980s—well before anyone would have thought Internet searching possible. When a major search engine took over the archives, it received pleas from contributors to retract indiscreet postings they had made in their youth, because, being easily available for anyone to find, they were now causing their middle-aged authors embarrassment.

THE WIPO TREATY

A treaty adopted by the World Intellectual Property Organization (WIPO) in 1996 addresses the copyright issues raised by digital technology and networks in the modern information era. It decrees that computer programs should be protected as literary works and that the arrangement and selection of material in databases is protected. It provides authors of works with more control over their rental and distribution than the Berne Convention does. It introduces an important but controversial requirement that countries must provide effective legal measures against the circumvention of technical protection measures (digital rights management schemes, mentioned in Chapter 2, page 56) and against the removal of electronic rights management information, that is, data identifying the author and other details of the work.

Many countries have not yet implemented the WIPO Treaty into their laws. One of the first pieces of national legislation to do so was the U.S. Digital Millennium Copyright Act (DMCA), one of whose consequences is that it may be unlawful to publish information that exposes the weaknesses of technical protection measures. The European Council has approved the treaty on behalf of the European Community and has issued directives that largely cover its subject matter.

THE BUSINESS OF SEARCH

The major search engines were born not by design, but almost by chance, through the ingenuity of talented young graduate students who worked on challenging problems for the sheer joy of seeing what they could do. They realized how rewarding it would be to help others dig useful information out of the chaotic organization of the web. The spirit of entrepreneurship helped turn the rewards into immensely profitable businesses. Some were more successful than others: luck, timing, skill, and inspiration all played a role. Early market shakedowns saw many acquisitions and mergers.

By the end of 2005, three major search engines (Google, Yahoo, and MSN) accounted for more than 80 percent of online searches in the United States. Measurements are made in the same way as for TV ratings: around 1 million people have agreed to have a meter installed on their computer to monitor their search behavior. And these three were serviced by only two paid advertising providers (Google uses its own AdWords/AdSense; MSN and Yahoo both use Overture, owned by Yahoo).

Google leads the growing market of paid search advertisements. Its $6 billion revenue in 2005 almost doubled the previous year’s figure and was predicted to increase by 40 percent in 2006 and maintain a compound annual growth of 35 to 40 percent over the following five years. By 2005, Google was already larger than many newspaper chains and TV channels; during 2006, it became the fourth largest media company in the United States (after Viacom, News Corporation, and Disney, but before giants such as NBC Universal and Time Warner).

Paid search accounts for nearly half of all Internet advertising revenue (in 2005). Spending on online advertising is projected to double by 2010, when it will constitute more than 10 percent of the general advertising market in the United States. One-fifth of Internet users already consider online advertising more effective than radio, newspapers, and magazines. Not only will paid search take the lion’s share of online advertising, it will also drain revenue from conventional broadcast media.

THE CONSEQUENCES OF COMMERCIALIZATION

This lively commercial interest pushes in two directions. On the one hand, websites strive to gain good positions in search engine rankings in order to further their commercial visibility. On the other, search engines strive to outdo one another in the quality of service they offer in order to attract and retain customers. The resulting tension explains the web wars of Chapter 5. There is huge commercial value in being listed early in the results of search engines. After all, there’s no need to waste money on advertising if you are already at the top of the class! It means life or death to many small online businesses.

Countless poignant real-life stories describe casualties of the war between search engines and spammers. You start a small specialist Internet business. Quickly realizing the importance of having high rankings on major search engines, you subscribe to a commercial service that promises to boost them. Little do you know that it utilizes dubious techniques for website promotion of the kind described in Chapter 5. In fact, such considerations are quite outside your experience, knowledge, and interests—your expertise is in your business, not in how the web works. The crash comes suddenly, with no warning whatsoever. Overnight, your ranking drops from top of the search results for relevant queries to the millionth position. Only later do you figure out the problem: the search engine has decided that your promotional efforts amount to spam.

The web is a capricious and erratic place. Meanwhile, users (as we saw earlier) place blind trust in their search results, as though they represented objective reality. They hardly notice the occasional seismic shifts in the world beneath their feet. They feel solidly in touch with their information, blissfully unaware of the instability of the mechanisms that underlie search. For them, the dragons are omniscient.

THE VALUE OF DIVERSITY

You might recall from Chapter 2 that France’s national librarian warned of the risk of “crushing American domination in the definition of how future generations conceive the world” and threatened to build a European version (page 54). Shortly afterward, Jacques Chirac, President of the French Republic, initiated a mega-project: a Franco-German joint venture to create European search engine technology, funded from both public coffers and private enterprise. Quaero, Latin for I search, is a massive effort explicitly aimed at challenging U.S. supremacy in web search. The project involves many research institutes in France and Germany, as well as major businesses. Project leaders understand that they must be creative to succeed in the ultra-competitive market of online search. Quaero will target its services to various market areas. One is general web search; another will be designed for businesses that work with digital content; and a third version is aimed at the mobile phone market.

European leaders believe it is politically unwise to leave the fabulously profitable and influential business of online search to foreign entrepreneurs. In addition to economic concerns, they worry about the cultural value of services that organize and access the information on the web, services that are used to preserve and transmit cultural heritage. The Franco-German project aims to augment perceived American cultural hegemony with a European perspective. But wait, there’s more. Japan has decided to create its own national search engine, through a partnership of major Japanese technical and media companies: NEC, Fujitsu, NTT, Matsushita, and Nippon public television (NHJ).

Large cooperative enterprises, organized on a national or regional scale with a component of public leadership and funding, are a far cry from the lone young geniuses working for love rather than money, who created the search engines we have today and grew into talented entrepreneurs whose dragons are breathing fire at the advertising legends of Madison Avenue. Will the new approach lead to better search? If the object is really to forestall cultural hegemony, will search engines be nationalized and funded from taxes? Does jingoism have a role to play in the international web world? These are interesting questions, and it will be fascinating to watch the results play out—from the safety of the sidelines.

However, we believe these efforts are addressing the wrong problem. The real issue is not one of cultural dominance in a nationalistic sense, but rather cultural dominance through a universal, centralized service, or a small set of services that all operate in essentially the same way. The activity of searching should recognize the multifarious nature of the web, whose unique characteristic is the collective production of an immense panoply of visions, the coexistence of countless communities. What users need is not a competition between a few grand national perspectives, but inbuilt support for the very idea of different perspectives. While it is understandable for them to want a piece of the action, the Europeans and Japanese have declared war on the wrong thing.

In order to truly represent the immense richness of the treasure, preserve it, and provide a basis for its continued development and renewal, we need an equally rich variety of methods to access it, methods that are based on different technologies and administered by different regimes—some distributed and localized, others centralized. Not only is content socially determined; so is the research process itself. Although we talk of research as though it were a single concept, it covers a grand scope of activities that share only a broad family resemblance. We research different kinds of objects, from apples to archangels, from biology to biography, from cattle prods to cathodes. And different objects imply different kinds of search. When we seek an evening’s entertainment, we are not doing the same thing as when we seek precedents for a homicide case. Different activities imply different criteria and involve different elements for success.

Search is a voyage of understanding. In the real world, early European explorers set the scene for what was to become a huge decrease in our planet’s ecological diversity. Today, in the web world, we are concerned for the survival of diversity in our global knowledge base.

PERSONALIZATION AND PROFILING

Sometimes we search for a particular object, such as an article we read long ago or the name of the heroine of Gone with the Wind. More often, what we seek is imprecise; the question is not well formulated, and our goal is unclear even to ourselves. We are looking more for inspiration than for an answer. Wittgenstein, the twentieth-century philosopher we met in Chapter 1, thought that search is more about formulating the right question than finding the answer to it. Once you know the question, you also have the response. When we embark on a quest online, we are looking more for education than for a fact. Too often, what we get is a factoid—a brief item that is factual but trivial. The trouble is, this affects the way we think and what we tend to value.

History and legend are replete with quests that are ostensibly a search for a physical treasure (the Golden Fleece, the Holy Grail, a missing wife) but end up transforming the hero and his friends from callow youths to demigods (Jason and the Argonauts, the medieval knights Parsifal, Galahad, and Bors, D’Artagnan and his trusty musketeers). Are today’s search tools up to supporting epic quests? If only he’d had a search engine, Jason might have ended up as he began, an unknown political refugee from Thessaly.

Everyone who works in the search business knows that to respond properly to a question, it is necessary to take the user, and his or her context, into account. Keyword input is not enough. Even a full statement of the question would not be enough (even if computers could understand it). In order to resolve the inevitable ambiguities of a query, it is essential to know the user’s perspective, where she is coming from, why she is asking the question. We need the ability to represent and refine the user’s profile.

From a search engine’s point of view, personalization has two distinct aspects. First, it is necessary to know the user in order to answer his questions in a way that is both pertinent and helpful. Second, knowing the user allows more precisely targeted advertising that maximizes clickthrough rate and hence profit. Google’s AdWords system, for example, is an advanced artificial intelligence application that applies contextual analysis to connect user keywords with sponsored links in the most effective possible way. MSN demonstrated a demographic targeting tool for its contextual advertising service at the 2005 Search Engine Strategies Conference in Chicago. Perhaps search could be even more effective if the same degree of attention were paid to exploiting context in generating the search results themselves, in order to satisfy the user’s real information need.

User profiles are based on two kinds of information: the user’s “clickstream” and his geographical location. Search engines can analyze each user’s actions in order to get behind individual queries and understand more of the user’s thoughts and interests. It is difficult to encapsulate a request into a few keywords for a query, but by taking history into account, search engines might be able to fulfill your desires more accurately than you can express them. Localization provides a powerful contextual clue: you can be given a response that is personalized to your particular whereabouts—even, with GPS, when you are browsing from a mobile phone. This maximizes advertising revenue by sharply focusing commercial information to geographically relevant customers. It increases clickthrough by not overloading users with irrelevant advertisements. It saves sponsors from wasting their money on pointless advertising that may even aggravate potential customers. It provides an attractive alternative to broadcast media like TV advertising.

Personalization is a major opportunity for search companies. Personal service that disappeared when we abandoned village shops for centralized supermarkets is making a comeback on the web. Leading online marketers like Amazon already provide impressive personal service. A user’s location and clickstream gives deep insight into what she is doing and thinking. The privacy dilemma emerges in a different guise. The tradeoff between personalization and privacy will be one of the most challenging questions to arise in years to come.

Web search is by no means the only field where new technology raises serious privacy concerns. Consider the vast store of call information collected by telecommunication companies, information that is even more sensitive. In principle, search technology could retrieve information from automatic transcriptions created by state-of-the-art speech recognition systems. Policies can change and evolve in different ways, and users should be keenly aware of the risks associated with the use of any information service—particularly ones that are funded from advertising revenue.

SO WHAT?

Information is not as neutral as you might have supposed. Ever since humans began keeping written records, control over the supply and archiving of information has been a powerful social, political, and historical force. But the web is no ordinary archive: it’s created and maintained by everyone—an archive of the people, for the people, and by the people.

Access, however, is mediated by the dragons. They strive to determine the authority of each page as judged by the entire web community, and reflect this in their search results. Though it sounds ultra-democratic, this has some unexpected and undesirable outcomes: the rich get richer, popularity is substituted for genuine authority, and the sensitive balance between privacy and censorship is hard to maintain. In a global village that spans the entire world, the tyranny of the majority reigns.

How will these challenges be met? While we cannot predict the future, the next and final chapter gives some ideas. An important underlying theme is the reemergence of communities in our world of information.

WHAT CAN YOU DO ABOUT ALL THIS?

• Hunt down private information about some important personage.

• Next time you fill out a web form, analyze each field to determine why (and whether) they need this information.

• Find the “prejudice map” and learn how it works.

• Compare results for sensitive searches in different countries.

• Learn what your country’s doing about the WIPO copyright treaty.

• Slap a Creative Commons license on the next thing you write.

• Read the story of someone who feels a dragon has threatened his livelihood.

• Find discussions on the web about blogs and democracy.

• Try separate searches for countries—for example, united states, united kingdom, south africa, new zealand—and compare the results.

• Investigate the privacy policy of major search engines.

NOTES AND SOURCES

Tim Berners-Lee wrote the prophetic words at the beginning of this chapter in 1989. Pew Internet & American Life Project⁴⁷ have surveyed search engine users (the results quoted on page 178 are from a January 2005 survey). A pioneering paper on information ethics is Mason (1986). Brennan and Johnson (2004), Moore (2005), and Spinello and Tavani (2006) are recent collections that discuss the ethical issues raised by information technology. In 2005, the International Review of Information Ethics devoted a special issue to ethical questions concerning search engines (Vol. 3).

Derrida’s (1996) charming, though complex, monograph on Freud’s inheritance gives a provocative and stimulating account of the role of archives in contemporary society; we have drawn upon the interesting commentary by Steedman (2001). Introna and Nissenbaum (2000), writing well before most people realized that there was any problem at all, discuss the links between web democracy and search engines. A well-informed book that raises this issue is Battelle’s The Search (2005), which includes many fascinating facts about the commercialization of search engines. The study of web snapshots taken seven months apart (page 183) is described by Cho and Roy (2004); it was they who modeled the random searcher and contrasted him with the random surfer to obtain the extraordinary result that it takes 60 times longer for a new page to become popular under random searching. Belew (2000) gives a theoretical perspective on traps that people fall into when finding out new information.

The notion of privacy as a limitation of other people’s access to an individual is defended by Gavison (1980). Various modalities of protection of personal data, and the two principles that relate to online privacy, are identified in European Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and the free movement of such data. This directive is discussed by Elgesem (1999) and can be contrasted with the American perspective found in Nagenborg (2004). Lessig (1999, 2002) argues that privacy should be considered as property, though this view is by no means universal. Tavani and Moor (2000) argue for the restricted-control approach.

The information about the dissident Chinese journalist Shi Tao of the Dangdai Shang Bao (Contemporary Business News) is from a September 2004 press release from Reporters Without Borders⁴⁸ (see also the Washington Post, September 11, 2005).⁴⁹ The OpenNet Initiative (2005) has published a report on the filtering of the Internet in China. Hinman (2005) contains an interesting discussion of online censorship, with many pointers to further information. There are websites that display Google results from selected countries (e.g., the United States and China) side by side, to allow you to explore the differences for yourself.⁵⁰ Document queries for falun gong and image queries for tiananmen yield interesting comparisons.

Turning now to copyright, Samuelson and Davis (2000) provide an excellent and thought-provoking overview of copyright and related issues in the Information Age, which is a synopsis of a larger report published by the National Academy of Sciences Press (Committee on Intellectual Property Rights, 2000). An earlier paper by Samuelson (1998) discusses specific digital library issues raised by copyright and intellectual property law, from a U.S. perspective. The Association for Computing Machinery has published a collection of papers on the effect of emerging technologies on intellectual property issues (White, 1999). We learned a lot from Lessig’s great book Free Culture (Lessig, 2004), which has strongly influenced our perspective. Lessig is also the originator of the Creative Commons, whose licenses can be found on the web⁵¹—as can the GNU GFDL license.⁵² The lawsuit we cited on page 200 which claims that Google infringed copyright is Parker v. Google; the one about the legality of cached copies is Field v. Google, and the one concerning image search is Perfect 10, a purveyor of nude photographs, v. Google.

There’s plenty of information on copyright on the web. Staff at Virginia Tech have developed a useful site to share what they learned about policies and common practices relating to copyright.⁵³ It includes interpretations of U.S. copyright law, links to the text of the law, sample letters to request permission to use someone else’s work, links to publishers, advice for authors about negotiating to retain some rights, as well as current library policies. Georgia Harper at the University of Texas at Austin has created an excellent “Crash Course in Copyright” that is delightfully presented and well worth reading.⁵⁴ Some information we have presented about the duration of copyright protection is from the website of Lolly Gasaway, director of the Law Library and professor of law at the University of North Carolina.⁵⁵

Two major measurement services of search engine performance are comScore qSearch system and Nielsen/Netratings Mega View Search reporting service. The revenue and growth predictions for Google (page 202) are from Goldman Sachs and Piper Jaffray, a securities firm. Figures regarding advertising revenues come from the Interactive Advertising Bureau (IAB); projections are by Wang (2005). Two examples of the casualties of the war between search engines and spammers are the sad tales of exoticleatherwear.com and 2bigfeet.com from the Wall Street Journal (February 26, 2003) and Chapter 7 of Battelle (2005), respectively.

A human in the node, and the insidious braids of control.

⁴¹ The name is an acronym for Uniting and Strengthening America by Providing Appropriate Tools Required to Intercept and Obstruct Terrorism. This Act began in 2001 and was so controversial that it had to be renewed by the U.S. Senate every year; most of its provisions were finally made permanent in 2006. The Electronic Privacy Information Center (EPIC) has more information.

⁴⁶ The Act is named in memory of former musician Sonny Bono, who, according to his widow, believed that “copyrights should be forever.”

⁴⁷ www.pewinternet.org

⁴⁸ www.rsf.org

⁴⁹ www.washingtonpost.com/wp-dyn/content/article/2005/09/10/AR2005091001222_pf.html

⁵⁰ For example, CenSEARCHip at homer.informatics.indiana.edu/censearchip and ComputerBytesMan at www.computerbytesman.com/google

⁵¹ creativecommons.org

⁵² http://www.gnu.org/copyleft/fdl.html

⁵³ scholar.lib.vt.edu/copyright

⁵⁴ www.utsystem.edu/ogc/intellectualproperty/cprtindx.htm3

⁵⁵ www.unc.edu/∼unclng/public-d.htm

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.