PREFACE

In the eye-blink that has elapsed since the turn of the millennium, the lives of those of us who work with information have been utterly transformed. Much—most—perhaps even all—of what we need to know is on the World Wide Web; if not today, then tomorrow. The web is where society keeps the sum total of human knowledge. It’s where we learn and play, shop and do business, keep up with old friends and meet new ones. And what has made all this possible is not just the fantastic amount of information out there, it’s a fantastic new technology: search engines. Efficient and effective ways of searching through immense tracts of text is one of the most striking technical advances of the last decade. And today search engines do it for us. They weigh and measure every web page to determine whether it matches our query. And they do it all for free. We call on them whenever we want to find something that we need to know. To learn how they work, read on!

We refer to search engines as “web dragons” because they are the gatekeepers of our society’s treasure trove of information. Dragons are all-powerful figures that stand guard over great hoards of treasure. The metaphor fits. Dragons are mysterious: no one really knows what drives them. They’re mythical: the subject of speculation, hype, legend, old wives’ tales, and fairy stories. In this case, the immense treasure they guard is society’s repository of knowledge. What could be more valuable than that? In oriental folklore, dragons not only enjoy awesome grace and beauty, they are endowed with immense wisdom. But in the West, they are often portrayed as evil—St. George vanquishes a fearsome dragon, as does Beowulf—though sometimes they are friendly (Puff). In both traditions, they are certainly magic, powerful, independent, and unpredictable. The ambiguity suits our purpose well because, in addition to celebrating the joy of being able to find stuff on the web, we want to make you feel uneasy about how everyone has come to rely on search engines so utterly and completely.

The web is where we record our knowledge, and the dragons are how we access it. This book examines their interplay from many points of view: the philosophy of knowledge; the history of technology; the role of libraries, our traditional knowledge repositories; how the web is organized; how it grows and evolves; how search engines work; how people and companies try to take advantage of them to promote their wares; how the dragons fight back; who controls information on the web and how; and what we might see in the future.

We have laid out our story from beginning to end, starting with early philosophers and finishing with visions of tomorrow. But you don’t have to read this book that way: you can start in the middle. To find out how search engines work, turn to Chapter 4. To learn about web spam, go to Chapter 5. For social issues about web democracy and the control of information, head straight for Chapter 6. To see how the web is organized and how its massively linked structure grows, start at Chapter 3. To learn about libraries and how they are finding their way onto the web, go to Chapter 2. For philosophical and historical underpinnings, read Chapter 1. Unlike most books, which you start at the beginning, and give up when you run out of time or have had enough, we recommend that you consider reading this book starting in the middle and, if you can, continuing right to the end. You don’t really need the early chapters to understand the later parts, though they certainly provide context and add depth. To help you chart a passage, here’s a brief account of what each chapter has in store.

The information revolution is creating turmoil in our lives. For years it has been opening up a wondrous panoply of exciting new opportunities and simultaneously threatening to drown us in them, dragging us down, gasping, into murky undercurrents of information overload. Feeling confused? We all are. Chapter 1 sets the scene by placing things in a philosophical and historical context. The web is central to our thinking, and the way it works resembles the very way we think—by linking pieces of information together. Its growth reflects the growth in the sum total of human knowledge. It’s not just a storehouse into which we drop nuggets of information or pearls of wisdom. It’s the stuff out of which society’s knowledge is made, and how we use it determines how humankind’s knowledge will grow. That’s why this is all so important. How we access the web is central to the development of humanity.

The World Wide Web is becoming ever larger, qualitatively as well as quantitatively. It is slowly but surely beginning to subsume “the literature,” which up to now has been locked away in libraries. Chapter 2 gives a bird’s-eye view of the long history of libraries and then describes how today’s custodians are busy putting their books on the web, and in their public-spirited way giving as much free access to them as they can. Initiatives such as the Gutenberg Project, the United States, China, and India Million Book Project, and the Open Content Alliance, are striving to create open collections of public domain material. Web bookstores such as Amazon present pages from published works and let you sample them. Google is digitizing the collections of major libraries and making them searchable worldwide. We are witnessing a radical convergence of online and print information, and of commercial and noncommercial information sources.

Chapter 3 paints a picture of the overall size, scale, construction, and organization of the web, a big picture that transcends the details of all those millions of websites and billions of web pages. How can you measure the size of this beast? How fast is it growing? What about its connectivity: is it one network, or does it drop into disconnected parts? What’s the likelihood of being able to navigate through the links from one randomly chosen page to another? You’ve probably heard that complete strangers are joined by astonishingly short chains of acquaintanceship: one person knows someone who knows someone who…through about six degrees of separation…knows the other. How far apart are web pages? Does this affect the web’s robustness to random failure—and to deliberate attack? And what about the deep web—those pages that are generated dynamically in response to database queries—and sites that require registration or otherwise limit access to their contents?

Having surveyed the information landscape, Chapter 4 tackles the key ideas behind full-text searching and web search engines, the Internet’s new “killer app.” Despite the fact that search engines are intricate pieces of software, the underlying ideas are simple, and we describe them in plain English. Full-text search is an embodiment of the classical concordance, with the advantage that, being computerized, it works for all documents, no matter how banal—not just sacred texts and outstanding works of literature. Multiword queries are answered by combining concordance entries and ranking the results, weighing rare words more heavily than commonplace ones. Web search services augment full-text search with the notion of the prestige of a source, which they estimate by counting the web pages that cite the source, and their prestige—in effect weighting popular works highly. This book focuses exclusively on techniques for searching text, for even when we seek pictures and movies, today’s search engines usually find them for us by analyzing associated textual descriptions.

Chapter 5 turns to the dark side. Once the precise recipe for attribution of prestige is known, it can be circumvented, or “spammed,” by commercial interests intent on artificially raising their profile. On the web, visibility is money. It’s excellent publicity—better than advertising—and it’s free. We describe some of the techniques of spamming, techniques that are no secret to the spammers, but will come as a surprise to web users. Like e-mail spam, this is a scourge that will pollute our lives. Search engine operators strive to root it out and neutralize it in an escalating war against misuse of the web. And that’s not all. Unscrupulous firms attack the advertising budget of rival companies by mindlessly clicking on their advertisements, for every referral costs money. Some see click fraud as the dominant threat to the search engine business.

There’s another problem: access to information is controlled by a few commercial enterprises that operate in secret. This raises ethical concerns that have been concealed by the benign philosophy of today’s dominant players and the exceptionally high utility of their product. Chapter 6 discusses the question of democracy (or lack of it) in cyberspace. We also review the age-old system of copyright—society’s way of controlling the flow of information to protect the rights of authors. The fact that today’s web concentrates enormous power over people’s information-seeking activities into a handful of major players has led some to propose that the search business should be nationalized—or perhaps “internationalized”—into public information utilities. But we disagree, for two reasons. First, the apolitical nature of the web—it is often described as anarchic—is one of its most alluring features. Second, today’s exceptionally effective large-scale search engines could only have been forged through intense commercial competition—particularly in a mere decade of development.

We believe that we stand on the threshold of a new era, and Chapter 7 provides a glimpse of what’s in store. Today’s search engines are just the first, most obvious, step. While centralized indexes will continue to thrive, they will be augmented—and for many purposes usurped—by local control and customization. Search engine companies are already experimenting with personalization features, on the assumption that users will be prepared to sacrifice some privacy and identify themselves if they thereby receive better service. Localized rather than centralized control will make this more palatable and less susceptible to corruption. Information gleaned from end users—searchers and readers—will play a more prominent role in directing searches. The web dragons are diversifying from search alone toward providing general information processing services, which could generate a radically new computer ecosystem based on central hosting services rather than personal workstations. Future dragons will offer remote application software and file systems that will augment or even replace your desktop computer. Does this presage a new generation of operating systems?

We want you to get involved with this book. These are big issues. The natural reaction is to concede that they may be important in theory but to question what difference they really make in practice—and anyway, what can you do about them? To counter any feeling of helplessness, we’ve put a few activities at the end of each chapter in gray boxes: things you can do to improve life for yourself—perhaps for others too. If you like, peek ahead before reading each chapter to get a feeling for what practical actions it might suggest.

ACKNOWLEDGMENTS

The seeds for this project were sown during a brief visit by Ian Witten to Italy, sponsored by the Italian Artificial Intelligence Society, and the book was conceived and begun during a more extended visit generously supported by the University of Siena. We would all like to thank our home institutions for their support for our work over the years: the University of Waikato in New Zealand, and the Universities of Siena and Salerno in Italy. Most of Ian’s work on the book was done during a sabbatical period while visiting the École Nationale Supérieure des Télécommunications in Paris, Google in New York (he had to promise not to learn anything there), and the University of Cape Town in South Africa (where the book benefited from numerous discussions with Gary Marsden); the generous support of these institutions is gratefully acknowledged. Marco benefited from insightful discussions during a brief visit to the Université de Montréal, and from collaboration with the Automated Reasoning System division of IRST, Trento, Italy. Teresa would like to thank the Leverhulme Foundation for its generous support and the Logic group at the University of Rome, and in particular Jonathan Bowen, Roberto Cordeschi, Marcello Frixione, and Sandro Nannini for their interesting, wise, and stimulating comments.

In developing these ideas, we have all been strongly influenced by our students and colleagues; they are far too numerous to mention individually but gratefully acknowledged all the same. We particularly want to thank members of our departments and research groups: the Computer Science Department at Waikato, the Artificial Intelligence Research Group at Siena, and the Department of Communication Sciences at Salerno. Parts of Chapter 2 are adapted from How to Build a Digital Library by Witten and Bainbridge; parts of Chapter 4 come from Managing Gigabytes by Witten, Moffat, and Bell.

We must thank the web dragons themselves, not just for providing such an interesting topic for us to write about, but for all their help in ferreting out facts and other information while writing this book. We may be critical, but we are also grateful! In addition, we would like to thank all the authors in the Wikipedia community for their fabulous contributions to the spread of knowledge, from which we have benefited enormously.

The delightful cover illustration and chapter openers were drawn for us by Lorenzo Menconi. He did it for fun, with no thought of compensation, the only reward being to see his work in print. We thank him very deeply and sincerely hope that this will boost his sideline in imaginative illustration. We are extremely grateful to the reviewers of this book, who have helped us focus our thoughts and correct and enrich the text: Rob Akscyn, Ed Fox, Jonathan Grudin, Antonio Gulli, Gary Marchionini, Edie Rasmussen, and Sarah Shieff.

We received sterling support from Diane Cerra and Asma Palmeiro at Morgan Kaufmann while writing this book. Diane’s enthusiasm infected us from the very beginning, when she managed to process our book proposal and give us the go-ahead in record time. Marilyn Rash, our project manager, has made the production process go very smoothly for us.

Finally, without the support of our families, none of our work would have been possible. Thank you Agnese, Anna, Cecilia, Fabrizio, Irene, Nikki, and Pam; this is your book too!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset