In this chapter, we will cover:
MapReduce frameworks are well suited for large-scale search and indexing applications. In fact, Google came up with the original MapReduce framework specifically to facilitate the various operations involved with web searching. The Apache Hadoop project was started as a support project for the Apache Nutch search engine, before spawning off as a separate top-level project.
Web searching consists of fetching, indexing, ranking, and retrieval. Given the size of the data, all these operations need to be scalable. In addition, the retrieval also should ensure real-time access. Typically, fetching is performed through web crawling, where the crawlers fetch a set of pages in the fetch queue, extract links from the fetched pages, add the extracted links back to the fetch queue, and repeat this process many times. Indexing parses, organizes, and stores the fetched data in manner that is fast and efficient for querying and retrieval. Search engines perform offline ranking of the documents based on algorithms such as PageRank and real-time ranking of the results based on the query.
In this chapter, we will introduce you to several tools that can be used with Apache Hadoop to perform large-scale searching and indexing.