Chapter 7. Searching and Indexing

In this chapter, we will cover:

  • Generating an inverted index using Hadoop MapReduce
  • Intra-domain web crawling using Apache Nutch
  • Indexing and searching web documents using Apache Solr
  • Configuring Apache HBase as the backend data store for Apache Nutch
  • Deploying Apache HBase on a Hadoop cluster
  • Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
  • ElasticSearch for indexing and searching
  • Generating the in-links graph for crawled web pages

Introduction

MapReduce frameworks are well suited for large-scale search and indexing applications. In fact, Google came up with the original MapReduce framework specifically to facilitate the various operations involved with web searching. The Apache Hadoop project was started as a support project for the Apache Nutch search engine, before spawning off as a separate top-level project.

Web searching consists of fetching, indexing, ranking, and retrieval. Given the size of the data, all these operations need to be scalable. In addition, the retrieval also should ensure real-time access. Typically, fetching is performed through web crawling, where the crawlers fetch a set of pages in the fetch queue, extract links from the fetched pages, add the extracted links back to the fetch queue, and repeat this process many times. Indexing parses, organizes, and stores the fetched data in manner that is fast and efficient for querying and retrieval. Search engines perform offline ranking of the documents based on algorithms such as PageRank and real-time ranking of the results based on the query.

In this chapter, we will introduce you to several tools that can be used with Apache Hadoop to perform large-scale searching and indexing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset