Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Searching and Indexing

In this chapter, we will cover:

Generating an inverted index using Hadoop MapReduce
Intra-domain web crawling using Apache Nutch
Indexing and searching web documents using Apache Solr
Configuring Apache HBase as the backend data store for Apache Nutch
Deploying Apache HBase on a Hadoop cluster
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
ElasticSearch for indexing and searching
Generating the in-links graph for crawled web pages

Introduction

MapReduce frameworks are well suited for large-scale search and indexing applications. In fact, Google came up with the original MapReduce framework specifically to facilitate the various operations involved with web searching. The Apache Hadoop project was started as a support project for the Apache Nutch search engine, before spawning off as a separate top-level project.

Web searching consists of fetching, indexing, ranking, and retrieval. Given the size of the data, all these operations need to be scalable. In addition, the retrieval also should ensure real-time access. Typically, fetching is performed through web crawling, where the crawlers fetch a set of pages in the fetch queue, extract links from the fetched pages, add the extracted links back to the fetch queue, and repeat this process many times. Indexing parses, organizes, and stores the fetched data in manner that is fast and efficient for querying and retrieval. Search engines perform offline ranking of the documents based on algorithms such as PageRank and real-time ranking of the results based on the query.

In this chapter, we will introduce you to several tools that can be used with Apache Hadoop to perform large-scale searching and indexing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Searching and Indexing

Create new playlist

Sign In

Sign Up

Chapter 7. Searching and Indexing

Introduction

Table of Contents for
7. Searching and Indexing