ElasticSearch for indexing and searching

ElasticSearch (http://www.elasticsearch.org/) is an Apache 2.0 licensed open source search solution built on top of Apache Lucene. ElasticSearch is a distributed, multi-tenant, and document-oriented search engine. ElasticSearch supports distributed deployments, by breaking down an index in to shards and by distributing the shards across the nodes in the cluster. While both ElasticSearch and Apach Solr use Apache Lucene as the core search engine, ElasticSearch aims to provide a more scalable and a distributed solution that is better suited for the cloud environments than Apache Solr.

Getting ready

Install Apache Nutch and crawl some web pages as per the Whole web crawling with Apache Nutch using an existing Hadoop/HBase cluster recipe or the Configuring Apache HBase local mode as the backend data store for Apache Nutch recipe. Make sure the backend Hbase (or HyperSQL) data store for Nutch is still available.

How to do it

The following steps show you how to index and search the data crawled by Nutch using ElasticSearch.

  1. Download and extract ElasticSearch from http://www.elasticsearch.org/download/.
  2. Go to the extracted ElasticSearch directory and execute the following command to start the ElasticSearch server in the foreground.
    > bin/elasticsearch -f
    
  3. Run the following command in a new console to verify your installation.
    > curl localhost:9200
    {
      "ok" : true,
      "status" : 200,
      "name" : "Outlaw",
      "version" : {
        "number" : "0.19.11",
        "snapshot_build" : false
      },
      "tagline" : "You Know, for Search"
    
  4. Go to the $NUTCH_HOME/runtime/deploy (or $NUTCH_HOME/runtime/local in case you are running Nutch in the local mode) directory. Execute the following command to index the data crawled by Nutch in to the ElasticSearch server.
    > bin/nutch  elasticindex elasticsearch -all 
    12/11/01 06:11:07 INFO elastic.ElasticIndexerJob: Starting
    …...
    
  5. Issue the following command to perform a search:
    > curl -XGET 'http://localhost:9200/_search?q=hadoop'
    ....
    {"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":36,"max_score":0.44754887,"hits":[{"_index":"index","_type":"doc","_id": 100 30551  100  30551 "org.apache.hadoop:http/","_score":0.44754887, ....
    

How it works

Similar to Apache Solr, ElasticSearch too is built using the Apache Lucene text search library. In the preceding steps we export the data crawled by Nutch in to an instance of ElasticSearch for indexing and searching purposes.

We add the –f switch to force the ElasticSearch to run in the foreground to make the development and testing process easier.

bin/elasticsearch –f

You can also install ElasticSearch as a service as well. Refer to http://www.elasticsearch.org/guide/reference/setup/installation.html for more details on installing ElasticSearch as a service.

We use the ElasticIndex job of Nutch to import the data crawled by Nutch into the ElasticSearch server. The usage of the elasticindex command is as follows:

bin/nutch  elasticindex  <elastic cluster name> (<batchId> | -all | -reindex) [-crawlId <id>]

The elastic cluster name defaults to elasticsearch. You can change the cluster name by editing the cluster.name property in the config/elasticsearch.yml file. The cluster name is used for auto-discovery purposes and should be unique for each ElasticSearch deployment in a single network.

See also

  • The Indexing and searching web documents using Apache Solr recipe of this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset