ElasticSearch (http://www.elasticsearch.org/) is an Apache 2.0 licensed open source search solution built on top of Apache Lucene. ElasticSearch is a distributed, multi-tenant, and document-oriented search engine. ElasticSearch supports distributed deployments, by breaking down an index in to shards and by distributing the shards across the nodes in the cluster. While both ElasticSearch and Apach Solr use Apache Lucene as the core search engine, ElasticSearch aims to provide a more scalable and a distributed solution that is better suited for the cloud environments than Apache Solr.
Install Apache Nutch and crawl some web pages as per the Whole web crawling with Apache Nutch using an existing Hadoop/HBase cluster recipe or the Configuring Apache HBase local mode as the backend data store for Apache Nutch recipe. Make sure the backend Hbase (or HyperSQL) data store for Nutch is still available.
The following steps show you how to index and search the data crawled by Nutch using ElasticSearch.
> bin/elasticsearch -f
> curl localhost:9200 { "ok" : true, "status" : 200, "name" : "Outlaw", "version" : { "number" : "0.19.11", "snapshot_build" : false }, "tagline" : "You Know, for Search"
$NUTCH_HOME/runtime/deploy
(or $NUTCH_HOME/runtime/local
in case you are running Nutch in the local mode) directory. Execute the following command to index the data crawled by Nutch in to the ElasticSearch server.> bin/nutch elasticindex elasticsearch -all 12/11/01 06:11:07 INFO elastic.ElasticIndexerJob: Starting …...
> curl -XGET 'http://localhost:9200/_search?q=hadoop' .... {"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":36,"max_score":0.44754887,"hits":[{"_index":"index","_type":"doc","_id": 100 30551 100 30551 "org.apache.hadoop:http/","_score":0.44754887, ....
Similar to Apache Solr, ElasticSearch too is built using the Apache Lucene text search library. In the preceding steps we export the data crawled by Nutch in to an instance of ElasticSearch for indexing and searching purposes.
We add the –f
switch to force the ElasticSearch to run in the foreground to make the development and testing process easier.
bin/elasticsearch –f
You can also install ElasticSearch as a service as well. Refer to http://www.elasticsearch.org/guide/reference/setup/installation.html for more details on installing ElasticSearch as a service.
We use the ElasticIndex job of Nutch to import the data crawled by Nutch into the ElasticSearch server. The usage of the elasticindex
command is as follows:
bin/nutch elasticindex <elastic cluster name> (<batchId> | -all | -reindex) [-crawlId <id>]
The elastic cluster name defaults to elasticsearch. You can change the cluster name by editing the cluster.name
property in the config/elasticsearch.yml
file. The cluster name is used for auto-discovery purposes and should be unique for each ElasticSearch deployment in a single network.