More like this

The ElasticSearch functionality is not only about searching documents based on selected criteria. For example, we can use it in our application to find similar products to the ones that were returned by a user query.

In fact, we already know something about this functionality from Chapter 2, Searching Your Data, where we saw the "more like this" query. But, in the mentioned query, we have to construct the like_text field. ElasticSearch can generate this data based on the example document and provides a special endpoint for this.

Example data

For our example, let's imagine we have a travel agency where every available location is assigned a set of tags describing it. The simplified version of the data can look like this:

{ "index": { "_index" : "travel", "_type" : "loc", "_id" : 1}}
{ "name" : "beautiful hotel by the sea", "tags" : ["sea", "greece", "beach"] }
{ "index": { "_index" : "travel", "_type" : "loc", "_id" : 2}}
{ "name" : "a small cottage in the mountains", "tags" : ["mountains", "switzerland", "hiking"] }
{ "index": { "_index" : "travel", "_type" : "loc", "_id" : 3}}
{ "name" : "a small cottage in the mountains", "tags" : ["mountains", "italy", "hiking"] }
{ "index": { "_index" : "travel", "_type" : "loc", "_id" : 4}}
{ "name" : "at the seaside", "tags" : ["sea", "italy"] }

As in previous examples, we've written this data to the documents.json file and loaded it into ElasticSearch using the following command:

curl -XPOST 'localhost:9200/_bulk' --data-binary @documents.json

Finding similar documents

Now we can use the _mlt endpoint and provide an identifier for a document for which we would like to find similar documents for. Look at the following command:

curl 'localhost:9200/travel/loc/3/_mlt?pretty&mlt_fields=tags&min_term_freq=1&min_doc_freq=0'

We've tried to find documents similar to the one with ID equal to 3. The mlt_fields parameter tells ElasticSearch which fields (separated by a comma character) from this document should be used for searching. In our example we want to find the documents that have similar tags (the tags field). Let's look at the results returned by ElasticSearch:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2169777,
    "hits" : [ {
      "_index" : "travel",
      "_type" : "loc",
      "_id" : "2",
      "_score" : 0.2169777, 
      "_source" : { "name" : "a small cottage in the mountains", "tags" : ["mountains", "switzerland", "hiking"] }
    }, {
      "_index" : "travel",
      "_type" : "loc",
      "_id" : "4",
      "_score" : 0.19178301, 
      "_source" : { "name" : "at he seaside", "tags" : ["sea", "italy"] }
    } ]
  }
}

As we can see, ElasticSearch thinks that if you are interested in hiking somewhere in the mountains in Italy, you can also consider a journey to Switzerland. Or maybe you should see more in Italy? You may wonder why we used additional parameters such as min_term_freq and min_doc_freq. This is because these parameters control which terms (words) should be ignored in comparison. ElasticSearch assumes that the given word should not be used too rarely in the index or too frequently (like common words such as and). Our index is very small, so we need it to slightly tune these parameters. In real cases, default values should work better (2 for the min_term_freq parameter and 5 for the min_doc_freq parameter), but you can also experiment for best results with other parameters described in the The more like this query section in Chapter 2, Searching Your Data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset