The Elasticsearch out-of-the-box tools

Elasticsearch primarily works with two models of information retrieval: the Boolean model and the Vector Space model. In addition to these, there are other scoring algorithms available in Elasticsearch as well, such as Okapi BM25, Divergence from Randomness (DFR), and Information Based (IB). Working with these three models requires extensive mathematical knowledge and needs some extra configurations in Elasticsearch, which are beyond the scope of this book.

The Boolean model uses the AND, OR, and NOT conditions in a query to find all the matching documents. This Boolean model can be further combined with the Lucene scoring formula, TF/IDF (which we have already discussed in Chapter 2, Understanding Document Analysis and Creating Mappings), to rank documents.

The vector space model works differently from the Boolean model, as it represents both queries and documents as vectors. In the vector space model, each number in the vector is the weight of a term that is calculated using TF/IDF.

The queries and documents are compared using a cosine similarity in which angles between two vectors are compared to find the similarity, which ultimately leads to finding the relevancy of the documents.

An example: why defaults are not enough

Let's build an index with sample documents to understand the examples in a better way.

First, create an index with the name profiles:

curl -XPUT 'localhost:9200/profiles'

Then, put the mapping with the document type as candidate:

curl -XPUT 'localhost:9200/profiles/candidate'
{
 "properties": {
   "geo_code": {
     "type": "geo_point",
     "lat_lon": true
   }
 }
}

Please note that in the preceding mapping, we are putting mapping only for the geo data type. The rest of the fields will be indexed dynamically.

Now, you can create a data.json file with the following content in it:

{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 1 }}
{ "name" : "Sam", "geo_code" : "12.9545163,77.3500487", "total_experience":5, "skills":["java","python"] }
{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 2 }}
{ "name" : "Robert", "geo_code" : "28.6619678,77.225706", "total_experience":2, "skills":["java"] }
{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 3 }}
{ "name" : "Lavleen", "geo_code" : "28.6619678,77.225706", "total_experience":4, "skills":["java","Elasticsearch"] }
{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 4 }}
{ "name" : "Bharvi", "geo_code" : "28.6619678,77.225706", "total_experience":3, "skills":["java","lucene"] }
{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 5 }}
{ "name" : "Nips", "geo_code" : "12.9545163,77.3500487", "total_experience":7, "skills":["grails","python"] }
{ "index" : { "_index" : "profiles", "_type" : "candidate", "_id" : 6 }}
{ "name" : "Shikha", "geo_code" : "28.4250666,76.8493508", "total_experience":10, "skills":["c","java"] }

Note

If you are indexing skills, which are separated by spaces or which include non-English characters, that is, C++, C#, or Core Java, you need to create mapping for the skills field as not_analyzed in advance to have exact term matching.

Once the file is created, execute the following command to put the data inside the index we have just created:

curl -XPOST 'localhost:9200' --data-binary @data.json

If you look carefully at the example, the documents contain the data of the candidates who might be looking for jobs. For hiring candidates, a recruiter can have the following criteria:

  • Candidates should know about Java
  • Candidates should have experience of 3 to 5 years
  • Candidates should fall in the distance range of 100 kilometers from the office of the recruiter

You can construct a simple bool query in combination with a term query on the skills field along with geo_distance and range filters on the geo_code and total_experience fields respectively. However, does this give a relevant set of results? The answer would be NO.

The problem is that if you are restricting the range of experience and distance, you might even get zero results or no suitable candidates. For example, you can put a range of 0 to 100 kilometers of distance but your perfect candidate might be at a distance of 101 kilometers. At the same time, if you define a wide range, you might get a huge number of non-relevant results.

The other problem is that if you search for candidates who know Java, there is a chance that a person who knows only Java and not any other programming language will be at the top, while a person who knows other languages apart from Java will be at the bottom. This happens because during the ranking of documents with TF/IDF, the lengths of the fields are taken into account. If the length of a field is small, the document is more relevant.

Elasticsearch is not intelligent enough to understand the semantic meaning of your queries, but for these scenarios, it offers you the full power to redefine how scoring and document ranking should be done.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset