Chapter 6. Make Your Search Better

In the previous chapter, we were focused on indexing operations; we learned how to handle the structured data. We started with indexing tree-like structures and JSON objects. We used nested objects and indexed documents using parent-child functionality. Finally, at the end of the chapter, we used Elasticsearch API to modify our indices structures. By the end of this chapter, you will have learned the following topics:

  • Understanding how Apache Lucene scoring works
  • Using scripting
  • Handling multilingual data
  • Using boosting to affect document scoring
  • Using synonyms
  • Understanding how your documents were scored

Introduction to Apache Lucene scoring

When talking about queries and their relevance, we can't omit the information about the scoring and where it comes from. But what is a score? The score is a property that describes the relevance of a document in the context of a query. In the following section, we will talk about the default Apache Lucene scoring mechanism – the TF/IDF algorithm and how it affects the returned document.

Note

The TF/IDF is not the only available algorithm exposed by Elasticsearch. For more information about the available models, refer to the Available similarity models section in Chapter 2, Indexing Your Data. You can also refer to the books Mastering Elasticsearch and Mastering Elasticsearch Second Edition published by Packt Publishing.

When a document is matched

When a document is returned by Lucene, it means that it matched the query we sent to it. In most cases, each of the resulting documents in the response is given a score. The higher the score, the more relevant the document is from the search engine's point of view, of course, in the context of a given query. This means that the score factor calculated for the same document on two different queries will be different. Because of that, comparing scores between queries usually doesn't make much sense. However, let's get back to the scoring. To calculate the score property for a document, multiple factors are taken into account:

  • document boost: The boost value given for a document during indexing.
  • field boost: The boost value given for a field during querying and indexing.
  • coord: The coordination factor that is based on the number of terms the document has. It is responsible for giving more value to the documents that contain more search terms compared to the other documents.
  • inverse document frequency: The term based factor that tells the scoring formula how rarefor score property calculation:inverse document frequency" the given term is. The higher the inverse document frequency the less common the term is.
  • length norm: The field based factor for normalization based on the number of terms the given field contains. The longer the field, the smaller boost this factor will give. It basically means that the shorter documents will be favored.
  • term frequency: The term based factor describing how many times the given term occurs in a document. The higher the term frequency, the higher the score of the document will be.
  • query norm: The query based normalization factor that is calculated as the sum of the squared weight of each of the query terms. Query norm is used to allow score comparison between queries, which we said is not always easy or possible.

Default scoring formula

The practical formula for the TF/IDF algorithm looks as follows:

Default scoring formula

To adjust your query relevance, you don't need to remember the details of the equation, but it is very important to know how it works – to at least be aware that there is an equation you can analyze. We can see that the score factor for the document is a function of query q and document d. There are also two factors that are not dependent directly on query terms: coord and queryNorm. These two elements of the formula are multiplied by the sum calculated for each term in the query. The sum on the other hand is calculated by multiplying the term frequency for the given term, its inverse document frequency, term boost, and the norm, which is the length norm we discussed previously.

Note

Note that the preceding formula is a practical one. You can find more information about the conceptual formula in Lucene Javadocs at http://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html.

The good thing about the preceding rules is that you don't need to remember all of that. What you should be aware of is what matters when it comes to the document score. Basically, there are a few rules which come from the preceding mentioned equation:

  • The rarer the matched term is, the higher the score the document will have
  • The shorter the document fields are (the less terms they have), the higher the score the document will have
  • The higher the boost for the fields is, the higher the score the document will have

As we can see, Lucene gives a higher score for the documents that have many query terms matched and have shorter fields (less terms indexed) that were used for matching, and it also favors rarer terms instead of the common ones (of course, the ones that matched).

Relevancy matters

In most cases, we want to get the best matching documents. However, the most relevant documents don't always mean the same as the best matches. Some use cases define very strict rules on why a given document should be higher on the results list. For example, one could say that, in addition to the document being a perfect match in terms of TF/IDF similarity, we have paying customers to consider. Depending on the customer plan, we want to give more importance to such documents. In such cases, we could want the documents for the customers that pay the most to be on top of the search results. Of course, this is not relevant in TF/IDF.

The other example is yellow pages, where customers pay for more information describing the document. Such large documents may not be the most relevant ones according to TF/IDF, so you may want to adjust the scoring if you are working with such data.

These are very simple examples and Elasticsearch queries can become really complicated. We will talk about such queries in the Influencing scores with query boosts section in this chapter.

When working on search relevance, you should always remember that it is not a onetime process. Your data will change with time and your queries will need to be adjusted. In most cases, tuning the query relevancy will be constant work. You will need to react to your business rules and needs, to how the users behave, and so on. It is very important to remember that this process is not a single time one about which you can forget.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset