One of the great things in Elasticsearch is its scripting capabilities. You can use script for calculating score, text-based scoring, data filtering, and data analysis. Although scripting can be slow in some cases, such as calculating the score for each document, we think that this part of Elasticsearch is important. Because of this, we decided that this section should bring you the information about the changes and will extend the information present in the Elasticsearch Server Second Edition book.
Elasticsearch scripting has gone through a lot of refactoring in version 1.0 and in the versions that came after that. Because of those changes, some users were lost as to why their scripts stopped working when upgrading to version 1.2 of Elasticsearch and what is happening in general. This section will try to give you an insight on what to expect.
During the lifetime of Elasticsearch 1.1, an exploit was published (see http://bouk.co/blog/elasticsearch-rce/): it showed that with the default configuration, Elasticsearch was not fully secure. Because of that, dynamic scripting was disabled by default in Elasticsearch 1.2. Although, disabling dynamic scripting was enough to make Elasticsearch secure, it made script usage far more complicated.
With the release of Elasticsearch 1.3, we can use a new scripting language that will become default in the next version: Groovy (see http://groovy.codehaus.org/). The reason for this is that it can be closed in its own sandbox, preventing dynamic scripts from doing any harm to the cluster and the operating system. In addition to that, because Groovy can be sandboxed, Elasticsearch allows us to use dynamic scripting when using it. Generally speaking, starting from version 1.3, if a scripting language can be sandboxed, it can be used in dynamic scripts. However, Groovy is not everything: Elasticsearch 1.3 allows us to use Lucene expressions, which we will cover in this section. However, with the release of Elasticsearch 1.3.8 and 1.4.3 dynamic scripting was turned off even for Groovy. Because of that, if you still want to use dynamic scripting for Groovy you need to add script.groovy.sandbox.enabled
property to elasticsearch.yml
and set it to true
or make your Elasticsearch a bit less dynamic with stored scripts. Please be aware that enabling dynamic scripting exposes security issues though and should be used with caution.
Because of the security issues and introduction of Groovy, starting from Elasticsearch 1.4, MVEL will no longer be available by default with Elasticsearch distribution. The default language will be Groovy, and MVEL will only be available as a plugin installed on demand. Remember that if you want to drop MVEL scripts, it is really easy to port them to Groovy. Of course, you will be able to install the MVEL plugin, but still dynamic scripting will be forbidden.
Groovy is a dynamic language for the Java Virtual Machine. It was built on top of Java, with some inspiration from languages such as Python, Ruby, or Smalltalk. Even though Groovy is out of the context of this book, we decided to describe it because, as you know, it is the default scripting language starting from Elasticsearch 1.4. If you already know Groovy and you know how to use it in Elasticsearch, you can easily skip this section and move to the Scripting in full text context section of this book.
The thing to remember is that Groovy is only sandboxed up to Elasticsearch 1.3.8 and 1.4.3. Starting from the mentioned versions it is not possible to run dynamic Groovy scripts unless Elasticsearch is configured to allow such. All the queries in the examples that we will show next require you to add script.groovy.sandbox.enabled
property to elasticsearch.yml
and set it to true
.
Before we go into an introduction to Groovy, let's learn how to use it in Elasticsearch scripts. To do this, check the version you are using. If you are using Elasticsearch older than 1.4, you will need to add the lang
property with the value groovy
. For example:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "fields" : [ "_id", "_score", "title" ], "query" : { "function_score" : { "query" : { "match_all" : {} }, "script_score" : { "lang" : "groovy", "script" : "_index["title"].docCount()" } } } }'
If you are using Elasticsearch 1.4 or newer, you can easily skip the scripting language definition because Elasticsearch will use Groovy by default.
Groovy allows us to define variables in scripts used in Elasticsearch. To define a new variable, we use the def
keyword followed by the variable name and its value. For example, to define a variable named sum
and assign an initial value of 0,
to it we would use the following snippet of code:
def sum = 0
Of course, we are not only bound to simple variables definition. We can define lists, for example, a list of four values:
def listOfValues = [0, 1, 2, 3]
We can define a range of values, for example, from 0 to 9:
def rangeOfValues = 0..9
Finally, we can define maps:
def map = ['count':1, 'price':10, 'quantity': 12]
The preceding line of code will result in defining a map with three keys (count
, price
, and quantity
) and three values corresponding to those keys (1
, 10
, and 12
).
We are also allowed to use conditional statements in scripts. For example, we can use standard if - else if - else structures:
if (count > 1) { return count } else if (count == 1) { return 1 } else { return 0 }
We can use the ternary operator:
def isHigherThanZero = (count > 0) ? true : false
The preceding code will assign a true
value to the isHigherThanZero
variable if the count
variable is higher than 0
. Otherwise, the value assigned to the isHigherThanZero
variable will be false
.
Of course, we are also allowed to use standard switch statements that allow us to use an elegant way of choosing the execution path based on the value of the statement:
def isEqualToTenOrEleven = false; switch (count) { case 10: isEqualToTenOrEleven = true break case 11: isEqualToTenOrEleven = true break default: isEqualToTenOrEleven = false }
The preceding code will set the value of the isEqualToTenOrEleven
variable to true
if the count
variable is equal to 10
or 11
. Otherwise, the value of the isEqualToTenOrEleven
variable will be set to false
.
Of course, we can also use loops when using Elasticsearch scripts and Groovy as the language in which scripts are written. Let's start with the while
loop that is going to be executed until the statement in the parenthesis is true:
def i = 2 def sum = 0 while (i > 0) { sum = sum + i i-- }
The preceding loop will be executed twice and ended. In the first iteration, the i
variable will have the value of 2
, which means that the i > 0
statement is true. In the second iteration, the value of the i
variable will be 1
, which again makes the i > 0
statement true. In the third iteration, the i
variable will be 0
, which will cause the while
loop not to execute its body and exit.
We can also use the for
loop, which you are probably familiar with if you've used programming languages before. For example, to iterate 10 times over the for loop body, we could use the following code:
def sum = 0 for ( i = 0; i < 10; i++) { sum += i }
We can also iterate over a range of values:
def sum = 0 for ( i in 0..9 ) { sum += i }
Or iterate over a list of values:
def sum = 0 for ( i in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ) { sum += i }
If we have a map, we can iterate over its entries:
def map = ['quantity':2, 'value':1, 'count':3] def sum = 0 for ( entry in map ) { sum += entry.value }
Now after seeing some basics of Groovy, let's try to run an example script that will modify the score of our documents. We will implement the following algorithm for score calculation:
The query that does the preceding example looks as follows:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "fields" : [ "_id", "_score", "title", "year" ], "query" : { "function_score" : { "query" : { "match_all" : {} }, "script_score" : { "lang" : "groovy", "script" : "def year = doc["year"].value; if (year < 1800) { return 1.0 } else if (year < 1900) { return 2.0 } else { return year - 1000 }" } } } }'
The result returned by Elasticsearch for the preceding query is as follows:
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 6, "max_score" : 961.0, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "2", "_score" : 961.0, "fields" : { "title" : [ "Catch-22" ], "year" : [ 1961 ], "_id" : "2" } }, { "_index" : "library", "_type" : "book", "_id" : "3", "_score" : 936.0, "fields" : { "title" : [ "The Complete Sherlock Holmes" ], "year" : [ 1936 ], "_id" : "3" } }, { "_index" : "library", "_type" : "book", "_id" : "1", "_score" : 929.0, "fields" : { "title" : [ "All Quiet on the Western Front" ], "year" : [ 1929 ], "_id" : "1" } }, { "_index" : "library", "_type" : "book", "_id" : "6", "_score" : 904.0, "fields" : { "title" : [ "The Peasants" ], "year" : [ 1904 ], "_id" : "6" } }, { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 2.0, "fields" : { "title" : [ "Crime and Punishment" ], "year" : [ 1886 ], "_id" : "4" } }, { "_index" : "library", "_type" : "book", "_id" : "5", "_score" : 1.0, "fields" : { "title" : [ "The Sorrows of Young Werther" ], "year" : [ 1774 ], "_id" : "5" } } ] } }
Of course, the information we just gave is not a comprehensive guide to Groovy and was never intended to be one. Groovy is out of the scope of this book and we wanted to give you a glimpse of what to expect from it. If you are interested in Groovy and you want to extend your knowledge beyond what you just read, we suggest going to the official Groovy web page and reading the documentation available at http://groovy.codehaus.org/.
Of course, scripts are not only about modifying the score on the basis of data. In addition to this, we can use full text-specific statistics in our scripts, such as document frequency or term frequency. Let's look at these possibilities.
The first text-related information we can use in scripts we would like to talk about is field-related statistics. The field-related information Elasticsearch allows us to use is as follows:
_index['field_name'].docCount()
: Number of documents that contain a given field. This statistic doesn't take deleted documents into consideration._index['field_name'].sumttf()
: Sum of the number of times all terms appear in all documents in a given field._index['field_name'].sumdf()
: Sum of document frequencies. This shows the sum of the number of times all terms appear in a given field in all documents.For example, if we would like to give our documents a score equal to the number of documents having the title
field living in a given shard, we could run the following query:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "fields" : [ "_id", "_score", "title" ], "query" : { "function_score" : { "query" : { "match_all" : {} }, "script_score" : { "lang" : "groovy", "script" : "_index["title"].docCount()" } } } }'
If we would look at the response, we would see the following:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 6, "max_score" : 2.0, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "1", "_score" : 2.0, "fields" : { "title" : [ "All Quiet on the Western Front" ], "_id" : "1" } }, { "_index" : "library", "_type" : "book", "_id" : "6", "_score" : 2.0, "fields" : { "title" : [ "The Peasants" ], "_id" : "6" } }, { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 1.0, "fields" : { "title" : [ "Crime and Punishment" ], "_id" : "4" } }, { "_index" : "library", "_type" : "book", "_id" : "5", "_score" : 1.0, "fields" : { "title" : [ "The Sorrows of Young Werther" ], "_id" : "5" } }, { "_index" : "library", "_type" : "book", "_id" : "2", "_score" : 1.0, "fields" : { "title" : [ "Catch-22" ], "_id" : "2" } }, { "_index" : "library", "_type" : "book", "_id" : "3", "_score" : 1.0, "fields" : { "title" : [ "The Complete Sherlock Holmes" ], "_id" : "3" } } ] } }
As you can see, we have five documents that were queried to return the preceding results. The first two documents have a score of 2.0, which means that they are probably living in the same shard because the four remaining documents have a score of 1.0, which means that are alone in their shard.
The shard level information that we are allowed to use are as follows:
_index.numDocs()
: Number of documents in a shard_index.maxDoc()
: Internal identifier of the maximum number of documents in a shard_index.numDeletedDocs()
: Number of deleted documents in a given shardFor example, if we would like to sort documents on the basis of the highest internal identifier each shard has, we could send the following query:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "fields" : [ "_id", "_score", "title" ], "query" : { "function_score" : { "query" : { "match_all" : {} }, "script_score" : { "lang" : "groovy", "script" : "_index.maxDoc()" } } } }'
Of course, it doesn't make much sense to use those statistics alone, like we just did, but with addition to other text-related information, they can be very useful.
The next type of information that we can use in scripts is term level information. Elasticsearch allows us to use the following:
_index['field_name']['term'].df()
: Returns the number of documents the term appears in a given field_index['field_name']['term'].ttf()
: Returns the sum of the number of times a given term appears in all documents in a given field_index['field_name']['term'].tf()
: Returns the information about the number of times a given term appears in a given field in a documentTo give a good example of how we can use the preceding statistics, let's index two documents by using the following commands:
curl -XPOST 'localhost:9200/scripts/doc/1' -d '{"name":"This is a document"}' curl -XPOST 'localhost:9200/scripts/doc/2' -d '{"name":"This is a second document after the first document"}'
Now, let's try filtering documents on the basis of how many times a given term appears in the name
field. For example, let's match only those documents that have in the name
field the document
term appearing at least twice. To do this, we could run the following query:
curl -XGET 'localhost:9200/scripts/_search?pretty' -d '{ "query" : { "filtered" : { "query" : { "match_all" : {} }, "filter" : { "script" : { "lang" : "groovy", "script": "_index["name"]["document"].tf() > 1" } } } } }'
The result of the query would be as follows:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "scripts", "_type" : "doc", "_id" : "2", "_score" : 1.0, "_source":{"name":"This is a second document after the first document"} } ] } }
As we can see, Elasticsearch did exactly what we wanted.
In addition to already presented information, we can also use term positions, offsets, and payloads in our scripts. To get those, we can use one the _index['field_name'].get('term', OPTION)
expression, where OPTION
is one of the following:
_OFFSETS
: Term offsets_PAYLOADS
: Term payloads_POSITIONS
: Term positionsIn addition to this, we can also use the _CACHE
option. It allows us to iterate multiple times over all the term positions. Options can also be combined using the |
operator; for example, if you would like to get term offsets and positions for the document
term in the title
field, you could use the following expression in your script:
_index['title'].get('document', _OFFSETS | _POSITIONS).
One thing to remember is that all the preceding options return an object called that, depending on the options we have chosen, contains the following information:
startOffset
: Start offset for the termendOffset
: End offset for the termpayload
: Payload for the termpayloadAsInt(value)
: Returns payload for the term converted to integer or the value in case the current position doesn't have a payloadpayloadAsFloat(value)
: Returns payload for the term converted to float or the value in case the current position doesn't have a payloadpayloadAsString(value)
: Returns payload for the term converted to string or the value in case the current position doesn't have a payloadposition
: Position of a termTo illustrate an example, let's create a new index with the following mappings:
curl -XPOST 'localhost:9200/scripts2' -d '{ "mappings" : { "doc" : { "properties" : { "name" : { "type" : "string", "index_options" : "offsets" } } } } }'
After this, we index two documents using the following commands:
curl -XPOST 'localhost:9200/scripts2/doc/1' -d '{"name":"This is the first document"}' curl -XPOST 'localhost:9200/scripts2/doc/2' -d '{"name":"This is a second simple document"}'
Now, let's set the score of our documents to the sum of all the start positions for the document
term in the name
field. To do this, we run the following query:
curl -XGET 'localhost:9200/scripts2/_search?pretty' -d '{ "query" : { "function_score" : { "query" : { "match_all" : {} }, "script_score" : { "lang" : "groovy", "script": "def termInfo = _index["name"].get("document",_OFFSETS); def sum = 0; for (offset in termInfo) { sum += offset.startOffset; }; return sum;" } } } }'
The results returned by Elasticsearch would be as follows:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 24.0, "hits" : [ { "_index" : "scripts2", "_type" : "doc", "_id" : "2", "_score" : 24.0, "_source":{"name":"This is a second simple document"} }, { "_index" : "scripts2", "_type" : "doc", "_id" : "1", "_score" : 18.0, "_source":{"name":"This is the first document"} } ] } }
As we can see, it works. If we look at the formatted script, we would see something like the following:
def termInfo = _index['name'].get('document',_OFFSETS); def sum = 0; for (offset in termInfo) { sum += offset.startOffset; }; return sum;
As you can see, it is nothing sophisticated. First, we get the information about the offsets in an object; next, we create a variable to hold our offsets sum. Then, we have a loop for all the offsets information (we can have multiple instances of offsets for different occurrences of the same term in a field) and, finally, we return the sum that makes our score for the document to be set to the returned value.
In addition to all what we talked about in the preceding section, we are also able to get information about term vectors if we turned them on during indexing. To do that, we can use the _index.termVectors()
expression, which will return Apache Lucene Fields
object instance. You can find more about the Fields
object in Lucene Javadocs available at https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/Fields.html.
Although marked as experimental, we decided to talk about it because this is a new and very good feature. The reason that makes Lucene expressions very handy is using them is very fast—their execution is as fast as native scripts, but yet they are like dynamic scripts with some limitations. This section will show you what you can do with Lucene expressions.
Lucene provides functionality to compile a JavaScript expression to a Java bytecode. This is how Lucene expressions work and this is why they are as fast as native Elasticsearch scripts. Lucene expressions can be used in the following Elasticsearch functionalities:
function_score
query in the script_score
queryscript_fields
In addition to this, you have to remember that:
_score
to access the document score and doc['field_name'].value
to access the value of a single valued numeric field in the documentKnowing the preceding information, we can try using Lucene expressions to modify the score of our documents. Let's get back to our library
index and try to increase the score of the given document by 10% of the year it was originally released. To do this, we could run the following query:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "fields" : [ "_id", "_score", "title" ], "query" : { "function_score" : { "query" : { "match_all" : {} }, "script_score" : { "lang" : "expression", "script" : "_score + doc["year"].value * percentage", "params" : { "percentage" : 0.1 } } } } }'
The query is very simple, but let's discuss its structure. First, we are using the match_all
query wrapped in the function_score
query because we want all documents to match and we want to use script for scoring. We are also setting the script language to expression
(by setting the lang
property to expression
) to tell Elasticsearch that our script is a Lucene expressions script. Of course, we provide the script and we parameterize it, just like we would with any other script. The results of the preceding query look as follows:
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 6, "max_score" : 197.1, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "2", "_score" : 197.1, "fields" : { "title" : [ "Catch-22" ], "_id" : "2" } }, { "_index" : "library", "_type" : "book", "_id" : "3", "_score" : 194.6, "fields" : { "title" : [ "The Complete Sherlock Holmes" ], "_id" : "3" } }, { "_index" : "library", "_type" : "book", "_id" : "1", "_score" : 193.9, "fields" : { "title" : [ "All Quiet on the Western Front" ], "_id" : "1" } }, { "_index" : "library", "_type" : "book", "_id" : "6", "_score" : 191.4, "fields" : { "title" : [ "The Peasants" ], "_id" : "6" } }, { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 189.6, "fields" : { "title" : [ "Crime and Punishment" ], "_id" : "4" } }, { "_index" : "library", "_type" : "book", "_id" : "5", "_score" : 178.4, "fields" : { "title" : [ "The Sorrows of Young Werther" ], "_id" : "5" } } ] } }
Of course, the provided example is a very simple one. If you are interested in what Lucene expressions provide, please visit the official Javadocs available at http://lucene.apache.org/core/4_9_0/expressions/index.html?org/apache/lucene/expressions/js/package-summary.html. The documents available at the given URL provide more information about what Lucene exposes in expressions module.