Scripting changes between Elasticsearch versions

One of the great things in Elasticsearch is its scripting capabilities. You can use script for calculating score, text-based scoring, data filtering, and data analysis. Although scripting can be slow in some cases, such as calculating the score for each document, we think that this part of Elasticsearch is important. Because of this, we decided that this section should bring you the information about the changes and will extend the information present in the Elasticsearch Server Second Edition book.

Scripting changes

Elasticsearch scripting has gone through a lot of refactoring in version 1.0 and in the versions that came after that. Because of those changes, some users were lost as to why their scripts stopped working when upgrading to version 1.2 of Elasticsearch and what is happening in general. This section will try to give you an insight on what to expect.

Security issues

During the lifetime of Elasticsearch 1.1, an exploit was published (see http://bouk.co/blog/elasticsearch-rce/): it showed that with the default configuration, Elasticsearch was not fully secure. Because of that, dynamic scripting was disabled by default in Elasticsearch 1.2. Although, disabling dynamic scripting was enough to make Elasticsearch secure, it made script usage far more complicated.

Groovy – the new default scripting language

With the release of Elasticsearch 1.3, we can use a new scripting language that will become default in the next version: Groovy (see http://groovy.codehaus.org/). The reason for this is that it can be closed in its own sandbox, preventing dynamic scripts from doing any harm to the cluster and the operating system. In addition to that, because Groovy can be sandboxed, Elasticsearch allows us to use dynamic scripting when using it. Generally speaking, starting from version 1.3, if a scripting language can be sandboxed, it can be used in dynamic scripts. However, Groovy is not everything: Elasticsearch 1.3 allows us to use Lucene expressions, which we will cover in this section. However, with the release of Elasticsearch 1.3.8 and 1.4.3 dynamic scripting was turned off even for Groovy. Because of that, if you still want to use dynamic scripting for Groovy you need to add script.groovy.sandbox.enabled property to elasticsearch.yml and set it to true or make your Elasticsearch a bit less dynamic with stored scripts. Please be aware that enabling dynamic scripting exposes security issues though and should be used with caution.

Removal of MVEL language

Because of the security issues and introduction of Groovy, starting from Elasticsearch 1.4, MVEL will no longer be available by default with Elasticsearch distribution. The default language will be Groovy, and MVEL will only be available as a plugin installed on demand. Remember that if you want to drop MVEL scripts, it is really easy to port them to Groovy. Of course, you will be able to install the MVEL plugin, but still dynamic scripting will be forbidden.

Short Groovy introduction

Groovy is a dynamic language for the Java Virtual Machine. It was built on top of Java, with some inspiration from languages such as Python, Ruby, or Smalltalk. Even though Groovy is out of the context of this book, we decided to describe it because, as you know, it is the default scripting language starting from Elasticsearch 1.4. If you already know Groovy and you know how to use it in Elasticsearch, you can easily skip this section and move to the Scripting in full text context section of this book.

Note

The thing to remember is that Groovy is only sandboxed up to Elasticsearch 1.3.8 and 1.4.3. Starting from the mentioned versions it is not possible to run dynamic Groovy scripts unless Elasticsearch is configured to allow such. All the queries in the examples that we will show next require you to add script.groovy.sandbox.enabled property to elasticsearch.yml and set it to true.

Using Groovy as your scripting language

Before we go into an introduction to Groovy, let's learn how to use it in Elasticsearch scripts. To do this, check the version you are using. If you are using Elasticsearch older than 1.4, you will need to add the lang property with the value groovy. For example:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "fields" : [ "_id", "_score", "title" ],
 "query" : {
  "function_score" : {
   "query" : {
    "match_all" : {}
   },
   "script_score" : {
    "lang" : "groovy",
    "script" : "_index["title"].docCount()"
   }
  }
 }
}'

If you are using Elasticsearch 1.4 or newer, you can easily skip the scripting language definition because Elasticsearch will use Groovy by default.

Variable definition in scripts

Groovy allows us to define variables in scripts used in Elasticsearch. To define a new variable, we use the def keyword followed by the variable name and its value. For example, to define a variable named sum and assign an initial value of 0, to it we would use the following snippet of code:

def sum = 0

Of course, we are not only bound to simple variables definition. We can define lists, for example, a list of four values:

def listOfValues = [0, 1, 2, 3]

We can define a range of values, for example, from 0 to 9:

def rangeOfValues = 0..9

Finally, we can define maps:

def map = ['count':1, 'price':10, 'quantity': 12]

The preceding line of code will result in defining a map with three keys (count, price, and quantity) and three values corresponding to those keys (1, 10, and 12).

Conditionals

We are also allowed to use conditional statements in scripts. For example, we can use standard if - else if - else structures:

if (count > 1) {
  return count
} else if (count == 1) {
  return 1
} else {
  return 0
}

We can use the ternary operator:

def isHigherThanZero = (count > 0) ? true : false

The preceding code will assign a true value to the isHigherThanZero variable if the count variable is higher than 0. Otherwise, the value assigned to the isHigherThanZero variable will be false.

Of course, we are also allowed to use standard switch statements that allow us to use an elegant way of choosing the execution path based on the value of the statement:

def isEqualToTenOrEleven = false;
switch (count) {
  case 10:
    isEqualToTenOrEleven = true
    break
  case 11:
    isEqualToTenOrEleven = true
    break
  default:
    isEqualToTenOrEleven  = false
} 

The preceding code will set the value of the isEqualToTenOrEleven variable to true if the count variable is equal to 10 or 11. Otherwise, the value of the isEqualToTenOrEleven variable will be set to false.

Loops

Of course, we can also use loops when using Elasticsearch scripts and Groovy as the language in which scripts are written. Let's start with the while loop that is going to be executed until the statement in the parenthesis is true:

def i = 2
def sum = 0
while (i > 0) {
  sum = sum + i
  i--
} 

The preceding loop will be executed twice and ended. In the first iteration, the i variable will have the value of 2, which means that the i > 0 statement is true. In the second iteration, the value of the i variable will be 1, which again makes the i > 0 statement true. In the third iteration, the i variable will be 0, which will cause the while loop not to execute its body and exit.

We can also use the for loop, which you are probably familiar with if you've used programming languages before. For example, to iterate 10 times over the for loop body, we could use the following code:

def sum = 0
for ( i = 0; i < 10; i++) {
  sum += i
}

We can also iterate over a range of values:

def sum = 0
for ( i in 0..9 ) {
  sum += i
}

Or iterate over a list of values:

def sum = 0
for ( i in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ) {
  sum += i
}

If we have a map, we can iterate over its entries:

def map = ['quantity':2, 'value':1, 'count':3]
def sum = 0
for ( entry in map ) {
  sum += entry.value
}

An example

Now after seeing some basics of Groovy, let's try to run an example script that will modify the score of our documents. We will implement the following algorithm for score calculation:

  • if the year field holds the value lower than 1800, we will give the book a score of 1.0
  • if the year field is between 1800 and 1900, we will give the book a score of 2.0
  • the rest of the books should have the score equal to the value of the year field minus 1000

The query that does the preceding example looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "fields" : [ "_id", "_score", "title", "year" ],
 "query" : {
  "function_score" : {
   "query" : {
    "match_all" : {}
   },
   "script_score" : {
    "lang" : "groovy",
    "script" : "def year = doc["year"].value; if (year < 1800) {  return 1.0 } else if (year < 1900) { return 2.0 } else { return  year - 1000 }"
   }
  }
 }
}'

Note

You may have noticed that we've separated the def year = doc["year"].value statement in the script from the rest of it using the ; character. We did it because we have the script in a single line and we need to tell Groovy where our assign statement ends and where another statement starts.

The result returned by Elasticsearch for the preceding query is as follows:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 961.0,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "2",
      "_score" : 961.0,
      "fields" : {
        "title" : [ "Catch-22" ],
        "year" : [ 1961 ],
        "_id" : "2"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "3",
      "_score" : 936.0,
      "fields" : {
        "title" : [ "The Complete Sherlock Holmes" ],
        "year" : [ 1936 ],
        "_id" : "3"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 929.0,
      "fields" : {
        "title" : [ "All Quiet on the Western Front" ],
        "year" : [ 1929 ],
        "_id" : "1"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "6",
      "_score" : 904.0,
      "fields" : {
        "title" : [ "The Peasants" ],
        "year" : [ 1904 ],
        "_id" : "6"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 2.0,
      "fields" : {
        "title" : [ "Crime and Punishment" ],
        "year" : [ 1886 ],
        "_id" : "4"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "5",
      "_score" : 1.0,
      "fields" : {
        "title" : [ "The Sorrows of Young Werther" ],
        "year" : [ 1774 ],
        "_id" : "5"
      }
    } ]
  }
}

As you can see, our script worked as we wanted it to.

There is more

Of course, the information we just gave is not a comprehensive guide to Groovy and was never intended to be one. Groovy is out of the scope of this book and we wanted to give you a glimpse of what to expect from it. If you are interested in Groovy and you want to extend your knowledge beyond what you just read, we suggest going to the official Groovy web page and reading the documentation available at http://groovy.codehaus.org/.

Scripting in full text context

Of course, scripts are not only about modifying the score on the basis of data. In addition to this, we can use full text-specific statistics in our scripts, such as document frequency or term frequency. Let's look at these possibilities.

Field-related information

The first text-related information we can use in scripts we would like to talk about is field-related statistics. The field-related information Elasticsearch allows us to use is as follows:

  • _index['field_name'].docCount(): Number of documents that contain a given field. This statistic doesn't take deleted documents into consideration.
  • _index['field_name'].sumttf(): Sum of the number of times all terms appear in all documents in a given field.
  • _index['field_name'].sumdf(): Sum of document frequencies. This shows the sum of the number of times all terms appear in a given field in all documents.

Note

Please remember that the preceding information is given for a single shard, not for the whole index, so they may differ between shards.

For example, if we would like to give our documents a score equal to the number of documents having the title field living in a given shard, we could run the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "fields" : [ "_id", "_score", "title" ],
 "query" : {
  "function_score" : {
   "query" : {
    "match_all" : {}
   },
   "script_score" : {
    "lang" : "groovy",
    "script" : "_index["title"].docCount()"
   }
  }
 }
}'

If we would look at the response, we would see the following:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 2.0,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 2.0,
      "fields" : {
        "title" : [ "All Quiet on the Western Front" ],
        "_id" : "1"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "6",
      "_score" : 2.0,
      "fields" : {
        "title" : [ "The Peasants" ],
        "_id" : "6"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 1.0,
      "fields" : {
        "title" : [ "Crime and Punishment" ],
        "_id" : "4"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "5",
      "_score" : 1.0,
      "fields" : {
        "title" : [ "The Sorrows of Young Werther" ],
        "_id" : "5"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "2",
      "_score" : 1.0,
      "fields" : {
        "title" : [ "Catch-22" ],
        "_id" : "2"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "3",
      "_score" : 1.0,
      "fields" : {
        "title" : [ "The Complete Sherlock Holmes" ],
        "_id" : "3"
      }
    } ]
  }
}


As you can see, we have five documents that were queried to return the preceding results. The first two documents have a score of 2.0, which means that they are probably living in the same shard because the four remaining documents have a score of 1.0, which means that are alone in their shard.

Shard level information

The shard level information that we are allowed to use are as follows:

  • _index.numDocs(): Number of documents in a shard
  • _index.maxDoc(): Internal identifier of the maximum number of documents in a shard
  • _index.numDeletedDocs(): Number of deleted documents in a given shard

Note

Please remember that the preceding information is given for a single shard, not for the whole index, so they may differ between shards.

For example, if we would like to sort documents on the basis of the highest internal identifier each shard has, we could send the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "fields" : [ "_id", "_score", "title" ],
 "query" : {
  "function_score" : {
   "query" : {
    "match_all" : {}
   },
   "script_score" : {
    "lang" : "groovy",
    "script" : "_index.maxDoc()"
   }
  }
 }
}'

Of course, it doesn't make much sense to use those statistics alone, like we just did, but with addition to other text-related information, they can be very useful.

Term level information

The next type of information that we can use in scripts is term level information. Elasticsearch allows us to use the following:

  • _index['field_name']['term'].df(): Returns the number of documents the term appears in a given field
  • _index['field_name']['term'].ttf(): Returns the sum of the number of times a given term appears in all documents in a given field
  • _index['field_name']['term'].tf(): Returns the information about the number of times a given term appears in a given field in a document

To give a good example of how we can use the preceding statistics, let's index two documents by using the following commands:

curl -XPOST 'localhost:9200/scripts/doc/1' -d '{"name":"This is a  document"}'
curl -XPOST 'localhost:9200/scripts/doc/2' -d '{"name":"This is a  second document after the first document"}'

Now, let's try filtering documents on the basis of how many times a given term appears in the name field. For example, let's match only those documents that have in the name field the document term appearing at least twice. To do this, we could run the following query:

curl -XGET 'localhost:9200/scripts/_search?pretty' -d '{
 "query" : {
  "filtered" : {
   "query" : {
    "match_all" : {}
   },
   "filter" : {
    "script" : {
     "lang" : "groovy",
     "script": "_index["name"]["document"].tf() > 1"
    }
   }
  }
 }
}'

The result of the query would be as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "scripts",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{"name":"This is a second document after the first  document"}
    } ]
  }
}

As we can see, Elasticsearch did exactly what we wanted.

More advanced term information

In addition to already presented information, we can also use term positions, offsets, and payloads in our scripts. To get those, we can use one the _index['field_name'].get('term', OPTION) expression, where OPTION is one of the following:

  • _OFFSETS: Term offsets
  • _PAYLOADS: Term payloads
  • _POSITIONS: Term positions

Note

Please remember that the field you want to get offsets or positions for needs to have this enabled during indexing.

In addition to this, we can also use the _CACHE option. It allows us to iterate multiple times over all the term positions. Options can also be combined using the | operator; for example, if you would like to get term offsets and positions for the document term in the title field, you could use the following expression in your script:

_index['title'].get('document', _OFFSETS | _POSITIONS).

One thing to remember is that all the preceding options return an object called that, depending on the options we have chosen, contains the following information:

  • startOffset: Start offset for the term
  • endOffset: End offset for the term
  • payload: Payload for the term
  • payloadAsInt(value): Returns payload for the term converted to integer or the value in case the current position doesn't have a payload
  • payloadAsFloat(value): Returns payload for the term converted to float or the value in case the current position doesn't have a payload
  • payloadAsString(value): Returns payload for the term converted to string or the value in case the current position doesn't have a payload
  • position: Position of a term

To illustrate an example, let's create a new index with the following mappings:

curl -XPOST 'localhost:9200/scripts2' -d '{
 "mappings" : {
  "doc" : {
   "properties" : {
    "name" : { "type" : "string", "index_options" : "offsets" }
   }
  }
 }
}'

After this, we index two documents using the following commands:

curl -XPOST 'localhost:9200/scripts2/doc/1' -d '{"name":"This is the  first document"}'
curl -XPOST 'localhost:9200/scripts2/doc/2' -d '{"name":"This is a  second simple document"}'

Now, let's set the score of our documents to the sum of all the start positions for the document term in the name field. To do this, we run the following query:

curl -XGET 'localhost:9200/scripts2/_search?pretty' -d '{
 "query" : {
  "function_score" : {
   "query" : {
    "match_all" : {}
   },
   "script_score" : {
    "lang" : "groovy",
"script": "def termInfo =  _index["name"].get("document",_OFFSETS); def sum = 0; for (offset in termInfo) { sum += offset.startOffset; }; return sum;"
   }
  }
 }
}'

The results returned by Elasticsearch would be as follows:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 24.0,
    "hits" : [ {
      "_index" : "scripts2",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 24.0,
      "_source":{"name":"This is a second simple document"}
    }, {
      "_index" : "scripts2",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 18.0,
      "_source":{"name":"This is the first document"}
    } ]
  }
}

As we can see, it works. If we look at the formatted script, we would see something like the following:

def termInfo = _index['name'].get('document',_OFFSETS); 
def sum = 0; 
for (offset in termInfo) { 
  sum += offset.startOffset; 
}; 
return sum;

As you can see, it is nothing sophisticated. First, we get the information about the offsets in an object; next, we create a variable to hold our offsets sum. Then, we have a loop for all the offsets information (we can have multiple instances of offsets for different occurrences of the same term in a field) and, finally, we return the sum that makes our score for the document to be set to the returned value.

Note

In addition to all what we talked about in the preceding section, we are also able to get information about term vectors if we turned them on during indexing. To do that, we can use the _index.termVectors() expression, which will return Apache Lucene Fields object instance. You can find more about the Fields object in Lucene Javadocs available at https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/index/Fields.html.

Lucene expressions explained

Although marked as experimental, we decided to talk about it because this is a new and very good feature. The reason that makes Lucene expressions very handy is using them is very fast—their execution is as fast as native scripts, but yet they are like dynamic scripts with some limitations. This section will show you what you can do with Lucene expressions.

The basics

Lucene provides functionality to compile a JavaScript expression to a Java bytecode. This is how Lucene expressions work and this is why they are as fast as native Elasticsearch scripts. Lucene expressions can be used in the following Elasticsearch functionalities:

  • Scripts responsible for sorting
  • Aggregations that work on numeric fields
  • In the function_score query in the script_score query
  • In queries using script_fields

In addition to this, you have to remember that:

  • Lucene expressions can be only used on numeric fields
  • Stored fields can't be accessed using Lucene expressions
  • Missing values for a field will be given a value of 0
  • You can use _score to access the document score and doc['field_name'].value to access the value of a single valued numeric field in the document
  • No loops are possible, only single statements

An example

Knowing the preceding information, we can try using Lucene expressions to modify the score of our documents. Let's get back to our library index and try to increase the score of the given document by 10% of the year it was originally released. To do this, we could run the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "fields" : [ "_id", "_score", "title" ],
 "query" : {
  "function_score" : {
   "query" : {
    "match_all" : {}
   },
   "script_score" : {
    "lang" : "expression",
    "script" : "_score + doc["year"].value * percentage",
    "params" : {
     "percentage" : 0.1
    }
   }
  }
 }
}'

The query is very simple, but let's discuss its structure. First, we are using the match_all query wrapped in the function_score query because we want all documents to match and we want to use script for scoring. We are also setting the script language to expression (by setting the lang property to expression) to tell Elasticsearch that our script is a Lucene expressions script. Of course, we provide the script and we parameterize it, just like we would with any other script. The results of the preceding query look as follows:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 197.1,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "2",
      "_score" : 197.1,
      "fields" : {
        "title" : [ "Catch-22" ],
        "_id" : "2"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "3",
      "_score" : 194.6,
      "fields" : {
        "title" : [ "The Complete Sherlock Holmes" ],
        "_id" : "3"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 193.9,
      "fields" : {
        "title" : [ "All Quiet on the Western Front" ],
        "_id" : "1"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "6",
      "_score" : 191.4,
      "fields" : {
        "title" : [ "The Peasants" ],
        "_id" : "6"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 189.6,
      "fields" : {
        "title" : [ "Crime and Punishment" ],
        "_id" : "4"
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "5",
      "_score" : 178.4,
      "fields" : {
        "title" : [ "The Sorrows of Young Werther" ],
        "_id" : "5"
      }
    } ]
  }
}

As we can see, Elasticsearch did what it was asked to do.

There is more

Of course, the provided example is a very simple one. If you are interested in what Lucene expressions provide, please visit the official Javadocs available at http://lucene.apache.org/core/4_9_0/expressions/index.html?org/apache/lucene/expressions/js/package-summary.html. The documents available at the given URL provide more information about what Lucene exposes in expressions module.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset