Scripting capabilities of Elasticsearch

Elasticsearch has a few functionalities where scripts can be used. You've already seen examples such as updating documents and searching. We will also use the scripting capabilities of Elasticsearch when we discuss aggregations. Even though scripts seem to be a rather advanced topic, we will look at the possibilities offered by Elasticsearch. That's because scripts are priceless in certain situations.

Elasticsearch can use several languages for scripting. When not explicitly declared, it assumes that Groovy (www.groovy-lang.org/) is used. Other languages available out of the box are Lucene expression language and Mustache (https://mustache.github.io/). Of course we can use plugins, which will make Elasticsearch understand additional scripting languages, such as JavaScript, MVEL, and Python. The thing worth mentioning is that independent from the scripting language that we choose, Elasticsearch exposes objects that we can use in our scripts. Let's start by briefly looking at what type of information we are allowed to use in our scripts.

Objects available during script execution

During different operations, Elasticsearch allows us to use different objects in our scripts. To develop a script that fits our use case, we should be familiar with these objects.

For example, during a search operation, the following objects are available:

  • _doc (also available as doc): This is an instance of the org.elasticsearch.search.lookup.LeafDocLookup object. It gives us access to the current document found with the calculated score and field values.
  • _source: This is an instance of the org.elasticsearch.search.lookup.SourceLookup object. It provides access to the source of the current document and the values defined in the source.
  • _fields: This is an instance of the org.elasticsearch.search.lookup.LeafFieldsLookup object. It can be used to access the values of the document fields.

On the other hand, during a document update operation, the preceding mentioned variables are not accessible. Elasticsearch exposes only the ctx object with the _source property, which provides access to the document currently processed in the update request.

As we have previously seen, several methods are mentioned in the context of document fields and their values. Let's now look at examples of how to get the value for a particular field using the previously mentioned object available during the search operation. In the brackets after the script piece, you can see what Elasticsearch will return for one of our example documents from the library index (we will use the document with identifier 4):

  • _doc.title.value (and)
  • _source.title (crime and punishment)
  • _fields.title.value (null)

A bit confusing, isn't it? During indexing, the original document is by default stored in the _source field. Of course, by default, all the fields are present in that _source field. In addition to that, the document is parsed and every field may be stored in an index if it is marked as stored (that is, if the store property is set to true; otherwise, by default, the fields are not stored). Finally, the field value may be configured as indexed. This means that the field value is analyzed and placed in the index. To sum up, one field may land in Elasticsearch index in the following ways:

  • As a part of the _source document
  • As a stored and unparsed original value
  • As an indexed value that is processed by an analyzer

In scripts, we have access to all these field representations. The only exception is the update operation, which, as we've mentioned before, gives us only access to document _source as part of the ctx variable. You may wonder which version you should use. Well, if you want access to the processed form, the answer will be simple – use the _doc object. What about _source and _fields? In most cases, _source is a good choice. It is usually fast and needs less disk operations than reading the original field values from the index. This is especially true when you need to read the values of multiple fields in your scripts; fetching a single _source field is faster than fetching multiple independent fields from the index.

Script types

Elasticsearch allows us to use scripts in three different ways:

  • Inline scripts: The source of the script is directly defined in the query
  • In file scripts: The source is defined in the external file placed in the Elasticsearch config/scripts directory
  • As a document in the dedicated index: The source of the script is defined as a document in a special index available by using the /_scripts API end-point

Choosing the way to define scripts depends on several factors. If you have scripts which you will use in many different queries, the file or the dedicated index seem to be the best solutions. The scripts in file is probably less convenient, but it is preferred from the security point of view; they can't be overwritten and injected into your query causing a security breach.

In file scripts

This is the only way to allow dynamic scripting if we don't want to enable query dynamic scripting in Elasticsearch. The idea is that every script used by the queries is defined in its own file placed in the config/scripts directory. We will now look at this method of using scripts. Let's create an example file called tag_sort.groovy and let's place it in the config/scripts directory of our Elasticsearch instance (or instances if we run a cluster). The content of the mentioned file should look like this:

_doc.tags.values.size() > 0 ? _doc.tags.values[0] : 'u19999'

After few seconds, Elasticsearch will automatically load a new file. You should see something like the following in the Elasticsearch logs:

[2015-08-30 13:14:33,005][INFO ][script                   ] [Alex Wilder] compiling script file [/Users/negativ/Developer/ES/es-current/config/scripts/tag_sort.groovy]

Note

If you have multi-node cluster, you have to make sure that the script is available on every node.

Now we are ready to use this script in our queries. You may remember that we used exactly the same script in the Sorting data section in Chapter 4, Extending Your Querying Knowledge. Now the modified query that uses our script stored in the file looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
  "query" : {
    "match_all" : { }
  },
  "sort" : {
    "_script" : {
      "script" : {
        "file" : "tag_sort"
       },
       "type" : "string",
       "order" : "asc"
     }
  }
}'

We will return to this, but first, the next possible way of defining inline scripts.

Inline scripts

Inline scripts are a more convenient way of using scripts, especially for constantly changing queries and for ad-hoc queries. The main drawback of such an approach is security. If we allow users to run any kind of query, including scripts, we can expose our Elasticsearch instance to attackers. Such attacks can execute arbitrary code on the server running Elasticsearch with rights equal to the ones given to the user running Elasticsearch. In the worst case scenario, the attacker could use security holes to gain super user rights. This is the reason why inline scripts are disabled by default. After careful consideration, you can enable them by adding:

script.inline: on

Add the preceding command line to the elasticsearch.yml file.

After allowing the inline script to be executed, we can run a query that looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
  "query" : {
    "match_all" : { }
  },
  "sort" : {
    "_script" : {
      "script" : {
        "inline" : "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999""
       },
       "type" : "string",
       "order" : "asc"
     }
  }
}'

Indexed scripts

The last option for defining scripts is storing them in the dedicated Elasticsearch index. For the same security reasons, dynamic execution of the indexed scripts is by default disabled. To enable the indexed scripts, we have to add a similar configuration option to the one we added to be able to use the inline scripts. We need to add the following line to the elasticsearch.yml file:

script.indexed: on

After adding the preceding property to all the nodes and restarting the cluster, we will be ready to start using the indexed scripts. Elasticsearch provides an additional, dedicated endpoint for this purpose. Let's store our script:

curl -XPOST 'localhost:9200/_scripts/groovy/tag_sort' -d '{
  "script" :  "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999""
}'

The script is ready, but let's discuss what we just did. We sent an HTTP POST request to the special _scripts REST end-point. We also specified the language of the script (groovy in our case) and the name of the script (tag_sort). The body of the request is the script itself.

We can now move on to the query, which looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
  "query" : {
    "match_all" : { }
  },
  "sort" : {
    "_script" : {
      "script" : {
        "id" : "tag_sort"
       },
       "type" : "string",
       "order" : "asc"
     }
  }
}'

As we see, the query is practically identical to the query used with the script defined in a file. The only difference is that we provided the identifier of the script using the id parameter instead of providing the file name.

Querying with scripts

If we look at any request made to Elasticsearch that uses scripts, we will notice some similar properties, which are as follows:

  • script: This property wraps the script definition.
  • inline: This property holds the code of the script itself.
  • id: This property defines the identifier of the indexed script.
  • file: The filename of the script without the extension.
  • lang: This property defines the language of the script. If it is omitted, Elasticsearch assumes groovy.
  • params: This object contains the parameters and their values. Every defined parameter can be used inside the script by specifying that parameter's name. The parameters allow us to write cleaner code which will be executed in a more efficient manner. Scripts using the parameters are executed faster than code with embedded constants because of caching.

Scripting with parameters

As our scripts become more and more complicated, the need for creating multiple, almost identical scripts can appear. These scripts usually differ in the values used, with the logic behind them being exactly the same. In our simple example, we used a hardcoded value used to mark documents with empty tags list. Let's change this to allow definition of the hardcoded value. Let's use in file script definition and create a tag_sort_with_param.groovy file with the following contents:

_doc.tags.values.size() > 0 ? _doc.tags.values[0] : tvalue

The only change we've made is the introduction of the parameter named tvalue, which can be set in the query in the following way:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
  "query" : {
    "match_all" : { }
  },
  "sort" : {
    "_script" : {
      "script" : {
        "file" : "tag_sort_with_param",
        "params" : {
          "tvalue" : "000"
        }
       },
       "type" : "string",
       "order" : "asc"
     }
  }
}'

The params section defines all the script parameters. In our simple example, we've only used a single parameter, but of course we can have multiple parameters in a single query.

Script languages

As we already said, the default language for scripting is Groovy. However, we are not limited to only a single scripting language when using Elasticsearch. In fact, if you would like to, you can even use Java to write your scripts. In addition to that, the community behind Elasticsearch provides additional languages support as plugins. So if you are willing to install plugins, you can extend the list of scripting languages that Elasticsearch supports even further. You may wonder why you would even consider using a scripting language other than the default Groovy. The first reason is your own preferences. If you are a python enthusiast, you are probably now thinking about how to use python for your Elasticsearch scripts. The other reason could be security. When we talked about the inline scripts, we told you that they are turned off by default. This is not exactly true for all the scripting languages available out of the box. The inline scripts are disabled by default when using Groovy, but you can use Lucene expressions and Mustache without any issues. This is because those languages are sandboxed, which means that the security sensitive functions are turned off. And of course, the last factor when choosing a language is performance. Theoretically, the native scripts (in Java) should have better performance than others, but you should remember that the difference can be insignificant. You should always consider the cost of development and measure performance.

Using other than embedded languages

Using Groovy for scripting is a simple and sufficient solution for most use cases. However, you may have a different preference and you may like to use something different, such as JavaScript, Python, or Mvel. Before using other languages, we must install an appropriate plugin. You can read more details about plugins in the Elasticsearch plugins section of Chapter 9, Elasticsearch Cluster. For now, we'll just run the following command from the Elasticsearch directory:

bin/plugin install lang-javascript

The preceding command will install a plugin that will allow the usage of JavaScript as the scripting language. The only change we should make in the request is to add the additional information about the language we are using for scripting and, of course, modify the script itself to correctly use the new language. Look at the following example:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
  "query" : {
    "match_all" : { }
  },
  "sort" : {
    "_script" : {
      "script" : {
        "inline" : "_doc.tags.values.length > 0 ? _doc.tags.values[0] :"u19999";",
        "lang" : "javascript"
      },
      "type" : "string",
      "order" : "asc"
    }
  }
}'

As you can see, we've used JavaScript for scripting instead of the default Groovy. The lang parameter informs Elasticsearch about the language being used.

Using native code

In case the scripts are too slow or you don't like scripting languages, Elasticsearch allows you to write Java classes and use them instead of scripts. There are two possible ways of adding native scripts: adding classes defining scripts to Elasticsearch classpath or adding script as a functionality provided by a plugin. We will describe this second solution as it is more elegant.

The factory implementation

We need to implement at least two classes to create a new native script. The first one is a factory for our script. For now, let's focus on it. The following sample code illustrates the factory for our script:

package pl.solr.elasticsearch.examples.scripts;

import java.util.Map;

import org.elasticsearch.common.Nullable;
import org.elasticsearch.script.ExecutableScript;
import org.elasticsearch.script.NativeScriptFactory;

public class HashCodeSortNativeScriptFactory implements NativeScriptFactory {

    @Override
    public ExecutableScript newScript(@Nullable Map<String, Object> params) {
        return new HashCodeSortScript(params);
    }

  @Override
  public boolean needsScores() {
    return false;
  }

}

The essential parts are highlighted in the code snippet. This class should implement the org.elasticsearch.script.NativeScriptFactory class. The interface forces us to implement two methods. The newScript() method takes the parameters defined in the API call and returns an instance of our script. Finally, needsScores() informs Elasticsearch if we want to use scoring and whether it should be calculated.

Implementing the native script

Now let's look at the implementation of our script. The idea is simple – our script will be used for sorting. Documents will be ordered by the hashCode() value of the chosen field. The documents without a value in the defined field will be first on the results list. We know the logic doesn't make too much sense, but it is good for presentation as it is simple. The source code for our native script looks as follows:

package pl.solr.elasticsearch.examples.scripts;

import java.util.Map;

import org.elasticsearch.script.AbstractSearchScript;

public class HashCodeSortScript extends AbstractSearchScript {
  private String field = "name";

  public HashCodeSortScript(Map<String, Object> params) {
    if (params != null && params.containsKey("field")) {
      this.field = params.get("field").toString();
    }
  }

  @Override
  public Object run() {
    Object value = source().get(field);
    if (value != null) {
      return value.hashCode();
    }
    return 0;
  }

}

First of all, our class inherits from the org.elasticsearch.script.AbstractSearchScript class and implements the run() method. This is where we get the appropriate values from the current document, process it according to our strange logic, and return the result. You may notice the source() call. It is exactly the same _source parameter that we used when dealing with non-native scripts. The doc() and fields() methods are also available and they follow the same logic we described earlier.

The thing worth looking at is how we've used the parameters. We assume that a user can put the field parameter, telling us which document field will be used for manipulation. We also provide a default value for this parameter.

The plugin definition

We said that we will install our script as a part of a plugin. This is why we need additional files. The first file is the plugin initialization class where we tell Elasticsearch about our new script:

package pl.solr.elasticsearch.examples.scripts;

import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.script.ScriptModule;

public class ScriptPlugin extends Plugin {

  @Override
  public String description() {
    return "The example of native sort script";
  }

  @Override
  public String name() {
    return "naive-sort-plugin";
  }

  public void onModule(final ScriptModule module) {
    module.registerScript("native_sort", HashCodeSortNativeScriptFactory.class);

  }

}

The implementation is easy. The description() and name() methods are only for information, so let's focus on the onModule() method. In our case, we need access to the script module – Elasticsearch service with scripts and scripting languages. This is why we define onModule() with one ScriptModule argument. Thanks to Elasticsearch magic, we can use this module and register our script so it can be found by the engine. We have used the registerScript() method, which takes the script name and the previously defined factory class.

The second needed file is a plugin descriptor file: plugin-descriptor.properties. It defines the constants used by the Elasticsearch plugin subsystem. Without more thinking, let's look at the contents of this file:

jvm=true
classname=pl.solr.elasticsearch.examples.scripts.ScriptPlugin
elasticsearch.version=2.2.0
version=0.0.1-SNAPSHOT
name=native_script
description=Example Native Scripts
java.version=1.7

The appropriate lines have the following meaning:

  • jvm: tells Elasticsearch that our file contains Java code
  • classname: describes the main class with plugin definition
  • elasticsearch.version and java.version: tells us about the Elasticsearch version that is supported by the plugin and the Java version that is needed
  • name and description: Informative name and short description of our plugin

And that's it. We have all the files needed to run our script. Please note that you can have more than a single script packed as a single plugin.

Installing the plugin

Now it's time to install our native script embedded in the plugin. After packing the compiled classes as a JAR archive, we should put it in the Elasticsearch plugins/native-script directory. The native-script part is a root directory for our plugin and you may name it as you wish. In this directory you also need the prepared plugin-descriptor.properties file. This makes our plugin visible to Elasicsearch.

Running the script

After restarting Elasticsearch (or the whole cluster if you run more than a single node), we can start sending the queries that use our native script. For example, we will send a query that uses our previously indexed data from the library index. This example query looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
  "query" : {
    "match_all" : { }
  },
  "sort" : {
    "_script" : {
      "script" : {
        "script" : "native_sort",
        "lang" : "native",
        "params" : {
          "field" : "otitle"
        }
      },
      "type" : "string",
      "order" : "asc"
    }
  }
}'

Note the params part of the query. In this call, we want to sort on the otitle field. We provide the script name native_sort and the script language native. This is required. If everything goes well, we should see our results sorted by our custom sort logic. If we look at the response from Elasticsearch, we will see that the documents without the otitle field are at the first few positions of the results list and their sort value is 0.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset