Creating the custom analysis plugin

The last thing we want to discuss when it comes to custom Elasticsearch plugins is the analysis process extension. We've chosen to show how to develop a custom analysis plugin because this is sometimes very useful, for example, when you want to have the custom analysis process that you use in your company introduced, or when you want to use the Lucene analyzer or filter that is not present in Elasticsearch itself or as a plugin for it. Because creating an analysis extension is more complicated compared to what we've seen when developing a custom REST action, we decided to leave it until the end of the chapter.

Implementation details

Because developing a custom analysis plugin is the most complicated, at least from the Elasticsearch point of view and the number of classes we need to develop, we will have more things to do compared to previous examples. We will need to develop the following things:

  • The TokenFilter class extension (from the org.apache.lucene.analysis package) implementation that will be responsible for handling token reversing; we will call it CustomFilter
  • The AbstractTokenFilterFactory extension (from the org.elasticsearch.index.analysis package) that will be responsible for providing our CustomFilter instance to Elasticsearch; we will call it CustomFilterFactory
  • The custom analyzer, which will extend the org.apache.lucene.analysis.Analyzer class and provide the Lucene analyzer functionality; we will call it CustomAnalyzer
  • The analyzer provider, which we will call CustomAnalyzerProvider, which extends AbstractIndexAnalyzerProvider from the org.elasticsearch.index.analysis package, and which will be responsible for providing the analzyer instance to Elasticsearch
  • An extension of AnalysisModule.AnalysisBinderProcessor from the org.elasticsearch.index.analysis package, which will have information about the names under which our analyzer and token filter will be available in Elasticsearch; we will call it CustomAnalysisBinderProcessor
  • An extension of the AbstractComponent class from the org.elasticsearch.common.component package, which will inform Elasticsearch which factories should be used for our custom analyzer and token filter; we will call it CustomAnalyzerIndicesComponent
  • The AbstractModule extension (from the org.elasticsearch.common.inject package) that will inform Elasticsearch that our CustomAnalyzerIndicesComponent module should be a singleton; we will call it CustomAnalyzerModule
  • Finally, the usual AbstractPlugin extension (from the org.elasticsearch.plugins package) that will register our plugin; we will call it CustomAnalyzerPlugin

So let's start discussing the code.

Implementing TokenFilter

The funniest thing about the currently discussed plugin is that the whole analysis work is actually done on a Lucene level, and what we need to do is write the org.apache.lucene.analysis.TokenFilter extension, which we will call CustomFilter. In order to do this, we need to initialize the super class and override the incrementToken method. Our class will be responsible for reversing the tokens, so that's the logic we want our analyzer and filter to have. The whole implementation of our CustomFilter class looks as follows:

public class CustomFilter extends TokenFilter {
  private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class);
  
  protected CustomFilter(TokenStream input) {
    super(input);
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      char[] originalTerm = termAttr.buffer();
      if (originalTerm.length > 0) {
        StringBuilder builder = new StringBuilder(new String(originalTerm).trim()).reverse();
        termAttr.setEmpty();
        termAttr.append(builder.toString());
      }
      return true;
    } else {
      return false;
    }
  }
}

The first thing we see in the implementation is the following line:

private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class);

It allows us to retrieve the text of the token we are currently processing. In order to get access to the other token information, we need to use other attributes. The list of attributes can be found by looking at the classes implementing Lucene's org.apache.lucene.util.Attribute interface (http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/util/Attribute.html). What you need to know now is that by using the static addAttribute method, we can bind different attributes and use them during token processing.

Then, we have the constructor, which is only used for super class initialization, so we can skip discussing it.

Finally, there is the incrementToken method, which returns true when there is a token in the token stream left to be processed, and false if there is no token left to be processed. So, what we do first is we check whether there is a token to be processed by calling the incrementToken method of input, which is the TokenStream instance stored in the super class. Then, we get the term text by calling the buffer method of the attribute we bind in the first line of our class. If there is text in the term (its length is higher than zero), we use a StringBuilder object to reverse the text, we clear the term buffer (by calling setEmpty on the attribute), and we append the reversed text to the already emptied term buffer (by calling the append method of the attribute). After this, we return true, because our token is ready to be processed further—on a token filter level, we don't know whether the token will be processed further or not, so we need to be sure we return the correct information, just in case.

Implementing the TokenFilter factory

The factory for our token filter implementation is one of the simplest classes in the case of the discussed plugins. What we need to do is create an AbstractTokenFilterFactory (from the org.elasticsearch.index.analysis package) extension that overrides a single create method in which we create our token filter. The code of this class looks as follows:

public class CustomFilterFactory extends AbstractTokenFilterFactory {
  @Inject
  public CustomFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
    super(index, indexSettings, name, settings);
  }
  @Override
  public TokenStream create(TokenStream tokenStream) {
    return new CustomFilter(tokenStream);
  }
}

As you can see, the class is very simple. We start with the constructor, which is needed, because we need to initialize the parent class. In addition to this, we have the create method, in which we create our CustomFilter class with the provided TokenStream object.

Before we go on, we would like to mention two more things: the @IndexSettings and @Assisted annotations. The first one will result in index settings being injected as the Settings class object to the constructor; of course, this is done automatically. The @Assisted keyword results in the annotated parameter value to be injected from the argument of the factory method.

Implementing the class custom analyzer

We wanted to keep the example implementation as simple as possible and, because of that, we've decided not to complicate the analyzer implementation. To implement our analyzer, we need to extend an abstract Analyzer class from Lucene's org.apache.lucene.analysis package, and we did that. The whole code of our CustomAnalyzer class looks as follows:

public class CustomAnalyzer extends Analyzer {
  public CustomAnalyzer() {
  }

  @Override
  protected TokenStreamComponents createComponents(String field, Reader reader) {
    final Tokenizer src = new WhitespaceTokenizer(reader);
    return new TokenStreamComponents(src, new CustomFilter(src));
  }
}

Note

If you want to see more complicated analyzer implementations, please look at the source code of Apache Lucene, Apache Solr, and Elasticsearch.

The createComponent method is the one we need to implement, and it should return a TokenStreamComponents object (from the org.apache.lucene.analysis package) for a given field name (the String type object—the first argument of the method) and data (the Reader type object—the second method argument). What we do is create a Tokenizer object using the WhitespaceTokenizer class available in Lucene. This will result in the input data to be tokenized on whitespace characters. Then, we create a Lucene TokenStreamComponents object, to which we give the source of tokens (our previously created Tokenizer object) and our CustomFilter object. This will result in our CustomFilter object to be used by CustomAnalyzer.

Implementing the analyzer provider

Let's talk about another provider implementation in addition to the token filter factory we've created earlier. This time, we need to extend AbstractIndexAnalyzerProvider from the org.elasticsearch.index.analysis package in order for Elasticsearch to be able to create our analyzer. The implementation is very simple, as we only need to implement the get method in which we should return our analyzer. The CustomAnalyzerProvider class code looks as follows:

public class CustomAnalyzerProvider extends AbstractIndexAnalyzerProvider<CustomAnalyzer> {
  private final CustomAnalyzer analyzer;

  @Inject
  public CustomAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
      super(index, indexSettings, name, settings);
      analyzer = new CustomAnalyzer();
  }

  @Override
  public CustomAnalyzer get() {
      return this.analyzer;
  }
}

As you can see, we've implemented the constructor in order to be able to initialize the super class. In addition to that, we are creating a single instance of our analyzer, which we will return when Elasticsearch requests it. We do this because we don't want to create an analyzer every time Elasticsearch requests it; this is not efficient. We don't need to worry about multithreading because our analyzer is thread-safe and, thus, a single instance can be reused. In the get method, we are just returning our analyzer.

Implementing the analysis binder

The binder is a part of our custom code that informs Elasticsearch about the names under which our analyzer and token filter will be available. Our CustomAnalysisBinderProcessor class extends AnalysisModule.AnalysisBinderProcessor from org.elasticsearch.index.analysis, and we override two methods of this class: processAnalyzers in which we will register our analyzer and processTokenFilters in which we will register our token filter. If we had only an analyzer or only a token filter, we would only override a single method. The code of CustomAnalysisBinderProcessor looks as follows:

public class CustomAnalysisBinderProcessor extends AnalysisModule.AnalysisBinderProcessor {
  @Override
  public void processAnalyzers(AnalyzersBindings analyzersBindings) {
    analyzersBindings.processAnalyzer("mastering_analyzer", CustomAnalyzerProvider.class);
  }

  @Override
  public void processTokenFilters(TokenFiltersBindings tokenFiltersBindings) {
    tokenFiltersBindings.processTokenFilter("mastering_filter", CustomFilterFactory.class);
  }
}

The first method—processAnalyzers—takes a single AnalysisBinding object type, which we can use to register our analyzer under a given name. We do this by calling the processAnalyzer method of the AnalysisBinding object and pass in the name under which our analyzer will be available and the implementation of AbstractIndexAnalyzerProvider, which is responsible for creating our analyzer, which in our case, is the CustomAnalyzerProvider class.

The second method—procesTokenFilters—again takes a single TokenFiltersBindings class, which enables us to register our token filter. We do this by calling the processTokenFilter method and passing the name under which our token filter will be available and the token filter factory class, which in our case, is CustomFilterFactory.

Implementing the analyzer indices component

Now, we need to implement a node level component that will allow our analyzer and token filter to be reused. However, we will tell Elasticsearch that our analyzer should be reusable only on the indices level and not globally (just to show you how to do it). What we need to do is extend the AbstractComponent class from the org.elasticsearch.common.component package. In fact, we only need to develop a constructor for the class we called CustomAnalyzerIndicesComponent. The whole code for the mentioned class looks as follows:

public class CustomAnalyzerIndicesComponent extends AbstractComponent {
  @Inject
  public CustomAnalyzerIndicesComponent(Settings settings, IndicesAnalysisService indicesAnalysisService) {
    super(settings);
    indicesAnalysisService.analyzerProviderFactories().put(
        "mastering_analyzer",
        new PreBuiltAnalyzerProviderFactory("mastering_analyzer", AnalyzerScope.INDICES, new CustomAnalyzer()));

    indicesAnalysisService.tokenFilterFactories().put("mastering_filter",
        new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
          @Override
          public String name() {
            return "mastering_filter";
          }

          @Override
          public TokenStream create(TokenStream tokenStream) {
            return new CustomFilter(tokenStream);
          }
        }));
  }
}

First of all, we pass the constructor arguments to the super class in order to initialize it. After that, we create a new analyzer, which is our CustomAnalyzer class, by using the following code snippet:

indicesAnalysisService.analyzerProviderFactories().put(
        "mastering_analyzer",
        new PreBuiltAnalyzerProviderFactory("mastering_analyzer", AnalyzerScope.INDICES, new CustomAnalyzer()));

As you can see, we've used the IndicesAnalysisService object and its analyzerProviderFactories method to get the map of PreBuiltAnalyzerProviderFactory (as a value and the name as a key in the map), and we've put a newly created PreBuiltAnalyzerProviderFactory object with the name of mastering_analyzer. In order to create the PreBuiltAnalyzerProviderFactory we've used our CustomAnalyzer and AnalyzerScope.INDICES enum values (from the org.elasticsearch.index.analysis package). The other values of AnalyzerScope enum are GLOBAL and INDEX. If you would like the analyzer to be globally shared, you should use AnalyzerScope.GLOBAL and AnalyzerScope.INDEX, both of which should be created for each index separately.

In a similar way, we add our token filter, but this time, we use the tokenFilterFactories method of the IndicesAnalysisService object, which returns a Map of PreBuiltTokenFilterFactoryFactory as a value and a name (a String object) as a key. We put a newly created TokenFilterFactory object with the name of mastering_filter.

Implementing the analyzer module

A simple class called CustomAnalyzerModule extends AbstractModule from the org.elasticsearch.common.inject package. It is used to tell Elasticsearch that our CustomAnalyzerIndicesComponent class should be used as a singleton; we do this because it's enough to have a single instance of that class. Its code looks as follows:

public class CustomAnalyzerModule extends AbstractModule {
  @Override
  protected void configure() {
    bind(CustomAnalyzerIndicesComponent.class).asEagerSingleton();
  }
}

As you can see, we implement a single configure method, which tells you to bind the CustomAnalyzerIndicesComponent class as a singleton.

Implementing the analyzer plugin

Finally, we need to implement the plugin class so that Elasticsearch knows that there is a plugin to be loaded. It should extend the AbstractPlugin class from the org.elasticsearch.plugins package and thus implement at least the name and descriptions methods. However, we want our plugin to be registered, and that's why we implement two additional methods, which we can see in the following code snippet:

public class CustomAnalyzerPlugin extends AbstractPlugin {
  @Override
  public Collection<Class<? extends Module>> modules() {
      return ImmutableList.<Class<? extends Module>>of(CustomAnalyzerModule.class);
  }

  public void onModule(AnalysisModule module) {
      module.addProcessor(new CustomAnalysisBinderProcessor());
  }

  @Override
  public String name() {
    return "AnalyzerPlugin";
  }
  
  @Override
  public String description() {
    return "Custom analyzer plugin";
  }
}

The name and description methods are quite obvious, as they are returning the name of the plugin and its description. The onModule method adds our CustomAnalysisBinderProcessor object to the AnalysisModule object provided to it.

The last method is the one we are not yet familiar with: the modules method:

public Collection<Class<? extends Module>> modules() {
  return ImmutableList.<Class<? extends Module>>of(CustomAnalyzerModule.class);
}

We override this method from the super class in order to return a collection of modules that our plugin is registering. In this case, we are registering a single module class—CustomAnalyzerModule—and we are returning a list with a single entry.

Informing Elasticsearch about our custom analyzer

Once we have our code ready, we need to add one additional thing: we need to let Elasticsearch know what the class registering our plugin is—the one we've called CustonAnalyzerPlugin. In order to do that, we create an es-plugin.properties file in the src/main/resources directory with the following content:

plugin=pl.solr.analyzer.CustomAnalyzerPlugin

We just specify the plugin property there, which should have a value of the class we use to register our plugins (the one that extends the Elasticsearch AbstractPlugin class). This file will be included in the JAR file that will be created during the build process and will be used by Elasticsearch during the plugin load process.

Testing our custom analysis plugin

Now, we want to test our custom analysis plugin just to be sure that everything works. In order to do that, we need to build our plugin, install it on all nodes in our cluster, and finally, use the Admin Indices Analyze API to see how our analyzer works. Let's do that.

Building our custom analysis plugin

We start with the easiest part: building our plugin. In order to do that, we run a simple command:

mvn compile package

We tell Maven that we want the code to be compiled and packaged. After the command finishes, we can find the archive with the plugin in the target/release directory (assuming you are using a project setup similar to the one we've described at the beginning of the chapter).

Installing the custom analysis plugin

To install the plugin, we will use the plugin command, just like we did previously. Assuming that we have our plugin archive stored in the /home/install/es/plugins directory, would run the following command (we run it from the Elasticsearch home directory):

bin/plugin --install analyzer --url file:/home/install/es/plugins/elasticsearch-analyzer-1.4.1.zip

We need to install the plugin on all the nodes in our cluster, because we want Elasticsearch to be able to find our analyzer and filter no matter on which node the analysis process is done. If we don't install the plugin on all nodes, we can be certain that we will run into issues.

Note

In order to learn more about installing Elasticsearch plugins, please refer to our previous book, Elasticsearch Server Section Edition, by Packt Publishing or refer to the official Elasticsearch documentation.

After we have the plugin installed, we need to restart our Elasticsearch instance we were creating the installation on. After the restart, we should see something like this in the logs:

[2014-12-03 22:39:11,231][INFO ][plugins                  ] [Tattletale] loaded [AnalyzerPlugin], sites []

With the preceding log line, Elasticsearch informs us that the plugin named AnalyzerPlugin was successfully loaded.

Checking whether our analysis plugin works

We can finally check whether our custom analysis plugin works as it should. In order to do that, we start with creating an empty index called analyzetest (the index name doesn't matter). We do this by running the following command:

curl -XPOST 'localhost:9200/analyzetest/'

After this we use the Admin Indices Analyze API (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html) to see how our analyzer works. We do that by running the following command:

curl -XGET 'localhost:9200/analyzetest/_analyze?analyzer=mastering_analyzer&pretty' -d 'mastering elasticsearch'

So, what we should see in response is two tokens: one that should be reversed—mastering—gniretsam and another one that should also be reversed—elasticsearch—hcraescitsale. The response Elasticsearch returns looks as follows:

{
  "tokens" : [ {
    "token" : "gniretsam",
    "start_offset" : 0,
    "end_offset" : 9,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "hcraescitsale",
    "start_offset" : 10,
    "end_offset" : 23,
    "type" : "word",
    "position" : 2
  } ]
}

As you can see, we've got exactly what we expected, so it seems that our custom analysis plugin works as intended.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset