The last thing we want to discuss when it comes to custom Elasticsearch plugins is the analysis process extension. We've chosen to show how to develop a custom analysis plugin because this is sometimes very useful, for example, when you want to have the custom analysis process that you use in your company introduced, or when you want to use the Lucene analyzer or filter that is not present in Elasticsearch itself or as a plugin for it. Because creating an analysis extension is more complicated compared to what we've seen when developing a custom REST action, we decided to leave it until the end of the chapter.
Because developing a custom analysis plugin is the most complicated, at least from the Elasticsearch point of view and the number of classes we need to develop, we will have more things to do compared to previous examples. We will need to develop the following things:
TokenFilter
class extension (from the org.apache.lucene.analysis
package) implementation that will be responsible for handling token reversing; we will call it CustomFilter
AbstractTokenFilterFactory
extension (from the org.elasticsearch.index.analysis
package) that will be responsible for providing our CustomFilter
instance to Elasticsearch; we will call it CustomFilterFactory
org.apache.lucene.analysis.Analyzer
class and provide the Lucene analyzer functionality; we will call it CustomAnalyzer
CustomAnalyzerProvider
, which extends AbstractIndexAnalyzerProvider
from the org.elasticsearch.index.analysis
package, and which will be responsible for providing the analzyer instance to ElasticsearchAnalysisModule.AnalysisBinderProcessor
from the org.elasticsearch.index.analysis
package, which will have information about the names under which our analyzer and token filter will be available in Elasticsearch; we will call it CustomAnalysisBinderProcessor
AbstractComponent
class from the org.elasticsearch.common.component
package, which will inform Elasticsearch which factories should be used for our custom analyzer and token filter; we will call it CustomAnalyzerIndicesComponent
AbstractModule
extension (from the org.elasticsearch.common.inject
package) that will inform Elasticsearch that our CustomAnalyzerIndicesComponent module
should be a singleton; we will call it CustomAnalyzerModule
AbstractPlugin
extension (from the org.elasticsearch.plugins
package) that will register our plugin; we will call it CustomAnalyzerPlugin
So let's start discussing the code.
The funniest thing about the currently discussed plugin is that the whole analysis work is actually done on a Lucene level, and what we need to do is write the org.apache.lucene.analysis.TokenFilter
extension, which we will call CustomFilter
. In order to do this, we need to initialize the super class and override the incrementToken
method. Our class will be responsible for reversing the tokens, so that's the logic we want our analyzer and filter to have. The whole implementation of our CustomFilter
class l
ooks as follows:
public class CustomFilter extends TokenFilter { private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); protected CustomFilter(TokenStream input) { super(input); } @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { char[] originalTerm = termAttr.buffer(); if (originalTerm.length > 0) { StringBuilder builder = new StringBuilder(new String(originalTerm).trim()).reverse(); termAttr.setEmpty(); termAttr.append(builder.toString()); } return true; } else { return false; } } }
The first thing we see in the implementation is the following line:
private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class);
It allows us to retrieve the text of the token we are currently processing. In order to get access to the other token information, we need to use other attributes. The list of attributes can be found by looking at the classes implementing Lucene's org.apache.lucene.util.Attribute
interface (http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/util/Attribute.html). What you need to know now is that by using the static addAttribute
method, we can bind different attributes and use them during token processing.
Then, we have the constructor, which is only used for super class initialization, so we can skip discussing it.
Finally, there is the incrementToken
method, which returns true
when there is a token in the token stream left to be processed, and false
if there is no token left to be processed. So, what we do first is we check whether there is a token to be processed by calling the incrementToken
method of input, which is the TokenStream
instance stored in the super class. Then, we get the term text by calling the buffer
method of the attribute we bind in the first line of our class. If there is text in the term (its length is higher than zero), we use a StringBuilder
object to reverse the text, we clear the term buffer (by calling setEmpty
on the attribute), and we append the reversed text to the already emptied term buffer (by calling the append
method of the attribute). After this, we return true
, because our token is ready to be processed further—on a token filter level, we don't know whether the token will be processed further or not, so we need to be sure we return the correct information, just in case.
The factory for our token filter implementation is one of the simplest classes in the case of the discussed plugins. What we need to do is create an AbstractTokenFilterFactory
(from the org.elasticsearch.index.analysis
package) extension that overrides a single create
method in which we create our token filter. The code of this class looks as follows:
public class CustomFilterFactory extends AbstractTokenFilterFactory { @Inject public CustomFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) { super(index, indexSettings, name, settings); } @Override public TokenStream create(TokenStream tokenStream) { return new CustomFilter(tokenStream); } }
As you can see, the class is very simple. We start with the constructor, which is needed, because we need to initialize the parent class. In addition to this, we have the create
method, in which we create our CustomFilter
class with the provided TokenStream
object.
Before we go on, we would like to mention two more things: the @IndexSettings
and @Assisted
annotations. The first one will result in index settings being injected as the Settings
class object to the constructor; of course, this is done automatically. The @Assisted keyword
results in the annotated parameter value to be injected from the argument of the factory method.
We wanted to keep the example implementation as simple as possible and, because of that, we've decided not to complicate the analyzer implementation. To implement our analyzer, we need to extend an abstract Analyzer
class from Lucene's org.apache.lucene.analysis
package, and we did that. The whole code of our CustomAnalyzer
class looks as follows:
public class CustomAnalyzer extends Analyzer { public CustomAnalyzer() { } @Override protected TokenStreamComponents createComponents(String field, Reader reader) { final Tokenizer src = new WhitespaceTokenizer(reader); return new TokenStreamComponents(src, new CustomFilter(src)); } }
The createComponent
method is the one we need to implement, and it should return a TokenStreamComponents
object (from the org.apache.lucene.analysis
package) for a given field name (the String
type object—the first argument of the method) and data (the Reader
type object—the second method argument). What we do is create a Tokenizer object
using the WhitespaceTokenizer
class available in Lucene. This will result in the input data to be tokenized on whitespace characters. Then, we create a Lucene TokenStreamComponents
object, to which we give the source of tokens (our previously created Tokenizer object
) and our CustomFilter object
. This will result in our CustomFilter object
to be used by CustomAnalyzer
.
Let's talk about another provider implementation in addition to the token filter factory we've created earlier. This time, we need to extend AbstractIndexAnalyzerProvider
from the org.elasticsearch.index.analysis
package in order for Elasticsearch to be able to create our analyzer. The implementation is very simple, as we only need to implement the get
method in which we should return our analyzer. The CustomAnalyzerProvider
class code looks as follows:
public class CustomAnalyzerProvider extends AbstractIndexAnalyzerProvider<CustomAnalyzer> { private final CustomAnalyzer analyzer; @Inject public CustomAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) { super(index, indexSettings, name, settings); analyzer = new CustomAnalyzer(); } @Override public CustomAnalyzer get() { return this.analyzer; } }
As you can see, we've implemented the constructor in order to be able to initialize the super class. In addition to that, we are creating a single instance of our analyzer, which we will return when Elasticsearch requests it. We do this because we don't want to create an analyzer every time Elasticsearch requests it; this is not efficient. We don't need to worry about multithreading because our analyzer is thread-safe and, thus, a single instance can be reused. In the get
method, we are just returning our analyzer.
The binder is a part of our custom code that informs Elasticsearch about the names under which our analyzer and token filter will be available. Our CustomAnalysisBinderProcessor
class extends AnalysisModule.AnalysisBinderProcessor
from org.elasticsearch.index.analysis
, and we override two methods of this class: processAnalyzers
in which we will register our analyzer and processTokenFilters
in which we will register our token filter. If we had only an analyzer or only a token filter, we would only override a single method. The code of CustomAnalysisBinderProcessor
looks as follows:
public class CustomAnalysisBinderProcessor extends AnalysisModule.AnalysisBinderProcessor { @Override public void processAnalyzers(AnalyzersBindings analyzersBindings) { analyzersBindings.processAnalyzer("mastering_analyzer", CustomAnalyzerProvider.class); } @Override public void processTokenFilters(TokenFiltersBindings tokenFiltersBindings) { tokenFiltersBindings.processTokenFilter("mastering_filter", CustomFilterFactory.class); } }
The first method—processAnalyzers
—takes a single AnalysisBinding
object type, which we can use to register our analyzer under a given name. We do this by calling the processAnalyzer
method of the AnalysisBinding
object and pass in the name under which our analyzer will be available and the implementation of AbstractIndexAnalyzerProvider
, which is responsible for creating our analyzer, which in our case, is the CustomAnalyzerProvider
class.
The second method—procesTokenFilters
—again takes a single TokenFiltersBindings
class, which enables us to register our token filter. We do this by calling the
processTokenFilter
method and passing the name under which our token filter will be available and the token filter factory class, which in our case, is CustomFilterFactory
.
Now, we need to implement a node level component that will allow our analyzer and token filter to be reused. However, we will tell Elasticsearch that our analyzer should be reusable only on the indices level and not globally (just to show you how to do it). What we need to do is extend the AbstractComponent
class from the org.elasticsearch.common.component
package. In fact, we only need to develop a constructor for the class we called CustomAnalyzerIndicesComponent
. The whole code for the mentioned class looks as follows:
public class CustomAnalyzerIndicesComponent extends AbstractComponent { @Inject public CustomAnalyzerIndicesComponent(Settings settings, IndicesAnalysisService indicesAnalysisService) { super(settings); indicesAnalysisService.analyzerProviderFactories().put( "mastering_analyzer", new PreBuiltAnalyzerProviderFactory("mastering_analyzer", AnalyzerScope.INDICES, new CustomAnalyzer())); indicesAnalysisService.tokenFilterFactories().put("mastering_filter", new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() { @Override public String name() { return "mastering_filter"; } @Override public TokenStream create(TokenStream tokenStream) { return new CustomFilter(tokenStream); } })); } }
First of all, we pass the constructor arguments to the super class in order to initialize it. After that, we create a new analyzer, which is our CustomAnalyzer
class, by using the following code snippet:
indicesAnalysisService.analyzerProviderFactories().put( "mastering_analyzer", new PreBuiltAnalyzerProviderFactory("mastering_analyzer", AnalyzerScope.INDICES, new CustomAnalyzer()));
As you can see, we've used the IndicesAnalysisService
object and its analyzerProviderFactories
method to get the map of PreBuiltAnalyzerProviderFactory
(as a value and the name as a key in the map), and we've put a newly created PreBuiltAnalyzerProviderFactory
object with the name of mastering_analyzer
. In order to create the PreBuiltAnalyzerProviderFactory
we've used our CustomAnalyzer
and AnalyzerScope.INDICES
enum values (from the org.elasticsearch.index.analysis
package). The other values of AnalyzerScope
enum are GLOBAL
and INDEX
. If you would like the analyzer to be globally shared, you should use AnalyzerScope.GLOBAL
and AnalyzerScope.INDEX
, both of which should be created for each index separately.
In a similar way, we add our token filter, but this time, we use the tokenFilterFactories
method of the IndicesAnalysisService
object, which returns a Map
of PreBuiltTokenFilterFactoryFactory
as a value and a name (a String
object) as a key. We put a newly created TokenFilterFactory
object with the name of mastering_filter
.
A simple class called CustomAnalyzerModule
extends AbstractModule
from the org.elasticsearch.common.inject
package. It is used to tell Elasticsearch that our CustomAnalyzerIndicesComponent
class should be used as a singleton; we do this because it's enough to have a single instance of that class. Its code looks as follows:
public class CustomAnalyzerModule extends AbstractModule { @Override protected void configure() { bind(CustomAnalyzerIndicesComponent.class).asEagerSingleton(); } }
As you can see, we implement a single configure method, which tells you to bind the CustomAnalyzerIndicesComponent
class as a singleton.
Finally, we need to implement the plugin class so that Elasticsearch knows that there is a plugin to be loaded. It should extend the AbstractPlugin
class from the org.elasticsearch.plugins
package and thus implement at least the name
and descriptions
methods. However, we want our plugin to be registered, and that's why we implement two additional methods, which we can see in the following code snippet:
public class CustomAnalyzerPlugin extends AbstractPlugin { @Override public Collection<Class<? extends Module>> modules() { return ImmutableList.<Class<? extends Module>>of(CustomAnalyzerModule.class); } public void onModule(AnalysisModule module) { module.addProcessor(new CustomAnalysisBinderProcessor()); } @Override public String name() { return "AnalyzerPlugin"; } @Override public String description() { return "Custom analyzer plugin"; } }
The name
and description
methods are quite obvious, as they are returning the name of the plugin and its description. The onModule
method adds our CustomAnalysisBinderProcessor
object to the AnalysisModule
object provided to it.
The last method is the one we are not yet familiar with: the modules
method:
public Collection<Class<? extends Module>> modules() { return ImmutableList.<Class<? extends Module>>of(CustomAnalyzerModule.class); }
We override this method from the super class in order to return a collection of modules that our plugin is registering. In this case, we are registering a single module class—CustomAnalyzerModule
—and we are returning a list with a single entry.
Once we have our code ready, we need to add one additional thing: we need to let Elasticsearch know what the class registering our plugin is—the one we've called CustonAnalyzerPlugin
. In order to do that, we create an es-plugin.properties
file in the src/main/resources
directory with the following content:
plugin=pl.solr.analyzer.CustomAnalyzerPlugin
We just specify the plugin
property there, which should have a value of the class we use to register our plugins (the one that extends the Elasticsearch AbstractPlugin
class). This file will be included in the JAR file that will be created during the build process and will be used by Elasticsearch during the plugin load process.
Now, we want to test our custom analysis plugin just to be sure that everything works. In order to do that, we need to build our plugin, install it on all nodes in our cluster, and finally, use the Admin Indices Analyze API to see how our analyzer works. Let's do that.
We start with the easiest part: building our plugin. In order to do that, we run a simple command:
mvn compile package
We tell Maven that we want the code to be compiled and packaged. After the command finishes, we can find the archive with the plugin in the target/release
directory (assuming you are using a project setup similar to the one we've described at the beginning of the chapter).
To install the plugin, we will use the plugin
command, just like we did previously. Assuming that we have our plugin archive stored in the /home/install/es/plugins
directory, would run the following command (we run it from the Elasticsearch home directory):
bin/plugin --install analyzer --url file:/home/install/es/plugins/elasticsearch-analyzer-1.4.1.zip
We need to install the plugin on all the nodes in our cluster, because we want Elasticsearch to be able to find our analyzer and filter no matter on which node the analysis process is done. If we don't install the plugin on all nodes, we can be certain that we will run into issues.
After we have the plugin installed, we need to restart our Elasticsearch instance we were creating the installation on. After the restart, we should see something like this in the logs:
[2014-12-03 22:39:11,231][INFO ][plugins ] [Tattletale] loaded [AnalyzerPlugin], sites []
With the preceding log line, Elasticsearch informs us that the plugin named AnalyzerPlugin
was successfully loaded.
We can finally check whether our custom analysis plugin works as it should. In order to do that, we start with creating an empty index called analyzetest
(the index name doesn't matter). We do this by running the following command:
curl -XPOST 'localhost:9200/analyzetest/'
After this we use the Admin Indices Analyze API (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html) to see how our analyzer works. We do that by running the following command:
curl -XGET 'localhost:9200/analyzetest/_analyze?analyzer=mastering_analyzer&pretty' -d 'mastering elasticsearch'
So, what we should see in response is two tokens: one that should be reversed—mastering—gniretsam
and another one that should also be reversed—elasticsearch—hcraescitsale
. The response Elasticsearch returns looks as follows:
{ "tokens" : [ { "token" : "gniretsam", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 1 }, { "token" : "hcraescitsale", "start_offset" : 10, "end_offset" : 23, "type" : "word", "position" : 2 } ] }
As you can see, we've got exactly what we expected, so it seems that our custom analysis plugin works as intended.