When a field is indexed by Lucene, it undergoes a parsing and conversion process called analysis. In Chapter 3, Performing Queries, we mentioned that the default analyzer tokenizes string fields, and that this behavior should be disabled if you plan to sort on that field.
However, much more is possible during analysis. Apache Solr components may be assembled in hundreds of combinations. They can manipulate text in various ways during indexing, and open the door to some really powerful search functionally.
In order to discuss the Solr components that are available, or how to assemble them into a custom analyzer definition, we must first understand the three phases of Lucene analysis:
Analysis begins by applying zero or more character filters, which strip or replace characters prior to any other processing. The filtered string then undergoes tokenization, splitting it into smaller tokens to make keyword searches more efficient. Finally, zero or more token filters remove or replace tokens before they are saved to the index.
These components are provided by the Apache Solr project, and they number over three-dozen in total. This book cannot dive deeply into every single one, but we can take a look at a few key examples of the three types and see how to apply them generally.
The full documentation for all of these Solr analyzer components may be found at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters, with Javadocs at http://lucene.apache.org/solr/api-3_6_1.
When defining a custom analyzer, character filtering is an optional step. Should this step be desired, there are only three character filter types available:
MappingCharFilterFactory
: This filter replaces characters (or sequences of characters) with specifically defined replacement text, for example, you might replace occurrences of 1 with one, 2 with two, and so on.The mappings between character(s) and replacement value(s) are stored in a resource file, using the standard java.util.Properties
format, located somewhere in the application's classpath. For each property, the key is the sequence to look for, and the value is the mapped replacement.
The classpath-relative location of this
mappings file is passed to the MappingCharFilterFactory
definition, as a parameter named mapping
. The exact mechanism for passing this parameter will be illustrated shortly in the Defining and Selecting Analyzers section.
PatternReplaceCharFilter
: This filter applies a regular expression, passed via a parameter named pattern
. Any matches will be replaced with a string of static text passed via a replacement
parameter.HTMLStripCharFilterFactory
: This extremely useful filter removes HTML tags, and replaces escape sequences with their usual text forms (for example, >
becomes >
).Character and token filters are both optional when defining a custom analyzer, and you may combine multiple filters of both types. However, the tokenizer
component is unique. An analyzer definition must contain one, and no more than one.
There are 10 tokenizer
components available in total. Some illustrative examples include:
WhitespaceTokenizerFactory
: This simply splits text on whitespace. For instance, hello world is tokenized into hello and world.LetterTokenizerFactory
: This is similar to WhitespaceTokenizrFactory
in functionality, but this tokenizer also splits text on non-letter characters. The non-letter characters are discarded altogether, for example, please don't go is tokenized into please, don, t, and go.StandardTokenizerFactory
: This is the default tokenizer
that is automatically applied when you don't define a custom analyzer. It generally splits on whitespace, discarding extraneous characters. For instance, it's 25.5 degrees outside!!! becomes it's, 25.5, degrees, and outside.By far the greatest variety in analyzer functionality comes through token filters, with Solr offering two dozen options for use alone or in combination. These are only a few of the more useful examples:
StopFilterFactory
: This filter simply throws away "stop words", or extremely common words for which no one would ever want to perform a keyword query anyway. The list includes a, the, if, for, and, or, and so on (the Solr documentation presents the full list).PhoneticFilterFactory
: When you use a major search engine, you would probably notice that it can be very intelligent in dealing with your typos. One technique for doing this is to look for words that sound similar to the searched keyword, in case it was misspelled. For example, if you meant to search for morning, but misspelled it as mourning, the search would still match the intended term! This token filter provides that functionality by indexing phonetically similar strings along with the actual token. The filter requires a parameter named encoder
, set to the name of a supported encoding algorithm ("DoubleMetaphone"
is a sensible option).SnowballPorterFilterFactory
: Stemming is a process in which tokens are broken down into their root form, to make it easier to match related words. Snowball and Porter refer to stemming algorithms. For instance, the words developer and development can both be broken down to the root stem develop. Therefore, Lucene can recognize a relationship between the two longer words (even though neither one is a substring of the other!) and can return matches on both. This filter takes one parameter, named language
(for example, "English"
).An analyzer definition assembles some combination of these components into a logical whole, which can then be referenced when indexing an entity or individual field. Custom analyzers may be defined in a static manner, or may be assembled dynamically based on some conditions at runtime.
Either approach for defining a
custom analyzer begins with an @AnalyzerDef
annotation on the relevant persistent class. In the chapter4
version of our VAPORware Marketplace application, let's define a custom analyzer to be used with the App
entity's description
field. It should strip out any HTML tags, and apply various token filters to reduce clutter and account for typos:
... @AnalyzerDef( name="appAnalyzer", charFilters={ @CharFilterDef(factory=HTMLStripCharFilterFactory.class) }, tokenizer=@TokenizerDef(factory=StandardTokenizerFactory.class), filters={ @TokenFilterDef(factory=StandardFilterFactory.class), @TokenFilterDef(factory=StopFilterFactory.class), @TokenFilterDef(factory=PhoneticFilterFactory.class, params = { @Parameter(name="encoder", value="DoubleMetaphone") }), @TokenFilterDef(factory=SnowballPorterFilterFactory.class, params = { @Parameter(name="language", value="English") }) } ) ...
The @AnalyzerDef
annotation must have a name element set, and as previously discussed, an analyzer must always include one and only one tokenizer.
The charFilters
and filters
elements are optional. If set, they receive lists of one or more factory classes, for character filters and token filters respectively.
The @Analyzer
annotation is used to select and apply a custom analyzer. This annotation may be placed on an individual field, or on the overall class where it will affect every field. In this case, we are only selecting our
analyzer definition for the desciption
field:
...
@Column(length = 1000)
@Field
@Analyzer(definition="appAnalyzer")
private String description;
...
It is possible to define multiple analyzers in a single class, by wrapping their @AnalyzerDef
annotations within a plural @AnalyzerDefs
:
...
@AnalyzerDefs({
@AnalyzerDef(name="stripHTMLAnalyzer", ...),
@AnalyzerDef(name="applyRegexAnalyzer", ...)
})
...
Obviously, where the @Analyzer
annotation is later applied, its definition element has to match the appropriate @AnalyzerDef
annotation's name element.
The chapter4
version of the VAPORware Marketplace application now strips HTML from the customer reviews. If a search includes the keyword span, there will not be a false positive match on reviews containing the <span>
tag, for instance.
Snowball and phonetic filters are being applied to the app descriptions. The keyword mourning finds a match containing the word morning, and a search for development returns an app with developers in its description.
It is possible to wait until runtime to select a particular analyzer for a field. The most obvious scenario is an application supporting different languages, with analyzer definitions configured for each language. You would want to select the appropriate analyzer based on a language attribute for each object.
To support such a dynamic selection, an @AnalyzerDiscriminator
annotation is added to a particular field or to the class as a whole. This
code snippet uses the latter approach:
@AnalyzerDefs({
@AnalyzerDef(name="englishAnalyzer", ...),
@AnalyzerDef(name="frenchAnalyzer", ...)
})
@AnalyzerDiscriminator(impl=CustomerReviewDiscriminator.class)
public class CustomerReview {
...
@Field
private String language;
...
}
There are two analyzer definitions, one for English and the other for French, and the class CustomerReviewDiscriminator
is declared responsible for deciding which to use. This class must implement the Discriminator
interface, and its getAnalyzerDefinitionName
method:
public class LanguageDiscriminator implements Discriminator { public String getAnalyzerDefinitionName(Object value, Object entity, String field) { if( entity == null || !(entity instanceofCustomerReview) ) { return null; } CustomerReview review = (CustomerReview) entity; if(review.getLanguage() == null) { return null; } else if(review.getLanguage().equals("en")) { return "englishAnalyzer"; } else if(review.getLanguage().equals("fr")) { return "frenchAnalyzer"; } else { return null; } } }
If the @AnalyzerDiscriminator
annotation is placed on a field, then its value for the current object is automatically passed as the first parameter to getAnalyzerDefinitionName
. If the annotation is placed on the class itself, then a null value is passed instead. The second parameter is the current entity object either way.
In this case, the discriminator is applied at the class level. So we cast that second parameter to type CustomerReview
, and return the name of the appropriate analyzer based on the object's language
field. If the language
is unknown or if there are other issues, then the method simply returns null
, telling Hibernate Search to fall back to the default analyzer.