Analysis

When a field is indexed by Lucene, it undergoes a parsing and conversion process called analysis. In Chapter 3, Performing Queries, we mentioned that the default analyzer tokenizes string fields, and that this behavior should be disabled if you plan to sort on that field.

However, much more is possible during analysis. Apache Solr components may be assembled in hundreds of combinations. They can manipulate text in various ways during indexing, and open the door to some really powerful search functionally.

In order to discuss the Solr components that are available, or how to assemble them into a custom analyzer definition, we must first understand the three phases of Lucene analysis:

  • Character filtering
  • Tokenization
  • Token filtering

Analysis begins by applying zero or more character filters, which strip or replace characters prior to any other processing. The filtered string then undergoes tokenization, splitting it into smaller tokens to make keyword searches more efficient. Finally, zero or more token filters remove or replace tokens before they are saved to the index.

Note

These components are provided by the Apache Solr project, and they number over three-dozen in total. This book cannot dive deeply into every single one, but we can take a look at a few key examples of the three types and see how to apply them generally.

The full documentation for all of these Solr analyzer components may be found at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters, with Javadocs at http://lucene.apache.org/solr/api-3_6_1.

Character filtering

When defining a custom analyzer, character filtering is an optional step. Should this step be desired, there are only three character filter types available:

  • MappingCharFilterFactory: This filter replaces characters (or sequences of characters) with specifically defined replacement text, for example, you might replace occurrences of 1 with one, 2 with two, and so on.

    The mappings between character(s) and replacement value(s) are stored in a resource file, using the standard java.util.Properties format, located somewhere in the application's classpath. For each property, the key is the sequence to look for, and the value is the mapped replacement.

    The classpath-relative location of this mappings file is passed to the MappingCharFilterFactory definition, as a parameter named mapping . The exact mechanism for passing this parameter will be illustrated shortly in the Defining and Selecting Analyzers section.

  • PatternReplaceCharFilter: This filter applies a regular expression, passed via a parameter named pattern. Any matches will be replaced with a string of static text passed via a replacement parameter.
  • HTMLStripCharFilterFactory: This extremely useful filter removes HTML tags, and replaces escape sequences with their usual text forms (for example, > becomes >).

Tokenization

Character and token filters are both optional when defining a custom analyzer, and you may combine multiple filters of both types. However, the tokenizer component is unique. An analyzer definition must contain one, and no more than one.

There are 10 tokenizer components available in total. Some illustrative examples include:

  • WhitespaceTokenizerFactory: This simply splits text on whitespace. For instance, hello world is tokenized into hello and world.
  • LetterTokenizerFactory: This is similar to WhitespaceTokenizrFactory in functionality, but this tokenizer also splits text on non-letter characters. The non-letter characters are discarded altogether, for example, please don't go is tokenized into please, don, t, and go.
  • StandardTokenizerFactory: This is the default tokenizer that is automatically applied when you don't define a custom analyzer. It generally splits on whitespace, discarding extraneous characters. For instance, it's 25.5 degrees outside!!! becomes it's, 25.5, degrees, and outside.

Tip

When in doubt, StandardTokenizerFactory is almost always the sensible choice.

Token filtering

By far the greatest variety in analyzer functionality comes through token filters, with Solr offering two dozen options for use alone or in combination. These are only a few of the more useful examples:

  • StopFilterFactory: This filter simply throws away "stop words", or extremely common words for which no one would ever want to perform a keyword query anyway. The list includes a, the, if, for, and, or, and so on (the Solr documentation presents the full list).
  • PhoneticFilterFactory: When you use a major search engine, you would probably notice that it can be very intelligent in dealing with your typos. One technique for doing this is to look for words that sound similar to the searched keyword, in case it was misspelled. For example, if you meant to search for morning, but misspelled it as mourning, the search would still match the intended term! This token filter provides that functionality by indexing phonetically similar strings along with the actual token. The filter requires a parameter named encoder, set to the name of a supported encoding algorithm ("DoubleMetaphone" is a sensible option).
  • SnowballPorterFilterFactory: Stemming is a process in which tokens are broken down into their root form, to make it easier to match related words. Snowball and Porter refer to stemming algorithms. For instance, the words developer and development can both be broken down to the root stem develop. Therefore, Lucene can recognize a relationship between the two longer words (even though neither one is a substring of the other!) and can return matches on both. This filter takes one parameter, named language (for example, "English").

Defining and selecting analyzers

An analyzer definition assembles some combination of these components into a logical whole, which can then be referenced when indexing an entity or individual field. Custom analyzers may be defined in a static manner, or may be assembled dynamically based on some conditions at runtime.

Static analyzer selection

Either approach for defining a custom analyzer begins with an @AnalyzerDef annotation on the relevant persistent class. In the chapter4 version of our VAPORware Marketplace application, let's define a custom analyzer to be used with the App entity's description field. It should strip out any HTML tags, and apply various token filters to reduce clutter and account for typos:

...
@AnalyzerDef(
   name="appAnalyzer",
   charFilters={    
      @CharFilterDef(factory=HTMLStripCharFilterFactory.class) 
   },
   tokenizer=@TokenizerDef(factory=StandardTokenizerFactory.class),
   filters={ 
      @TokenFilterDef(factory=StandardFilterFactory.class),
      @TokenFilterDef(factory=StopFilterFactory.class),
      @TokenFilterDef(factory=PhoneticFilterFactory.class, 
            params = {
         @Parameter(name="encoder", value="DoubleMetaphone")
            }),
      @TokenFilterDef(factory=SnowballPorterFilterFactory.class, 
            params = {
      @Parameter(name="language", value="English") 
      })
   }
)
...

The @AnalyzerDef annotation must have a name element set, and as previously discussed, an analyzer must always include one and only one tokenizer.

The charFilters and filters elements are optional. If set, they receive lists of one or more factory classes, for character filters and token filters respectively.

Tip

Be aware that character filters and token filters are applied in the order they are listed. In some cases, changes to the ordering can affect the final result.

The @Analyzer annotation is used to select and apply a custom analyzer. This annotation may be placed on an individual field, or on the overall class where it will affect every field. In this case, we are only selecting our analyzer definition for the desciption field:

...
@Column(length = 1000)
@Field
@Analyzer(definition="appAnalyzer")
private String description;
...

It is possible to define multiple analyzers in a single class, by wrapping their @AnalyzerDef annotations within a plural @AnalyzerDefs:

...
@AnalyzerDefs({
   @AnalyzerDef(name="stripHTMLAnalyzer", ...),
   @AnalyzerDef(name="applyRegexAnalyzer", ...)
})
...

Obviously, where the @Analyzer annotation is later applied, its definition element has to match the appropriate @AnalyzerDef annotation's name element.

Note

The chapter4 version of the VAPORware Marketplace application now strips HTML from the customer reviews. If a search includes the keyword span, there will not be a false positive match on reviews containing the <span> tag, for instance.

Snowball and phonetic filters are being applied to the app descriptions. The keyword mourning finds a match containing the word morning, and a search for development returns an app with developers in its description.

Dynamic analyzer selection

It is possible to wait until runtime to select a particular analyzer for a field. The most obvious scenario is an application supporting different languages, with analyzer definitions configured for each language. You would want to select the appropriate analyzer based on a language attribute for each object.

To support such a dynamic selection, an @AnalyzerDiscriminator annotation is added to a particular field or to the class as a whole. This code snippet uses the latter approach:

@AnalyzerDefs({
   @AnalyzerDef(name="englishAnalyzer", ...),
   @AnalyzerDef(name="frenchAnalyzer", ...)
})
@AnalyzerDiscriminator(impl=CustomerReviewDiscriminator.class)
public class CustomerReview {
   ...
   @Field
   private String language;
   ...
}

There are two analyzer definitions, one for English and the other for French, and the class CustomerReviewDiscriminator is declared responsible for deciding which to use. This class must implement the Discriminator interface, and its getAnalyzerDefinitionName method:

public class LanguageDiscriminator implements Discriminator {

   public String getAnalyzerDefinitionName(Object value, 
         Object entity, String field) {
      if( entity == null || !(entity instanceofCustomerReview) ) {
         return null;
      }
      CustomerReview review = (CustomerReview) entity;
      if(review.getLanguage() == null) {
         return null;
       } else if(review.getLanguage().equals("en")) {
         return "englishAnalyzer";
       } else if(review.getLanguage().equals("fr")) {
         return "frenchAnalyzer";
       } else {
         return null;
      }
   }

}

If the @AnalyzerDiscriminator annotation is placed on a field, then its value for the current object is automatically passed as the first parameter to getAnalyzerDefinitionName. If the annotation is placed on the class itself, then a null value is passed instead. The second parameter is the current entity object either way.

In this case, the discriminator is applied at the class level. So we cast that second parameter to type CustomerReview, and return the name of the appropriate analyzer based on the object's language field. If the language is unknown or if there are other issues, then the method simply returns null, telling Hibernate Search to fall back to the default analyzer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset