Time for action - using morphology for stemming

  1. Create the Sphinx configuration file /path/to/sphinx-stem.conf as follows:
    source items
    {
    type = mysql
    sql_host = localhost
    sql_user = root
    sql_pass =
    sql_db = sphinx_conf
    sql_query = SELECT id, title, content, created FROM items
    sql_attr_timestamp = created
    }
    index items
    {
    source = items
    path = /usr/local/sphinx/var/data/items-morph
    charset_type = utf-8
    morphology = stem_en
    }
    
  2. Run the indexer command:
    $/usr/local/sphinx/bin/indexer -c /path/to/sphinx-stem.conf items
    
  3. Search for the word run using the command line search utility:
    $/usr/local/sphinx/bin/search -c /path/to/sphinx-stem.conf run
    
    Time for action - using morphology for stemming
  4. Search for the word running:
    $/usr/local/sphinx/bin/search -c /path/to/sphinx-stem.conf running
    
    Time for action - using morphology for stemming

What just happened?

We created a configuration file to test the morphology option with the value stem_en. We then indexed the data and ran a few searches. Let's try to understand the results.

Firstly we searched for the word run. None of the records in our items table have the word run. But even then we got a document in the search results. The reason for this is that documentID 1 has the word runs in one of its fields, and when runs is normalized, it becomes "run". As we used the stem_en morphology pre-processor, all English words are normalized to their base form. Thus "runs" becomes "run".

Similarly in our second search command, we searched for the word running. We again got the same documentID 1 in the result because running is normalized to "run"; and then a search is done, thus returning documentID 1.

morphology

As we saw in the previous example, Sphinx supports applying morphology pre-processors to the indexed data. This option is optional and the default value is empty, that is, it does not apply any pre-processor.

Sphinx comes with built-in English and Russian stemmers. Other built-in pre-processors are Soundex and Metaphone. The latter two are used to replace words with phonetic codes. Phonetic codes of different words are equal if they sound phonetically similar. For example, if you use Soundex morphology when your indexed data contains the word "gosh", and someone searches for the word "ghosh", then it will match "gosh". This is because these two words are phonetically similar and have the same phonetic code.

Note

Multiple stemmers can be specified (comma separated) and they are applied in the order that they are listed. Processing stops if one of the stemmers actually modifies the word.

One more option related to morphology is min_stemming_len.

min_stemming_len

This option lets us specify the minimum word length at which the stemming is enabled. The default value is 1 and everything is stemmed.

This option is particularly useful in those cases where stemming does not give you the desired results.

Wordforms

There may be occasions where you want to replace a word with an altogether different word when indexing and searching. For example, when someone searches for "walk", the record with "runs" should match, although the two words are quite different. This cannot be accomplished by stemming (morphology).

Wordforms comes to rescue in such a situation. Wordforms are dictionaries applied after tokenizing the incoming text by charset_table rules. These dictionaries let you replace one word with another. An Ideal case of usage is bringing different forms of a word to a single normal form. Wordforms are used both during indexing and searching.

Dictionary file for the wordforms should be in a simple plain text format with each line containing source and destination wordforms. the Same encoding should be used for the wordforms file as specified in the charset_type. An example wordform file is shown below.

Here's an example wordform file:

walks > runs
walked > walk
play station 3 > ps3
playstation 3 > ps3

To specify the wordform file in the index section of the configuration file you should use the wordforms option:

wordforms = /path/to/wordforms.txt

Note

Wordforms are applied prior to stemming by using morphology pre-processor. If a word is modified by the wordforms then stemmers will not be applied at all.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset