Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Time for action - using morphology for stemming

Create the Sphinx configuration file /path/to/sphinx-stem.conf as follows:

source items
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass =
sql_db = sphinx_conf
sql_query = SELECT id, title, content, created FROM items
sql_attr_timestamp = created
}
index items
{
source = items
path = /usr/local/sphinx/var/data/items-morph
charset_type = utf-8
morphology = stem_en
}

Run the indexer command:

$/usr/local/sphinx/bin/indexer -c /path/to/sphinx-stem.conf items

Search for the word run using the command line search utility:

$/usr/local/sphinx/bin/search -c /path/to/sphinx-stem.conf run

Time for action - using morphology for stemming

Search for the word running:

$/usr/local/sphinx/bin/search -c /path/to/sphinx-stem.conf running

What just happened?

We created a configuration file to test the morphology option with the value stem_en. We then indexed the data and ran a few searches. Let's try to understand the results.

Firstly we searched for the word run. None of the records in our items table have the word run. But even then we got a document in the search results. The reason for this is that documentID 1 has the word runs in one of its fields, and when runs is normalized, it becomes "run". As we used the stem_en morphology pre-processor, all English words are normalized to their base form. Thus "runs" becomes "run".

Similarly in our second search command, we searched for the word running. We again got the same documentID 1 in the result because running is normalized to "run"; and then a search is done, thus returning documentID 1.

morphology

As we saw in the previous example, Sphinx supports applying morphology pre-processors to the indexed data. This option is optional and the default value is empty, that is, it does not apply any pre-processor.

Sphinx comes with built-in English and Russian stemmers. Other built-in pre-processors are Soundex and Metaphone. The latter two are used to replace words with phonetic codes. Phonetic codes of different words are equal if they sound phonetically similar. For example, if you use Soundex morphology when your indexed data contains the word "gosh", and someone searches for the word "ghosh", then it will match "gosh". This is because these two words are phonetically similar and have the same phonetic code.

Note

Multiple stemmers can be specified (comma separated) and they are applied in the order that they are listed. Processing stops if one of the stemmers actually modifies the word.

One more option related to morphology is min_stemming_len.

min_stemming_len

This option lets us specify the minimum word length at which the stemming is enabled. The default value is 1 and everything is stemmed.

This option is particularly useful in those cases where stemming does not give you the desired results.

Wordforms

There may be occasions where you want to replace a word with an altogether different word when indexing and searching. For example, when someone searches for "walk", the record with "runs" should match, although the two words are quite different. This cannot be accomplished by stemming (morphology).

Wordforms comes to rescue in such a situation. Wordforms are dictionaries applied after tokenizing the incoming text by charset_table rules. These dictionaries let you replace one word with another. An Ideal case of usage is bringing different forms of a word to a single normal form. Wordforms are used both during indexing and searching.

Dictionary file for the wordforms should be in a simple plain text format with each line containing source and destination wordforms. the Same encoding should be used for the wordforms file as specified in the charset_type. An example wordform file is shown below.

Here's an example wordform file:

walks > runs
walked > walk
play station 3 > ps3
playstation 3 > ps3

To specify the wordform file in the index section of the configuration file you should use the wordforms option:

wordforms = /path/to/wordforms.txt

Note

Wordforms are applied prior to stemming by using morphology pre-processor. If a word is modified by the wordforms then stemmers will not be applied at all.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Time for action - using morphology for stemming

Create new playlist

Sign In

Sign Up

Time for action - using morphology for stemming

What just happened?

morphology

Note

min_stemming_len

Wordforms

Note

Table of Contents for
Time for action - using morphology for stemming