/path/to/sphinx-stem.conf
as follows:source items { type = mysql sql_host = localhost sql_user = root sql_pass = sql_db = sphinx_conf sql_query = SELECT id, title, content, created FROM items sql_attr_timestamp = created } index items { source = items path = /usr/local/sphinx/var/data/items-morph charset_type = utf-8 morphology = stem_en }
indexer
command:$/usr/local/sphinx/bin/indexer -c /path/to/sphinx-stem.conf items
search
utility:$/usr/local/sphinx/bin/search -c /path/to/sphinx-stem.conf run
$/usr/local/sphinx/bin/search -c /path/to/sphinx-stem.conf running
We created a configuration file to test the morphology
option with the value stem_en
. We then indexed the data and ran a few searches. Let's try to understand the results.
Firstly we searched for the word run. None of the records in our items table have the word run. But even then we got a document in the search results. The reason for this is that documentID 1
has the word runs in one of its fields, and when runs is normalized, it becomes "run". As we used the stem_en
morphology pre-processor, all English words are normalized to their base form. Thus "runs" becomes "run".
Similarly in our second search command, we searched for the word running. We again got the same documentID 1
in the result because running is normalized to "run"; and then a search
is done, thus returning documentID 1
.
As we saw in the previous example, Sphinx supports applying morphology pre-processors to the indexed data. This option is optional and the default value is empty, that is, it does not apply any pre-processor.
Sphinx comes with built-in English and Russian stemmers. Other built-in pre-processors are Soundex and Metaphone. The latter two are used to replace words with phonetic codes. Phonetic codes of different words are equal if they sound phonetically similar. For example, if you use Soundex morphology when your indexed data contains the word "gosh", and someone searches for the word "ghosh", then it will match "gosh". This is because these two words are phonetically similar and have the same phonetic code.
One more option related to morphology is min_stemming_len
.
This option lets us specify the minimum word length at which the stemming is enabled. The default value is 1
and everything is stemmed.
This option is particularly useful in those cases where stemming does not give you the desired results.
There may be occasions where you want to replace a word with an altogether different word when indexing and searching. For example, when someone searches for "walk", the record with "runs" should match, although the two words are quite different. This cannot be accomplished by stemming (morphology).
Wordforms comes to rescue in such a situation. Wordforms are dictionaries applied after tokenizing the incoming text by charset_table
rules. These dictionaries let you replace one word with another. An Ideal case of usage is bringing different forms of a word to a single normal form. Wordforms are used both during indexing and searching.
Dictionary file for the wordforms should be in a simple plain text format with each line containing source and destination wordforms. the Same encoding should be used for the wordforms file as specified in the charset_type
. An example wordform file is shown below.
Here's an example wordform file:
walks > runs walked > walk play station 3 > ps3 playstation 3 > ps3
To specify the wordform file in the index section of the configuration file you should use the wordforms
option:
wordforms = /path/to/wordforms.txt