/usr/local/sphinx/etc/sphinx-distributed.conf
on the primary server (192.168.1.1) and add a new index definition as shown:index master { type = distributed # Local index to be searched local = items # Remote agent (index) to be searched agent = 192.168.1.2:9312:items-2 }
searchd { log = /usr/local/sphinx/var/log/searchd-distributed.log query_log = /usr/local/sphinx/var/log/query-distributed.log max_children = 30 pid_file = /usr/local/sphinx/var/log/searchd-distributed.pid }
searchd
daemon on the primary server (make sure to stop any previous instance):$ /usrl/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/sphinx-distributed.conf
searchd
daemon on the second server (make sure to stop any previous instance):$ /usrl/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/sphinx-distributed-2.conf
We added a second index definition in our configuration file on the primary server. This index will be used for distributed searching. We named this index as master
and used the type
option to define it as a distributed index.
The master
index contains only references to other local and remote indexes. It cannot be directly indexed and is used only for search purposes. You should rather re-index the indexes that master
references (In our case, the items
index on the first server and the items-2
index on the second server).
To reference local indexes, indexes on the same machine or configuration file, local
option is used. To reference remote indexes, the agent
option is used. Multiple local and remote indexes can be referenced. For example:
local = items local = items-delta agent = 192.168.1.2:9312:items-2,items-3 agent = myhost:9313:items-4
We defined two local
and two remote
indexes. The syntax for specifying a remote
index using TCP connection is:
hostname:port:index1[,index2[,...]]
Syntax for specifying a local UNIX connection is:
/var/run/searchd.sock:index4
We also added searchd
configuration section in both the configuration files. Now, if you want to perform a distributed search, you should fire the query against the master index as follows:
<?php require_once('sphinxapi.php'), $client = new SphinxClient(); $client->SetServer('192.168.1.1', 9312); $client->SetConnectTimeout(1); $client->SetArrayResult(true); $results = $client->Query('search term', 'master'),
When you send a query to searchd
using the client API (as shown in the previous code snippet), the following will occur:
searchd
connects to the configured remote agentssearchd
retrieves the remote agents' (index's) search resultsAs we just saw, scaling Sphinx horizontally is a breeze and even a beginner can do it.
What follows are a few more options that can be used to configure a distributed index.'
This option lets you issue queries to remote agents and then forget them. This is useful for debugging purposes since you can set up a separate searchd
instance and forward the search queries to this instance from your production instance, without interfering with the production work. The production server's searchd
will try to connect and query the blackhole agent, but it will not wait for the process or results. This is an optional option and there can be multiple blackholes:
agent_blackhole = debugserver:9312:debugindex1,debugindex2
The remote agent's connection timeout in milliseconds. It's an optional option and its default value is 1000 ms (1 second):
agent_connect_timeout = 2000
This option specifies the time period before searchd
should give up connecting to a remote agent.
The remote agent's query timeout in milliseconds. It's an optional option and its default value is 3000 ms (3 seconds):
agent_query_timeout = 5000
This option specifies the time period before searchd
should give up querying a remote agent.
The example we saw in the previous section used two different servers. The same example can be built on one server with little modifications to the configuration files. All references to the second server (192.168.1.2
) should be replaced with the primary server (192.168.1.1
).
The other important change would be the port on which searchd
listens. The configuration file for the secondary server should use a different port for listening than the primary server. The same should be reflected in the agent
option of master
index.
The next set of options in the index section are charset related options. Let's take a look at them.
You can specify the character encoding type using this option. The two character encodings, which are widely a used character set, that can be used with Sphinx are UTF-8 and Single Byte Character Set (SBCS).charset_type
is optional and default value is sbcs
. Another known value that it can hold is utf-8
.
The specified encoding is used while indexing the data, parsing the query and generating the snippets:
charset_type = utf-8
This is one of the most important options in Sphinx's tokenizing process that extracts keywords from the document text or query. This option controls which characters are acceptable and whether to remove the case or not.
There are more than a hundred thousand characters in Unicode (and 256 in sbcs)
and the charset
table holds the mapping for each of those characters. Each character is mapped to 0 by default, that is, the character does not occur in keywords and it should be treated as a separator. charset_table
is used to map all such characters to either themselves, or to their lower case letter so that they are treated as a part of the keyword. charset_table
can also be used to map one character to an entirely different character.
When you specify the charset_type
as sbcs
then the charset_table
being used (internally) is:
# 'sbcs' defaults for English and Russian charset_table = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
And when utf-8
charset type is used then the table being used is:
# 'utf-8' defaults for English and Russian charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
Default charset table for sbcs
and utf-8
can be overwritten in the index
section of the configuration file.
The format for specifying the charset_table
is a comma-separated list of mappings (as shown in the previous snippet). You can specify a single character as valid, or map one character to its lowercase character or to another character. There are thousands of characters and specifying each one of them in the table would be a tedious task. Further, it will bloat the configuration file and will be unmanageable. To solve this issue Sphinx lets you use syntax shortcuts that map a whole range of characters at once. The list is as follows:
a
—declares a single character as allowed and maps it to itself.A->a
—declares A
as allowed and maps it to the lowercase a
. Lowercase a
itself is not declared as allowed.A..Z
—declares all characters in the range (A to Z) as allowed and maps them to themselves. Use a..z
for lowercase range.A..Z->a..z
—declares all characters in the range A to Z as allowed and maps them to lowercase a to z range. Again, lowercase characters are not declared as allowed by this syntax.A..Z/2
—declares odd characters in the range as allowed and maps them to the even characters. In addition, it declares even characters as allowed and maps them to themselves. Let's understand this with an example. A..Z/2
is equivalent to the following:A->B,B->B,C->D,D->D,E->F,F->F,....,Y->Z,Z->Z
Unicode characters and 8-bit ASCII characters must be specified in the format U+xxx
, where xxx
is the hexadecimal codepoint number for the character.
On many occasions you may want to limit the type of data to be indexed either by words or length. You may also want to filter the incoming text before indexing. Sphinx provides many options to handle all these things. Let's take a look at some of these options.
You may want to skip a few words while indexing, that is, those words should not go in the index. Stopwords are meant for this purpose. A file containing all such words can be stored anywhere on the file system and its path should be specified as the value of the stopwords
option.
This option takes a list of file paths (comma separated). The Default value is empty.
You can specify multiple stopwords
files and all of them will be loaded during indexing. The encoding of the stopwords
file should match the encoding specified in the charset_type
. The format of the file should be plain text. You can use the same separators as in the indexed data because data will be tokenized with respect to the charset_table
settings.
To specify stopwords
in the configuration file:
stopwords = /usr/local/sphinx/var/data/stopwords.txt stopwords = /home/stop-en.txt
And the contents of the stopwords
file should be (for default charset settings):
the into a with
Stopwords do affect the keyword positions even though they are not indexed. As an example; if "into" is a stopword and a document contains the phrase "put into hand", and another document contains "put hand"; then when an exact phrase "put hand" is searched for, it will return only the later document, even though "into" in the first document is stopped.
This option specifies the minimum length a word should have to be considered as a candidate for indexing. Default value is 1 (index everything).
For example: If min_word_len
is 3
, then the words "at" and "to" won't be indexed . However; "the", "with", and any other word whose length is three or greater than three will be indexed:
min_word_len = 3
This option is used when you want to ignore certain characters. Let's take an example of a hyphen (-) to understand this. If you have a word "test-match" in your data, then this would normally go as two different words, "test" and "match" in the index. However, if you specify "-" in ignore_chars
, then "test-match" will go as one single word ("testmatch") in the index.
Example for hyphen (whose codepoint number is AD):
ignore_chars = U+AD
This option is used to strip out HTML markup from the incoming data before it gets indexed. This is an optional option and its default value is 0, that is, do not strip HTML. The only other value this option can hold is 1, that is, strip HTML.
This option only strips the tags and not the content within the tags. For practical purposes this works in a similar way to the strip_tags()
PHP function:
html_strip = 1
This option is used to specify the list of HTML attributes whose value should be indexed when stripping HTML.
This is useful for tags, such as<img>
and<a>
, where you may want to index the value of the alt
and title
attributes:
html_index_attrs = img=alt,title; a=title;
We often use a variant of the actual word while searching. For example, we search for "run" and we intend that the results should also contain those documents that match "runs", "running", or "ran". You must have seen this in action on search websites such as Google, Yahoo!, and so on. The same thing can be achieved in Sphinx quite easily using morphology and stemming.
Morphology is concerned with wordforms and it describes how words are pronounced. In morphology, stemming is the process of transforming words to their base (root) form, that is, reducing inflected words to their stem.