Time for action - adding distributed index configuration

  1. Modify /usr/local/sphinx/etc/sphinx-distributed.conf on the primary server (192.168.1.1) and add a new index definition as shown:
    index master
    {
    type = distributed
    # Local index to be searched
    local = items
    # Remote agent (index) to be searched
    agent = 192.168.1.2:9312:items-2
    }
    
  2. Modify the configuration files on both 192.168.1.1 and 192.168.1.2 servers, and add the searchd section as shown:
    searchd
    {
    log = /usr/local/sphinx/var/log/searchd-distributed.log
    query_log = /usr/local/sphinx/var/log/query-distributed.log
    max_children = 30
    pid_file = /usr/local/sphinx/var/log/searchd-distributed.pid
    }
    
  3. Start the searchd daemon on the primary server (make sure to stop any previous instance):
    $ /usrl/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/sphinx-distributed.conf
    
  4. Start the searchd daemon on the second server (make sure to stop any previous instance):
    $ /usrl/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/sphinx-distributed-2.conf
    

What just happened?

We added a second index definition in our configuration file on the primary server. This index will be used for distributed searching. We named this index as master and used the type option to define it as a distributed index.

The master index contains only references to other local and remote indexes. It cannot be directly indexed and is used only for search purposes. You should rather re-index the indexes that master references (In our case, the items index on the first server and the items-2 index on the second server).

To reference local indexes, indexes on the same machine or configuration file, local option is used. To reference remote indexes, the agent option is used. Multiple local and remote indexes can be referenced. For example:

local = items
local = items-delta
agent = 192.168.1.2:9312:items-2,items-3
agent = myhost:9313:items-4

We defined two local and two remote indexes. The syntax for specifying a remote index using TCP connection is:

hostname:port:index1[,index2[,...]]

Syntax for specifying a local UNIX connection is:

/var/run/searchd.sock:index4

We also added searchd configuration section in both the configuration files. Now, if you want to perform a distributed search, you should fire the query against the master index as follows:

<?php
require_once('sphinxapi.php'),
$client = new SphinxClient();
$client->SetServer('192.168.1.1', 9312);
$client->SetConnectTimeout(1);
$client->SetArrayResult(true);
$results = $client->Query('search term', 'master'),

When you send a query to searchd using the client API (as shown in the previous code snippet), the following will occur:

  • searchd connects to the configured remote agents
  • It issues the search query
  • It searches all local indexes sequentially (at this time, remote agents are searching)
  • searchd retrieves the remote agents' (index's) search results
  • It merges results from local and remote indexes and removes any duplicates
  • Finally, the merged results are sent to the client
  • When you get the results in your application, there is absolutely no difference between results returned by a normal index and a distributed index

As we just saw, scaling Sphinx horizontally is a breeze and even a beginner can do it.

What follows are a few more options that can be used to configure a distributed index.'

agent_blackhole

This option lets you issue queries to remote agents and then forget them. This is useful for debugging purposes since you can set up a separate searchd instance and forward the search queries to this instance from your production instance, without interfering with the production work. The production server's searchd will try to connect and query the blackhole agent, but it will not wait for the process or results. This is an optional option and there can be multiple blackholes:

agent_blackhole = debugserver:9312:debugindex1,debugindex2

agent_connect_timeout

The remote agent's connection timeout in milliseconds. It's an optional option and its default value is 1000 ms (1 second):

agent_connect_timeout = 2000

This option specifies the time period before searchd should give up connecting to a remote agent.

agent_query_timeout

The remote agent's query timeout in milliseconds. It's an optional option and its default value is 3000 ms (3 seconds):

agent_query_timeout = 5000

This option specifies the time period before searchd should give up querying a remote agent.

Distributed searching on single server

The example we saw in the previous section used two different servers. The same example can be built on one server with little modifications to the configuration files. All references to the second server (192.168.1.2) should be replaced with the primary server (192.168.1.1).

The other important change would be the port on which searchd listens. The configuration file for the secondary server should use a different port for listening than the primary server. The same should be reflected in the agent option of master index.

charset configuration

The next set of options in the index section are charset related options. Let's take a look at them.

charset_type

You can specify the character encoding type using this option. The two character encodings, which are widely a used character set, that can be used with Sphinx are UTF-8 and Single Byte Character Set (SBCS).charset_type is optional and default value is sbcs. Another known value that it can hold is utf-8.

The specified encoding is used while indexing the data, parsing the query and generating the snippets:

charset_type = utf-8

charset_table

This is one of the most important options in Sphinx's tokenizing process that extracts keywords from the document text or query. This option controls which characters are acceptable and whether to remove the case or not.

There are more than a hundred thousand characters in Unicode (and 256 in sbcs) and the charset table holds the mapping for each of those characters. Each character is mapped to 0 by default, that is, the character does not occur in keywords and it should be treated as a separator. charset_table is used to map all such characters to either themselves, or to their lower case letter so that they are treated as a part of the keyword. charset_table can also be used to map one character to an entirely different character.

When you specify the charset_type as sbcs then the charset_table being used (internally) is:

# 'sbcs' defaults for English and Russian
charset_table = 0..9, A..Z->a..z, _, a..z, 
U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF

And when utf-8 charset type is used then the table being used is:

# 'utf-8' defaults for English and Russian
charset_table = 0..9, A..Z->a..z, _, a..z, 
U+410..U+42F->U+430..U+44F, U+430..U+44F

Default charset table for sbcs and utf-8 can be overwritten in the index section of the configuration file.

The format for specifying the charset_table is a comma-separated list of mappings (as shown in the previous snippet). You can specify a single character as valid, or map one character to its lowercase character or to another character. There are thousands of characters and specifying each one of them in the table would be a tedious task. Further, it will bloat the configuration file and will be unmanageable. To solve this issue Sphinx lets you use syntax shortcuts that map a whole range of characters at once. The list is as follows:

  • a—declares a single character as allowed and maps it to itself.
  • A->a—declares A as allowed and maps it to the lowercase a. Lowercase a itself is not declared as allowed.
  • A..Z—declares all characters in the range (A to Z) as allowed and maps them to themselves. Use a..z for lowercase range.
  • A..Z->a..z—declares all characters in the range A to Z as allowed and maps them to lowercase a to z range. Again, lowercase characters are not declared as allowed by this syntax.
  • A..Z/2—declares odd characters in the range as allowed and maps them to the even characters. In addition, it declares even characters as allowed and maps them to themselves. Let's understand this with an example. A..Z/2 is equivalent to the following:
    A->B,B->B,C->D,D->D,E->F,F->F,....,Y->Z,Z->Z
    

    Note

    Unicode characters and 8-bit ASCII characters must be specified in the format U+xxx, where xxx is the hexadecimal codepoint number for the character.

Data related options

On many occasions you may want to limit the type of data to be indexed either by words or length. You may also want to filter the incoming text before indexing. Sphinx provides many options to handle all these things. Let's take a look at some of these options.

stopwords

You may want to skip a few words while indexing, that is, those words should not go in the index. Stopwords are meant for this purpose. A file containing all such words can be stored anywhere on the file system and its path should be specified as the value of the stopwords option.

This option takes a list of file paths (comma separated). The Default value is empty.

You can specify multiple stopwords files and all of them will be loaded during indexing. The encoding of the stopwords file should match the encoding specified in the charset_type. The format of the file should be plain text. You can use the same separators as in the indexed data because data will be tokenized with respect to the charset_table settings.

To specify stopwords in the configuration file:

stopwords = /usr/local/sphinx/var/data/stopwords.txt
stopwords = /home/stop-en.txt

And the contents of the stopwords file should be (for default charset settings):

the into a with

Stopwords do affect the keyword positions even though they are not indexed. As an example; if "into" is a stopword and a document contains the phrase "put into hand", and another document contains "put hand"; then when an exact phrase "put hand" is searched for, it will return only the later document, even though "into" in the first document is stopped.

min_word_len

This option specifies the minimum length a word should have to be considered as a candidate for indexing. Default value is 1 (index everything).

For example: If min_word_len is 3, then the words "at" and "to" won't be indexed . However; "the", "with", and any other word whose length is three or greater than three will be indexed:

min_word_len = 3

ignore_chars

This option is used when you want to ignore certain characters. Let's take an example of a hyphen (-) to understand this. If you have a word "test-match" in your data, then this would normally go as two different words, "test" and "match" in the index. However, if you specify "-" in ignore_chars, then "test-match" will go as one single word ("testmatch") in the index.

Note

The syntax for specifying ignore_chars is similar to charset_table but it only allows you to declare the characters and not to map them. In addition, ignored characters must not be present in the charset_table.

Example for hyphen (whose codepoint number is AD):

ignore_chars = U+AD

html_strip

This option is used to strip out HTML markup from the incoming data before it gets indexed. This is an optional option and its default value is 0, that is, do not strip HTML. The only other value this option can hold is 1, that is, strip HTML.

This option only strips the tags and not the content within the tags. For practical purposes this works in a similar way to the strip_tags() PHP function:

html_strip = 1

html_index_attrs

This option is used to specify the list of HTML attributes whose value should be indexed when stripping HTML.

This is useful for tags, such as<img> and<a>, where you may want to index the value of the alt and title attributes:

html_index_attrs = img=alt,title; a=title;

html_remove_elements

This option is used to specify the list of HTML elements that should be completely removed from the data, that is, both tags and their content are removed.

This option is useful to strip out inline CSS and JavaScript if you are indexing HTML pages:

html_remove_elements = style, script

Word processing options

We often use a variant of the actual word while searching. For example, we search for "run" and we intend that the results should also contain those documents that match "runs", "running", or "ran". You must have seen this in action on search websites such as Google, Yahoo!, and so on. The same thing can be achieved in Sphinx quite easily using morphology and stemming.

Morphology is concerned with wordforms and it describes how words are pronounced. In morphology, stemming is the process of transforming words to their base (root) form, that is, reducing inflected words to their stem.

Morphology

Some pre-processors can be applied to the words being indexed to replace different forms of the same word with the base (normalized) form. Let's see how.

Note

The following exercise assumes that your data (items table) has the word "runs" in one or more records. Further,"run" and "running" are not present in the same record where "runs" is present.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset