Data sources

The source of the data that is to be indexed is called a data source. The data can generally come from very different sources such as SQL databases, plain text files, HTML documents, web services, mailboxes, and so on.

Sphinx cannot directly connect to a data source and fetch the required data. For different sources Sphinx requires different code to prepare the data for indexing. The code that does this job is called as data source driver (or data source for brevity).

Sphinx is available with pre-built data source drivers for MySQL and PostgreSQL databases. These drivers can connect to the database using Sphinx's native C/C++ API to run queries and fetch the data. The retrieved data is then indexed and stored in the indexes.

Another driver called xmlpipe2 (and xmlpipe which is now deprecated) is shipped with Sphinx. This driver executes the specified command and reads the data from its stdout. The data is expected in a predefined XML structure. This data source is very useful when indexing data from non-conventional sources such as mailboxes, web service, plain text files, and so on. This data source can also be used to get the data from a database.

Data sources are defined in Sphinx configuration files and there can be multiple sources per index. Multiple sources are processed sequentially in the very same order in which they were specified in the index definition. The data (thus documents) from all the sources are merged as if they were coming from a single source.

In this book we will be going through the following data sources in detail:

  • MySQL data source
  • xmlpipe data source

How to define the data source?

Data sources are defined in the Sphinx configuration file (in our case /usr/local/sphinx/etc/sphinx.conf). The data source block looks something like:

source name
{
# Source options
type = mysql
…….
…….
}

We need to provide a name to each source followed by the source options. The type option specifies whether the data source is MySQL or PostgreSQL, or xmlpipe or xmlpipe2. We will be looking at the respective data source options, which are dependent on the type, later in the chapter.

SQL data sources

As mentioned earlier, Sphinx ships with two SQL data sources: MySQL and PostgreSQL. Let's see how we can use these data sources to create indexes that can later be searched using the searchd daemon.

Note

All our examples will use MySQL as the source database. However, there shouldn't be any major difference in defining a source for PostgreSQL.

Let's start by understanding how the MySQL data source worked in the example we saw in Chapter 2,Getting Started.

The data source configuration was as given next:

source src1
{
type = mysql
sql_host = localhost
sql_user = test
sql_pass =
sql_db = test
sql_port = 3306 # optional, default is 3306
sql_query = 
SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content 
FROM documents
sql_attr_uint = group_id
sql_attr_timestamp = date_added
sql_query_info = SELECT * FROM documents WHERE id=$id
}

In this case we named the data source as src1. We specified that we will be using MySQL as the source database with the help of the type option.

The next few options like sql_host, sql_user, sql_pass, sql_db and sql_port were used to define the connection parameters for connecting to the MySQL database.

Now let's understand what sql_query is used for. This option is used to specify the main SQL query that will fetch the data from the database. This is a mandatory option if you are using an SQL data source and there can only be one main query per source.

Select as many fields in the query as you want to be included in the index. However, document ID must be the very first field and it must be a unique unsigned positive integer. In this case the id field of the documents table will be treated as document id in the created index.

Note

Document ID should be the first field in the SQL query and it must be a unique unsigned non-zero, non-negative integer number. If you are using multiple sources for an index then the document IDs must be unique across all sources.

All the fields (except document ID) selected by the query are, by default, treated as full-text fields in the created index. If you want one or more fields to act as attributes in the index then you can do so with the help of the sql_attr_* option. In the previous example, we declared group_id to be an attribute of type unsigned integer and date_added to be an attribute of type timestamp. The following options can be used to declare different types of attributes:

  • sql_attr_unit: Unsigned integer attribute (32 bit)
  • sql_attr_bool: Boolean attribute
  • sql_attr_bigint: Signed integer attribute (64 bit)
  • sql_attr_timestamp: Unix timestamp attribute
  • sql_attr_str2ordinal: Ordinal string number attribute
  • sql_attr_float: Floating point attribute
  • sql_attr_multi: Multi-valued attribute (MVA)

As we had inserted four rows in the documents table, the query will retrieve all the four rows and create the index with id as document id, title and content as full-text fields, group_id as an unsigned integer attribute, and date_added as timestamp attribute.

When a search is performed on the index, the search term is matched against title and content fields, while the two attributes can be used for sorting and filtering.

The last option in the source configuration is sql_query_info and this is optional. This option only applies to MySQL source type. It is used by the CLI search utility to fetch and display document information. The $id macro is required and it expands to the queried document ID. By default the CLI search utility will display only the attributes stored in the index for the matched documents. If you want to display the actual text that was full-text indexed or any other information related to the matched document, the sql_query_info option comes in handy.

Creating Index using SQL data source (Blog)

Now let's take a look at how to create indexes using a MySQL data source. We will understand this with the help of a small blogging application.

We are going to assume that our blogging application has some frontend to manage the posts, authors, and categories. We will only deal with the database part of it.

We will assume that each blog post is written by one author and it can be assigned multiple categories. Also, the same category can be assigned to multiple blog posts.

Our aim is to create an index for blog posts that can be searched from within our application. We will create indexes in steps, and our first step would be to create a simple index with only full-text indexed fields and no attributes. Next we will try to add simple attributes so that search can be filtered based on authors and date. Lastly we will try our hand at multi-value attributes (MVAs) and see what configuration options to use in order to add them to the index.

Creating a simple index without any attributes

Let's create an index that would essentially work in a very similar way to the database table itself. It will have two full-text indexed fields: title and content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset