The source of the data that is to be indexed is called a data source. The data can generally come from very different sources such as SQL databases, plain text files, HTML documents, web services, mailboxes, and so on.
Sphinx cannot directly connect to a data source and fetch the required data. For different sources Sphinx requires different code to prepare the data for indexing. The code that does this job is called as data source driver (or data source for brevity).
Sphinx is available with pre-built data source drivers for MySQL and PostgreSQL databases. These drivers can connect to the database using Sphinx's native C/C++ API to run queries and fetch the data. The retrieved data is then indexed and stored in the indexes.
Another driver called xmlpipe2 (and xmlpipe which is now deprecated) is shipped with Sphinx. This driver executes the specified command and reads the data from its stdout
. The data is expected in a predefined XML structure. This data source is very useful when indexing data from non-conventional sources such as mailboxes, web service, plain text files, and so on. This data source can also be used to get the data from a database.
Data sources are defined in Sphinx configuration files and there can be multiple sources per index. Multiple sources are processed sequentially in the very same order in which they were specified in the index definition. The data (thus documents) from all the sources are merged as if they were coming from a single source.
In this book we will be going through the following data sources in detail:
Data sources are defined in the Sphinx configuration file (in our case /usr/local/sphinx/etc/sphinx.conf)
. The data source block looks something like:
source name { # Source options type = mysql ……. ……. }
We need to provide a name to each source followed by the source options. The type option specifies whether the data source is MySQL or PostgreSQL, or xmlpipe or xmlpipe2. We will be looking at the respective data source options, which are dependent on the type
, later in the chapter.
As mentioned earlier, Sphinx ships with two SQL data sources: MySQL and PostgreSQL. Let's see how we can use these data sources to create indexes that can later be searched using the searchd daemon.
Let's start by understanding how the MySQL data source worked in the example we saw in Chapter 2,Getting Started.
The data source configuration was as given next:
source src1 { type = mysql sql_host = localhost sql_user = test sql_pass = sql_db = test sql_port = 3306 # optional, default is 3306 sql_query = SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content FROM documents sql_attr_uint = group_id sql_attr_timestamp = date_added sql_query_info = SELECT * FROM documents WHERE id=$id }
In this case we named the data source as src1
. We specified that we will be using MySQL as the source database with the help of the type
option.
The next few options like sql_host, sql_user, sql_pass, sql_db
and sql_port
were used to define the connection parameters for connecting to the MySQL database.
Now let's understand what sql_query
is used for. This option is used to specify the main SQL query that will fetch the data from the database. This is a mandatory option if you are using an SQL data source and there can only be one main query per source.
Select as many fields in the query as you want to be included in the index. However, document ID must be the very first field and it must be a unique unsigned positive integer. In this case the id
field of the documents
table will be treated as document id in the created index.
All the fields (except document ID) selected by the query are, by default, treated as full-text fields in the created index. If you want one or more fields to act as attributes in the index then you can do so with the help of the sql_attr_*
option. In the previous example, we declared group_id
to be an attribute of type unsigned integer and date_added
to be an attribute of type timestamp. The following options can be used to declare different types of attributes:
sql_attr_unit:
Unsigned integer attribute (32 bit)sql_attr_bool:
Boolean attributesql_attr_bigint:
Signed integer attribute (64 bit)sql_attr_timestamp:
Unix timestamp attributesql_attr_str2ordinal:
Ordinal string number attributesql_attr_float:
Floating point attributesql_attr_multi:
Multi-valued attribute (MVA)As we had inserted four rows in the documents table, the query will retrieve all the four rows and create the index with id
as document id, title
and content
as full-text fields, group_id
as an unsigned integer attribute, and date_added
as timestamp attribute.
When a search is performed on the index, the search term is matched against title and content fields, while the two attributes can be used for sorting and filtering.
The last option in the source configuration is sql_query_info
and this is optional. This option only applies to MySQL source type. It is used by the CLI search utility to fetch and display document information. The $id macro is required and it expands to the queried document ID. By default the CLI search utility will display only the attributes stored in the index for the matched documents. If you want to display the actual text that was full-text indexed or any other information related to the matched document, the sql_query_info
option comes in handy.
Now let's take a look at how to create indexes using a MySQL data source. We will understand this with the help of a small blogging application.
We are going to assume that our blogging application has some frontend to manage the posts, authors, and categories. We will only deal with the database part of it.
We will assume that each blog post is written by one author and it can be assigned multiple categories. Also, the same category can be assigned to multiple blog posts.
Our aim is to create an index for blog posts that can be searched from within our application. We will create indexes in steps, and our first step would be to create a simple index with only full-text indexed fields and no attributes. Next we will try to add simple attributes so that search can be filtered based on authors and date. Lastly we will try our hand at multi-value attributes (MVAs) and see what configuration options to use in order to add them to the index.