Time for action - creating index (without attributes)

  1. Create a new Sphinx configuration file at /usr/local/sphinx/etc/sphinx-blog-xmlpipe2.conf with the following options:
    source blog
    {
    type = xmlpipe2
    xmlpipe_command = /usr/bin/php /home/abbas/sphinx/makeindex.php
    }
    index posts
    {
    source = blog
    path = /usr/local/sphinx/var/data/blog-xmlpipe2
    docinfo = extern
    charset_type = utf-8
    }
    indexer
    {
    mem_limit = 32M
    }
    
  2. Create the PHP script /home/abbas/sphinx/makeindex.php (this script can be anywhere on your machine).
    <?php
    // Database connection credentials
    $dsn ='mysql:dbname=myblog;host=localhost';
    $user = 'root';
    $pass = '';
    // Instantiate the PDO (PHP 5 specific) class
    try {
    $dbh = new PDO($dsn, $user, $pass);
    } catch (PDOException $e){
    echo'Connection failed: '.$e->getMessage();
    }
    // We will use PHP's inbuilt XMLWriter to create the xml structure
    $xmlwriter = new XMLWriter();
    $xmlwriter->openMemory();
    $xmlwriter->setIndent(true);
    $xmlwriter->startDocument('1.0', 'UTF-8'),
    // Start the parent docset element
    $xmlwriter->startElement('sphinx:docset'),
    // Start the element for schema definition
    $xmlwriter->startElement('sphinx:schema'),
    // Start the element for title field
    $xmlwriter->startElement('sphinx:field'),
    $xmlwriter->writeAttribute("name", "title");
    $xmlwriter->endElement(); //end field
    // Start the element for content field
    $xmlwriter->startElement('sphinx:field'),
    $xmlwriter->writeAttribute("name", "content");
    $xmlwriter->endElement(); //end field
    $xmlwriter->endElement(); //end schema
    // Query to get all posts from the database
    $sql = "SELECT id, title, content FROM posts";
    // Run a loop and put the post data in XML
    foreach ($dbh->query($sql) as $post) {
    // Start the element for holding the actual document (post)
    $xmlwriter->startElement('sphinx:document'),
    // Add the id attribute
    $xmlwriter->writeAttribute("id", $post['id']);
    // Set value for the title field
    $xmlwriter->startElement('title'),
    $xmlwriter->text($post['title']);
    $xmlwriter->endElement();//end title
    // Set value for the content field
    $xmlwriter->startElement('content'),
    $xmlwriter->text($post['content']);
    $xmlwriter->endElement();// end content
    $xmlwriter->endElement();// end document
    }
    $xmlwriter->endElement();// end docset
    // Output the xml
    print $xmlwriter->flush();
    ?>
    
  3. Run the indexer to create the index:
    $ /usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx-blog-xmlpipe2.conf --all
    
    Time for action - creating index (without attributes)
  4. Test the index by searching for "programming":
    $ /usr/local/sphinx/bin/search --config /usr/local/sphinx/etc/xmlpipe2.conf programming
    
    Time for action - creating index (without attributes)

What just happened?

The xmlpipe2 data source needs an option xmlpipe_command, which should be the command to be executed, that streams the XML on its stdout. In our case, we are using the PHP script /home/abbas/sphinx/makeindex.php to create the well-formed XML. This script is executed using the PHP CLI located at /usr/bin/php.

Note

You may put the PHP script anywhere on your file system. Just make sure to use the correct path in your configuration file.

Tip

To determine the path to PHP CLI, you can issue the following command:

$ which php

No other option is required in the data source configuration if we are specifying the schema in the XML itself.

Next we created the PHP script which outputs the well-formed XML. In the script we first connected to the database and retrieved all posts using the PHP 5 native PDO driver. We created the XML structure with the help of PHP's XMLWriter class.

Note

An explanation of how the PHP code works is beyond the scope of this book. Please refer to the PHP Manual (http://www.php.net/manual/) for more details.

The output (text truncated for brevity) of our PHP script looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="title"/>
<sphinx:field name="content"/>
</sphinx:schema>
<sphinx:document id="1">
<title>Electronics For You</title>
<content>EFY- Electronics For You is a magazine for people with a passion for Electronics and Technology...</content>
</sphinx:document>
<sphinx:document id="2">
<title>What is PHP?</title>
<content>PHP Hypertext Preprocessor...</content>
</sphinx:document>
<!-- ... remaining documents here ... -->
</sphinx:docset>

The XML structure is pretty much self explanatory. We specified the schema (fields and attributes to be added to the index) at the top using the<sphinx:schema> element. It is compulsory to declare the schema before any document is parsed. Arbitrary fields and attributes are allowed, and they can occur in the stream in arbitrary order within each document.

Note

<sphinx:schema> is only allowed to occur as the very first sub-element in<sphinx:docset>. However, it is optional and can be omitted if settings are defined in configuration file.

If the schema is already declared in the configuration file then there's no need to declare it in the XML structure. In-stream schema definition takes precedence and if there is no in-stream definition, then settings from the configuration file will be used.

Any unknown XML tags, such as the tags, which were neither declared as fields nor as attributes, will be ignored and won't make it to the index.

The following are the XML elements (tags) used in the previous code snippet. They are recognized by xmlpipe2:

  • sphinx:docset: Mandatory top-level element. It contains the document set for xmlpipe2.
  • sphinx:schema: Optional, must occur as the first child of docset or never occur at all. It contains field and attribute declarations, and defines the document schema. It overrides the settings from the configuration file.
  • sphinx:field: Optional, child of sphinx:schema. It declares a full-text field and its only recognized attribute is name, which specifies the element name that should be treated as a full-text field in the subsequent documents.
  • sphinx:document: This is a mandatory element which holds the actual data to be indexed. It must be the child of the sphinx:docset element. This element can contain arbitrary sub-elements with field and attribute values to be indexed (as declared either in sphinx:schema or configuration file). The compulsory known attribute of this element is id. It must contain the unique integer document ID.

Once the index stands created, we perform a search for the usual way using the command line utility.

Note

We don't get extra information like title in search results output since that was SQL data-source specific. sql_query_info was used to fetch that extra information and that cannot be used with xmlpipe2 data source.

Now, let's see how to define attributes in sphinx:schema so that the same goes in our index.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset