Time for action - create the index

  1. Create the Sphinx configuration file at /usr/local/sphinx/etc/feeds.conf with the following content:
    source feeds
    {
    type = xmlpipe2
    xmlpipe_command = /usr/bin/php /path/to/webroot/feeds/makeindex.php
    xmlpipe_field = title
    xmlpipe_field = description
    xmlpipe_field = author
    xmlpipe_attr_timestamp = pub_date
    xmlpipe_attr_multi = category_id
    }
    index feed-items
    {
    source = feeds
    path = /usr/local/sphinx/var/data/feed-items
    charset_type = utf-8
    }
    indexer
    {
    mem_limit = 64M
    }
    
    
  2. Create the PHP script, /path/to/webroot/feeds/makeindex.php, to stream the XML required for indexing:
    <?php
    require('init.php'),
    require('simplepie.inc'),
    // Instantiate the simplepie class
    // We will use simplepie to parse the feed xml
    $feed = new SimplePie();
    // We don't want to cache feed items
    $feed->enable_cache(false);
    $feed->set_timeout(30);
    // We will use PHP's inbuilt XMLWriter to create the xml structure
    $xmlwriter = new XMLWriter();
    $xmlwriter->openMemory();
    $xmlwriter->setIndent(true);
    $xmlwriter->startDocument('1.0', 'UTF-8'),
    // Start the parent docset element
    $xmlwriter->startElement('sphinx:docset'),
    // Select all feeds from database
    $query = "SELECT * FROM feeds";
    $feeds = $dbh->query($query);
    // Run a loop on all feeds and fetch the items
    foreach ($feeds as $row) {
    // Fetch the feed
    $feed->set_feed_url($row['url']);
    $feed->init();
    // Fetch all items of this feed
    foreach ($feed->get_items() as $item) {
    $id = $item->get_id(true);
    $query = "INSERT INTO items (title, guid, link, pub_date) VALUES (?, ?, ?, ?)";
    $stmt = $dbh->prepare();
    // Params to be binded in the sql
    $params = array(
    $item->get_title(),
    $id,
    $item->get_permalink(),
    $item->get_date('Y-m-d H;i:s'),
    );
    $stmt->execute($params);
    // Start the element for holding the actual document (item)
    $xmlwriter->startElement('sphinx:document'),
    // Add the id attribute which will be the id of the last
    // record inserted in the items table.
    $xmlwriter->writeAttribute("id", $dbh->lastInsertId());
    // Set value for the title field
    $xmlwriter->startElement('title'),
    $xmlwriter->text($item->get_title());
    $xmlwriter->endElement();//end title
    // Set value for the description field
    $xmlwriter->startElement('description'),
    $xmlwriter->text($item->get_description());
    $xmlwriter->endElement();// end description
    // Set value for the author field
    $xmlwriter->startElement('author'),
    // If we have the author name then get it
    // else it will be empty string
    if ($item->get_author()) {
    $author = $item->get_author()->get_name();
    } else {
    $author = '';
    }
    $xmlwriter->text($author);
    $xmlwriter->endElement();// end author
    // Set value for the publish_date attribute
    $xmlwriter->startElement('pub_date'),
    $xmlwriter->text($item->get_date('U'));
    $xmlwriter->endElement();// end attribute
    // Get all categories of this item
    $categories = $item->get_categories();
    $catIds = array();
    // If we have categories then insert them in database
    if ($categories) {
    // Insert the categories
    foreach ($item->get_categories() as $category) {
    $catName = $category->get_label();
    $stmt = $dbh->prepare(
    "INSERT INTO categories (name) VALUES (?)");
    $stmt->execute(array($catName));
    $catIds[] = $dbh->lastInsertId();
    }
    }
    // Set value for the category_id attribute
    // Multiple category ids should be comma separated
    $xmlwriter->startElement('category_id'),
    $xmlwriter->text(implode(',', $catIds));
    $xmlwriter->endElement();// end attribute
    $xmlwriter->endElement();// end document
    }
    }
    $xmlwriter->endElement();// end docset
    // Output the xml
    print $xmlwriter->flush();
    
  3. Run the indexer command to create the index (as root):
    $ /usr/local/sphinx/bin/indexer -c /usr/local/sphinx/etc/feeds.conf feed-items
    
    
    Time for action - create the index
  4. Test the index from the command line:
    $ /usr/local/sphinx/bin/search -c /usr/local/sphinx/etc/feeds.conf development
    
    

    Note

    You will get a different set of results depending on the feeds you added and the current items in those feeds.

    Time for action - create the index

What just happened?

As always, the first thing we did was to create the Sphinx configuration file. We defined the source, index, and indexer blocks with necessary options.

For indexing the feed items we will use the xmlpipe2 data source. We chose xmlpipe2 over an SQL data source because the data is coming from a non-conventional source (feed), and we do not store the feed items (description) in our database.

We defined the fields and attributes in the configuration file. The following fields and attributes will be created in the index:

  • title: Full-text field to hold the title of the feed item
  • description: Full-text field to hold the description of the feed item
  • author: Full-text field to hold the author name
  • pub_data: Timestamp attribute to hold the publish date of the feed item
  • category_id: MVA attribute to hold categories associated with a feed item

The XML will be streamed by a PHP script, which we will create at /path/to/webroot/feeds/makeindex.php. The index will be saved at /usr/local/sphinx/var/data/feed-items.

After that, we created the PHP script makeindex.php, which streams the XML required by Sphinx to index the feed data. We used SimplePie to fetch the feeds and parse it into PHP objects so that we can loop over the data and save it in our database.

In makeindex.php, we wrote code to fetch all feed URLs stored in the feeds database table and then fetch each feed one by one. We are storing the feed title, guid (unique identifier for the item and comes along with the item in the feed XML), and link in the items table. We need to do this to show the feed title and link in our search results page.

We are also storing the data: categories associated with the items. We will need the category names to build the drop-down on a search page (for filtering purposes).

Apart from storing the data in the database, we also created the XML structure and data, as required by Sphinx, to create the index. This XML is then flushed at the end of the script.

Everything looks fine—right? Well there is one flaw in our PHP script which streams the XML. We are going to run the indexer once everyday to fetch feed items. However, what if the same items are returned on consecutive runs? What if the same category name is being used by different items? We certainly don't want to duplicate the data. Let's see how to rectify this.

Check for duplicate items

Let's modify the file which streams the XML and add code so that it doesn't include an item in the XML that has already been indexed. For this we will use the unique guid field and before considering it for indexing, check if an item with the same guid already exists in the items table. Let's do it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset