/usr/local/sphinx/etc/feeds.conf
with the following content:source feeds { type = xmlpipe2 xmlpipe_command = /usr/bin/php /path/to/webroot/feeds/makeindex.php xmlpipe_field = title xmlpipe_field = description xmlpipe_field = author xmlpipe_attr_timestamp = pub_date xmlpipe_attr_multi = category_id } index feed-items { source = feeds path = /usr/local/sphinx/var/data/feed-items charset_type = utf-8 } indexer { mem_limit = 64M }
/path/to/webroot/feeds/makeindex.php
, to stream the XML required for indexing:<?php require('init.php'), require('simplepie.inc'), // Instantiate the simplepie class // We will use simplepie to parse the feed xml $feed = new SimplePie(); // We don't want to cache feed items $feed->enable_cache(false); $feed->set_timeout(30); // We will use PHP's inbuilt XMLWriter to create the xml structure $xmlwriter = new XMLWriter(); $xmlwriter->openMemory(); $xmlwriter->setIndent(true); $xmlwriter->startDocument('1.0', 'UTF-8'), // Start the parent docset element $xmlwriter->startElement('sphinx:docset'), // Select all feeds from database $query = "SELECT * FROM feeds"; $feeds = $dbh->query($query); // Run a loop on all feeds and fetch the items foreach ($feeds as $row) { // Fetch the feed $feed->set_feed_url($row['url']); $feed->init(); // Fetch all items of this feed foreach ($feed->get_items() as $item) { $id = $item->get_id(true); $query = "INSERT INTO items (title, guid, link, pub_date) VALUES (?, ?, ?, ?)"; $stmt = $dbh->prepare(); // Params to be binded in the sql $params = array( $item->get_title(), $id, $item->get_permalink(), $item->get_date('Y-m-d H;i:s'), ); $stmt->execute($params); // Start the element for holding the actual document (item) $xmlwriter->startElement('sphinx:document'), // Add the id attribute which will be the id of the last // record inserted in the items table. $xmlwriter->writeAttribute("id", $dbh->lastInsertId()); // Set value for the title field $xmlwriter->startElement('title'), $xmlwriter->text($item->get_title()); $xmlwriter->endElement();//end title // Set value for the description field $xmlwriter->startElement('description'), $xmlwriter->text($item->get_description()); $xmlwriter->endElement();// end description // Set value for the author field $xmlwriter->startElement('author'), // If we have the author name then get it // else it will be empty string if ($item->get_author()) { $author = $item->get_author()->get_name(); } else { $author = ''; } $xmlwriter->text($author); $xmlwriter->endElement();// end author // Set value for the publish_date attribute $xmlwriter->startElement('pub_date'), $xmlwriter->text($item->get_date('U')); $xmlwriter->endElement();// end attribute // Get all categories of this item $categories = $item->get_categories(); $catIds = array(); // If we have categories then insert them in database if ($categories) { // Insert the categories foreach ($item->get_categories() as $category) { $catName = $category->get_label(); $stmt = $dbh->prepare( "INSERT INTO categories (name) VALUES (?)"); $stmt->execute(array($catName)); $catIds[] = $dbh->lastInsertId(); } } // Set value for the category_id attribute // Multiple category ids should be comma separated $xmlwriter->startElement('category_id'), $xmlwriter->text(implode(',', $catIds)); $xmlwriter->endElement();// end attribute $xmlwriter->endElement();// end document } } $xmlwriter->endElement();// end docset // Output the xml print $xmlwriter->flush();
indexer
command to create the index (as root):$ /usr/local/sphinx/bin/indexer -c /usr/local/sphinx/etc/feeds.conf feed-items
$ /usr/local/sphinx/bin/search -c /usr/local/sphinx/etc/feeds.conf development
As always, the first thing we did was to create the Sphinx configuration file. We defined the source, index
, and indexer
blocks with necessary options.
For indexing the feed items we will use the xmlpipe2
data source. We chose xmlpipe2
over an SQL data source because the data is coming from a non-conventional source (feed), and we do not store the feed items (description) in our database.
We defined the fields and attributes in the configuration file. The following fields and attributes will be created in the index:
title:
Full-text field to hold the title of the feed itemdescription:
Full-text field to hold the description of the feed itemauthor:
Full-text field to hold the author namepub_data:
Timestamp attribute to hold the publish date of the feed itemcategory_id:
MVA attribute to hold categories associated with a feed itemThe XML will be streamed by a PHP script, which we will create at /path/to/webroot/feeds/makeindex.php
. The index will be saved at /usr/local/sphinx/var/data/feed-items
.
After that, we created the PHP script makeindex.php
, which streams the XML required by Sphinx to index the feed data. We used SimplePie to fetch the feeds and parse it into PHP objects so that we can loop over the data and save it in our database.
In makeindex.php
, we wrote code to fetch all feed URLs stored in the feeds
database table and then fetch each feed one by one. We are storing the feed title, guid
(unique identifier for the item and comes along with the item in the feed XML), and link
in the items
table. We need to do this to show the feed title
and link
in our search results page.
We are also storing the data: categories associated with the items. We will need the category names to build the drop-down on a search page (for filtering purposes).
Apart from storing the data in the database, we also created the XML structure and data, as required by Sphinx, to create the index. This XML is then flushed at the end of the script.
Everything looks fine—right? Well there is one flaw in our PHP script which streams the XML. We are going to run the indexer
once everyday to fetch feed items. However, what if the same items are returned on consecutive runs? What if the same category name is being used by different items? We certainly don't want to duplicate the data. Let's see how to rectify this.
Let's modify the file which streams the XML and add code so that it doesn't include an item in the XML that has already been indexed. For this we will use the unique guid
field and before considering it for indexing, check if an item with the same guid
already exists in the items table. Let's do it.