Integrating MongoDB for full text search with Elasticsearch

MongoDB has integrated text search features, as we saw in the previous recipe. However, there are multiple reasons why one would not use the Mongo text search feature and fall back to a conventional search engine like Solr or Elasticsearch, and the following are few of them:

  • The text search feature is production ready in version 2.6. In version 2.4, it was introduced in beta and not suitable for production use cases.
  • Products like Solr and Elasticsearch are built on top of Lucene, which has proven itself in the search engine arena. Solr and Elasticsearch are pretty stable products too.
  • You might already have expertise on products like Solr and Elasticsearch and would like to use it as a full text search engine rather than MongoDB.
  • Some particular feature that you might find missing in MongoDB search which your application might require, for example, facets.

Setting up a dedicated search engine does need additional efforts to integrate it with a MongoDB instance. In this recipe, we will see how to integrate a MongoDB instance with a search engine, Elasticsearch.

We will be using the mongo-connector for integration purpose. It is an open source project that is available at https://github.com/10gen-labs/mongo-connector.

Getting ready

Refer to the recipe Connecting to a single node using a Python client, in Chapter 1, Installing and Starting the Server for installing and setting up Python client. The tool pip is used for getting the mongo-connector. However, if you are working on a Windows platform, the steps to install pip was not mentioned earlier. Visit the URL https://sites.google.com/site/pydatalog/python/pip-for-windows to get pip for windows.

The prerequisites for starting the single instance are all we need for this recipe. We would, however, start the server as a one node replica set for demonstration purpose in this recipe.

Download the file BlogEntries.json from the Packt site and keep it on your local drive ready to be imported.

Download elastic search from the following URL for your target platform: http://www.elasticsearch.org/overview/elkdownloads/. Extract the downloaded archive and from the shell, go to the bin directory of the extraction.

We will get the mongo-connector source from GitHub.com and run it. A Git client is needed for this purpose. Download and install the Git client on your machine. Visit the URL http://git-scm.com/downloads and follow the instructions for installing Git on your target operating system. If you are not comfortable installing Git on your operating system, then there is an alternative available that lets you download the source as an archive.

Visit the following URL https://github.com/10gen-labs/mongo-connector. Here, we will get an option that lets us download the current source as an archive, which we can then extract on our local drive. The following image shows that the download option available on the bottom-right corner:

Getting ready

Note

Note that we can also install mongo-connector in a very easy way using pip as follows:

pip install mongo-connector

However, the version in PyPi is a very old with not many features supported and thus using the latest from the repository is recommended.

Similar to the previous recipe, where we saw text search in Mongo, we will use the same five documents to test our simple search. Download and keep the BlogEntries.json file.

How to do it…

  1. At this point, it is assumed that Python and PyMongo are installed and pip for your operating system platform is installed. We will now get mongo-connector from source. If you have already installed the Git client, we will be executing the following on the operating system shell. If you have decided to download the repository as an archive, you may skip this step. Go to the directory where you would like to clone the connector repository and execute the following:
    $ git clone https://github.com/10gen-labs/mongo-connector.git
    $ cd mongo-connector
    $ python setup.py install
    
  2. The preceding setup will also install the Elasticsearch client that will be used by this application.
  3. We will now start a single mongo instance but as a replica set. From the operating system console, execute the following:
    $  mongod --dbpath /data/mongo/db --replSet textSearch --smallfiles --oplogSize 50
    
  4. Start a mongo shell and connect to the started instance:
    $ mongo
    
  5. From the mongo shell initiate the replica set as follows:
    > rs.initiate()
    
  6. The replica set will be initiated in a few moments. Meanwhile, we can proceed to starting the elasticsearch server instance.
  7. Execute the following from the command after going to the bin directory of the extracted elasticsearch archive:
    $ elasticsearch
    
  8. We won't be getting into the Elasticsearch settings, and we will start it in the default mode.
  9. Once started, enter the following URL in the browser http://localhost:9200/_nodes/process?pretty.
  10. If we see a JSON document as the following, giving the process details, we have successfully started elasticsearch.
    {
      "cluster_name" : "elasticsearch",
      "nodes" : {
        "p0gMLKzsT7CjwoPdrl-unA" : {
          "name" : "Zaladane",
          "transport_address" : "inet[/192.168.2.3:9300]",
          "host" : "Amol-PC",
          "ip" : "192.168.2.3",
          "version" : "1.0.1",
          "build" : "5c03844",
          "http_address" : "inet[/192.168.2.3:9200]",
          "process" : {
            "refresh_interval" : 1000,
            "id" : 5628,
            "max_file_descriptors" : -1,
            "mlockall" : false
          }
        }
      }
    }
    
  11. Once the elasticsearch server and mongo instance are up and running, and the necessary Python libraries are installed, we will start the connector that will sync the data between the started mongo instance and the elasticsearch server. For the sake of this test, we will be using the collection user_blog in the test database. The field on which we would like to have text search implemented is the field blog_text in the document.
  12. Start the mongo-connector from the operating system shell as follows. The following command was executed with the mongo-connector's directory as the current directory.
    $ python mongo_connector/connector.py -m localhost:27017 -t http://localhost:9200 -n test.user_blog --fields blog_text -d mongo_connector/doc_managers/elastic_doc_manager.py
    
  13. Import the BlogEntries.json file into the collection using mongoimport utility as follows. The command is executed with the .json file present in the current directory.
    $ mongoimport -d test -c user_blog BlogEntries.json --drop
    
  14. Open a browser of your choice and enter the following URL in it: http://localhost:9200/_search?q=blog_text:facebook.
  15. You should see something like the following in the browser:
    How to do it…

How it works…

Mongo-connector basically tails the oplog to find new updates that it publishes to another endpoint. We used elasticsearch in our case, but it could be even be Solr. You may choose to write a custom DocManager that would plugin with the connector. Refer to the wiki https://github.com/10gen-labs/mongo-connector/wiki for more details, and the readme for https://github.com/10gen-labs/mongo-connector gives some detailed information too.

We gave the connector the options -m, -t, -n, --fields, and -d and what they mean is explained in the table as follows:

Option

Description

-m

URL of the MongoDB host to which the connector connects to get the data to be synchronized.

-t

The target URL of the system with which the data is to be synchronized with. Elasticsearch in this case. The URL format will depend on the target system. Should you choose to implement your own DocManager, the format will be one that your DocManager understands.

-n

This is the namespace that we would like keep synchronized with the external system. The connector will just be looking for changes in these namespaces while tailing the oplog for data. The value will be comma separated if more than one namespaces are to be synchronized.

--fields

These are the fields from the document that will be sent to the external system. In our case, it doesn't make sense to index the entire document and waste resources. It is recommended to add to the index just the fields that you would like to add text search support. The identifier _id and the namespace of the source is also present in the result, as we can see in the preceding screenshot. The _id field can then be used to query the target collection.

-d

This is the document manager to be used, in our case we have used the elasticsearch's document manager.

For more supported options, refer to the readme of the connector's page on GitHub.

Once the insert is executed on the MongoDB server, the connector detects the newly added documents to the collection of its interest, user_blog, and starts sending the data to be indexed from the newly documents to the elasticsearch. To confirm the addition, we execute a query in the browser to view the results.

Elasticsearch will complain that the index names have upper case characters in them. The mongo-connector doesn't take care of this, and thus the name of the collection has to be in lower case. For example, the name userBlog will fail.

There's more…

We have not done any additional configuration on elasticsearch as that was not the objective of the recipe. We were more interested in integrating MongoDB and elasticsearch. You will have to refer to elasticsearch documentation for more advanced config options. If integrating with elasticsearch is required, there is a concept called rivers in elasticsearch that can be used as well. Rivers are elasticsearch's way to get data from another data source. For MongoDB, the code for the river can be found at https://github.com/richardwilly98/elasticsearch-river-mongodb/. The readme in this repository has steps on how to set it up.

In this chapter, we saw a recipe, Implementing triggers in Mongo using oplog, on how to implement trigger-like functionalities using Mongo. This connector and MongoDB river for elasticsearch rely on the same logic for getting the data out of Mongo as and when it is needed.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset