Integrating MongoDB with Elasticsearch for a full-text search

MongoDB has integrated text search features, as we saw in the previous recipe. However, there are multiple reasons why one would not use the Mongo text search feature and would fall back to conventional search engines such as Solr or Elasticsearch. The following are a few of the reasons:

  • The text search feature is production-ready in version 2.6. In version 2.4, it was introduced in beta, which is not suitable for production use cases.
  • Products such as Solr and Elasticsearch are built on top of Lucene, which has proven itself in the search engine arena. Solr and Elasticsearch are pretty stable products too.
  • You might already have expertise on products such as Solr and Elasticsearch and would like to use them as full-text search engines rather than MongoDB.
  • Some particular feature that your application might require may be missing in MongoDB search,.

Setting up a dedicated search engine does need additional efforts to integrate it with a MongoDB instance. In this recipe, we will see how to integrate a MongoDB instance with the search engine Elasticsearch.

We will be using the Mongo connector for integration purpose. It is an open source project that is available at https://github.com/10gen-labs/mongo-connector.

Getting ready

Refer to the Installing PyMongo recipe in Chapter 3, Programming Language Drivers, to install and set up Python. The tool pip is used to get the Mongo connector. However, if you are working on the Windows platform, the steps to install pip were not mentioned earlier. Visit https://sites.google.com/site/pydatalog/python/pip-for-windows to get pip for Windows.

The prerequisites for starting a single instance are all we need for this recipe. However, in this recipe, we will start the server as a single node replica set for demonstration purpose.

Download the BlogEntries.json file from the book's website and keep it on your local drive, ready to be imported.

Download Elasticsearch for your target platform from http://www.elasticsearch.org/overview/elkdownloads/. Extract the downloaded archive, and from the shell, go to the bin directory of the extraction.

We will be getting the mongo-connector source from github.com and running it. A Git client is needed for this purpose. Download and install the Git client on your machine. Visit http://git-scm.com/downloads and follow the instructions to install Git on your target operating system. If you are not comfortable installing Git on your operating system, then there is an alternative available that lets you download the source as an archive.

Visit https://github.com/10gen-labs/mongo-connector. Here, you will get an option that lets you download the current source as an archive, which we can then extract on our local drive. The following screenshot shows the download option available on the bottom-right corner of the screen:

Getting ready

Note

Note that we can also install mongo-connector in a very easy way using pip as follows:

pip install mongo-connector

However, the version in PyPI is very old, with not many features supported and thus, using the latest version from the repository is recommended.

Just like in the previous recipe, where we saw text search in Mongo, we will use the five documents to test our simple search. Download and keep BlogEntries.json

How to do it…

  1. At this point, it is assumed that Python, PyMongo, and pip for your operating system platform are installed. We will now get mongo-connector from the source. If you have already installed the Git client, we will be executing the following steps on the operating system shell. If you have decided to download the repository as an archive, you may skip this step. Go to the directory where you would like to clone the connector repository, and execute the following commands:
    $ git clone https://github.com/10gen-labs/mongo-connector.git
    $ cd mongo-connector
    $ python setup.py install
    
  2. The preceding setup will also install the Elasticsearch client that will be used by this application.
  3. We will now start a single Mongo instance, but as a replica set. From the operating system console, execute the following command:
    $ mongod --dbpath /data/mongo/db --replSet textSearch --smallfiles --oplogSize 50
    
  4. Start a Mongo shell and connect to the started instance as follows:
    $ mongo
    
  5. From the Mongo shell, initiate the replica set as follows:
    > rs.initiate()
    
  6. The replica set will be initiated in a few moments. Meanwhile, we can proceed to start the Elasticsearch server instance.
  7. Execute the following command from the command line after going to the bin directory of the extracted elasticsearch archive:
    $ elasticsearch
    
  8. We won't be getting into Elasticsearch settings and will start it in the default mode.
  9. Once started, enter http://localhost:9200/_nodes/process?pretty in the browser.
  10. If we see a JSON document, such as the following, giving the process details, we have successfully started Elasticsearch:
    {
      "cluster_name" : "elasticsearch",
      "nodes" : {
        "p0gMLKzsT7CjwoPdrl-unA" : {
          "name" : "Zaladane",
          "transport_address" : "inet[/192.168.2.3:9300]",
          "host" : "Amol-PC",
          "ip" : "192.168.2.3",
          "version" : "1.0.1",
          "build" : "5c03844",
          "http_address" : "inet[/192.168.2.3:9200]",
          "process" : {
            "refresh_interval" : 1000,
            "id" : 5628,
            "max_file_descriptors" : -1,
            "mlockall" : false
          }
        }
      }
    }
    
  11. Once the Elasticsearch server and Mongo instance are up and running, and the necessary Python libraries installed, we will start the connector that will sync the data between the started Mongo instance and the Elasticsearch server.

    For the sake of this test, we will be using the user_blog collection in the test database. The field on which we would like to have text search implemented is the blog_text field in the document.

  12. Start the Mongo connector from the operating system shell as follows. The following command was executed with the Mongo connector's directory as the current directory:
    $ python mongo_connector/connector.py -m localhost:27017 -t http://localhost:9200 -n test.user_blog --fields blog_text -d mongo_connector/doc_managers/elastic_doc_manager.py
    
  13. Import the BlogEntries.json file into the collection using the mongoimport utility as follows. The command is executed with the .json file present in the current directory:
    $ mongoimport -d test -c user_blog BlogEntries.json --drop
    
  14. Open a browser of your choice and enter http://localhost:9200/_search?q=blog_text:facebook in it.
  15. You should see something like the following screenshot in the browser:
    How to do it…

How it works…

Basically, Mongo connector tails the oplog to find new updates that it publishes to another endpoint. We used Elasticsearch in our case, but it could even be Solr. You may choose to write a custom DocManager that would plugin with the connector. For more details, visit https://github.com/10gen-labs/mongo-connector/wiki. The Readme for https://github.com/10gen-labs/mongo-connector gives some detailed information as well.

We gave the connector the -m, -t, -n, --fields, and -d options. Their meaning as follows:

Option

Description

-m

The URL of the MongoDB host to which the connector connects to get the data to be synchronized.

-t

The target URL of the system with which the data is to be synchronized; Elasticsearch in this case. The URL format will depend on the target system. Should you choose to implement your own DocManager, the format will be one that your DocManager understands.

-n

This is the namespace that we would like to keep synchronized with the external system. The connector will just be looking for changes in these namespaces while tailing the oplog for data. The value will be separated by commas if more than one namespace is to be synchronized.

--fields

These are the fields from the document that would be sent to the external system. In our case, it doesn't make sense to index the entire document and waste resources. It is recommended to add to the index just to the fields where you would like to add text search support. The identifier _id field and the namespace of the source are also present in the result, as we can see in the preceding screenshot. The _id field can then be used to query the target collection.

-d

This is the document manager to be used; in our case, we have used the Elasticsearch's document manager.

For more supported options, refer to the readme of the connector's page on GitHub.

Once the insert is executed on the MongoDB server, the connector detects the newly added documents to the collection of its interest, that is, user_blog, and starts sending the data to be indexed from the newly added documents to Elasticsearch. To confirm the addition, we execute a query in the browser to view the results.

Elasticsearch will complain about index names with upper case characters in them. The mongo connector doesn't take care of this and thus, if the name of the collection has to be in lower case (for example, userBlog), it will fail.

There's more…

We have not done any additional configuration on Elasticsearch, as that was not the objective of this recipe. We were more interested in integrating MongoDB and Elasticsearch. You will have to refer to the Elasticsearch documentation for more advanced config options. If integration with Elasticsearch is required, there is a concept called rivers in Elasticsearch, that can be used as well. Rivers are Elasticsearch's way to get data from another data source. For MongoDB, the code for a river can be found at https://github.com/richardwilly98/elasticsearch-river-mongodb/. README.md in this repository has steps on how to set up.

In this chapter, we explored a recipe named Implementing triggers in Mongo using oplog, on how to implement trigger-like functionalities using Mongo. This connector and the MongoDB river for Elasticsearch rely on the same logic to get the data out of Mongo as and how it is needed.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset