MongoDB has integrated text search features, as we saw in the previous recipe. However, there are multiple reasons why one would not use the Mongo text search feature and fall back to a conventional search engine like Solr or Elasticsearch, and the following are few of them:
Setting up a dedicated search engine does need additional efforts to integrate it with a MongoDB instance. In this recipe, we will see how to integrate a MongoDB instance with a search engine, Elasticsearch.
We will be using the mongo-connector for integration purpose. It is an open source project that is available at https://github.com/10gen-labs/mongo-connector.
Refer to the recipe Connecting to a single node using a Python client, in Chapter 1, Installing and Starting the Server for installing and setting up Python client. The tool pip is used for getting the mongo-connector. However, if you are working on a Windows platform, the steps to install pip was not mentioned earlier. Visit the URL https://sites.google.com/site/pydatalog/python/pip-for-windows to get pip for windows.
The prerequisites for starting the single instance are all we need for this recipe. We would, however, start the server as a one node replica set for demonstration purpose in this recipe.
Download the file BlogEntries.json
from the Packt site and keep it on your local drive ready to be imported.
Download elastic search from the following URL for your target platform: http://www.elasticsearch.org/overview/elkdownloads/. Extract the downloaded archive and from the shell, go to the bin
directory of the extraction.
We will get the mongo-connector source from GitHub.com and run it. A Git client is needed for this purpose. Download and install the Git client on your machine. Visit the URL http://git-scm.com/downloads and follow the instructions for installing Git on your target operating system. If you are not comfortable installing Git on your operating system, then there is an alternative available that lets you download the source as an archive.
Visit the following URL https://github.com/10gen-labs/mongo-connector. Here, we will get an option that lets us download the current source as an archive, which we can then extract on our local drive. The following image shows that the download option available on the bottom-right corner:
Similar to the previous recipe, where we saw text search in Mongo, we will use the same five documents to test our simple search. Download and keep the BlogEntries.json
file.
$ git clone https://github.com/10gen-labs/mongo-connector.git $ cd mongo-connector $ python setup.py install
$ mongod --dbpath /data/mongo/db --replSet textSearch --smallfiles --oplogSize 50
$ mongo
> rs.initiate()
elasticsearch
server instance.bin
directory of the extracted elasticsearch
archive:$ elasticsearch
http://localhost:9200/_nodes/process?pretty
.elasticsearch
.{ "cluster_name" : "elasticsearch", "nodes" : { "p0gMLKzsT7CjwoPdrl-unA" : { "name" : "Zaladane", "transport_address" : "inet[/192.168.2.3:9300]", "host" : "Amol-PC", "ip" : "192.168.2.3", "version" : "1.0.1", "build" : "5c03844", "http_address" : "inet[/192.168.2.3:9200]", "process" : { "refresh_interval" : 1000, "id" : 5628, "max_file_descriptors" : -1, "mlockall" : false } } } }
elasticsearch
server and mongo instance are up and running, and the necessary Python libraries are installed, we will start the connector that will sync the data between the started mongo instance and the elasticsearch
server. For the sake of this test, we will be using the collection user_blog
in the test
database. The field on which we would like to have text search implemented is the field blog_text
in the document.$ python mongo_connector/connector.py -m localhost:27017 -t http://localhost:9200 -n test.user_blog --fields blog_text -d mongo_connector/doc_managers/elastic_doc_manager.py
BlogEntries.json
file into the collection using mongoimport
utility as follows. The command is executed with the .json
file present in the current directory.$ mongoimport -d test -c user_blog BlogEntries.json --drop
http://localhost:9200/_search?q=blog_text:facebook
.Mongo-connector basically tails the oplog to find new updates that it publishes to another endpoint. We used elasticsearch in our case, but it could be even be Solr. You may choose to write a custom DocManager that would plugin with the connector. Refer to the wiki https://github.com/10gen-labs/mongo-connector/wiki for more details, and the readme for https://github.com/10gen-labs/mongo-connector gives some detailed information too.
We gave the connector the options -m
, -t
, -n
, --fields
, and -d
and what they mean is explained in the table as follows:
For more supported options, refer to the readme of the connector's page on GitHub.
Once the insert is executed on the MongoDB server, the connector detects the newly added documents to the collection of its interest, user_blog
, and starts sending the data to be indexed from the newly documents to the elasticsearch. To confirm the addition, we execute a query in the browser to view the results.
Elasticsearch will complain that the index names have upper case characters in them. The mongo-connector doesn't take care of this, and thus the name of the collection has to be in lower case. For example, the name userBlog
will fail.
We have not done any additional configuration on elasticsearch as that was not the objective of the recipe. We were more interested in integrating MongoDB and elasticsearch. You will have to refer to elasticsearch documentation for more advanced config options. If integrating with elasticsearch is required, there is a concept called rivers in elasticsearch that can be used as well. Rivers are elasticsearch's way to get data from another data source. For MongoDB, the code for the river can be found at https://github.com/richardwilly98/elasticsearch-river-mongodb/. The readme in this repository has steps on how to set it up.
In this chapter, we saw a recipe, Implementing triggers in Mongo using oplog, on how to implement trigger-like functionalities using Mongo. This connector and MongoDB river for elasticsearch rely on the same logic for getting the data out of Mongo as and when it is needed.