Implementing full text search in Mongo

Many of us (I won't be wrong to say all of us) use Google every day to search content on the web. To explain in short: the text that we provide in the text box on Google's page is used to search the pages on the web it has indexed. The search results are then returned to us in some order determined by Google's page rank algorithm. We might want to have a similar functionality in our database that lets us search for some text content and give the corresponding search results. Note that this text search is not same as finding the text as part of the sentence, which can easily be done using regex. It goes way beyond that and can be used to get results that contain the same word, a similar sounding word, have a similar base word, or even a synonym in the actual sentence.

Since MongoDB Version 2.4, text indexes have been introduced, which let us create text indexes on a particular field in the document and enable text search on those words. In this recipe, we will be importing some documents and creating text indexes on them, which we will later query to retrieve the results.

Getting ready

A simple, single node is what we would need for the test. Refer to the recipe Installing single node MongoDB from Chapter 1, Installing and Starting the Server, for how to start the server. However, do not start the server yet. There would be an additional flag provided during the startup to enable text search. Download the file BlogEntries.json from the Packt site and keep it on your local drive ready to be imported.

How to do it…

  1. Start the MongoDB server listening to port 27017 as follows. Once the server is started, we will be creating the test data in a collection as follows. With the file BlogEntries.json placed in the current directory, we will be creating the collection userBlog as follows using mongoimport:
    $ mongoimport -d test -c userBlog --drop BlogEntries.json
    
  2. Now, connect to the mongo process from a mongo shell by typing the following command from the operating system shell:
    $ mongo
    
  3. Once connected, get a feel of the documents in the userBlog collection as follows:
    > db.userBlog.findOne()
    
  4. The field blog_text is of our interest and this is the one on which we will be creating a text search index.
  5. Create a text index on the field blog_text of the document as follows:
    > db.userBlog.ensureIndex({'blog_text':'text'})
    
  6. Now, execute the following search on the collection from the mongo shell:
    $ db.userBlog.find({$text: {$search : 'plot zoo'}})
    

    Look at the results obtained.

  7. Execute another search as follows:
    $ db.userBlog.find({$text: {$search : 'Zoo -plot'}})
    

How it works…

Let's now see how it all works. A text search is done by a process called reverse indexes. In simple terms, this is a mechanism where the sentences are broken up into words and then those individual words point back to the document which they belong to. The process is not straightforward though, so let's see what happens in this process step by step at a high level:

  1. Consider the following input sentence, I played cricket yesterday. The first step is to break this sentence into tokens and they become [I, played, cricket, yesterday].
  2. Next, the stop words from the broken down sentence are removed and we are left with a subset of these. Stop words are a list of very common words that are eliminated as it makes no sense to index them as they can potentially affect the accuracy of the search when used in the search query. In this case, we will be left with the following words [played, cricket, yesterday]. Stop words are language specific and will be different for different languages.
  3. Finally, these words are stemmed to their base words, in this case it will be [play, cricket, yesterday]. Stemming is process of reduction of a word to its root. For instance, all the words play, playing, played, and plays have the same root word, play. There are a lot of algorithms and frameworks present for stemming a word to its root form. Refer to the Wikipedia http://en.wikipedia.org/wiki/Stemming page for more information on stemming and the algorithms used for this purpose. Similar to eliminating stop words, the stemming algorithm is language dependent. The examples given here were for the English language.

If we look at the index creation process, it is created as follows db.userBlog.ensureIndex({'blog_text':'text'}). The key given in the JSON argument is the name of the field on which the text index is to be created and the value will always be the text denoting that the index to be created is a text index. Once the index is created, at a high level, the preceding three steps get executed on the content of the field on which the index is created in each document and a reverse index is created. You can also choose to create a text index on more than one field. Suppose that we had two fields, blog_text1 and blog_text2; we can create the index as {'blog_text1': 'text', 'blog_text2':'text'}. The value {'$**':'text'} creates an index on all fields of the document.

Finally, we executed the search operation by invoking the following: db.userBlog.find({$text: {$search : 'plot zoo'}}).

This command runs the text search on the collection userBlog and the search string used is plot zoo. This searches for the value plot or zoo in the text in any order. If we look at the results, we see that we have two documents matched and the documents are ordered by the score. This score tells us how relevant the document searched is, and the higher the score, the more relevant it is. In our case, one of the documents had both the words plot and zoo in it, and thus got a higher score than a document, as we see here:

To get the scores in the result, we need to modify the query a bit, as follows:

db.userBlog.find({$text:{$search:'plot zoo'}}, {score: { $meta: "textScore"}})

We now have an additional document provided in the find method that asks for the score calculated for the text match. The results still are not ordered in descending order of score. Let's see how to sort the results by score:

db.userBlog.find({$text:{$search:'plot zoo'}}, { score: { $meta: "textScore" }}).sort({score: { $meta: "textScore"}})

As we can see, the query is same as before, it's just the additional sort function that we have added, which will sort the results by descending order of score.

When the search is executed as {$text:{$search:'Zoo -plot'}, it searches for all the documents that contain the word zoo and do not contain the word plot, thus we get only one result. The - sign is for negation and leaves out the document from the search result containing that word. However, do not expect to find all documents without the word plot by just giving -plot in the search.

If we look at the contents returned as the result of the search, it contains the entire matched document in the result. If we are not interested in the entire document, but only a few documents, we can use projection to get the desired fields of the document. The following query, for instance, db.userBlog.find({$text: {$search : 'plot zoo'}},{_id:1}) will be same as finding all the documents in the userBlog collection containing the words zoo or plot, but the results will contain the _id field from the resulting documents.

If multiple fields are used for creation of index, then we may have different weights for different fields in the document. For instance, suppose blog_text1 and blog_text2 are two fields of a collection. We can create an index where blog_text1 has higher weight than blog_text2 as follows:

db.collection.ensureIndex(
  {
    blog_text1: "text", blog_text2: "text"
  },
  {
    weights: {
      blog_text1: 2,
      blog_text2: 1,
    },
    name: "MyCustomIndexName"
  }
)

This gives the content in blog_text1 twice as much weight as that in blog_text2. Thus, if a word is found in two documents but is present in the blog_text1 field of the first document and blog_text2 of second document, then the score of first document will be more than the second. Note that we also have provided the name of the index using the name field as MyCustomIndexName.

We also see from the language key that the language in this case is English. MongoDB supports various languages for implementing text search. Languages are important when indexing the content as they decide the stop words, and stemming of words is language specific too.

Visit the link http://docs.mongodb.org/manual/reference/command/text/#text-search-languages for more details on the languages supported by Mongo for text search.

So, how do we choose the language while creating the index? By default, if nothing is provided, the index is created assuming the language is English. However, if we know the language is French, we create the index as follows:

db.userBlog.ensureIndex({'text':'text'}, {'default_language':'french'})

Suppose that we had originally created the index using the French language, the getIndexes method would return the following document:

[
  {
    "v" : 1,
    "key" : {
      "_id" : 1
    },
    "ns" : "test.userBlog",
    "name" : "_id_"
  },
  {
    "v" : 1,
    "key" : {
      "_fts" : "text",
      "_ftsx" : 1
    },
    "ns" : "test.userBlog",
    "name" : "text_text",
    "default_language" : "french",
    "weights" : {
      "text" : 1
    },
    "language_override" : "language",
    "textIndexVersion" : 1
  }
]

However, if the language was different per document basis, which is pretty common in scenarios like blogs, we have a way out. If we look at the document above, the value of the language_override field is language. This means that we can store the language of the content using this field on a per document basis. In its absence, the value will be assumed as the default value, french in the preceding case. Thus, we can have the following:

{_id:1, language:'english', text: ….}  //Language is English
{_id:2, language:'german', text: ….}  //Language is German
{_id:3, text: ….}      //Language is the default one, French in this case

There's more…

To use MongoDB text search in production, you would need version 2.6 or higher. Integrating MongoDB with other systems like Solr and Elasticsearch is also an option. In the next recipe, we will see how to integrate Mongo with Elasticsearch using the mongo-connector.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset