Creating unique indexes on collection and deleting the existing duplicate data automatically

In this recipe, we will look at creating unique indexes on a collection. Unique indexes, from the name itself, tell us that the value with which the index is created has to be unique. What if the collection already has data and we want to create a unique index on a field whose value is not unique in the existing data?

Obviously, we cannot create the index, and it will fail. There is, however, a way to drop the duplicates and create the index. Curious how this can be achieved? Yes? Keep reading this recipe.

Getting ready

For this recipe, we will create a collection called userDetails. We will need the server to be up and running. Refer to the Single node installation of MongoDB recipe in Chapter 1, Installing and Starting the MongoDB Server, to learn how to start the server. Also, start the shell with the UniqueIndexData.js script loaded. This script will be available on the book's website for download. To find out how to start the shell with a script reloaded, refer to the Connecting to a single node from the Mongo shell with a preloaded JavaScript recipe in Chapter 1, Installing and Starting the MongoDB Server.

How to do it…

  1. Load the required data in the collection using the loadUserDetailsData method.
  2. Execute the following command on the Mongo shell:
    > loadUserDetailsData()
    
  3. See the count of the documents in the collection using the following query (it should be 100):
    > db.userDetails.count()
    
  4. Now, try to create a unique index on the login field on the userDetails collection:
    > db.userDetails.ensureIndex({login:1}, {unique:true})
    
  5. This will not be successful and something like the following error will be seen on the console:
    {
      "err" : "E11000 duplicate key error index: test.userDetails.$login_1  dup key: { : "bander" }",
      "code" : 11000,
      "n" : 0,
      "connectionId" : 6,
      "ok" : 1
    }
    
  6. Next, we will try to create an index on this collection by eliminating the duplicates:
    > db.userDetails.ensureIndex({login:1}, {unique:true, dropDups:true})
    
  7. This will throw no errors and find the count in the collection again (take a note of the count and compare it with the count seen earlier, prior to index creation):
    > db.userDetails.count()
    
  8. Check whether the index is being used by viewing the plan of the query:
    > db.userDetails.find({login:'mtaylo'}).explain()
    

How it works…

We initially loaded our collection with 100 documents using the loadUserDetailsData function from the UniqueIndexData.js file. We looped 100 times and loaded the same data over and over again. Thus, we got duplicate documents.

We will then try to create a unique index on the login field in the userDetails collection as follows:

> db.userDetails.ensureIndex({login:1}, {unique:true})

This creation fails and indicates the duplicate key it first encountered on index creation. It is bander in this case. Can you guess why an error was first encountered for this user ID? This is not even the first ID we saw in the loaded data.

Tip

When specifying 1 in index creation, we mean to convey that the order of the values is ascending. Try creating a unique index using {login:-1} and see if the user ID for which the error is encountered is different.

In such a scenario, we are left with two options:

  • Manually pick the data to be deleted/fixed and ensure that the field on which the index is to be created has unique data across collection. This can either be done manually or programmatically, but it is outside the scope of Mongo and done by the end user on a case-to-case basis.
  • Alternatively, if we don't care much about the data as it is genuinely duplicated and we need to retain just one copy of it, Mongo provides a brilliant way to handle this. Apart from the regular {unique:true} option used to create a unique index, we will provide an additional dropDups:true option (or dropDups:1 if you wish) that will blindly delete all the duplicate data it encounters during index creation. Note that there is no guarantee of which document will be retained and which one will be deleted, but just one will be retained. In this case, there are 20 unique login IDs. On unique index creation, if the value of the login ID is not already present in the index, it will be added. Subsequently, when the login ID encountered is already present in the index, the corresponding document is deleted from the collection; this explains why we were left with just 20 documents in the userDetails collection.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset