Day 2: Indexing, Aggregating, Mapreduce

Increasing MongoDB’s query performance is the first item on today’s docket, followed by some more powerful and complex grouped queries. Finally, we’ll round out the day with some data analysis using mapreduce.

Indexing: When Fast Isn’t Fast Enough

One of Mongo’s useful built-in features is indexing in the name of enhanced query performance—something, as you’ve seen, that’s not available on all NoSQL databases. MongoDB provides several of the best data structures for indexing, such as the classic B-tree as well as other additions, such as two-dimensional and spherical GeoSpatial indexes.

For now, we’re going to do a little experiment to see the power of MongoDB’s B-tree index by populating a series of phone numbers with a random country prefix (feel free to replace this code with your own country code). Enter the following code into your console. This will generate 100,000 phone numbers (it may take a while), between 1-800-555-0000 and 1-800-565-0000.

 populatePhones = ​function​(area, start, stop) {
 for​(​var​ i = start; i < stop; i++) {
 var​ country = 1 + ((Math.random() * 8) << 0);
 var​ num = (country * 1e10) + (area * 1e7) + i;
 var​ fullNumber = ​"+"​ + country + ​" "​ + area + ​"-"​ + i;
  db.phones.insert({
  _id: num,
  components: {
  country: country,
  area: area,
  prefix: (i * 1e-4) << 0,
  number: i
  },
  display: fullNumber
  });
  print(​"Inserted number "​ + fullNumber);
  }
  print(​"Done!"​);
 }

Run the function with a three-digit area code (like 800) and a range of seven-digit numbers (5,550,000 to 5,650,000—please verify your zeros when typing).

 > populatePhones(800, 5550000, 5650000) ​// This could take a minute
 > db.phones.find().limit(2)
 { ​"_id"​ : 18005550000, ​"components"​ : { ​"country"​ : 1, ​"area"​ : 800,
 "prefix"​ : 555, ​"number"​ : 5550000 }, ​"display"​ : ​"+1 800-5550000"​ }
 { ​"_id"​ : 88005550001, ​"components"​ : { ​"country"​ : 8, ​"area"​ : 800,
 "prefix"​ : 555, ​"number"​ : 5550001 }, ​"display"​ : ​"+8 800-5550001"​ }

Whenever a new collection is created, Mongo automatically creates an index by the _id. These indexes can be found in the system.indexes collection. The following query shows all indexes in the database:

 > db.getCollectionNames().forEach(​function​(collection) {
  print(​"Indexes for the "​ + collection + ​" collection:"​);
  printjson(db[collection].getIndexes());
 });

Most queries will include more fields than just the _id, so we need to make indexes on those fields.

We’re going to create a B-tree index on the display field. But first, let’s verify that the index will improve speed. To do this, we’ll first check a query without an index. The explain method is used to output details of a given operation.

 > db.phones.find({display: ​"+1 800-5650001"​}).
  explain(​"executionStats"​).executionStats
 {
 "executionTimeMillis"​: 52,
 "executionStages"​: {
 "executionTimeMillisEstimate"​: 58,
  }
 }

Your output will differ from ours here and only a few fields from the output are shown here, but note the executionTimeMillisEstimate field—milliseconds to complete the query—will likely be double digits.

We create an index by calling ensureIndex(fields,options) on the collection. The fields parameter is an object containing the fields to be indexed against. The options parameter describes the type of index to make. In this case, we’re building a unique index on display that should just drop duplicate entries.

 > db.phones.ensureIndex(
  { display : 1 },
  { unique : ​true​, dropDups : ​true​ }
 )

Now try find again, and check explain to see whether the situation improves.

 > db.phones.find({ display: ​"+1 800-5650001"​ }).
  explain(​"executionStats"​).executionStats
 {
 "executionTimeMillis"​ : 0,
 "executionStages"​: {
 "executionTimeMillisEstimate"​: 0,
  }
 }

The executionTimeMillisEstimate changed from 52 to 0—an infinite improvement (52 / 0)! Just kidding, but the query is now orders of magnitude faster. Mongo is no longer doing a full collection scan but instead walking the tree to retrieve the value. Importantly, scanned objects dropped from 109999 to 1—since it has become a single unique lookup.

explain is a useful function, but you’ll use it only when testing specific query calls. If you need to profile in a normal test or production environment, you’ll need the system profiler.

Let’s set the profiling level to 2 (level 2 stores all queries; profiling level 1 stores only slower queries greater than 100 milliseconds) and then run find as normal.

 > db.setProfilingLevel(2)
 > db.phones.find({ display : ​"+1 800-5650001"​ })

This will create a new object in the system.profile collection, which you can read as any other table to get information about the query, such as a timestamp for when it took place and performance information (such as executionTimeMillisEstimate as shown). You can fetch documents from that collection like any other:

 > db.system.profile.find()

This will return a list of objects representing past queries. This query, for example, would return stats about execution times from the first query in the list:

 > db.system.profile.find()[0].execStats
 {
 "stage"​ : ​"EOF"​,
 "nReturned"​ : 0,
 "executionTimeMillisEstimate"​ : 0,
 "works"​ : 0,
 "advanced"​ : 0,
 "needTime"​ : 0,
 "needYield"​ : 0,
 "saveState"​ : 0,
 "restoreState"​ : 0,
 "isEOF"​ : 1,
 "invalidates"​ : 0
 }

Like yesterday’s nested queries, Mongo can build your index on nested values. If you wanted to index on all area codes, use the dot-notated field representation: components.area. In production, you should always build indexes in the background using the { background : 1 } option.

 > db.phones.ensureIndex({ ​"components.area"​: 1 }, { background : 1 })

If we find all of the system indexes for our phones collection, the new one should appear last. The first index is always automatically created to quickly look up by _id, and the other two we added ourselves.

 > db.phones.getIndexes()
 [
  {
 "v"​ : 2,
 "key"​ : {
 "_id"​ : 1
  },
 "name"​ : ​"_id_"​,
 "ns"​ : ​"book.phones"
  },
  {
 "v"​ : 2,
 "unique"​ : ​true​,
 "key"​ : {
 "display"​ : 1
  },
 "name"​ : ​"display_1"​,
 "ns"​ : ​"book.phones"
  },
  {
 "v"​ : 2,
 "key"​ : {
 "components.area"​ : 1
  },
 "name"​ : ​"components.area_1"​,
 "ns"​ : ​"book.phones"​,
 "background"​ : 1
  }
 ]

Our book.phones indexes have rounded out quite nicely.

We should close this section by noting that creating an index on a large collection can be slow and resource-intensive. Indexes simply “cost” more in Mongo than in a relational database like Postgres due to Mongo’s schemaless nature. You should always consider these impacts when building an index by creating indexes at off-peak times, running index creation in the background, and running them manually rather than using automated index creation. There are plenty more indexing tricks and tips online, but these are the basics that may come in handy the most often.

Aggregated Queries

MongoDB includes a handful of single-purpose aggregators: count provides the number of documents included in a result set (which we saw earlier), distinct collects the result set into an array of unique results, and aggregate returns documents according to a logic that you provide.

The queries we investigated yesterday were useful for basic data extraction, but any post-processing would be up to you to handle. For example, say you wanted to count the phone numbers greater than 5599999 or provide nuanced data about phone number distribution in different countries—in other words, to produce aggregate results using many documents. As in PostgreSQL, count is the most basic aggregator. It takes a query and returns a number (of matching documents).

 > db.phones.count({​'components.number'​: { $gt : 5599999 } })
 50000

The distinct method returns each matching value (not a full document) where one or more exists. We can get the distinct component numbers that are less than 5,550,005 in this way:

 > db.phones.distinct(​'components.number'​,
  {​'components.number'​: { $lt : 5550005 } })
 [ 5550000, 5550001, 5550002, 5550003, 5550004 ]

The aggregate method is more complex but also much more powerful. It enables you to specify a pipeline-style logic consisting of stages such as: $match filters that return specific sets of documents; $group functions that group based on some attribute; a $sort logic that orders the documents by a sort key; and many others.[30]

You can chain together as many stages as you’d like, mixing and matching at will. Think of aggregate as a combination of WHERE, GROUP BY, and ORDER BY clauses in SQL. The analogy isn’t perfect, but the aggregation API does a lot of the same things.

Let’s load some city data into Mongo. There’s an included mongoCities100000.js file containing insert statements for data about nearly 100,000 cities. Here’s how you can execute that file in the Mongo shell: c

 > load(​'mongoCities100000.js'​)
 > db.cities.count()
 99838

Here’s an example document for a city:

 {
 "_id"​ : ​ObjectId(​​"5913ec4c059c950f9b799895"​​)​,
 "name"​ : ​"Sant Julià de Lòria"​,
 "country"​ : ​"AD"​,
 "timezone"​ : ​"Europe/Andorra"​,
 "population"​ : 8022,
 "location"​ : {
 "longitude"​ : 42.46372,
 "latitude"​ : 1.49129
  }
 }

We could use aggregate to, for example, find the average population for all cities in the Europe/London timezone. To do so, we could $match all documents where timezone equals Europe/London, and then add a $group stage that produces one document with an _id field with a value of averagePopulation and an avgPop field that displays the average value across all population values in the collection:

 > db.cities.aggregate([
  {
  $match: {
  'timezone': {
  $eq: 'Europe/London'
  }
  }
  },
  {
  $group: {
  _id: 'averagePopulation',
  avgPop: {
  $avg: '$population'
  }
  }
  }
 ])
 { "_id" : "averagePopulation", "avgPop" : 23226.22149712092 }

We could also match all documents in that same timezone, sort them in descending order by population, and then $project documents that only contain the population field:

 > db.cities.aggregate([
  {
 // same $match statement the previous aggregation operation
  },
  {
  $sort: {
  population: -1
  }
  },
  {
  $project: {
  _id: 0,
  name: 1,
  population: 1
  }
  }
 ])

You should see results like this:

 { ​"name"​ : ​"City of London"​, ​"population"​ : 7556900 }
 { ​"name"​ : ​"London"​, ​"population"​ : 7556900 }
 { ​"name"​ : ​"Birmingham"​, ​"population"​ : 984333 }
 // many others

Experiment with it a bit—try combining some of the stage types we’ve already covered in new ways—and then delete the collection when you’re done, as we’ll add the same data back into the database using a different method on Day 3.

 > db.cities.drop()

This provides a very small taste of Mongo’s aggregation capabilities. The possibilities are really endless, and we encourage you to explore other stage types. Be forewarned that aggregations can be quite slow if you add a lot of stages and/or perform them on very large collections. There are limits to how well Mongo, as a schemaless database, can optimize these sorts of operations. But if you’re careful to keep your collections reasonably sized and, even better, structure your data to not require bold transformations to get the outputs you want, then aggregate can be a powerful and even speedy tool.

Server-Side Commands

In addition to evaluating JavaScript functions, there are several pre-built commands in Mongo, most of which are executed on the server, although some require executing only under the admin database (which you can access by entering use admin). The top command, for example, will output access details about all collections on the server.

 > use admin
 > db.runCommand(​"top"​)
 {
 "totals"​ : {
 "note"​ : ​"all times in microseconds"​,
 "admin.system.roles"​ : {
 "total"​ : {
 "time"​ : 3666,
 "count"​ : 1
  },
 // etc
  }
 }

You can also list all commands that are currently available (let’s switch back to the book database first because the admin database provides a different set of commands):

 > use book
 > db.listCommands()

When you run listCommands, you may notice a lot of commands we’ve used already. In fact, you can execute many common commands through the runCommand method, such as counting the number of phones. However, you may notice a slightly different output.

 > db.runCommand({ ​"find"​ : ​"someCollection"​ })
 {
 "cursor"​ : {
 "id"​ : NumberLong(0),
 "ns"​ : ​"book.someCollection"​,
 "firstBatch"​ : [ ]
  },
 "ok"​ : 1
 }

Here, we see that this function returns an object containing a cursor and an ok field. That’s because db.phones.find is a wrapper function created for our convenience by the shell’s JavaScript interface, whereas runCommand is an operation executed on the server. Remember that we can play detective on how a function such as find works by leaving off the calling parentheses.

 > db.phones.find
 function​ (query, fields, limit, skip, batchSize, options) {
 var​ cursor = ​new​ DBQuery(​this​._mongo,
 // other query parameters
  options || ​this​.getQueryOptions());
 
 // some cursor-building logic
 
 return​ cursor;
 }

So what about the DBQuery object? How much more can we find out about it?

 > DBQuery
 function​ DBQuery() {
  [​native​ code]
 }

Okay, looks like Mongo isn’t going to reveal too much about that. No matter: this way of diving into function definitions is a great way to both explore Mongo conceptually and to get a better sense of what’s happening inside of your queries and operations.

Mapreduce (and Finalize)

Mapreduce operations are designed for performing computations over large datasets. Every mapreduce operation is split into two basic steps. First, a map step performs some series of filtering and/or sorting operation, winnowing the original dataset down into some subset. Then, a reduce step performs some kind of operation on that subset. An example mapreduce operation would be finding all baseball players in Major League history with the first name Dave (the map step) and then finding the cumulative batting average for all of those Daves (the reduce step).

In MongoDB, the map step involves creating a mapper function that calls an emit function. The benefit of this approach is you can emit more than once per document. The reduce function accepts a single key and a list of values that were emitted to that key. Finally, Mongo provides an optional third step called finalize, which is executed only once per mapped value after the reducers are run. This allows you to perform any final calculations or cleanup you may need.

Because we already know the basics of mapreduce, we’ll skip the intro wading-pool example and go right to the high-dive. Let’s generate a report that counts all phone numbers that contain the same digits for each country. First, we’ll store a helper function that extracts an array of all distinct numbers (understanding how this helper works is not imperative to understanding the overall mapreduce).

 distinctDigits = ​function​(phone){
 var​ number = phone.components.number + ​''​,
  seen = [],
  result = [],
  i = number.length;
 
 while​(i--) {
  seen[+number[i]] = 1;
  }
 
 for​ (​var​ i = 0; i < 10; i++) {
 if​ (seen[i]) {
  result[result.length] = i;
  }
  }
 
 return​ result;
 }
 
 db.system.js.save({_id: ​'distinctDigits'​, value: distinctDigits})

Load the file in the mongo command line. If the file exists in the same directory you launched mongo from, you need only the filename; otherwise, a full path is required.

 > load(​'distinctDigits.js'​)

Now we can get to work on the mapper. As with any mapreduce function, deciding what fields to map by is a crucial decision because it dictates the aggregated values that you return. Because our report is finding distinct numbers, the array of distinct values is one field. But because we also need to query by country, that is another field. We add both values as a compound key: {digits : X, country : Y}.

Our goal is to simply count these values, so we emit the value 1 (each document represents one item to count). The reducer’s job is to sum all those 1s together.

 map = ​function​() {
 var​ digits = distinctDigits(​this​);
  emit({
  digits: digits,
  country: ​this​.components.country
  }, {
  count : 1
  });
 }
 reduce = ​function​(key, values) {
 var​ total = 0;
 for​ (​var​ i = 0; i < values.length; i++) {
  total += values[i].count;
  }
 return​ { count : total };
 }
 results = db.runCommand({
  mapReduce: ​'phones'​,
  map: map,
  reduce: reduce,
  out: ​'phones.report'
 })

Because we set the collection name via the out parameter (out: ’phones.report’), you can query the results like any other. It’s a materialized view that you can see in the show tables list.

 > db.phones.report.find({​'_id.country'​ : 8})
 {
 "_id"​ : { ​"digits"​ : [ 0, 1, 2, 3, 4, 5, 6 ], ​"country"​ : 8 },
 "value"​ : { ​"count"​ : 19 }
 }
 {
 "_id"​ : { ​"digits"​ : [ 0, 1, 2, 3, 5 ], ​"country"​ : 8 },
 "value"​ : { ​"count"​ : 3 }
 }
 {
 "_id"​ : { ​"digits"​ : [ 0, 1, 2, 3, 5, 6 ], ​"country"​ : 8 },
 "value"​ : { ​"count"​ : 48 }
 }
 {
 "_id"​ : { ​"digits"​ : [ 0, 1, 2, 3, 5, 6, 7 ], ​"country"​ : 8 },
 "value"​ : { ​"count"​ : 12 }
 }
 has more

Type it to continue iterating through the results. Note that the unique emitted keys are under the field _ids, and all of the data returned from the reducers is under the field value.

If you prefer that the mapreducer just output the results, rather than outputting to a collection, you can set the out value to { inline : 1 }, but bear in mind that there is a limit to the size of a result you can output. As of Mongo 2.0, that limit is 16 MB.

Reducers can have either mapped (emitted) results or other reducer results as inputs. Why would the output of one reducer feed into the input of another if they are mapped to the same key? Think of how this would look if run on separate servers, as shown in the figure that follows.

images/mongo-mapreduce.png

Each server must run its own map and reduce functions and then push those results to be merged with the service that initiated the call, gathering them up. Classic divide and conquer. If we had renamed the output of the reducer to total instead of count, we would have needed to handle both cases in the loop, as shown here:

 reduce = ​function​(key, values) {
 var​ total = 0;
 for​(​var​ i = 0; i < values.length; i++) {
 var​ data = values[i];
 if​(​'total'​ ​in​ data) {
  total += data.total;
  } ​else​ {
  total += data.count;
  }
  }
 return​ { total : total };
 }

However, Mongo predicted that you might need to perform some final changes, such as renaming a field or some other calculations. If you really need the output field to be total, you can implement a finalize function, which works the same way as the finalize function under group.

Day 2 Wrap-Up

On Day 2, we’ve expanded our query power by including several aggregate queries: count, distinct, and topped off by aggregate. To speed up the response time of these queries, we used MongoDB’s indexing options. When more power is required, the ever-present mapReduce is available.

Day 2 Homework

Find

  1. Find a shortcut for admin commands.
  2. Find the online documentation for queries and cursors.
  3. Find the MongoDB documentation for mapreduce.
  4. Through the JavaScript interface, investigate the code for three collections functions: help, findOne, and stats.

Do

  1. Implement a finalize method to output the count as the total.

  2. Install a Mongo driver for a language of your choice, and connect to the database. Populate a collection through it, and index one of the fields.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset