Chapter 5. Constructing queries

This chapter covers

  • Querying an e-commerce data model
  • The MongoDB query language in detail
  • Query selectors and options

MongoDB doesn’t use SQL. It instead features its own JSON-like query language. You’ve explored this language throughout the book, but now let’s turn to some meatier, real-world examples. You’ll revisit the e-commerce data model introduced in the previous chapter and present a variety of queries against it. Among the queries you’ll practice are _id lookups, ranges, ordering, and projections. This chapter then surveys the MongoDB query language as a whole, looking at each available query operator in detail.

Keep in mind as you’re reading this chapter that MongoDB’s query language and aggregation functions (which chapter 6 covers) are still works in progress, and refinements are being added with each release. As it stands, mastering queries and aggregations in MongoDB isn’t so much a matter of mapping out every nook as it is finding the best ways to accomplish everyday tasks. Through the examples in this chapter, you’ll learn the clearest routes to take. By the end of the chapter, you should have a good intuitive understanding of queries in MongoDB, and you’ll be ready to apply these tools to the design of application schemas.

5.1. E-commerce queries

This section continues our exploration of the e-commerce data model sketched out in the previous chapter. You’ve defined a document structure for products, categories, users, orders, and product reviews. Now, with that structure in mind, we’ll show you how you might query these entities in a typical e-commerce application. Some of these queries are simple. For instance, _id lookups shouldn’t be a mystery at this point. But we’ll also show you a few more sophisticated patterns, including querying for and displaying a category hierarchy, as well as providing filtered views of product listings. In addition, we’ll keep efficiency in mind by looking at possible indexes for some of these queries.

5.1.1. Products, categories, and reviews

Most e-commerce applications provide at least two basic views of products and categories. First is the product home page, which highlights a given product, displays reviews, and gives some sense of the product’s categories. Second is the product listing page, which allows users to browse the category hierarchy and view thumbnails of all the products within a selected category. Let’s begin with the product home page, in many ways the simpler of the two.

Imagine that your product page URLs are keyed on a product slug (you learned about these user-friendly permalinks in chapter 4). In that case, you can get all the data you need for your product page with the following three queries:

product = db.products.findOne({'slug': 'wheel-barrow-9092'})
db.categories.findOne({'_id': product['main_cat_id']})
db.reviews.find({'product_id': product['_id']})

The first query finds the product with the slug wheel-barrow-9092. Once you have your product, you query for its category information with a simple _id query on the categories collection. Finally, you issue another simple lookup that gets all the reviews associated with the product.

findOne vs. find queries

You’ll notice that the first two queries use the findOne method but the last uses find instead. All of the MongoDB drivers provide these two methods, so it’s worth reviewing the difference between them. As discussed in chapter 3, find returns a cursor object, whereas findOne returns a document. The findOne method is similar to the following, though a cursor is returned even when you apply a limit:

db.products.find({'slug': 'wheel-barrow-9092'}).limit(1)

If you’re expecting a single document, findOne will return that document if it exists. If you need to return multiple documents, use find. You’ll then need to iterate over that cursor somewhere in your application.

If your findOne query matches multiple items in the database, it’ll return the first item in the natural sort order of the documents in the collection. In most cases (but not always) this is the same order that the documents were inserted into the collection, and for capped collections, it’s always the case. If you expect multiple result documents, you should almost always use a find query or explicitly sort the results.

Now look again at the product page queries. See anything unsettling? If the query for reviews seems a bit liberal, you’re right. This query says to return all reviews for the given product, but doing so wouldn’t be prudent in cases where a product had hundreds of reviews.

Skip, limit, and sort query options

Most applications paginate reviews, and for enabling this MongoDB provides skip and limit options. You can use these options to paginate the review document like this:

db.reviews.find({'product_id': product['_id']}).skip(0).limit(12)

Notice how you set these options by calling the skip and limit methods on the returned value of find. This can be confusing because it’s a different pattern than you usually see, even in other MongoDB drivers. They appear to be called after the query, but the sorting and limiting parameters are sent with the query and handled by the MongoDB server. This syntax pattern, called method chaining, is intended to make it easier to build queries.

You also want to display reviews in a consistent order, which means you have to sort your query results. If you want to sort by the number of helpful votes received by each review, you can specify that easily:

db.reviews.find({'product_id': product['_id']}).
                   sort({'helpful_votes': -1}).
                   limit(12)

In short, this query tells MongoDB to return the first 12 reviews sorted by the total number of helpful votes in descending order. Now, with skip, limit, and sort in place, you need to decide whether to paginate in the first place. For this, you can issue a count query. You then use the results of the count in combination with the page of reviews you want. Your queries for the product page are complete:

page_number = 1
product  = db.products.findOne({'slug': 'wheel-barrow-9092'})
category = db.categories.findOne({'_id': product['main_cat_id']})
reviews_count = db.reviews.count({'product_id': product['_id']})
reviews = db.reviews.find({'product_id': product['_id']}).
                           skip((page_number - 1) * 12).
                           limit(12).
                           sort({'helpful_votes': -1})

The order in which you call skip, limit, and sort in the JavaScript shell doesn’t matter.

These lookups should use indexes. You’ve already seen that slugs should have a unique index on them because they serve as alternate primary keys, and you know that all _id fields will automatically have a unique index for standard collections. But it’s also important that you have an index on any fields acting as references. In this case, that would include the user_id and product_id fields on the reviews collection.

Product listing page

With the queries for the product home pages in place, you can now turn to the product listing page. Such a page will display a given category with a browsable listing of products. Links to parent and sibling categories will also appear on the page.

A product listing page is defined by its category; thus, requests for the page will use the category’s slug:

page_number = 1
category = db.categories.findOne({'slug': 'gardening-tools'})
siblings = db.categories.find({'parent_id': category['_id']})
products = db.products.find({'category_id': category['_id']})
                           .skip((page_number - 1) * 12)
                           .limit(12)
                           .sort({'helpful_votes': -1})

Siblings are any other categories with the same parent ID, so the query for siblings is straightforward. Because products all contain an array of category IDs, the query for finding all products in a given category is also trivial. You can imagine providing alternative sort methods (by name, price, and so forth). For those cases, you change the sort field.

It’s important to consider whether these sorts will be efficient. You may choose to rely on your index to handle sorting for you, but as you add more sort options, the number of indexes grows, and the cost of maintaining those indexes may not be reasonable because each index makes writes slightly more expensive. We’ll discuss this further in chapter 8, but start thinking about these trade-offs now.

The product listing page has a base case, where you’re viewing the root-level categories but no products. A query against the categories collection for a null parent ID is all that’s required to get these root-level categories:

categories = db.categories.find({'parent_id': null})

5.1.2. Users and orders

The queries in the previous section were generally limited to _id lookups and sorts. In looking at users and orders, we’ll dig deeper because you’ll want to generate basic reports on orders. The example queries search documents that look like those from chapter 4, listing 4.1 (products) and listing 4.4 (users).

Let’s start with something simpler: user authentication. Users log in to the application by providing a username and password. Thus, you’d expect to see the following query frequently:

db.users.findOne({
    'username': 'kbanker',
    'hashed_password': 'bd1cfa194c3a603e7186780824b04419'})

If the user exists and the password is correct, you’ll get back an entire user document; otherwise, the query will return nothing. This query is acceptable, but why should you return the entire user document if you only want to check that the user exists? You can limit the fields returned using a projection:

db.users.findOne({
  'username': 'kbanker',
  'hashed_password': 'bd1cfa194c3a603e7186780824b04419'},
  {'_id': 1})

In the JavaScript shell you do a projection by passing in an additional argument: a hash of the fields you want with their values set to 1. We discuss projections more in section 5.2.2. If you’re already familiar with SQL and RDBMS, this is the difference between SELECT * and SELECT ID. The response now consists exclusively of the document’s _id field:

{ "_id": ObjectId("4c4b1476238d3b4dd5000001") }
Partial match queries in users

You might want to query the users collection in a few other ways, such as searching by name. Often you’ll want to perform a lookup on a single field, such as last_name:

db.users.find({'last_name': 'Banker'})

This approach works, but there are limits to searching for an exact match. For one, you might not know how to spell a given user’s name. In this case, you’ll want some way of querying for a partial match. Suppose you know that the user’s last name starts with Ba. MongoDB allows you to query using regular expressions:

db.users.find({'last_name': /^Ba/})

The regular expression /^Ba/ can be read as “the beginning of the line followed by a B followed by an a.” A prefix search like this one can take advantage of an index, but not all regular expression queries can use an index.

Querying specific ranges

When it comes to marketing to your users, you’ll most likely want to target ranges of users. For instance, if you want to query for all users residing in Upper Manhattan, you issue this range query on a user’s zip code:

db.users.find({'addresses.zip': {'$gt': 10019, '$lt': 10040}})

Recall that each user document contains an array of one or more addresses. This query will match a user document if any zip code among those addresses falls within the given range. You can use the $gte (greater than) and $lt (less than) operators to define this range. To make this query efficient, you’ll want an index defined on addresses.zip.

You’ll see more examples of querying this data in the next chapter, and later on, you’ll learn how to get insight from the data using MongoDB’s aggregation functions. But with this introduction under your belt, we’ll now look at MongoDB’s query language in some depth, explaining the syntax in general and each operator in particular.

5.2. MongoDB’s query language

It’s time we explore MongoDB’s query language in all its glory. We’ve already walked through some real-world query examples; this section is intended as a more comprehensive reference of MongoDB query capabilities. If you’re learning about MongoDB queries for the first time, it may be easier to skim this section and revisit it when you need to write more advanced queries for your application.

5.2.1. Query criteria and selectors

Query criteria allow you to use one or more query selectors to specify the query’s results. MongoDB gives you many possible selectors. This section provides an overview.

Selector matching

The simplest way to specify a query is with a selector whose key-value pairs literally match against the document you’re looking for. Here are a couple of examples:

db.users.find({'last_name': "Banker"})
db.users.find({'first_name': "Smith", birth_year: 1975})

The second query reads, “Find me all users such that the first_name is Smith and was born in 1975.” Note that whenever you pass more than one key-value pair, both must match; the query conditions function as a Boolean AND. If you want to express a Boolean OR, see the upcoming section on Boolean operators.

In MongoDB all text string matches are case sensitive. If you need to perform case-insensitive matches, consider using a regex term (explained later in this chapter, when we discuss the use of the i regex flag) or investigate the use of text search introduced in chapter 9.

Ranges

You frequently need to query for documents the values of which span a certain range. In most languages, you use <, <=, >, and >=. With MongoDB, you get the analogous set of operators $lt, $lte, $gt, and $gte. You’ve used these operators throughout the book, and they work as you’d expect. Table 5.1 shows the range query operators most commonly used in MongoDB.

Table 5.1. Summary of range query operators

Operator

Description

$lt Less than
$gt Greater than
$lte Less than or equal
$gte Greater than or equal

Beginners sometimes struggle with combining these operators. A common mistake is to repeat the search key:

db.users.find({'birth_year': {'$gte': 1985}, 'birth_year': {'$lte': 2015}})

The aforementioned query only takes into account the last condition. You can properly express this query as follows:

db.users.find({'birth_year': {'$gte': 1985, '$lte': 2015}})

You should also know how these work with different data types. Range queries will match values only if they have the same type as the value to be compared against.[1] For example, suppose you have a collection with the following documents:

1

Note that the numeric types—integer, long integer, and double—have type equivalence for these queries.

{ "_id" : ObjectId("4caf82011b0978483ea29ada"), "value" : 97 }
{ "_id" : ObjectId("4caf82031b0978483ea29adb"), "value" : 98 }
{ "_id" : ObjectId("4caf82051b0978483ea29adc"), "value" : 99 }
{ "_id" : ObjectId("4caf820d1b0978483ea29ade"), "value" : "a" }
{ "_id" : ObjectId("4caf820f1b0978483ea29adf"), "value" : "b" }
{ "_id" : ObjectId("4caf82101b0978483ea29ae0"), "value" : "c" }

You then issue the following query:

db.items.find({'value': {'$gte': 97}})

You may think that this query should return all six documents because the strings are numerically equivalent to the integers 97, 98, and 99. But this isn’t the case. As MongoDB is schemaless, this query returns the integer results only because the criteria supplied was an integer itself. If you want the string results, you must query with a string instead:

db.items.find({'value': {'$gte': "a"}})

You won’t need to worry about this type restriction as long as you never store multiple types for the same key within the same collection. This is a good general practice, and you should abide by it.

Set operators

Three query operators—$in, $all, and $nin—take a list of one or more values as their predicate, so these are called set operators. $in returns a document if any of the given values matches the search key. You might use this operator to return all products belonging to some discrete set of categories. Table 5.2 shows these set query operators.

Table 5.2. Summary of set operators

Operator

Description

$in Matches if any of the arguments are in the referenced set
$all Matches if all of the arguments are in the referenced set and is used in documents that contain arrays
$nin Matches if none of the arguments are in the referenced set

If the following list of category IDs

[
  ObjectId("6a5b1476238d3b4dd5000048"),
  ObjectId("6a5b1476238d3b4dd5000051"),
  ObjectId("6a5b1476238d3b4dd5000057")
]

corresponds to the lawnmowers, hand tools, and work clothing categories, the query to find all products belonging to these categories looks like this:

db.products.find({
    'main_cat_id': {
      '$in': [
        ObjectId("6a5b1476238d3b4dd5000048"),
        ObjectId("6a5b1476238d3b4dd5000051"),
        ObjectId("6a5b1476238d3b4dd5000057")
      ]
    }
  })

Another way of thinking about the $in operator is as a kind of Boolean inclusive OR against a single attribute. Expressed this way, the previous query might be read, “Find me all products of which the category is lawnmowers or hand tools or work clothing.” Note that if you need a Boolean OR over multiple attributes, you’ll want to use the $or operator, described in the next section:

  • $in is frequently used with lists of IDs.
  • $nin (not in)) returns a document only when none of the given elements matches. You might use $nin to find all products that are neither black nor blue:
    db.products.find({'details.color': {'$nin': ["black", "blue"]}})
  • $all matches if every given element matches the search key. If you wanted to find all products tagged as gift and garden, $all would be a good choice:
    db.products.find({'tags': {'$all': ["gift", "garden"]}})

Naturally, this query makes sense only if the tags attribute stores an array of terms, like this:

{
  'name': "Bird Feeder",
  'tags': [ "gift", "birds", "garden" ]
}

Selectivity is the ability of a query to narrow results using the index. The problem is that both $ne and $nin operators aren’t selective. Therefore, when using the set operators, keep in mind that $in and $all can take advantage of indexes, but $nin can’t and thus requires a collection scan. If you use $nin, try to use it in combination with another query term that does use an index. Better yet, find a different way to express the query. You may, for instance, be able to store an attribute whose presence indicates a condition equivalent to your $nin query. For example, if you commonly issue a query for {timeframe: {$nin: ['morning', 'afternoon']}}, you may be able to express this more directly as {timeframe: 'evening'}.

Boolean operators

MongoDB’s Boolean operators include $ne, $not, $or, $and, $nor, and $exists. Table 5.3 summarizes the Boolean operators.

Table 5.3. Summary of Boolean operators

Operator

Description

$ne Matches if the argument is not equal to the element
$not Inverts the result of a match
$or Matches if any of the supplied set of query terms is true
$nor Matches if none of the supplied set of query terms are true
$and Matches if all of the supplied set of query terms are true
$exists Matches if the element exists in the document.

$ne, the not equal to operator, works as you’d expect. In practice, it’s best used in combination with at least one other operator; otherwise, it’s likely to be inefficient because it can’t take advantage of indexes. For example, you might use $ne to find all products manufactured by Acme that aren’t tagged with gardening:

db.products.find({'details.manufacturer': 'Acme', tags: {$ne: "gardening"} })

$ne works on keys pointing to single values and to arrays, as shown in the example where you match against the tags array.

Whereas $ne matches the negation of a specified value, $not negates the result of another MongoDB operator or regular expression query. Most query operators already have a negated form ($in and $nin, $gt and $lte, and so on); $not is useful because it includes documents that aren’t evaluated by the given expression. Consider the following example:

db.users.find({'age': {'$not': {'$lte': 30}}})

As you’d expect, this query returns documents where age is greater than 30. It also returns documents with no age field, which makes it distinct from using the $gt operator in this case.

$or expresses the logical disjunction of two values for two different keys. This is an important point: if the possible values are scoped to the same key, use $in instead. Trivially, finding all products that are either blue or green looks like this:

db.products.find({'details.color': {$in: ['blue', 'Green']}})

But finding all products that are either blue or made by Acme requires $or:

db.products.find({
    '$or': [
      {'details.color': 'blue'},
      {'details.manufacturer': 'Acme'}
    ]
  })

$or takes an array of query selectors, where each selector can be arbitrarily complex and may itself contain other query operators. $nor works much the same as $or but is logically true only when none of its query selectors are true.

Like $or, the $and operator takes an array of query selectors. Because MongoDB interprets all query selectors containing more than one key by ANDing the conditions, you should use $and only when you can’t express an AND in a simpler way. For example, suppose you want to find all products that are tagged with gift or holiday and either gardening or landscaping. The only way to express this query is with the conjunction of two $in queries:

db.products.find({
    $and: [
      {
        tags: {$in: ['gift', 'holiday']}
      },
      {
        tags: {$in: ['gardening', 'landscaping']}
      }
    ]
  })
Querying for a document with a specific key

The final operator we’ll discuss in this section is $exists. This operator is necessary because collections don’t enforce a fixed schema, so you occasionally need a way to query for documents containing a particular key. Recall that you’d planned to use each product’s details attribute to store custom fields. You might, for instance, store a color field inside the details attribute. But if only a subset of all products specify a set of colors, then you can query for the ones that don’t like this:

db.products.find({'details.color': {$exists: false}})

The opposite query is also possible:

db.products.find({'details.color': {$exists: true}})

Here you’re checking whether the field exists in a document at all. Even if a field exists, it can still be set to null. Depending on your data and query, you may want to filter those values as well.

Matching subdocuments

Some of the entities in this book’s e-commerce data model have keys that point to a single embedded object. The product’s details attribute is one good example. Here’s part of the relevant document, expressed as JSON:

{
  _id:  ObjectId("4c4b1476238d3b4dd5003981"),
  slug: "wheel-barrow-9092",
  sku:  "9092",
  details: {
    model_num: 4039283402,
    manufacturer: "Acme",
    manufacturer_id: 432,
    color:  "Green"
  }
}

You can query such objects by separating the relevant keys with a . (dot). For instance, if you want to find all products manufactured by Acme, you can use this query:

db.products.find({'details.manufacturer': "Acme"});

Such queries can be specified arbitrarily deep. Supposing you had the following slightly modified representation

{
  _id: ObjectId("4c4b1476238d3b4dd5003981"),
  slug: "wheel-barrow-9092",
  sku: "9092",
  details: {
    model_num: 4039283402,
    manufacturer: {
      name: "Acme",
      id: 432
    },
    color:  "Green"
  }
}

the key in the query selector would contain two dots:

db.products.find({'details.manufacturer.id': 432});

But in addition to matching against an individual subdocument attribute, you can match an object as a whole. For example, imagine you’re using MongoDB to store stock market positions. To save space, you forgo the standard object ID and replace it with a compound key consisting of a stock symbol and a timestamp. Here’s how a representative document might look:[2]

2

In a potential high-throughput scenario, you’d want to limit document size as much as possible. You could accomplish this in part by using short key names. Thus you might use the key name o in place of open.

{
  _id: {
    sym: 'GOOG',
    date: 20101005
  }
  open:  40.23,
  high:  45.50,
  low:   38.81,
  close: 41.22
}

You could then find the summary of GOOG for October 5, 2010 with the following _id query:

db.ticks.find({'_id': {'sym': 'GOOG', 'date': 20101005}})

It’s important to realize that a query matching an entire object like this will perform a strict byte-by-byte comparison, which means that the order of the keys matters. The following query isn’t equivalent and won’t match the sample document:

db.ticks.find({'_id': {'date': 20101005, 'sym': 'GOOG'}})

Though the order of keys will be preserved in JSON documents entered via the shell, this isn’t necessarily true for other programming languages, and it’s safer to assume that order won’t be preserved. For example, hashes in Ruby 1.8 aren’t order-preserving. To preserve key order in Ruby 1.8, you must use an object of class BSON::OrderedHash instead:

doc = BSON::OrderedHash.new
doc['sym']  = 'GOOG'
doc['date'] = 20101005
@ticks.find(doc)

Be sure to check whether the language you’re using supports ordered dictionaries; if not, the language’s MongoDB driver will always provide an ordered alternative.

Arrays

Arrays give the document model much of its power. As you’ve seen in the e-commerce example, arrays are used to store lists of strings, object IDs, and even other documents. Arrays afford rich yet comprehensible documents; it stands to reason that MongoDB would let you query and index the array type with ease. And it’s true: the simplest array queries look like queries on any other document type, as you can see in table 5.4.

Table 5.4. Summary of array operators

Operator

Description

$elemMatch Matches if all supplied terms are in the same subdocument
$size Matches if the size of the array subdocument is the same as the supplied literal value

Let’s look at these arrays in action. Take product tags again. These tags are represented as a simple list of strings:

{
  _id:  ObjectId("4c4b1476238d3b4dd5003981"),
  slug: "wheel-barrow-9092",
  sku:  "9092",
  tags: ["tools", "equipment", "soil"]
}

Querying for products with the tag "soil" is trivial and uses the same syntax as querying a single document value:

db.products.find({tags: "soil"})

Importantly, this query can take advantage of an index on the tags field. If you build the required index and run your query with explain(), you’ll see that a B-tree cursor[3] is used:

3

The WiredTiger storage engine can support additional data structures for indexing. You can find more information about it in chapter 10.

db.products.ensureIndex({tags: 1})
db.products.find({tags: "soil"}).explain()

When you need more control over your array queries, you can use dot notation to query for a value at a particular position within the array. Here’s how you’d restrict the previous query to the first of a product’s tags:

db.products.find({'tags.0': "soil"})

It might not make much sense to query tags in this way, but imagine you’re dealing with user addresses. You might represent these with an array of subdocuments:

{
  _id: ObjectId("4c4b1476238d3b4dd5000001")
  username: "kbanker",
  addresses: [
    {
      name:    "home",
      street:  "588 5th Street",
      city:    "Brooklyn",
      state:   "NY",
      zip:     11215
    },
    {
      name:   "work",
      street: "1 E. 23rd Street",
      city:   "New York",
      state   "NY",
      zip     10010
    },
  ]
}

You might stipulate that the zeroth element of the array always be the user’s primary shipping address. Thus, to find all users whose primary shipping address is in New York, you could again specify the zeroth position and combine that with a dot to target the state field:

db.users.find({'addresses.0.state': "NY"})

As you can easily understand, you shouldn’t stipulate when writing production code.

You can just as easily omit the position and specify a field alone. The following query will return a user document if any of the addresses in the list is in New York:

db.users.find({'addresses.state': "NY"})

As before, you’ll want to index this dotted field:

db.users.ensureIndex({'addresses.state': 1})

Note that you use the same dot notation regardless of whether a field points to a subdocument or to an array of subdocuments. The dot notation is powerful, and the consistency is reassuring. But ambiguity can arise when querying against more than one attribute within an array of subobjects. For example, suppose you want to fetch a list of all users whose home address is in New York. Can you think of a way to express this query?

db.users.find({'addresses.name': 'home', 'addresses.state': 'NY'})

The problem with this query is that the field references aren’t restricted to a single address. In other words, this query will match as long as one of the addresses is designated as “home” and one is in New York, but what you want is for both attributes to apply to the same address. Fortunately, there’s a query operator for this. To restrict multiple conditions to the same subdocument, you use the $elemMatch operator. You can properly satisfy the query like this:

db.users.find({
    'addresses': {
      '$elemMatch': {
        'name': 'home',
        'state': 'NY'
      }
    }
  })

Logically, use $elemMatch only when you need to match two or more attributes in a subdocument.

Querying for an array by size

The only array operator left to discuss is the $size operator. This operator allows you to query for an array by its size. For example, if you want to find all users with exactly three addresses, you can use the $size operator like this:

db.users.find({'addresses': {$size: 3}})

As of this writing, the $size operator doesn’t use an index and is limited to exact matches (you can’t specify a range of sizes).[4] Therefore, if you need to perform queries based on the size of an array, you should cache the size in its own attribute within the document and update it manually as the array changes. For instance, you might consider adding an address_length field to your user document. You could then build an index on this field and issue all the range and exact match queries you require. A possible solution is to use the aggregation framework, which is described in chapter 6.

4

See https://jira.mongodb.org/browse/SERVER-478 for updates on this issue.

JavaScript query operators

If you can’t express your query with the tools described thus far, you may need to write some JavaScript. You can use the special $where operator to pass a JavaScript expression to any query, as summarized here:

  • $where Execute some arbitrary JavaScript to select a document

Within a JavaScript context, the keyword this refers to the current document. Let’s take a contrived example:

db.reviews.find({
    '$where': "function() { return this.helpful_votes > 3; }"
  })

There’s also an abbreviated form for simple expressions like this one:

db.reviews.find({'$where': "this.helpful_votes > 3"})

This query works, but you’d never want to use it because you can easily express it using other query operators. The problem is that JavaScript expressions can’t use an index, and they incur substantial overhead because they must be evaluated within a JavaScript interpreter context and are single-threaded. For these reasons, you should issue JavaScript queries only when you can’t express your query using other query operators. If you do need JavaScript, try to combine the JavaScript expression with at least one other query operator. The other query operator will pare down the result set, reducing the number of documents that must be loaded into a JavaScript context. Let’s look at a quick example to see how this might make sense.

Imagine that for each user, you’ve calculated a rating reliability factor. This is essentially an integer that, when multiplied by the user’s rating, results in a more normalized rating. Also suppose that you want to query a particular user’s reviews and return only a normalized rating greater than 3. Here’s how that query would look:

db.reviews.find({
    'user_id': ObjectId("4c4b1476238d3b4dd5000001"),
    '$where': "(this.rating * .92) > 3"
  })

This query meets both recommendations: it uses a standard query on a presumably indexed user_id field, and it employs a JavaScript expression that’s absolutely beyond the capabilities of other query operators. Keep in mind that sometimes using the Aggregation framework can make your life easier.

In addition to recognizing the attendant performance penalties, it’s good to be aware of the possibility of JavaScript injection attacks. An injection attack becomes possible whenever a user is allowed to enter code directly into a JavaScript query. An example is when the user submits a web form and the values are used directly in this sort of query. If the user sets the values of attribute or value, this query is unsafe:

@users.find({'$where' => "this.#{attribute} == #{value}"})

In this case, the values of attribute and value are inserted into a string, which is then evaluated to JavaScript. This approach is dangerous because users could potentially include JavaScript code in the values they send, giving them access to other data in the collection. This would result in a serious security breach if a malicious user was able to see data about other users. In general, you should always assume that your users can and will send you malicious data and plan accordingly.

Regular expressions

You saw near the beginning of the chapter that you can use a regular expression within a query. In that example, you used a prefix expression, /^Ba/, to find last names beginning with Ba, and I pointed out that this query would use an index. In fact, much more is possible. MongoDB is compiled with Perl Compatible Regular Expressions (PCRE; http://mng.bz/hxmh), which supports a huge gamut of regular expressions. The $regex operator is summarized here:

  • $regex Match the element against the supplied regex term

With the exception of the prefix-style query just described, regular expressions queries can’t use an index and take longer to execute than most selectors. We recommend using them sparingly. Here’s how you might query a given user’s reviews for text containing the words best or worst. Note that you use the i regex flag[5] to indicate case-insensitivity:

5

The case-insensitive option will always prevent an index from being used to serve the query, even in the case of a prefix match.

db.reviews.find({
    'user_id': ObjectId("4c4b1476238d3b4dd5000001"),
    'text': /best|worst/i
  })

Using the case-insensitive flag has penalties; it excludes the use of any indexes, which in MongoDB are always case-sensitive. If you want to use case-insensitive searches on large numbers of documents, you should use the new text search capability supplied in version 2.4 or later, or integrate an external text search engine. See chapter 9 for an explanation of MongoDB’s searching capabilities.

If the language you’re using has a native regex type, you can use a native regex object to perform the query. You can express an identical query in Ruby like this:

@reviews.find({
    :user_id => BSON::ObjectId("4c4b1476238d3b4dd5000001"),
    :text => /best|worst/i
  })

Even though the regex is defined locally, it’s evaluated on the MongoDB server.

If you’re querying from an environment that doesn’t support a native regex type, you can use the special $regex and $options operators. Using these operators from the shell, you can express the query in yet another way:

db.reviews.find({
    'user_id': ObjectId("4c4b1476238d3b4dd5000001"),
    'text': {
      '$regex': "best|worst",
      '$options': "i"}
  })

MongoDB is a case-sensitive system, and when using a regex, unless you use the /i modifier (that is, /best|worst/i), the search will have to exactly match the case of the fields being searched. But one caveat is that if you do use /i, it will disable the use of indexes. If you want to do indexed case-insensitive search of the contents of string fields in documents, consider either storing a duplicate field with the contents forced to lowercase specifically for searching or using MongoDB’s text search capabilities, which can be combined with other queries and does provide an indexed case-insensitive search.

Miscellaneous query operators

Two more query operators aren’t easily categorized and thus deserve their own section. The first is $mod, which allows you to query documents matching a given modulo operation, and the second is $type, which matches values by their BSON type. Both are detailed in table 5.5.

Table 5.5. Summary of miscellaneous operators

Operator

Description

$mod [(quotient),(result)] Matches if the element matches the result when divided by the quotient
$type Matches if the element type matches a specified BSON type
$text Allows you to performs a text search on the content of the fields indexed with a text index

For instance, $mod allows you to find all order subtotals that are evenly divisible by 3 using the following query:

db.orders.find({subtotal: {$mod: [3, 0]}})

You can see that the $mod operator takes an array having two values. The first is the divisor and the second is the expected remainder. This query technically reads, “Find all documents with subtotals that return a remainder of 0 when divided by 3.” This is a contrived example, but it demonstrates the idea. If you end up using the $mod operator, keep in mind that it won’t use an index.

The second miscellaneous operator, $type, matches values by their BSON type. I don’t recommend storing multiple types for the same field within a collection, but if the situation ever arises, you have a query operator that lets you test against type.

Table 5.6 shows the type numbers associated with each element type used in MongoDB. The example shown is how a member of that type would appear in the JavaScript console. For example, other MongoDB drivers may have a different way of storing the equivalent of an ISODate object.

Table 5.6. BSON types

BSON type

$type number

Example

Double 1 123.456
String (UTF-8) 2 “Now is the time”
Object 3 { name:"Tim",age:"myob" }
Array 4 [123,2345,"string"]
Binary 5 BinData(2,"DgAAAEltIHNvbWUgYmluYXJ5")
ObjectId 7 ObjectId("4e1bdda65025ea6601560b50")
Boolean 8 true
Date 9 ISODate("2011-02-24T21:26:00Z")
Null 10 null
Regex 11 /test/i
JavaScript 13 function() {return false;}
Symbol 14 Not used; deprecated in the standard
Scoped JavaScript 15 function (){return false;}
32-bit integer 16 10
Timestamp 17 { "t" : 1371429067,
"i" : 0
}
64-bit integer 18 NumberLong(10)
Maxkey 127 {"$maxKey": 1}
Minkey 255 { "$minKey" : 1}
Maxkey 128 {"maxkey" : { "$maxKey" : 1 }}

There are a couple of elements in table 5.6 worth mentioning. maxkey and minkey are used to insert a “virtual” value that’s the same as the maximum or minimum value in the index. This means that it can be used to force the document to be sorted out as the first or last item when using a sort index. Gone are the days of adding a field with “aardvark” in it to the collection to force a document to sort to the front. Most of the language drivers have a means for adding a minkey or maxkey type.

Scoped JavaScript and JavaScript look identical in the table, but this is only because the console doesn’t display the scope, which is a dictionary of key-value pairs supplied with the JavaScript code fragment. Scope means the context under which the function is executed. In other words, the function will be able to see the variables defined in the scope dictionary and use them during execution.

Finally, the symbol type has no representation. That’s because in most languages it’s not used—it’s only used where the language has a distinct type for “keys.” For instance, in Ruby there’s a difference between "foo" and :foo—the latter is a symbol. The Ruby driver will store any key as a symbol.

BSON symbol types

As far as querying is concerned, the MongoDB server will treat a BSON symbol type in the same way it treats a string; it’s only when the document is retrieved that a distinct symbol type mapping to the language key type is done. Note also that the symbol type is deprecated in the latest BSON spec (http://bsonspec.org) and may disappear at any moment. Regardless of the language you write your data with, you’ll be able to retrieve it in any other language with a BSON implementation.

5.2.2. Query options

All queries require a query selector. Even if empty, the query selector essentially defines the query. But when issuing a query, you have a variety of query options to choose from that allow you to further constrain the result set. Let’s look at those options next.

Projections

You can use a projection to select a subset of fields to return from each document in a query result set. Especially in cases where you have large documents, using a projection will minimize the costs of network latency and deserialization. The only operator, $slice, is summarized here:

  • $slice Select a subset of a document to be returned

Projections are most commonly defined as a set of fields to return:

db.users.find({}, {'username': 1})

This query returns user documents excluding all but two fields: the username and the _id field, which is a special case and always included by default.

In some situations you may want to specify fields to exclude instead. For instance, this book’s user document contains shipping addresses and payment methods, but you don’t usually need these. To exclude them, add those fields to the projection with a value of 0:

db.users.find({}, {'addresses': 0, 'payment_methods': 0})

In your projection you should either do inclusions or exclusions, though the _id field is a special case. You can exclude the _id field in the same way, by setting the value to 0 in the projection document.

In addition to including and excluding fields, you can return a range of values stored in an array. For example, you might want to store product reviews within the product document itself. In this case, you’d still want to be able to paginate those reviews, and for that you could use the $slice operator. To return the first 12 reviews, or the last 5, you’d use $slice like this:

db.products.find({}, {'reviews': {$slice: 12}})
db.products.find({}, {'reviews': {$slice: -5}})

$slice can also take a two-element array the values of which represent numbers to skip and limit, respectively. Here’s how to skip the first 24 reviews and limit the number of reviews to 12:

db.products.find({}, {'reviews': {$slice: [24, 12]}})

Finally, note that using $slice won’t prevent other fields from being returned. If you want to limit the other fields in the document, you must do so explicitly. For example, here’s how you can modify the previous query to return only the reviews and the review rating :

db.products.find({}, {'reviews': {'$slice': [24, 12]}, 'reviews.rating': 1})
Sorting

As we touched on early in this chapter, you can sort any query result by one or more fields in ascending or descending order. A simple sort of reviews by rating, descending from highest to lowest, looks like this:

db.reviews.find({}).sort({'rating': -1})

Naturally, it might be more useful to sort by helpfulness and then by rating:

db.reviews.find({}).sort({'helpful_votes':-1, 'rating': -1})

In compound sorts like this, the order does matter. As noted elsewhere, JSON entered via the shell is ordered. Because Ruby hashes aren’t ordered, you indicate sort order in Ruby with an array of arrays, which is ordered:

@reviews.find({}).sort([['helpful_votes', -1], ['rating', -1]])

The way you specify sorts in MongoDB is straightforward; understanding how indices can help improve sorting speeds is critical to using them well. We’ll get to that in chapter 8, but feel free to skip ahead if you’re using sorts heavily now.

Skip and limit

There’s nothing mysterious about the semantics of skip and limit. These query options should always work as you expect. But you should beware of passing large values (say, values greater than 10,000) for skip because serving such queries requires scanning over a number of documents equal to the skip value. For example, imagine you’re paginating a million documents sorted by date, descending, with 10 results per page. This means that the query to display the 50,000th page will include a skip value of 500,000, which is incredibly inefficient. A better strategy is to omit the skip altogether and instead add a range condition to the query that indicates where the next result set begins. Thus, this query

db.docs.find({}).skip(500000).limit(10).sort({date: -1})

becomes this:

previous_page_date = new Date(2013, 05, 05)
db.docs.find({'date': {'$gt': previous_page_date}}).limit(10).sort({'date': -1})

This second query will scan far fewer items than the first. The only potential problem is that if date isn’t unique for each document, the same document may be displayed more than once. There are many strategies for dealing with this, but the solutions are left as exercises for the reader.

There’s another set of query types that you can perform on MongoDB data: geospatial queries, which are used to index and retrieve geographical or geometric data and are typically used for mapping and location-aware applications.

5.3. Summary

Queries make up a critical corner of the MongoDB interface. Once you’ve skimmed this chapter’s material, you’re encouraged to put the query mechanisms to the test. If you’re ever unsure of how a particular combination of query operators will serve you, the shell is always a ready test bed.

MongoDB also supports query modifiers that are meta-operators that let you modify the output or behavior of a query. You can find more about them at http://docs.mongodb.org/manual/reference/operator/query-modifier/.

You’ll use MongoDB queries consistently from now on, and the next two chapters are a good reminder of that. You’ll tackle aggregation, document updates, and deletes. Because queries play a key role in most updates, you can look forward to yet more exploration of the query language.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset