Using parent-child relationships

In the previous section, we discussed the ability to index nested documents along with a parent one. However, even though the nested documents are indexed as separate documents in the index, we can't change a single nested document (unless we use the update API). However, ElasticSearch allows us to have a real parent-child relationship and we will look at it in the following sections.

Mappings and indexing

Let's use our previous example with the clothing store. However, what we would like to have is the ability to update sizes and colors without the need of indexing the whole document after each change. In order to do that, we will use the parent-child functionality of ElasticSearch.

Creating parent mappings

So now, the only field we need to have in our parent document is the name. We don't need anything more than that. So in order to create our mapping in the shop index, we would run the following command:

curl -XPUT 'localhost:9200/shop/cloth/_mapping' -d '{
 "cloth" : {
  "properties" : {
   "name" : {"type" : "string", "store" : "yes", "index" : "analyzed"}
  }
 }
}'

Creating child mappings

Now let's create the child mappings. In order to do that, we will need to add the _parent property with the name of the parent type, which is cloth in our case. So, the command that creates the variation type would look like the following:

curl -XPUT 'localhost:9200/shop/variation/_mapping' -d '{
 "variation" : {
  "_parent" : { "type" : "cloth" },
  "properties" : {
"size" : {"type" : "string", "index" : "not_analyzed"},
"color" : {"type" : "string", "index" : "not_analyzed"}
  }
 }
}'

And that's all. You don't need to specify which field will be used to connect a child document to the parent because, by default, ElasticSearch will use the unique identifier for that, and if you remember from the previous chapters, that information is present in the index by default.

Parent document

Now let's index our parent document. It's very simple; in order to do that, we just run the usual indexing command, for example, one like the following command:

curl -XPOST 'localhost:9200/shop/cloth/1' -d '{
 "name" : "Test document"
}'

If you look at the above command, you'll notice that our document will be given the identifier of 1.

Child documents

In order to index child documents, we need to provide information about the parent document with the use of the parent request parameter and set that parameter value to the identifier of the parent document. So in order to index two child documents to our parent document, we would need to run the following commands:

curl -XPOST 'localhost:9200/shop/variation/1000?parent=1' -d '{
 "color" : "red",
 "size" : "XXL"
}'

And:

curl -XPOST 'localhost:9200/shop/variation/1001?parent=1' -d '{
 "color" : "black",
 "size" : "XL"
}'

And that's all. We've indexed two additional documents, which are of a new type, but we've specified that our documents have a parent.

Querying

We've indexed our data and now we need to use appropriate queries in order to match documents with the data stored in their children. However, please note that when running queries against parents, child documents won't be returned and vice versa.

Querying for data in the child documents

So if we would like to get clothes that are of XXL size and in red, we would run the following query:

{
 "query" : {
  "has_child" : {
   "type" : "variation",
   "query" : {
    "bool" : {
     "must" : [
      { "term" : { "size" : "XXL" } }, 
      { "term" : { "color" : "red" } }
     ]
    }
   }
  }
 }
}

The query is quite simple—it is of a has_child type, which tells ElasticSearch that we want to search in the child documents. In order to specify which type of child documents we are interested in, we specify the type property with the name of the child type. Then we have a standard bool query, which we've already discussed.

The top children query

There is one additional query that returns parent documents, but is run against child documents—the top_children query. That query can be used to run against a specified number of child documents. Let's look at the following query:

{
 "query" : {
  "top_children" : {
   "type" : "variation",
   "query" : {
    "term" : { "size" : "XXL" }
   },
   "score" : "max",
   "factor" : 10, 
   "incremental_factor" : 2
  }
 }
}

The preceding query will be run first against a total of 100 child documents (factor multiplied by the default size of 10). If there are 10 parent documents found (because of the default size parameter being equal to 10), then those will be returned and the query execution will end. However, if there are less parents returned and there are still child documents that were not queried, then another 20 documents will be queried (the incremental_factor parameter multiplied by the result's size). And so on until the requested number of parent documents is found or there are no child documents left to query.

The top_children query offers the ability to specify how the score should be calculated with the use of the score parameter, with the following value of max, sum, or avg possible. Because ElasticSearch wraps the top_children query in the custom filter's score query, please refer to that query in order to see what the values mean (this query has been discussed in the Custom score query and Custom filters score query sections in Chapter 2, Searching Your Data).

Querying for data in the parent documents

If we would like to return child documents that match the given data in the parent document, we should use the has_parent query. It is similar to the has_child query, however, instead of the type property, we specify the parent_type parameter with the value of the parent document type. For example, the following query will return both the child documents we've indexed:

{
 "query" : {
  "has_parent" : {
   "parent_type" : "cloth",
   "query" : {
    "term" : { "name" : "test" }
   }
  }
 }
}

Parent-child relationship and filtering

If you would like to use parent- child queries as filters, you can use them. There are has_child and has_parent filters that have the same functionality as the queries with corresponding names. Actually ElasticSearch wraps those filters in the constant score query to allow them to be used as queries.

Performance considerations

When using the ElasticSearch parent-child functionality, one has to be aware of the performance impact that it has. The first thing you need to remember is that the parent and the child documents need to be stored in the same shard in order for the queries to work. If you happen to have a high number of child documents for a single parent, you may end up with shards not having a similar number of documents. Because of that, your query performance can be lower on one of the nodes resulting in entire queries being slower. Also please remember that the parent-child queries will be slower than those run against documents that don't have a relationship between them.

The second very important thing is that when running queries, such as the has_child one, ElasticSearch needs to preload and cache document identifiers. Those identifiers will be stored in memory and you have to be sure that you have given ElasticSearch enough memory to store those identifiers. Otherwise, you can expect OutOfMemory exceptions being thrown and your nodes or the whole cluster not being operational.

Finally, as we mentioned, the first query will preload the cache document identifiers, and it takes time. In order to improve performance of the first queries that use the parent-child relationship, the Warmer API can be used. You can find more information about how to add warming queries to ElasticSearch in the Warming up section of Chapter 8, Dealing with Problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset