In the previous section, we discussed the ability to index nested documents along with a parent one. However, even though the nested documents are indexed as separate documents in the index, we can't change a single nested document (unless we use the update API). However, ElasticSearch allows us to have a real parent-child relationship and we will look at it in the following sections.
Let's use our previous example with the clothing store. However, what we would like to have is the ability to update sizes and colors without the need of indexing the whole document after each change. In order to do that, we will use the parent-child functionality of ElasticSearch.
So now, the only field we need to have in our parent document is the name. We don't need anything more than that. So in order to create our mapping in the shop
index, we would run the following command:
curl -XPUT 'localhost:9200/shop/cloth/_mapping' -d '{ "cloth" : { "properties" : { "name" : {"type" : "string", "store" : "yes", "index" : "analyzed"} } } }'
Now let's create the child mappings. In order to do that, we will need to add the _parent
property with the name of the parent type, which is cloth
in our case. So, the command that creates the variation
type would look like the following:
curl -XPUT 'localhost:9200/shop/variation/_mapping' -d '{ "variation" : { "_parent" : { "type" : "cloth" }, "properties" : { "size" : {"type" : "string", "index" : "not_analyzed"}, "color" : {"type" : "string", "index" : "not_analyzed"} } } }'
And that's all. You don't need to specify which field will be used to connect a child document to the parent because, by default, ElasticSearch will use the unique identifier for that, and if you remember from the previous chapters, that information is present in the index by default.
Now let's index our parent document. It's very simple; in order to do that, we just run the usual indexing command, for example, one like the following command:
curl -XPOST 'localhost:9200/shop/cloth/1' -d '{ "name" : "Test document" }'
If you look at the above command, you'll notice that our document will be given the identifier of 1
.
In order to index child documents, we need to provide information about the parent document with the use of the parent
request parameter and set that parameter value to the identifier of the parent document. So in order to index two child documents to our parent document, we would need to run the following commands:
curl -XPOST 'localhost:9200/shop/variation/1000?parent=1' -d '{ "color" : "red", "size" : "XXL" }'
And:
curl -XPOST 'localhost:9200/shop/variation/1001?parent=1' -d '{ "color" : "black", "size" : "XL" }'
And that's all. We've indexed two additional documents, which are of a new type, but we've specified that our documents have a parent.
We've indexed our data and now we need to use appropriate queries in order to match documents with the data stored in their children. However, please note that when running queries against parents, child documents won't be returned and vice versa.
So if we would like to get clothes that are of XXL size and in red, we would run the following query:
{ "query" : { "has_child" : { "type" : "variation", "query" : { "bool" : { "must" : [ { "term" : { "size" : "XXL" } }, { "term" : { "color" : "red" } } ] } } } } }
The query is quite simple—it is of a has_child
type, which tells ElasticSearch that we want to search in the child documents. In order to specify which type of child documents we are interested in, we specify the type
property with the name of the child type. Then we have a standard bool
query, which we've already discussed.
There is one additional query that returns parent documents, but is run against child documents—the top_children
query. That query can be used to run against a specified number of child documents. Let's look at the following query:
{ "query" : { "top_children" : { "type" : "variation", "query" : { "term" : { "size" : "XXL" } }, "score" : "max", "factor" : 10, "incremental_factor" : 2 } } }
The preceding query will be run first against a total of 100 child documents (factor
multiplied by the default size
of 10
). If there are 10 parent documents found (because of the default size
parameter being equal to 10
), then those will be returned and the query execution will end. However, if there are less parents returned and there are still child documents that were not queried, then another 20 documents will be queried (the incremental_factor
parameter multiplied by the result's size). And so on until the requested number of parent documents is found or there are no child documents left to query.
The top_children
query offers the ability to specify how the score should be calculated with the use of the score
parameter, with the following value of max
, sum
, or avg
possible. Because ElasticSearch wraps the top_children
query in the custom filter's score query, please refer to that query in order to see what the values mean (this query has been discussed in the Custom score query and Custom filters score query sections in Chapter 2, Searching Your Data).
If we would like to return child documents that match the given data in the parent document, we should use the has_parent
query. It is similar to the has_child
query, however, instead of the type property, we specify the parent_type
parameter with the value of the parent document type. For example, the following query will return both the child documents we've indexed:
{ "query" : { "has_parent" : { "parent_type" : "cloth", "query" : { "term" : { "name" : "test" } } } } }
If you would like to use parent- child queries as filters, you can use them. There are has_child
and has_parent
filters that have the same functionality as the queries with corresponding names. Actually ElasticSearch wraps those filters in the constant score query to allow them to be used as queries.
When using the ElasticSearch parent-child functionality, one has to be aware of the performance impact that it has. The first thing you need to remember is that the parent and the child documents need to be stored in the same shard in order for the queries to work. If you happen to have a high number of child documents for a single parent, you may end up with shards not having a similar number of documents. Because of that, your query performance can be lower on one of the nodes resulting in entire queries being slower. Also please remember that the parent-child queries will be slower than those run against documents that don't have a relationship between them.
The second very important thing is that when running queries, such as the has_child
one, ElasticSearch needs to preload and cache document identifiers. Those identifiers will be stored in memory and you have to be sure that you have given ElasticSearch enough memory to store those identifiers. Otherwise, you can expect OutOfMemory
exceptions being thrown and your nodes or the whole cluster not being operational.
Finally, as we mentioned, the first query will preload the cache document identifiers, and it takes time. In order to improve performance of the first queries that use the parent-child relationship, the Warmer API can be used. You can find more information about how to add warming queries to ElasticSearch in the Warming up section of Chapter 8, Dealing with Problems.