Nested objects look similar to plain objects but they differ in mapping and the way they are stored internally in Elasticsearch.
We will work with the same Twitter data but this time we will index it in a nested structure. We will have a user as our root object and every user can have multiple tweets as nested documents. Indexing this kind of data without using nested mapping will lead to problems, as shown in the following example:
PUT /twitter/tweet/1 { "user": { "screen_name": "d_bharvi", "followers_count": "2000", "created_at": "2012-06-05" }, "tweets": [ { "id": "121223221", "text": "understanding nested relationships", "created_at": "2015-09-05" }, { "id": "121223222", "text": "NoSQL databases are awesome", "created_at": "2015-06-05" } ] } PUT /twitter/tweet/2 { "user": { "screen_name": "d_bharvi", "followers_count": "2000", "created_at": "2012-06-05" }, "tweets": [ { "id": "121223223", "text": "understanding nested relationships", "created_at": "2015-09-05" }, { "id": "121223224", "text": "NoSQL databases are awesome", "created_at": "2015-09-05" } ] }
Now, if we want to query all the tweets that are about NoSQL and have been created on 2015-09-05
, we would use the following code:
GET twitter/tweets/_search { "query": { "bool": { "must": [ { "match": { "tweets.text": "NoSQL" } }, { "term": { "tweets.created_at": "2015-09-05" } } ] } } }
The preceding query will return both the documents in the response. The reason is that Elasticsearch internally stores objects in the following way:
{tweets.id : ["121223221","121223222","121223223","121223224"], tweets.text : ["understanding nested relationships",........], tweets.created_at : ["2015-09-05","2015-06-05","2015-09-05","2015-09-05"]}
All the fields of the tweet objects are flattened into an array format, which leads to loosing the association between the tweet texts and tweet creation dates, and because of this, the previous query returned the wrong results.
The mapping for nested objects can be defined in the following way:
PUT twitter_nested/users/_mapping
{
"properties": {
"user": {
"type": "object",
"properties": {
"screen_name": {
"type": "string"
},
"followers_count": {
"type": "integer"
},
"created_at": {
"type": "date"
}
}
},
"tweets": {
"type": "nested",
"properties": {
"id": {
"type": "string"
},
"text": {
"type": "string"
},
"created_at": {
"type": "date"
}
}
}
}
}
In the previous mapping, user
is a simple object field but the tweets
field is defined as a nested
type object, which contains id
, text
, and created_at
as its properties.
You can use the same JSON documents that we used in the previous section to index users and their tweets, as indexing nested fields is similar to indexing object fields and does not require any extra effort in the code. However, Elasticsearch considers all the nested documents as separate documents and stores them internally in the following format, which preserves the relationships between tweet texts and dates:
{tweets.id : "121223221",tweets.text : "understanding nested relationships", tweets.created_at : "2015-09-05"} {tweets.id : "121223221",tweets.text : "understanding nested relationships", tweets.created_at : "2015-09-05"} {tweets.id : "121223221",tweets.text : "understanding nested relationships", tweets.date : "2015-09-05"}
To query a nested field, Elasticsearch offers a nested
query, which has the following syntax:
"query": { "nested": { "path": "path_to_nested_doc", "query": {} } }
Let's understand the nested
query syntax:
query
parameter wraps all the queries inside it.nested
parameter tells Elasticsearch that this query is of the nested typepath
parameter specifies the path of the nested fieldquery
object contains all the queries supported by ElasticsearchNow let's run the nested query to search all the tweets that are about NoSQL and have been created on 2015-09-05
.
Py thon example
query = { "query": { "nested": { "path": "tweets", "query": { "bool": { "must": [ { "match": { "tweets.text": "NoSQL" } }, { "term": { "tweets.created_at": "2015-09-05" } } ] } } } } } res = es.search(index='twitter_nested', doc_type= 'users', body=query)
Java example
SearchResponse response = client.prepareSearch("twitter_nested")
.setTypes("users")
.setQuery(QueryBuilders
.nestedQuery(nestedField, QueryBuilders
.boolQuery()
.must(QueryBuilders
.matchQuery("tweets.text", "Nosql Databases"))
.must(QueryBuilders
.termQuery("tweets.created_at", "2015-09-05"))))
.execute().actionGet();
The response object contains the output returned from Elasticsearch, which will have one matching document in the response this time.
Nested aggregations allow you to perform aggregations on nested fields. There are two types of nested aggregations available in Elasticsearch. The first one (nested aggregation) allows you to aggregate the nested
fields, whereas the second one (reverse nested
aggregation) allows you to aggregate the fields that fall outside the nested
scope.
A nested
aggregation allows you to perform all the aggregations on the fields inside a nested
object. The syntax is as follows:
{ "aggs": { "NAME": { "nested": { "path": "path_to_nested_field" }, "aggs": {} } } }
The syntax of a nested
aggregation is similar to the other aggregations but here we need to specify the path of the topmost nested
field as we have learnt to do in the nested
queries. Once the path is specified, you can perform any aggregation on the nested documents using the inner aggs
object. Let's see an example of how to do it:
Python example
query = { "aggs": { "NESTED_DOCS": { "nested": { "path": "tweets" },"aggs": { "TWEET_TIMELINE": { "date_histogram": { "field": "tweets.created_at", "interval": "day" } } } } } } res = es.search(index='twitter_nested', doc_type= 'users', body=query, size=0)
The preceding aggregation query creates a bucket of nested
aggregation, which further contains the date histogram of tweets (the number of tweets created per day). Please note that we can combine nested
aggregation with full-text search queries in a similar way to how we saw in Chapter 4, Aggregations for Analytics.
Java example
The following example requires this extra import in your code:
org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramInterval
You can build the aggregation in the following way:
SearchResponse response = client.prepareSearch("twitter_nested") .setTypes("users") .addAggregation(AggregationBuilders.nested("NESTED_DOCS") .path(nestedField) .subAggregation(AggregationBuilders .dateHistogram("TWEET_TIMELINE") .field("tweets.created_at") .interval(DateHistogramInterval.DAY) )).setSize(0).execute().actionGet();
The output for the preceding query will look like the following:
"aggregations" : { "NESTED_DOCS" : { "doc_count" : 2, "TWEET_TIMELINE" : { "buckets" : [ { "key_as_string" : "2015-09-05T00:00:00.000Z", "key" : 1441411200000, "doc_count" : 2 } ] } } }
In the output, NESTED_DOCS
is the name of our nested aggregations that shows doc_count
as 2
because our document was composed using an array of two nested tweet documents. The TWEET_TIMELINE
buckets show two documents because we have two tweets in one document.
Nested aggregation has the limitation that it can only access the fields within the nested scope. Reverse nested aggregations overcome this scenario and allow you to look beyond the nested scope and go back to the root document or other nested documents.
For example, we can find all the unique users who have tweeted in a particular date range with the following reverse nested aggregation:
Python example
query = { "aggs": { "NESTED_DOCS": { "nested": { "path": "tweets" }, "aggs": { "TWEET_TIMELINE": { "date_histogram": { "field": "tweets.created_at", "interval": "day" }, "aggs": { "USERS": { "reverse_nested": {}, "aggs": { "UNIQUE_USERS": { "cardinality": { "field": "user.screen_name" } } } } } } } } } } resp = es.search(index='twitter_nested', doc_type= 'users', body=query, size=0)
Java example
SearchResponse response = client.prepareSearch(indexName).setTypes(docType) .addAggregation(AggregationBuilders.nested("NESTED_DOCS") .path(nestedField) .subAggregation(AggregationBuilders.dateHistogram("TWEET_TIMELINE") .field("tweets.created_at").interval(DateHistogramInterval.DAY) .subAggregation(AggregationBuilders.reverseNested("USERS") .subAggregation(AggregationBuilders.cardinality("UNIQUE_USERS") .field("user.screen_name"))))) .setSize(0).execute().actionGet();
The output for the preceding aggregation will be as follows:
{ "aggregations": { "NESTED_DOCS": { "doc_count": 2, "TWEET_TIMELINE": { "buckets": [ { "key_as_string": "2015-09-05T00:00:00.000Z", "key": 1441411200000, "doc_count": 2, "USERS": { "doc_count": 1, "UNIQUE_USERS": { "value": 1 } } } ] } } } }
The preceding output shows the nested docs count as 2
, whereas the USERS
key specifies that there is only one root document that exists in the given time range. UNIQUE_USERS
shows the cardinality aggregation output for the unique users in the index.