Relations between documents

While Elasticsearch is gaining more and more attention, it is no longer used as a search engine only. It is seen as a data analysis solution and sometimes as a primary data store. Having a single data store that enables fast and efficient full text searching often seems like a good idea. We not only can store documents, but we can also search them and analyze their contents bringing meaning to the data. This is usually more than we could expect from traditional SQL databases. However, if you have any experience with SQL databases, when dealing with Elasticsearch, you soon realize the necessity of modeling relationships between documents. Unfortunately, it is not easy and many of the habits and good practices from relation databases won't work in the world of the inverted index that Elasticsearch uses. You should already be familiar with how Elasticsearch handles relationships because we already mentioned nested objects and parent–child functionality in our Elasticsearch Server Second Edition book, but let's go through available possibilities and look closer at the traps connected with them.

The object type

Elasticsearch tries to interfere as little as possible when modeling your data and turning it into an inverted index. Unlike the relation databases, Elasticsearch can index structured objects and it is natural to it. It means that if you have any JSON document, you can index it without problems and Elasticsearch adapts to it. Let's look at the following document:

{
   "title": "Title",
   "quantity": 100,
   "edition": {
      "isbn": "1234567890",
      "circulation": 50000
   }
}

As you can see, the preceding document has two simple properties and a nested object inside it (the edition one) with additional properties. The mapping for our example is simple and looks as follows (it is also stored in the relations.json file provided with the book):

{
  "book" : {
    "properties" : {
      "title" : {"type": "string" },
      "quantity" : {"type": "integer" },
      "edition" : {
        "type" : "object",
        "properties" : {
          "isbn" : {"type" : "string", "index" : "not_analyzed" },
          "circulation" : {"type" : "integer" }
        }
      }
    }
  }
}

Unfortunately, everything will work only when the inner object is connected to its parent with a one-to-one relation. If you add the second object, for example, like the following:

{
   "title": "Title",
   "quantity": 100,
   "edition": [
      {
         "isbn": "1234567890",
         "circulation": 50000
      },
      {
         "isbn": "9876543210",
         "circulation": 2000
      }
   ]
}

Elasticsearch will flatten it. To Elasticsearch, the preceding document will look more or less like the following one (of course, the _source field will still look like the preceding document):

{
   "title": "Title",
   "quantity": 100,
   "edition": {
       "isbn": [ "1234567890", "9876543210" ],
        "circulation": [50000, 2000 ]
     }
}

This is not exactly what we want, and such representation will cause problems when you search for books containing editions with given ISBN numbers and given circulation. Simply, cross-matches will happen—Elasticsearch will return books containing editions with given ISBNs and any circulation.

We can test this by indexing our document by using the following command:

curl -XPOST 'localhost:9200/object/doc/1' -d '{
 "title": "Title",
 "quantity": 100,
 "edition": [
  {
   "isbn": "1234567890",
   "circulation": 50000
  },
  {
   "isbn": "9876543210",
   "circulation": 2000
  }
 ]
}'

Now, if we would run a simple query to return documents with the isbn field equal to 1234567890 and the circulation field equal to 2000, we shouldn't get any documents. Let's test that by running the following query:

curl -XGET 'localhost:9200/object/_search?pretty' -d '{
 "fields" : [ "_id", "title" ],
 "query" : {
  "bool" : {
   "must" : [
    {
     "term" : {
      "isbn" : "1234567890"
     }
    },
    {
     "term" : {
      "circulation" : 2000
     }
    }
   ]
  }
 }
}'

What we got as a result from Elasticsearch is as follows:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0122644,
    "hits" : [ {
      "_index" : "object",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0122644,
      "fields" : {
        "title" : [ "Title" ]
      }
    } ]
  }
}

This cross-finding can be avoided by rearranging the mapping and document so that the source document looks like the following:

{
   "title": "Title",
   "quantity": 100,
   "edition": {
       "isbn": ["1234567890", "9876543210"],
        "circulation_1234567890": 50000,
        "circulation_9876543210": 2000        
   }
}

Now, you can use the preceding mentioned query, which use the relationships between fields by the cost of greater complexity of query building. The important problem is that the mappings would have to contain information about all the possible values of the fields—this is not something that we would like to go for when having more than a couple of possible values. From the other side, this still does not allow us to create more complicated queries such as all books with a circulation of more than 10 000 and ISBN number starting with 23. In such cases, a better solution would be to use nested objects.

To summarize, the object type could be handy only for the simplest cases when problems with cross-field searching does not exist—for example, when you don't want to search inside nested objects or you only need to search on one of the fields without matching on the others.

The nested documents

From the mapping point of view, the definition of a nested document differs only in the use of nested type instead of object (which Elasticsearch will use by default when guessing types). For example, let's modify our previous example so that it uses nested documents:

{
  "book" : {
    "properties" : {
      "title" : {"type": "string" },
      "quantity" : {"type": "integer" },
      "edition" : {
        "type" : "nested",
        "properties" : {
          "isbn" : {"type" : "string", "index" : "not_analyzed" },
          "circulation" : {"type" : "integer" }
        }
      }
    }
  }
}

When we are using the nested documents, Elasticsearch, in fact, creates one document for the main object (we can call it a parent one, but that can bring confusion when talking about the parent–child functionality) and additional documents for inner objects. During normal queries, these additional documents are automatically filtered out and not searched or displayed. This is called a block join in Apache Lucene (you can read more about Apache Lucene block join queries at a blog post written by Lucene committer Mike McCandless, available at http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html). For performance reasons, Lucene keeps these documents together with the main document, in the same segment block.

This is why the nested documents have to be indexed at the same time as the main document. Because both sides of the relation are prepared before storing them in the index and both sides are indexed at the same time. Some people refer to nested objects as an index-time join. This strong connection between documents is not a big problem when the documents are small and the data are easily available from the main data store. But what if documents are quite big, one of the relationship parts changes a lot, and reindexing the second part is not an option? The next problem is what if a nested document belongs to more than one main document? These problems do not exist in the parent–child functionality.

If we would get back to our example, and we would change our index to use the nested objects and we would change our query to use the nested query, no documents would be returned because there is no match for such a query in a single nested document.

Parent–child relationship

When talking about the parent–child functionality, we have to start with its main advantage—the true separation between documents— and each part of the relation can be indexed independently. The first cost of this advantage is more complicated queries and thus slower queries. Elasticsearch provides special query and filter types, which allow us to use this relation. This is why it is sometimes called a query-time join. The second disadvantage, which is more significant, is present in the bigger applications and multi-node Elasticsearch setups. Let's see how the parent–child relationship works in the Elasticsearch cluster that contains multiple nodes.

Note

Please note that unlike nested documents, the children documents can be queried without the context of the parent document, which is not possible with nested documents.

Parent–child relationship in the cluster

To better show the problem, let's create two indices: the rel_pch_m index holding documents being the parents and the rel_pch_s index with documents that are children:

curl -XPUT localhost:9200/rel_pch_m -d '{ "settings" : {  "number_of_replicas" : 0  } }'
curl -XPUT localhost:9200/rel_pch_s -d '{ "settings" : {  "number_of_replicas" : 0 } }'

Our mappings for the rel_pch_m index are simple and they can be sent to Elasticsearch by using the following command:

curl -XPOST localhost:9200/rel_pch_m/book/_mapping?pretty -d '{
  "book" : {
    "properties" : {
      "title" : { "type": "string" },
      "quantity" : { "type": "integer" }
    }
  }
}'

The mappings for the rel_pch_s index are simple as well, but we have to inform Elasticsearch what type of documents should be treated as parents. We can use the following command to send the mappings for the second index to Elasticsearch:

curl -XPOST localhost:9200/rel_pch_s/edition/_mapping?pretty -d '{
  "edition" : {
    "_parent" : {
      "type" : "book"
    },
    "properties" : {
      "isbn" : { "type" : "string", "index" : "not_analyzed" },
      "circulation" : { "type" : "integer" }
    }
  }
}'

The last step is to import data to these indices. We generated about 10000 records; an example document looks as follows:

{"index": {"_index": "rel_pch_m", "_type": "book", "_id": "1"}}
{"title" : "Doc no 1", "quantity" : 101}
{"index": {"_index": "rel_pch_s", "_type": "edition", "_id": "1",  "_parent": "1"}}
{"isbn" : "no1", "circulation" : 501}

Note

If you are curious and want to experiment, you will find the simple bash script create_relation_indices.sh used to generate the example data.

The assumption is simple: we have 10000 documents of each type (book and edition). The key is the _parent field. In our example, it will always be set to 1, so we have 10 000 books but our 10 000 edition belongs to that one particular book. This example is rather extreme, but it lets us point out an important thing.

Note

For visualization, we have used the ElasticHQ plugin available at http://www.elastichq.org/.

First let's look at the parent part of the relation and the index storing the parent documents, as shown in the following screenshot:

Parent–child relationship in the cluster

As we can see, the five shards of the index are located on three different nodes. Every shard has more or less the same number of documents. This is what we would expect—Elasticsearch used hashing to calculate the shard on which documents should be placed.

Now, let's look at the second index, which contains our children documents, as shown in the following screenshot:

Parent–child relationship in the cluster

The situation is different. We still have five shards, but four of them are empty and the last one contains all the 10,000 documents! So something is not right—all the documents we indexed are located in one particular shard. This is because Elasticsearch will always put documents with the same parent in the same shard (in other words, the routing parameter value for children documents is always equal to the parent parameter value). Our example shows that in situations when some parent documents have substantially more children, we can end up with uneven shards, which may cause performance and storage issues—for example, some shards may be idling, while others will be constantly overloaded.

A few words about alternatives

As we have seen, the handling of relations between documents can cause different problems to Elasticsearch. Of course, this is not only the case with Elasticsearch because full text search solutions are extremely valuable for searching and data analysis, and not for modeling relationships between data. If it is a big problem for your application, and the full text capability is not a core part of it, you may consider using an SQL database that allows full text searching to some extent. Of course, these solutions won't be as flexible and fast as Elasticsearch, but we have to pay the price if we need full relationship support. However, in most other cases, the change of data architecture and the elimination of relations by de-normalization will be sufficient.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset