Batch indexing to speed up your indexing process

In the first chapter, we've seen how to index a particular document into ElasticSearch. Now it's time to tell how to index many documents in a more convenient and efficient way than doing it one by one.

Some of the information in this chapter should not be new to us. We've already used it when preparing test data in the previous parts of this book, but now we'll summarize this knowledge.

How to prepare data

ElasticSearch allows us to merge many requests into one packet and send this in one request. In this way, we can mix three operations: adding or replacing existing documents in the index (index), removing documents from the index (delete), or adding new documents to the index when there is not another definition of the document in the index (create). The format of the request was chosen for processing efficiency and assumes that every line of the request contains a JSON object with the operation description followed by the second line with a JSON object, which contains document data for this operation. The exception to this rule is the delete operation, which, for obvious reasons, doesn't have the second line. Let's look at the example data:

{ "index": { "_index": "addr", "_type": "contact", "_id": 1 }}
{ "name": "Fyodor Dostoevsky", "country": "RU" }
{ "create": { "_index": "addr", "_type": "contact", "_id": 2 }}
{ "name": "Erich Maria Remarque", "country": "DE" }
{ "create": { "_index": "addr", "_type": "contact", "_id": 2 }}
{ "name": "Joseph Heller", country: "US" }
{ "delete": { "_index": "addr", "_type": "contact", "_id": 4 }}
{ "delete": { "_index": "addr", "_type": "contact", "_id": 1 }}

It is very important that every document or action description is placed in one line. This means that the document cannot be pretty-printed. There is a default limitation on the size of the bulk indexing file, which is set to 100 megabytes and can be changed by specifying the http.max_content_length property in the ElasticSearch configuration file. This lets us avoid issues with possible request timeouts and memory problems when dealing with over-large requests.

Note

Please note that with a single batch indexing file, we can load the data into many indices and that documents can have different types.

Indexing the data

In order to execute the bulk request, ElasticSearch provides the _bulk endpoint. This can be used as /_bulk or with an index as /index_name/_bulk or even with a type as /index_name/type_name/_bulk. The second and third forms define the default values for the index name and type name, and if we would like, we can omit those in data in the operation description line.

If we have our example data in the documents.json file, execution would look as follows:

curl -XPOST 'localhost:9200/_bulk?pretty' --data-binary @documents.json

The ?pretty parameter is of course not necessary. We've used this parameter only for the ease of analyzing the result of this command. What is important, in this case, is using curl with the --data-binary parameter instead of using –d. This is because the standard –d parameter ignores new line characters, which, as we said earlier, are important for parsing commands by ElasticSearch. Now let's look at the result returned by ElasticSearch:

{
  "took" : 113,
  "items" : [ {
    "index" : {
      "_index" : "addr",
      "_type" : "contact",
      "_id" : "1",
      "_version" : 1,
      "ok" : true
    }
  }, {
    "create" : {
      "_index" : "addr",
      "_type" : "contact",
      "_id" : "2",
      "_version" : 1,
      "ok" : true
    }
  }, {
    "create" : {
      "_index" : "addr",
      "_type" : "contact",
      "_id" : "2",
      "error" : "DocumentAlreadyExistsException[[addr][3] [contact][2]: document already exists]"
    }
  }, {
    "delete" : {
      "_index" : "addr",
      "_type" : "contact",
      "_id" : "4",
      "_version" : 1,
      "ok" : true
    }
  }, {
    "delete" : {
      "_index" : "addr",
      "_type" : "contact",
      "_id" : "1",
      "_version" : 2,
      "ok" : true
    }
  } ]
}

As we can see, every result is a part of the items array. Let's briefly compare these results with our input data. The first two commands, namely, index and create, were executed without any problems. The third operation failed because we wanted to create a record with an identifier that already existed in the index. The next two operations were deletions. Both succeeded. Note that the first of them tried to delete a nonexistent document; as you can see, this wasn't a problem.

Is it possible to do it quicker?

Bulk operations are fast, but if you are wondering if there is a more efficient and quicker way of indexing, you can take a look at the User Datagram Protocol (UDP) bulk operations. Note that using UDP doesn't guarantee that no data was lost during communication with the ElasticSearch server. So this is useful only in some cases where performance is critical and more important than accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset