In the first chapter, we've seen how to index a particular document into ElasticSearch. Now it's time to tell how to index many documents in a more convenient and efficient way than doing it one by one.
Some of the information in this chapter should not be new to us. We've already used it when preparing test data in the previous parts of this book, but now we'll summarize this knowledge.
ElasticSearch allows us to merge many requests into one packet and send this in one request. In this way, we can mix three operations: adding or replacing existing documents in the index (index), removing documents from the index (delete), or adding new documents to the index when there is not another definition of the document in the index (create). The format of the request was chosen for processing efficiency and assumes that every line of the request contains a JSON object with the operation description followed by the second line with a JSON object, which contains document data for this operation. The exception to this rule is the delete operation, which, for obvious reasons, doesn't have the second line. Let's look at the example data:
{ "index": { "_index": "addr", "_type": "contact", "_id": 1 }} { "name": "Fyodor Dostoevsky", "country": "RU" } { "create": { "_index": "addr", "_type": "contact", "_id": 2 }} { "name": "Erich Maria Remarque", "country": "DE" } { "create": { "_index": "addr", "_type": "contact", "_id": 2 }} { "name": "Joseph Heller", country: "US" } { "delete": { "_index": "addr", "_type": "contact", "_id": 4 }} { "delete": { "_index": "addr", "_type": "contact", "_id": 1 }}
It is very important that every document or action description is placed in one line. This means that the document cannot be pretty-printed. There is a default limitation on the size of the bulk indexing file, which is set to 100 megabytes and can be changed by specifying the http.max_content_length
property in the ElasticSearch configuration file. This lets us avoid issues with possible request timeouts and memory problems when dealing with over-large requests.
In order to execute the bulk request, ElasticSearch provides the _bulk
endpoint. This can be used as /_bulk
or with an index as /index_name/_bulk
or even with a type as /index_name/type_name/_bulk
. The second and third forms define the default values for the index name and type name, and if we would like, we can omit those in data in the operation description line.
If we have our example data in the documents.json
file, execution would look as follows:
curl -XPOST 'localhost:9200/_bulk?pretty' --data-binary @documents.json
The ?pretty
parameter is of course not necessary. We've used this parameter only for the ease of analyzing the result of this command. What is important, in this case, is using curl
with the --data-binary
parameter instead of using –d
. This is because the standard –d
parameter ignores new line characters, which, as we said earlier, are important for parsing commands by ElasticSearch. Now let's look at the result returned by ElasticSearch:
{ "took" : 113, "items" : [ { "index" : { "_index" : "addr", "_type" : "contact", "_id" : "1", "_version" : 1, "ok" : true } }, { "create" : { "_index" : "addr", "_type" : "contact", "_id" : "2", "_version" : 1, "ok" : true } }, { "create" : { "_index" : "addr", "_type" : "contact", "_id" : "2", "error" : "DocumentAlreadyExistsException[[addr][3] [contact][2]: document already exists]" } }, { "delete" : { "_index" : "addr", "_type" : "contact", "_id" : "4", "_version" : 1, "ok" : true } }, { "delete" : { "_index" : "addr", "_type" : "contact", "_id" : "1", "_version" : 2, "ok" : true } } ] }
As we can see, every result is a part of the items
array. Let's briefly compare these results with our input data. The first two commands, namely, index
and create,
were executed without any problems. The third operation failed because we wanted to create a record with an identifier that already existed in the index. The next two operations were deletions. Both succeeded. Note that the first of them tried to delete a nonexistent document; as you can see, this wasn't a problem.
Bulk operations are fast, but if you are wondering if there is a more efficient and quicker way of indexing, you can take a look at the User Datagram Protocol (UDP) bulk operations. Note that using UDP doesn't guarantee that no data was lost during communication with the ElasticSearch server. So this is useful only in some cases where performance is critical and more important than accuracy.