All the information provided in the previous chapters gave us a good look at what ElasticSearch is capable of, both in terms of indexing and querying. But their coverage was not nearly complete. One thing we would like to discuss in more detail is the functionalities of ElasticSearch that are not used every day, but can make our life easier when it comes to data handling.
Each of the following field types should be defined on an appropriate type level. So if you recall our sample mappings for our small library from Chapter 2, Searching Your Data, we would add any of the following types under the book
type mappings.
As you recall, each document indexed in ElasticSearch has its own identifier and type. In ElasticSearch there are two types of internal identifiers for the documents.
The first one is the _uid
field, which is the unique identifier of the document in the index and is composed of the document's identifier and the document type. This basically means that documents of different types that are indexed into the same index can have the same document identifier yet ElasticSearch will be able to distinguish them. This field doesn't require any additional settings; it is always indexed, but it's good to know that it exists.
The second field holding an identifier is the _id
field. This field stores the actual identifier set during index time. In order to enable the indexing of the _id
field (and storing it if possible), we need to add the _id
field definition just like any other property in our mappings (although as said before, please add it in the body of the type definition).
So, our sample book type definition will look like the following:
{ "book" : { "_id" : { "index": "not_analyzed", "store" : "no" }, "properties" : { . . . } } }
As you can see, in the previous example, we said that we want our _id
field to be indexed, but not analyzed and we don't want to store it.
In addition to specifying an ID during indexing time, we can specify that we want it to be fetched from one of the fields of the indexed documents (although this will be slightly slower because of the additional parsing needed). In order to do that we need to specify the path
property with the name of the field we want to use as the identifier value provider. For example, if we have the book_id
field in our index and we would like to use it as the value for the _id
field, we could change the previous mappings to something like the following:
{
"book" : {
"_id" : {
"path": "book_id"
},
"properties" : {
.
.
.
}
}
}
One last point to remember is that even when disabling the _id
field, all the functionalities requiring the document's unique identifier will still work because they will be using the _uid
field instead.
Let's say it one more time, each document in ElasticSearch is at least described by an identifier and type and if we want, we may include the type name as the internal _type
field of our indices. By default the _type
field is indexed, but not stored. If we would like to store that field we will have to change our mappings file to one like the following:
{ "book" : { "_type" : { "store" : "yes" }, "properties" : { . . . } } }
We can also change the _type
field in such a way that it will not be indexed, but then some queries like term queries and filters will not work.
The _all
field allows us to create a field where the contents of other fields will be copied as well. This kind of field may be useful when we want to implement a simple search feature and search all the data (or only the fields we copy to the _all
field), but we don't want to think about field names and things like that. By default the _all
field is enabled and it contains all the data from all the fields from the index. In order to exclude a certain field from the _all
field, one should use the include_in_all
property, which was discussed in Chapter 2, Searching Your Data.
In order to completely turn off the _all
field functionality (our index will be smaller without the _all
field) we will modify our mappings file to one looking like the following:
{ "book" : { "_all" : { "enabled" : false }, "properties" : { . . . } } }
In addition to the enabled
property, the _all
field supports the following ones:
store
term_vector
analyzer
index_analyzer
search_analyzer
For information about these properties please refer to Chapter 2, Searching Your Data.
The _source
field allows us to store the original JSON document that was sent to ElasticSearch during indexing. By default the _source
field is turned on because some of the ElasticSearch functionalities depend on it, for example, the partial update feature that was already described in Chapter 1, Getting Started with ElasticSearch Cluster. In addition to that, the _source
field can be used as the source of data for the highlighting functionality if a field is not stored. But if we don't need such functionality, we can disable those fields because it causes some storage overhead. In order to do that, we will need to set the enabled
property of the _source
object to false
, for example, as shown in the following code:
{ "book" : { "_source" : { "enabled" : false }, "properties" : { . . . } } }
Because the _source
field causes some storage overhead we may choose to compress information stored in that field. In order to do that, we would have to set the compress
parameter to true
. Although this will shrink the index, it will make the operations made on the _source
field a bit more CPU-intensive. However, ElasticSearch allows us to decide when to compress the _source
field. Using the compress_threshold
property, we can control how big the _source
field's content needs to be in order for ElasticSearch to compress it. This property accepts a size value in bytes (for example, 100b
, 10kb
).
As you may suspect, the _boost
field allows us to set a default boost value for all the documents of a certain type. Imagine that we would like our book's documents to have a higher value than all the other types of documents in the index. You may wonder, why increase the boost value of the document? If some of your documents are more important than others, you can increase their boost value in order for ElasticSearch to know that they are more valuable. To achieve that for every single document we can use the _boost
field. So if we would like all our book
documents to have the value 10.0
, we can modify our mappings to something like the following:
{ "book" : { "_boost" : { "name" : "_boost", "null_value" : 10.0 }, "properties" : { . . . } } }
This mapping change says that if we don't add an additional field named _boost
to our documents sent for indexing the null_value
value will be used as boost. If we do add such a field, its value will be used instead of the default one.
ElasticSearch allows us to store information about the index that the document is indexed in. We can do that by using the internal _index
field. Imagine that we create daily indices, use aliasing, and are interested to know in which daily index the returned document is stored. In such a case the _index
field can be useful.
By default, the indexing of the _index
field is disabled. In order to enable it, we need to set the enabled
property of the _index
object to true
, for example:
{ "book" : { "_index" : { "enabled" : true }, "properties" : { . . . } } }
The _size
field, which is disabled by default, allows you to automatically index the original, uncompressed size of the _source
field and store it along with the documents. If we would like to enable the _size
field, we need to add the _size
property and wrap the enabled
property with the value of true
. In addition to that, we can also set the _size
field to be stored by using the usual store
property. So, if we would like our mapping to include the _size
field and also want to store it, we have to change our mappings to something like the following:
{ "book" : { "_size" : { "enabled": true, "store" : "yes" }, "properties" : { . . . } } }
The_timestamp
field, which is disabled by default, allows us to store information about when the document was indexed. Enabling that functionality is as simple as adding the _timestamp
section to our mappings and setting the enabled
property to true
, for example:
{ "book" : { "_timestamp" : { "enabled" : true }, "properties" : { . . . } } }
The _timestamp
field is not stored, indexed, and also not analyzed by default, but you can change those two parameters to match your needs. In addition to that, the _timestamp
field is just like the normal date field so we can change its format just like we do with the usual date-based fields. In order to change the format, we need to specify the format
property with the desired format.
One more thing, instead of automatically creating the _timestamp
field during document indexation, we can add the path
property and set it to the name of the field from which the date should be taken. So if we would like our _timestamp
field to be based on the year
field, we need to modify our mappings to something like the following:
{
"book" : {
"_timestamp" : {
"enabled" : true,
"path" : "year",
"format" : "YYYY"
},
"properties" : {
.
.
.
}
}
}
As you may notice, we also modify the format of the _timestamp
field in order to match the values stored in the year
field.
The _ttl
field stands for time to live, that is, a functionality that allows us to define a life period of a document after which it will be automatically deleted. As you may expect, by default the _ttl
field is disabled and to enable it we need to add the _ttl
JSON object with its enabled
property set to true
, just like in the following example:
{ "book" : { "_ttl" : { "enabled" : true }, "properties" : { . . . } } }
If you need to provide the default expiration time for documents, just add the default
property to the _ttl
field definition with the desired expiration time. For example, to have our documents deleted after 30 days, we would set the following parameters:
{ "book" : { "_ttl" : { "enabled" : true, "default" : "30d" }, "properties" : { . . . } } }
By default, the _ttl
field is stored and indexed, but not analyzed and you can change those two parameters, but remember that this field needs to be not analyzed in order to work.