Extending your index structure with additional internal information

All the information provided in the previous chapters gave us a good look at what ElasticSearch is capable of, both in terms of indexing and querying. But their coverage was not nearly complete. One thing we would like to discuss in more detail is the functionalities of ElasticSearch that are not used every day, but can make our life easier when it comes to data handling.

Note

Each of the following field types should be defined on an appropriate type level. So if you recall our sample mappings for our small library from Chapter 2, Searching Your Data, we would add any of the following types under the book type mappings.

The identifier field

As you recall, each document indexed in ElasticSearch has its own identifier and type. In ElasticSearch there are two types of internal identifiers for the documents.

The first one is the _uid field, which is the unique identifier of the document in the index and is composed of the document's identifier and the document type. This basically means that documents of different types that are indexed into the same index can have the same document identifier yet ElasticSearch will be able to distinguish them. This field doesn't require any additional settings; it is always indexed, but it's good to know that it exists.

The second field holding an identifier is the _id field. This field stores the actual identifier set during index time. In order to enable the indexing of the _id field (and storing it if possible), we need to add the _id field definition just like any other property in our mappings (although as said before, please add it in the body of the type definition).

So, our sample book type definition will look like the following:

{
 "book" : {
  "_id" : {
   "index": "not_analyzed", 
   "store" : "no"
  },
  "properties" : {
  .
  .
  .
  }
 }
}

As you can see, in the previous example, we said that we want our _id field to be indexed, but not analyzed and we don't want to store it.

In addition to specifying an ID during indexing time, we can specify that we want it to be fetched from one of the fields of the indexed documents (although this will be slightly slower because of the additional parsing needed). In order to do that we need to specify the path property with the name of the field we want to use as the identifier value provider. For example, if we have the book_id field in our index and we would like to use it as the value for the _id field, we could change the previous mappings to something like the following:

{
 "book" : {
  "_id" : {
   "path": "book_id"
  },
  "properties" : {
  .
  .
  .
  }
 }
}

One last point to remember is that even when disabling the _id field, all the functionalities requiring the document's unique identifier will still work because they will be using the _uid field instead.

The _type field

Let's say it one more time, each document in ElasticSearch is at least described by an identifier and type and if we want, we may include the type name as the internal _type field of our indices. By default the _type field is indexed, but not stored. If we would like to store that field we will have to change our mappings file to one like the following:

{
 "book" : {
  "_type" : {
   "store" : "yes"
  },
  "properties" : {
  .
  .
  .
  }
 }
}

We can also change the _type field in such a way that it will not be indexed, but then some queries like term queries and filters will not work.

The _all field

The _all field allows us to create a field where the contents of other fields will be copied as well. This kind of field may be useful when we want to implement a simple search feature and search all the data (or only the fields we copy to the _all field), but we don't want to think about field names and things like that. By default the _all field is enabled and it contains all the data from all the fields from the index. In order to exclude a certain field from the _all field, one should use the include_in_all property, which was discussed in Chapter 2, Searching Your Data.

In order to completely turn off the _all field functionality (our index will be smaller without the _all field) we will modify our mappings file to one looking like the following:

{
 "book" : {
  "_all" : {
   "enabled" : false
  },
  "properties" : {
  .
  .
  .
  }
 }
}

In addition to the enabled property, the _all field supports the following ones:

  • store
  • term_vector
  • analyzer
  • index_analyzer
  • search_analyzer

For information about these properties please refer to Chapter 2, Searching Your Data.

The _source field

The _source field allows us to store the original JSON document that was sent to ElasticSearch during indexing. By default the _source field is turned on because some of the ElasticSearch functionalities depend on it, for example, the partial update feature that was already described in Chapter 1, Getting Started with ElasticSearch Cluster. In addition to that, the _source field can be used as the source of data for the highlighting functionality if a field is not stored. But if we don't need such functionality, we can disable those fields because it causes some storage overhead. In order to do that, we will need to set the enabled property of the _source object to false, for example, as shown in the following code:

{
 "book" : {
  "_source" : {
   "enabled" : false
  },
  "properties" : {
  .
  .
  .
  }
 }
}

Because the _source field causes some storage overhead we may choose to compress information stored in that field. In order to do that, we would have to set the compress parameter to true. Although this will shrink the index, it will make the operations made on the _source field a bit more CPU-intensive. However, ElasticSearch allows us to decide when to compress the _source field. Using the compress_threshold property, we can control how big the _source field's content needs to be in order for ElasticSearch to compress it. This property accepts a size value in bytes (for example, 100b, 10kb).

The _boost field

As you may suspect, the _boost field allows us to set a default boost value for all the documents of a certain type. Imagine that we would like our book's documents to have a higher value than all the other types of documents in the index. You may wonder, why increase the boost value of the document? If some of your documents are more important than others, you can increase their boost value in order for ElasticSearch to know that they are more valuable. To achieve that for every single document we can use the _boost field. So if we would like all our book documents to have the value 10.0, we can modify our mappings to something like the following:

{
 "book" : {
  "_boost" : {
   "name" : "_boost",
   "null_value" : 10.0
  },
  "properties" : {
  .
  .
  .
  }
 }
}

This mapping change says that if we don't add an additional field named _boost to our documents sent for indexing the null_value value will be used as boost. If we do add such a field, its value will be used instead of the default one.

The _index field

ElasticSearch allows us to store information about the index that the document is indexed in. We can do that by using the internal _index field. Imagine that we create daily indices, use aliasing, and are interested to know in which daily index the returned document is stored. In such a case the _index field can be useful.

By default, the indexing of the _index field is disabled. In order to enable it, we need to set the enabled property of the _index object to true, for example:

{
 "book" : {
  "_index" : {
   "enabled" : true
  },
  "properties" : {
  .
  .
  .
  }
 }
}

The _size field

The _size field, which is disabled by default, allows you to automatically index the original, uncompressed size of the _source field and store it along with the documents. If we would like to enable the _size field, we need to add the _size property and wrap the enabled property with the value of true. In addition to that, we can also set the _size field to be stored by using the usual store property. So, if we would like our mapping to include the _size field and also want to store it, we have to change our mappings to something like the following:

{
 "book" : {
  "_size" : {
   "enabled": true, 
   "store" : "yes"
  },
  "properties" : {
  .
  .
  .
  }
 }
}

The _timestamp field

The_timestamp field, which is disabled by default, allows us to store information about when the document was indexed. Enabling that functionality is as simple as adding the _timestamp section to our mappings and setting the enabled property to true, for example:

{
 "book" : {
    "_timestamp" : {
    "enabled" : true
  },
  "properties" : {
  .
  .
  .
  }
 }
}

The _timestamp field is not stored, indexed, and also not analyzed by default, but you can change those two parameters to match your needs. In addition to that, the _timestamp field is just like the normal date field so we can change its format just like we do with the usual date-based fields. In order to change the format, we need to specify the format property with the desired format.

One more thing, instead of automatically creating the _timestamp field during document indexation, we can add the path property and set it to the name of the field from which the date should be taken. So if we would like our _timestamp field to be based on the year field, we need to modify our mappings to something like the following:

{
 "book" : {
  "_timestamp" : {
   "enabled" : true,
   "path" : "year",
   "format" : "YYYY"
  },
  "properties" : {
  .
  .
  .
  }
 }
}

As you may notice, we also modify the format of the _timestamp field in order to match the values stored in the year field.

Note

If you use the _timestamp field and you let ElasticSearch create it automatically, the value of that field will be set to the time of indexation of that document. Please note that when using the partial document update functionality the _timestamp field will also be updated.

The _ttl field

The _ttl field stands for time to live, that is, a functionality that allows us to define a life period of a document after which it will be automatically deleted. As you may expect, by default the _ttl field is disabled and to enable it we need to add the _ttl JSON object with its enabled property set to true, just like in the following example:

{
 "book" : {
  "_ttl" : {
   "enabled" : true
  },
  "properties" : {
  .
  .
  .
  }
 }
}

If you need to provide the default expiration time for documents, just add the default property to the _ttl field definition with the desired expiration time. For example, to have our documents deleted after 30 days, we would set the following parameters:

{
 "book" : {
  "_ttl" : {
   "enabled" : true,
   "default" : "30d"
  },
  "properties" : {
  .
  .
  .
  }
 }
}

By default, the _ttl field is stored and indexed, but not analyzed and you can change those two parameters, but remember that this field needs to be not analyzed in order to work.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset