Elasticsearch mapping

We have seen in the previous chapter how an index can have one or more types and each type has its own mapping.

Mappings are like database schemas that describe the fields or properties that the documents of that type may have. For example, the data type of each field, such as a string, integer, or date, and how these fields should be indexed and stored by Lucene.

One more thing to consider is that unlike a database, you cannot have a field with the same name with different types in the same index; otherwise, you will break doc_values, and the sorting/searching is also broken. For example, create myIndex and also index a document with a valid field that contains an integer value inside the type1 document type:

curl –XPOST localhost:9200/myIndex/type1/1 –d '{"valid":5}'

Now, index another document inside type2 in the same index with the valid field. This time the valid field contains a string value:

curl –XPOST localhost/myIndex/type2/1 –d '{"valid":"40"}'

In this scenario, the sort and aggregations on the valid field are broken because they are both indexed as valid fields in the same index!

Document metadata fields

When a document is indexed into Elasticsearch, there are several metadata fields maintained by Elasticsearch for that document. The following are the most important metadata fields you need to know in order to control your index structure:

  • _id: _id is a unique identifier for the document and can be either auto-generated or can be set while indexing or can be configured in the mapping to be parsed automatically from a field.
  • _source: This is a special field generated by Elasticsearch that contains the actual JSON data in it. Whenever we execute a search request, the _source field is returned by default. By default, it is enabled, but it can be disabled using the following configuration while creating a mapping:
    PUT index_name/_mapping/doc_type
          {"_source":{"enabled":false}}

    Note

    Be careful while disabling the _source field, as there are lots of features you can't with it disabled. For example, highlighting is dependent on the _source field. Documents can only be searched and not returned; documents can't be re-indexed and can't be updated.

  • _all: When a document is indexed, values from all the fields are indexed separately as well as in a special field called _all. This is done by Elasticsearch by default to make a search request on the content of the document without specifying the field name. It comes with an extra storage cost and should be disabled if searches need to be made against field names. For disabling it completely, use the following configuration in you mapping file:
    PUT index_name/_mapping/doc_type
          {"_all": { "enabled": true  }}

    However, there are some cases where you do not want to include all the fields to be included in _all where only certain fields. You can achieve it by setting the include_in_all parameter to false:

    PUT index_name/_mapping/doc_type
          {
              "_all": {
              "enabled": true
              },
              "properties": {
                "first_name": {
                "type": "string",
                "include_in_all": false
                },
                "last_name": {
                "type": "string"
                }
              }
          }

    In the preceding example, only the last name will be included inside the _all field.

  • _ttl: There are some cases when you want the documents to be automatically deleted from the index. For example, the logs. _ttl (time to live) field provides the options you can set when the documents should be deleted automatically. By default, it is disabled and can be enabled using the following configuration:
        PUT index_name/_mapping/doc_type
        {
          "_ttl": {
            "enabled": true,
            "default": "1w"
            }
        }

    Inside the default field, you can use time units such as m (minutes), d (days), w (weeks), M (months), and ms (milliseconds). The default is milliseconds.

    Note

    Please note that the __ttl field has been deprecated since the Elasticsearch 2.0.0 beta 2 release and might be removed from the upcoming versions. Elasticsearch will provide a new replacement for this field in future versions.

  • dynamic: There are some scenarios in which you want to restrict the dynamic fields to be indexed. You only allow the fields that are defined by you in the mapping. This can be done by setting the dynamic property to be strict, in the following way:
        PUT index_name/_mapping/doc_type
        {
            "dynamic": "strict",
            "properties": {
              "first_name": {
              "type": "string"
              },
              "last_name": {
              "type": "string"
              }
            }
        }

Data types and index analysis options

Lucene provides several options to configure each and every field separately depending on the use case. These options slightly differ based on the data types for a field.

Configuring data types

Data types in Elasticsearch are segregated in two forms:

  • Core types: These include string, number, date, boolean, and binary
  • Complex data types: These include arrays, objects, multi fields, geo points, geo shapes, nested, attachment, and IP

    Note

    Since Elasticsearch understands JSON, all the data types supported by JSON are also supported in Elasticsearch, along with some extra data types such as geopoint and attachment.

The following are the common attributes for the core data types:

  • index: The values can be from analyzed, no, or not_analyzed. If set to analyzed, the text for that field is analyzed using a specified analyzer. If set to no, the values for that field do not get indexed and thus, are not searchable. If set to not_analyzed, the values are indexed as it is; for example, Elasticsearch Essentials will be indexed as a single term and thus, only exact matches can be done while querying.
  • store: This takes values as either yes or no (default is no but _source is an exception). Apart from indexing the values, Lucene does have an option to store the data, which comes in handy when you want to extract the data from the field. However, since Elasticsearch has an option to store all the data inside the _source field, it is usually not required to store individual fields in Lucene.
  • boost: This defaults to 1. This specifies the importance of the field inside doc.
  • null_value: Using this attribute, you can set a default value to be indexed if a document contains a null value for that field. The default behavior is to omit the field that contains null.

    Note

    One should be careful while configuring default values for null. The default value should always be of the type corresponding to the data type configured for that field, and it also should not be a real value that might appear in some other document.

Let's start with the configuration of the core as well as complex data types.

String

In addition to the common attributes, the following attributes can also be set for string-based fields:

  • term_vector: This property defines whether the Lucene term vectors should be calculated for that field or not. The values can be no (the default one), yes, with_offsets, with_positions, and with_positions_offsets.

    Note

    A term vector is the list of terms in the document and their number of occurrences in that document. Term vectors are mainly used for Highlighting and MorelikeThis (searching for similar documents) queries. A very nice blog on term vectors has been written by Adrien Grand, which can be read here: http://blog.jpountz.net/post/41301889664/putting-term-vectors-on-a-diet.

  • omit_norms: This takes values as true or false. The default value is false. When this attribute is set to true, it disables the Lucene norms calculation for that field (and thus you can't use index-time boosting).
  • analyzer: A globally defined analyzer name for the index is used for indexing and searching. It defaults to the standard analyzer, but can be controlled also, which we will see in the upcoming section.
  • index_analyzer: The name of the analyzer used for indexing. This is not required if the analyzer attribute is set.
  • search_analyzer: The name of the analyzer used for searching. This is not required if the analyzer attribute is set.
  • ignore_above: This specifies the maximum size of the field. If the character count is above the specified limit, that field won't be indexed. This setting is mainly used for the not_analyzed fields. Lucene has a term byte-length limit of 32,766. This means a single term cannot contain more than 10,922 characters (one UTF-8 character contains at most 3 bytes).

An example mapping for two string fields, content and author_name, is as follows:

{
  "contents": {
    "type": "string",
    "store": "yes",
    "index": "analyzed",
    "include_in_all": false,
    "analyzer": "simple"
  },
  "author_name": {
    "type": "string",
    "index": "not_analyzed",
    "ignore_above": 50
  }
}

Number

The number data types are: byte, short, integer, long, floats, and double. The fields that contain numeric values need to be configured with the appropriate data type. Please go through the storage type requirements for all the types under a number before deciding which type you should actually use. In case the field does not contain bigger values, choosing long instead of integer is a waste of space.

An example of configuring numeric fields is shown here:

{"price":{"type":"float"},"age":{"type":"integer"}}

Date

Working with dates usually comes with some extra challenges because there are so many data formats available and you need to decide the correct format while creating a mapping. Date fields usually take two parameters: type and format. However, you can use other analysis options too.

Elasticsearch provides a list of formats to choose from depending on the date format of your data. You can visit the following URL to learn more about it: http://www.elasticsearch.org/guide/reference/mapping/date-format.html.

The following is an example of configuring date fields:

{
  "creation_time": {
    "type": "date",
    "format": "YYYY-MM-dd"
  },
  "updation_time": {
    "type": "date",
    "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd"
  },
  "indexing_time": {
    "type": "date",
    "format": "date_optional_time"
  }
}

Please note the different date formats used for different date fields in the preceding mapping. The updation_time field contains a special format with an || operator, which specifies that it will work for both yyyy/MM/dd HH:mm:ss and yyyy/MM/dd date formats. Elasticsearch uses date_optional_time as the default date parsing format, which is an ISO datetime parser.

Boolean

While indexing data, a Boolean type field can contain only two values: true or false, and it can be configured in a mapping in the following way:

{"is_verified":{"type":"boolean"}}

Arrays

By default, all the fields in Lucene and thus in Elasticsearch are multivalued, which means that they can store multiple values. In order to send such fields for indexing to Elasticsearch, we use the JSON array type, which is nested within opening and closing square brackets []. Some considerations need to be taken care of while working with array data types:

  • All the values of an array must be of the same data type.
  • If the data type of a field is not explicitly defined in a mapping, then the data type of the first value inside the array is used as the type of that field.
  • The order of the elements is not maintained inside the index, so do not get upset if you do not find the desired results while querying. However, this order is maintained inside the _source field, so when you return the data after querying, you get the same JSON as you had indexed.

Objects

JSON documents are hierarchical in nature, which allows them to define inner objects. Elasticsearch completely understands the nature of these inner objects and can map them easily by providing query support for their inner fields.

Note

Once a field is declared as an object type, you can't put any other type of data into it. If you try to do so, Elasticsearch will throw an exception.

{
  "features": {
    "type": "object",
    "properties": {
      "name": {
        "type": "string"
      },
      "sub_features": {
        "dynamic": false,
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "description": {
            "type": "string"
          }
        }
      }
    }
  }
}

If you look carefully in the previous mapping, there is a features root object field and it contains two properties: name and sub_features. Further, sub_features, which is an inner object, also contains two properties: name and description, but it has an extra setting: dynamic: false. Setting this property to false for an object changes the dynamic behavior of Elasticsearch, and you cannot index any other fields inside that object apart from the one that is declared inside the mapping. Therefore, you can index more fields in future inside the features object, but not inside the sub_features object.

Indexing the same field in different ways

If you need to index the same field in different ways, the following is the way to create a mapping for it. You can define as many fields with the fields parameter as you want:

{
  "name": {
    "type": "string",
    "fields": {
      "raw": {
        "type": "string",
        "index": "not_analyzed"
      }
    }
  }
}

With the preceding mapping, you just need to index data into the name field, and Elasticsearch will index the data into the name field using the standard analyzer that can be used for a full text search, and the data in the name.raw field without analyzing the tokens; which can be used for an exact term matching. You do not have to send data into the name.raw field explicitly.

Note

Please note that this option is only available for core data types and not for the objects.

Putting mappings in an index

There are two ways of putting mappings inside an index:

  • Using a post request at the time of index creation:
    curl –XPOST 'localhost:9200/index_name' -d '{
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
      },
      "mappings": {
        "type1": {
          "_all": {
            "enabled": false
          },
          "properties": {
            "field1": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "type2": {
          "properties": {
            "field2": {
              "type": "string",
              "index": "analyzed",
              "analyzer":"keyword"
            }
          }
        }
      }
    }'
  • Using a PUT request using the _mapping API. The index must exist before creating a mapping in this way:
    curl –XPUT 'localhost:9200/index_name/index_type/_mapping' –d '{
    "_all": {
            "enabled": false
      },
      "properties": {
        "field1": {
          "type": "integer"
        }
      }
    }'

    The mappings for the fields are enclosed inside the properties object, while all the metadata fields will appear outside the properties object.

    Note

    It is highly recommended to use the same configuration for the same field names across different types and indexes in a cluster. For instance, the data types and analysis options must be the same; otherwise, you will face weird outputs.

Viewing mappings

Mappings can be viewed easily with the _mapping API:

  • To view the mapping of all the types in an index, use the following URL: curl –XGET localhost:9200/index_name/_mapping?pretty
  • To view the mapping of a single type, use the following URL: curl –XGET localhost:9200/index_name/type_name/_mapping?pretty

Updating mappings

If you want to add mapping for some new fields in the mapping of an existing type, or create a mapping for a new type, you can do it later using the same _mapping API.

For example, to add a new field in our existing type, we only need to specify the mapping for the new field in the following way:

curl –XPUT 'localhost:9200/index_name/index_type/_mapping' –d '{
  "properties": {
    "new_field_name": {
      "type": "integer"
    }
  }
}'

Please note that the mapping definition of an existing field cannot be changed.

Tip

Dealing with a long JSON data to be sent in request body

While creating indexes with settings, custom analyzers, and mappings, you must have noted that all the JSON configurations are passed using –d, which stands for data. This is used to send a request body. While creating settings and mappings, it usually happens that the JSON data becomes so large that it gets difficult to use them in a command line using curl. The easy solution is to create a file with the .json extension and provide the path of the file while working with those settings or mappings. The following is an example command:

curl –XPUT 'localhost:9200/index_name/_settings' –d @path/setting.json
curl –XPUT 'localhost:9200/index_name/index_type/_mapping' –d @path/mapping.json
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset