Indexing data that is not flat

Not all data is flat like the examples we have used in the book until now. Most of the data you will encounter will have some structure and nested objects inside the root JSON object. Of course, if we are building our system that Elasticsearch will be a part of and we are in control of all the pieces of it, we can create a structure that is convenient for Elasticsearch. But even in such cases, flat data is not always an option. Thankfully, Elasticsearch allows us to index data that is not flat and this section will show us how to do that.

Data

Let's assume that we have the following data (we store it in the file called structured_data.json):

{
  "author" : {
    "name" : {
      "firstName" : "Fyodor",
      "lastName" : "Dostoevsky"
    }
  },
  "isbn" : "123456789",
  "englishTitle" : "Crime and Punishment",
  "year" : 1886,
  "characters" : [
    {
      "name" : "Raskolnikov"
    }, 
    {
      "name" : "Sofia"
    }
  ],
  "copies" : 0
}

As you can see the data is not flat – it contains arrays and nested objects. If we want to create mappings and use the knowledge that we've got so far, we will have to flatten the data. However, as we already said, Elasticsearch allows some degree of structure and we should be able to create mappings that will work for the preceding example.

Objects

The preceding example data shows the structured JSON file. As you can see in the example, our root object has some additional, simple properties, such as englishTitle, isbn, year, and copies. These will be indexed as normal fields in the index and we already know how to deal with them (we discussed that in the Mappings configuration section of Chapter 2, Indexing Your Data). In addition to that, it has the characters array type and the author object. The author object has another object nested within it – the name object, which has two properties: firstName and lastName. So as you can see, we can have multiple nested objects inside each other.

Arrays

We have already used array type data, but we didn't talk about it. By default, all the fields in Lucene and thus in Elasticsearch are multivalued, which means that they can store multiple values. In order to send such fields to indexing to Elasticsearch, we use the JSON array type, which is nested within the opening and closing square brackets []. As you can see in the preceding example, we used the array type for the characters of our book.

Mappings

Let's now look at how our mappings would look like for the book object we showed earlier. We already said that to index arrays we don't need anything special. So, in our case, to index the characters data we will need to add fields definition similar to the following one:

"characters" : {
 "properties" : {
  "name" : {"type" : "string"}
 }
}

Nothing strange! We just nest the properties section inside the arrays name (which is characters in our case) and we define the fields there. As the result of the preceding mappings, we will get the characters.name multivalued field in the index.

We do similar thing for our author object. We call the section with the same name as it is present in the data. We have the author object, but it also has the name object nested in it, so we do the same – we just nest another object inside it. So, our mappings for the author field would look as follows:

"author" : {
 "properties" : {
  "name" : {
   "properties" : {
    "firstName" : {"type" : "string"},
    "lastName" : {"type" : "string"}
   }
  }
 }
}

The firstName and lastName fields appear in the index as author.name.firstName and author.name.lastName.

The rest of the fields are simple core types, so I'll skip discussing them as they were already discussed in the Mappings configuration section of Chapter 2, Indexing Your Data.

Final mappings

So our final mappings file, that we've called structured_mapping.json, looks like the following:

{
 "book" : {
  "properties" : {
   "author" : {
    "type" : "object",
    "properties" : {
     "name" : {
      "type" : "object",
      "properties" : {
       "firstName" : {"type" : "string"},
       "lastName" : {"type" : "string"}
      }
     }
    }
   },
   "isbn" : {"type" : "string"},
   "englishTitle" : {"type" : "string"},
   "year" : {"type" : "integer"},
   "characters" : {
    "properties" : {
     "name" : {"type" : "string"}
    }
   },
   "copies" : {"type" : "integer"}
  }
 }
}

Sending the mappings to Elasticsearch

Now that we have our mappings done, we would like to test if all the work we did actually works. This time we will use a slightly different technique of creating an index and putting the mappings. First, let's create the library index with the following command (you need to delete the library index if you already have it):

curl -XPUT 'localhost:9200/library'

Now, let's send our mappings for the book type:

curl -XPUT 'localhost:9200/library/book/_mapping' -d @structured_mapping.json

Now we can index our example data:

curl -XPOST 'localhost:9200/library/book/1' -d @structured_data.json

To be or not to be dynamic

As we already know, Elasticsearch is schema-less, which means that it can index data without the need of creating the mappings upfront. What Elasticsearch will do in the background when a new field is encountered in the data is a mapping update – it will try to guess the field type and add it to the mappings. The dynamic behavior of Elasticsearch is turned on by default, but there may be situations where you may want to turn it off for some parts of your index. In order to do that, one should add the dynamic property to the given field and set it to false. This should be done on the same level of nesting as the type property for the object, which shouldn't be dynamic. For example, if we want our author and name objects to not be dynamic, we should modify the relevant part of the mappings file so that it looks as follows:

"author" : {
 "type" : "object",
 "dynamic" : false,
 "properties" : {
  "name" : {
   "type" : "object",
   "dynamic" : false,
   "properties" : {
    "firstName" : {"type" : "string", "index" : "analyzed"},
    "lastName" : {"type" : "string", "index" : "analyzed"}
   }
  }
 }
}

However, remember that in order to add new fields for such objects, we would have to update the mappings.

Note

You can also turn off the dynamic mappings functionality by adding the index.mapper.dynamic property to your elasticsearch.yml configuration file and setting it to false.

Disabling object indexing

There is one additional thing that we would like to mention when it comes to objects handling – we can disable indexing a particular object by using the enabled property and setting it to false. There may be various reasons for that, such as not wanting a field to be indexed or not wanting a whole JSON object to be indexed. For example, if we want to omit an object called information from our author object, we will have the author object definition look as follows:

"author" : {
 "type" : "object",
  "properties" : {
  "name" : {
   "type" : "object",
   "dynamic" : false,
   "properties" : {
    "firstName" : {"type" : "string", "index" : "analyzed"},
    "lastName" : {"type" : "string", "index" : "analyzed"},
    "information" : {"type" : "object", "enabled" : false}
   }
  }
 }
}

The dynamic parameter can also be set to strict. This means that new fields won't be added into the document when they appear and the indexing of such document will fail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset