Till now we've learned how to install, configure, and query our ElasticSearch cluster. We also prepared some more sophisticated mappings. We've also used aliasing to make querying easier and in addition to that we used routing to control where the data is placed. In this chapter, we will extend our knowledge of ElasticSearch by looking at how to index data that is not flat, how to handle geographical data, and how to deal with files. We will also learn how to distinguish the text fragment that was matched and how to implement commonly used autocomplete features. By the end of this chapter you will learn:
Not all data is flat like that which we have been using since Chapter 2, Searching Your Data. Of course if we are building our system, which ElasticSearch will be a part of, we can create a structure that is convenient for ElasticSearch. However, it doesn't need to be flat, it can be more object-oriented. Let's see how to create mappings that use fully structured JSON objects.
Let's assume we have the following data (we store it in the file called structured_data.json
):
{ "book" : { "author" : { "name" : { "firstName" : "Fyodor", "lastName" : "Dostoevsky" } }, "isbn" : "123456789", "englishTitle" : "Crime and Punishment", "originalTitle" : "Преступлéние и наказáние", "year" : 1886, "characters" : [ { "name" : "Raskolnikov" }, { "name" : "Sofia" } ], "copies" : 0 } }
As you can see, the data is not flat. It contains arrays and nested objects, so we can't use our mappings that we used previously. But we can create mappings that will be able to handle such data.
The previous example data shows a structured JSON file. As you can see, the root object in our file is book
. The root object is a special one, which allows us to define additional properties. The book
root object has some simple properties such as englishTitle
, originalTitle
, and so on. Those will be indexed as normal fields in the index. In addition to that it has the characters
array type, which we will discuss in the next paragraph. For now, let's focus on author
. As you can see, author
is an object that has another object nested in it, that is, the name
object, which has two properties firstName
and lastName
.
We have already used array type data, but we didn't talk about it. By default all fields in Lucene and thus in ElasticSearch are multivalued, which means that they can store multiple values. In order to send such fields for indexing to ElasticSearch we use the JSON array type, which is nested within the opening and closing square brackets []
. As you can see in the previous example, we used the array type for characters
property for our book.
So, what can we do to index such data as that shown previously? To index arrays we don't need to do anything, we just specify the properties for such fields inside the array name. So in our case in order to index the characters
data present in the data we would need to add such mappings as these:
"characters" : { "properties" : { "name" : {"type" : "string", "store" : "yes"} } }
Nothing strange, we just nest the properties
section inside the array's name (which is characters
in our case) and we define fields there. As a result of this mapping, we would get the characters.name
multivalued field in the index.
We perform similar steps for our author
object. We call the section by the same name as is present in the data, but in addition to the properties
section we also tell ElasticSearch that it should expect the object type by adding the type
property with the value object
. We have the author
object, but it also has the name
object nested in it, so we do the same; we just nest another object inside it. So, our mappings for that would look like the following code:
"author" : { "type" : "object", "properties" : { "name" : { "type" : "object", "properties" : { "firstName" : {"type" : "string", "store" : "yes"}, "lastName" : {"type" : "string", "store" : "yes"} } } } }
The firstName
and lastName
fields would appear in the index as author.name.firstName
and author.name.lastName
. We will check if that is true in just a second.
The rest of the fields are simple core types, so I'll skip discussing them as they were already discussed in the Schema mapping section of Chapter 1, Getting Started with ElasticSearch Cluster.
So our final mappings file that we've called structured_mapping.json
looks like the following:
{ "book" : { "properties" : { "author" : { "type" : "object", "properties" : { "name" : { "type" : "object", "properties" : { "firstName" : {"type" : "string", "store" : "yes"}, "lastName" : {"type" : "string", "store" : "yes"} } } } }, "isbn" : {"type" : "string", "store" : "yes"}, "englishTitle" : {"type" : "string", "store" : "yes"}, "originalTitle" : {"type" : "string", "store" : "yes"}, "year" : {"type" : "integer", "store" : "yes"}, "characters" : { "properties" : { "name" : {"type" : "string", "store" : "yes"} } }, "copies" : {"type" : "integer", "store" : "yes"} } } }
As we already know, ElasticSearch is schemaless, which means that it can index data without the need of first creating the mappings (although we should do so if we want to control the index structure). The dynamic behavior of ElasticSearch is turned on by default, but there may be situations where you may want to turn it off for some parts of your index. In order to do that, one should add the dynamic
property set to false
on the same level of nesting as the type
property for the object that shouldn't be dynamic. For example, if we would like our author
and name
objects not to be dynamic, we should modify the relevant parts of the mappings file so that it looks like the following code:
"author" : { "type" : "object", "dynamic" : false, "properties" : { "name" : { "type" : "object", "dynamic" : false, "properties" : { "firstName" : {"type" : "string", "store" : "yes"}, "lastName" : {"type" : "string", "store" : "yes"} } } } }
However, please remember that in order to add new fields for such objects, we would have to update the mappings.
The last thing I would like to do is test if all the work we did actually works. This time we will use a slightly different technique of creating an index and adding the mappings. First, let's create the library
index with the following command:
curl -XPUT 'localhost:9200/library'
Now, let's send our mappings for the book
type:
curl -XPUT 'localhost:9200/library/book/_mapping' -d @structured_mapping.json
Now we can index our example data:
curl -XPOST 'localhost:9200/library/book/1' -d @structured_data.json
If we would like to see how our data was indexed, we can run a query like the following:
curl -XGET 'localhost:9200/library/book/_search?q=*:*&fields=*&pretty=true'
It will return the following data:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "1", "_score" : 1.0, "fields" : { "copies" : 0, "characters.name" : [ "Raskolnikov", "Sofia" ], "englishTitle" : "Crime and Punishment", "author.name.lastName" : "Dostoevsky", "isbn" : "123456789", "originalTitle" : "Преступлéние и наказáние", "year" : 1886, "author.name.firstName" : "Fyodor" } } ] } }
As you can see, all the fields from arrays and object types are indexed properly. Please notice that there is, for example, the author.name.firstName
field present, because ElasticSearch did flatten the data.