Chapter 2. Indexing Your Data

In the previous chapter, we learned what full text search is and how Apache Lucene fits there. We were introduced to the basic concepts of Elasticsearch and we are now familiar with its top-level architecture, so we know how it works. We used the REST API to index data, to update it, to delete it, and of course to retrieve it. We searched our data with the simple URI query and we used versioning that allowed us to use optimistic locking functionality. By the end of this chapter, you will have learned the following topics:

  • Basic information about Elasticsearch indexing
  • Adjusting Elasticsearch schema-less behavior
  • Creating your own mappings
  • Using out of the box analyzers
  • Configuring your own analyzers
  • Index data in batches
  • Adding additional internal information to indices
  • Segment merging
  • Routing

Elasticsearch indexing

So far we have our Elasticsearch cluster up and running. We also know how to use Elasticsearch REST API to index our data, we know how to retrieve it, and we also know how to remove the data that we no longer need. We've also learned how to search in our data by using the URI request search and Apache Lucene query language. However, until now we've used Elasticsearch functionality that allows us not to care about indices, shards, and data structure. This is not something that you may be used to when you are coming from the world of SQL databases, where you need the database and the tables with all the columns created upfront. In general, you needed to describe the data structure to be able to put data into the database. Elasticsearch is schema-less and by default creates indices automatically and because of that we can just install it and index data without the need of any preparations. However, this is usually not the best situation when it comes to production environments where you want to control the analysis of your data. Because of that we will start with showing you how to manage your indices and then we will get you through the world of mappings in Elasticsearch.

Shards and replicas

In Chapter 1, Getting Started with Elasticsearch Cluster, we told you that indices in Elasticsearch are built from one or more shards. Each of those shards contains part of the document set and each shard is a separate Lucene index. In addition to that, each shard can have replicas – physical copies of the primary shard itself. When we create an index, we can tell Elasticsearch how many shards it should be built from.

Note

The default number of shards that Elasticsearch uses is 5 and each index will also contain a single replica. The default configuration can be changed by setting the index.number_of_shards and index.number_of_replicas properties in the elasticsearch.yml configuration file.

When defaults are used, we will end up with five Apache Lucene indices that our Elasticsearch index is built of and one replica for each of those. So, with five shards and one replica, we would actually get 10 shards. This is because each shard would get its own copy, so the total number of shards in the cluster would be 10.

Dividing indices in such a way allows us to spread the shards across the cluster. The nice thing about that is that all the shards will be automatically spread throughout the cluster. If we have a single node, Elasticsearch will put the five primary shards on that node and will leave the replicas unassigned, because Elasticsearch doesn't assign shards and their replicas to the same node. The reason for that is simple – if a node would crash, we would lose both the primary source of the data and all the copies. So, if you have one Elasticsearch node, don't worry about replicas not being assigned – it is something to be expected. Of course when you have enough nodes for Elasticsearch to assign all the replicas (in addition to shards), it is not good to not have them assigned and you should look for the probable causes of that situation.

The thing to remember is that having shards and replicas is not free. First of all, each replica needs additional disk space, exactly the same amount of space that the original shard needs. So if we have 3 replicas for our index, we will actually need 4 times more space. If our primary shard weighs 100GB in total, with 3 replicas we would need 400GB – 100GB for each replica. However, this is not the only cost. Each replica is a Lucene index on its own and Elasticsearch needs some memory to handle that. The more shards in the cluster, the more memory is being used. And finally, having replicas means that we will have to do indexation on each of the replica, in addition to the indexation on the primary shard. There is a notion of shadow replicas which can copy the whole binary index, but, in most cases, each replica will do its own indexation. The good thing about replicas is that Elasticsearch will try to spread the query and get requests evenly between the shards and their replicas, which means that we can scale our cluster horizontally by using them.

So to sum up the conclusions:

  • Having more shards in the index allows us to spread the index between more servers and parallelize the indexing operations and thus have better indexing throughput.
  • Depending on your deployment, having more shards may increase query throughput and lower queries latency – especially in environments that don't have a large number of queries per second.
  • Having more shards may be slower compared to a single shard query, because Elasticsearch needs to retrieve the data from multiple servers and combine them together in memory, before returning the final query results.
  • Having more replicas results in a more resilient cluster, because when the primary shard is not available, its copy will take that role. Basically, having a single replica allows us to lose one copy of a shard and still serve the whole data. Having two replicas allows us to lose two copies of the shard and still see the whole data.
  • The higher the replica count, the higher queries throughput the cluster will have. That's because each replica can serve the data it has independently from all the others.
  • The higher number of shards (both primary and replicas) will result in more memory needed by Elasticsearch.

Of course, these are not the only relationships between the number of shards and replicas in Elasticsearch. We will talk about most of them later in the book.

So, how many shards and replicas should we have for our indices? That depends. We believe that the defaults are quite good but nothing can replace a good test. Note that the number of replicas is not very important because you can adjust it on a live cluster after index creation. You can remove and add them if you want and have the resources to run them. Unfortunately, this is not true when it comes to the number of shards. Once you have your index created, the only way to change the number of shards is to create another index and re-index your data.

Write consistency

Elasticsearch allows us to control the write consistency to prevent writes happening when they should not. By default, Elasticsearch indexing operation is successful when the write is successful on the quorum on active shards – meaning 50% of the active shards plus one. We can control this behavior by adding action.write_consitency to our elasticsearch.yml file or by adding the consistency parameter to our index request. The mentioned properties can take the following values:

  • quorum: The default value, requiring 50% plus 1 active shards to be successful for the index operation to succeed
  • one: Requires only a single active shard to be successful for the index operation to succeed
  • all: Requires all the active shards to be successful for the index operation to succeed

Creating indices

When we were indexing our documents in Chapter 1, Getting Started with Elasticsearch Cluster, we didn't care about index creation at all. We assumed that Elasticsearch will do everything for us and actually it was true; we just used the following command:

curl -XPUT 'http://localhost:9200/blog/article/1' -d '{"title": "New version of Elasticsearch released!", "content": "Version 1.0 released today!", "tags": ["announce", "elasticsearch", "release"] }'

This is just fine. If such an index does not exist, Elasticsearch automatically creates the index for us. However, there are times when we want to create indices ourselves for various reasons. Maybe we would like to have control over which indices are created to avoid errors or maybe we have some non default settings that we would like to use when creating a particular index. The reasons may differ, but it's good to know that we can create indices without indexing documents.

The simplest way to create an index is to run a PUT HTTP request with the name of the index we want to create. For example, to create an index called blog, we could use the following command:

curl -XPUT http://localhost:9200/blog/

We just told Elasticsearch that we want to create the index with the name blog. If everything goes right, you will see the following response from Elasticsearch:

{"acknowledged":true}

Altering automatic index creation

We already mentioned that automatic index creation is not the best idea in some cases. For example, a simple typo during index creation can lead to creating hundreds of unused indices and make cluster state information larger than it should be, putting more pressure on Elasticsearch and the underlying JVM. Because of that, we can turn off automatic index creation by adding a simple property to the elasticsearch.yml configuration file:

action.auto_create_index: false

Let's stop for a while and discuss the action.auto_create_index property, because it allows us to do more complicated things than just allowing (setting it to true) and disabling (setting it to false) automatic index creation. The mentioned property allows us to use patterns that specify the index names which should be allowed to be automatically created and which should be disallowed. For example, let's assume that we would like to allow automatic index creation for indices starting with logs and we would like to disallow all the others. To do something like this, we would set the action.auto_create_index property to something as follows:

action.auto_create_index: +logs*,-*

Now if we would like to create an index called logs_2015-10-01, we would succeed. To create such an index, we would use the following command:

curl -XPUT http://localhost:9200/logs_2015-10-01/log/1 -d '{"message": "Test log message" }'

Elasticsearch would respond with:

{
  "_index" : "logs_2015-10-01",
  "_type" : "log",
  "_id" : "1",
  "_version" : 1,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}

However, suppose we now try to create the blog using the following command:

curl -XPUT http://localhost:9200/blog/article/1 -d '{"title": "Test article title" }'

Elasticsearch would respond with an error similar to the following one:

{
  "error" : {
    "root_cause" : [ {
      "type" : "index_not_found_exception",
      "reason" : "no such index",
      "resource.type" : "index_expression",
      "resource.id" : "blog",
      "index" : "blog"
    } ],
    "type" : "index_not_found_exception",
    "reason" : "no such index",
    "resource.type" : "index_expression",
    "resource.id" : "blog",
    "index" : "blog"
  },
  "status" : 404
}

One thing to remember is that the order of pattern definitions matters. Elasticsearch checks the patterns up to the first pattern that matches, so if we move -* as the first pattern, the +logs* pattern won't be used at all.

Settings for a newly created index

Manual index creation is also necessary when we want to pass non default configuration options during index creation; for example, initial number of shards and replicas. We can do that by including JSON payload with settings as the PUT HTTP request body. For example, if we would like to tell Elasticsearch that our blog index should only have a single shard and two replicas initially, the following command could be used:

curl -XPUT http://localhost:9200/blog/ -d '{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 2
    }
}'

The preceding command will result in the creation of the blog index with one shard and two replicas, making a total of three physical Lucene indices – called shards as we already know. Of course there are a lot more settings that we can use, but what we did is enough for now and we will learn about the rest throughout the book.

Index deletion

Of course, similar to how we handled documents, Elasticsearch allows us to delete indices as well. Deleting an index is very similar to creating it, but instead of using the PUT HTTP method, we use the DELETE one. For example, if we would like to delete our previously created blog index, we would run the following command:

curl -XDELETE http://localhost:9200/blog

The response will be the same as the one we saw earlier when we created an index and should look as follows:

{"acknowledged":true}

Now that we know what an index is, how to create it, and how to delete it, we are ready to create indices with the mappings we have defined. Even though Elasticsearch is schema–less, there are a lot of situations where we would like to manually create the schema, to avoid any problems with the index structure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset