Chapter 11. Scaling by Example

In the previous chapter, we discussed Elasticsearch administration. We started with discussion about backups and how we can do them by using available API. We monitored the health and state of our clusters and nodes and we learned how to control shard rebalancing. We controlled the shard and replicas allocation and used human friendly Cat API to get information about the cluster and nodes. We saw how to use warmers to speed up potentially heavy queries and we used index aliasing to manage our indices more easily. By the end of this chapter, you will have learned the following topics:

  • Hardware preparations for running Elasticsearch
  • Tuning a single Elasticsearch node
  • Preparing highly available and fault tolerant clusters
  • Expanding Elasticsearch vertically
  • Preparing Elasticsearch for high query and indexing throughput
  • Monitoring Elasticsearch

Hardware

One of the first decisions that we need to make when starting every serious software project is a set choices related to hardware. And believe us, this is not only a very important choice, but also one of the most difficult ones. Often the decisions are made at early project stages, when only the basic architecture is known and we don't have precise information regarding the queries, data load, and so on. Project architect has to balance precaution and projected cost of the whole solution. Too many times it is an intersection of experience and clairvoyance, which can lead to either great or terrible results.

Physical servers or a cloud

Let's start with a decision: a cloud, virtual, or physical machines. Nowadays, these are all valid options, but it was not always the case. Sometime ago the only option was to buy new servers for each environment part or share resources with the other applications on the same machine. The second option makes perfect sense as it is more cost-effective but introduces risk. Problems with one application, especially when they are hardware related, will result in problems for another application. You can imagine one of your applications using most of the I/O subsystem of the physical machine and all the other applications struggling with lots of I/O waits and performance problems because of that. Virtualization promises application separation and a more convenient way of managing resources, but you are still limited by the underlying hardware. Every unexpected traffic could be a problem and affect service availability. Imagine that your ecommerce site suddenly gains massive number of customers. Instead of being glad that the spike appeared and you have more potential customers, you search for a place where you can buy additional hardware that will be supplied as soon as possible.

Cloud computing on the other hand means a more flexible cost model. We can easily add new machines whenever we need. We can add them temporarily when we expect a greater load (for example, before Christmas for an ecommerce site) and pay only for the actually used processing power. It is just a few clicks in the admin panel. Even more, we can also setup automatic scaling, that is new virtual machines can appear automatically when we need them. Cloud-based software can also shut them down when we do not need them anymore. The cloud has many advantages, such as lower initial cost, ability to easily grow your business, and insensitivity to temporal fluctuations of resource requirements, but it also has several flaws. The costs of cloud servers rise faster than that of physical machines. Also, mass storage, although practically unlimited, has worse characteristics (number of operations per seconds) than physical servers. This is sometimes a great problem for us, especially with disk based storage such as Elasticsearch.

In practice, as usual, the choice can be hard but going through a few points can help you with your decision:

  • Business requirements may directly point for your own servers; for example, some procedures related to financial or medical data automatically exclude cloud solutions hosted by third-party vendors
  • For proof of concept and low/medium load services, the cloud can be a good choice because of simplicity, scalability, and low cost
  • Solutions with strong requirements connected with I/O subsystems will probably work better on bare metal machines where you have greater influence what storage type is available to you
  • When the traffic can greatly change within a short time, the cloud is a perfect place for you

For the purpose of further discussion, let's assume that we want to buy our own servers. We are in the computer store now and let's buy something!

CPU

In most cases, this is the least important part. You can choose any modern CPU model but you should know that more number of cores means a higher number of concurrent queries and indexing threads. That will lead to being able to index data faster, especially with complicated analysis and lots of merges.

RAM memory

More gigabytes of RAM is always better than less gigabytes of RAM. Memory is necessary, especially for aggregation and sorting. It is less of a problem now, with Elasticsearch 2.0 and doc values, but still complicated queries with lots of aggregation require memory to process the data. Memory is also used for indexing buffers and can lead to indexing speed improvements, because more data can be buffered in memory and thus disks will be used less frequently. If you try to use more memory than available, the operating system will use the hard disks as temporary space (it starts swapping) and you should avoid this at all cost. Note that you should never try to force Elasticsearch to use as much as possible memory. The first reason is Java garbage collector – less memory is more GC friendly. The second reason is that the unused memory is actually used by the operating system for buffers and disk cache. In fact, when your index can fit in this space, all data is read from these caches and not from the disks directly. This can drastically improve the performance. By default, Elasticsearch and the I/O subsystem share the same I/O cache, which gives another reason to leave even more memory for the operating system itself.

In practice, 8GB is the lowest requirement for memory. It does not mean that Elasticsearch will never work with less memory, but for most situations and data intensive applications, it is the reasonable minimum. On the other hand, more than 64GB is rarely needed. In lieu, think about scaling the system horizontally instead of assigning such amounts of memory to a single Elasticsearch node.

Mass storage

We said that we are in a good situation when the whole index fits into memory. In practice this can be difficult to achieve, so good and fast disks are very important. It is even more important if one of the requirements is high indexing throughput. In such a case, you may consider fast SSD disks. Unfortunately, these disks are expensive if your data volume is big. You can improve the situation by avoiding using RAID (see https://en.wikipedia.org/wiki/RAID), except RAID 0. In most cases, when you handle fault tolerance by having multiple servers, the additional level of security on the RAID level is unnecessary. The last thing is to avoid using external storage, such as network attached storage (NAS) or NFS volumes. The network latency in such cases always kills all the advantages of these solutions.

The network

When you use Elasticsearch cluster, each node opens several connections to other nodes for various uses. When you index, the data is forwarded to different shards and replicas. When you query for data, the node used for querying can run multiple partial queries to the other nodes and compose reply from the data fetched from the other nodes. This is why you should make sure that your network is not the bottleneck. In practice, use one network for all the servers in the cluster and avoid solutions in which the nodes in the cluster are spread between data centers.

How many servers

The answer is always the same, as it depends. It depends on many factors: the number of request per seconds, the data volume, the level of the query's complexity, the aggregations and sorting usage, the number of new documents per unit of time, how fast new data should be available for searching (the refresh time), the average document size, and the analyzers used. In practice, the handiest answer is - test it and approximate.

The one thing that is often underestimated is data security. When you think about fault tolerance and availability, you should start from three servers. Why? We talked about the split brain situation in the Master election configuration section of Chapter 9, Elasticsearch Cluster in Detail. Starting from three servers we are able to handle a single Elasticsearch node failure without taking down the whole cluster.

Cost cutting

You did some tests, considered carefully planned functionalities, estimated volumes and load, and went to the project owner with an architecture draft. "Its too expensive", he said and asked you to think about servers once again. What can we do?

Let's think about server roles and try to introduce some differences between them. If one of the requirements is indexing massive amounts of data connected with time (maybe logs), the possible way is having two groups of servers: hot nodes, when new data arrives, and cold nodes, when old data is moved. Thanks to this approach, hot nodes may have faster but smaller disks (that is, solid state drives) in opposite to the cold nodes, when fast disks are not so important but space is. You can also divide your architecture in to several groups as master servers (less powerful, with relativly small disks), data nodes (bigger disks), and query aggregator nodes (more RAM). We will talk about this in the following sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset