Federated search

Sometimes, having data in a single cluster is not enough. Imagine a situation where you have multiple locations where you need to index and search your data—for example, local company divisions that have their own clusters for their own data. The main center of your company would also like to search the data—not in each location but all at once. Of course, in your search application, you can connect to all these clusters and merge the results manually, but from Elasticsearch 1.0, it is also possible to use the so-called tribe node that works as a federated Elasticsearch client and can provide access to more than a single Elasticsearch cluster. What the tribe node does is fetch all the cluster states from the connected clusters and merge these states into one global cluster state available on the tribe node. In this section, we will take a look at tribe nodes and how to configure and use them.

Note

Remember that the described functionality was introduced in Elasticsearch 1.0 and is still marked as experimental. It can be changed or even removed in future versions of Elasticsearch.

The test clusters

For the purpose of showing you how tribe nodes work, we will create two clusters that hold data. The first cluster is named mastering_one (as you remember to set the cluster name, you need to specify the cluster.name property in the elasticsearch.yml file) and the second cluster is named mastering_two. To keep it as simple as it can get, each of the clusters contain only a single Elasticsearch node. The node in the cluster named mastering_one is available at the 192.168.56.10 IP address and the cluster named mastering_one is available at the 192.168.56.40 IP address.

Cluster one was indexed with the following documents:

curl -XPOST '192.168.56.10:9200/index_one/doc/1' -d '{"name" : "Test document 1 cluster 1"}'
curl -XPOST '192.168.56.10:9200/index_one/doc/2' -d '{"name" : "Test document 2 cluster 1"}'

For the second cluster the following data was indexed:

curl -XPOST '192.168.56.40:9200/index_two/doc/1' -d '{"name" : "Test document 1 cluster 2"}'
curl -XPOST '192.168.56.40:9200/index_two/doc/2' -d '{"name" : "Test document 2 cluster 2"}'

Creating the tribe node

Now, let's try to create a simple tribe node that will use the multicast discovery by default. To do this, we need a new Elasticsearch node. We also need to provide a configuration for this node that will specify which clusters our tribe node should connect together—in our case, these are our two clusters that we created earlier. To configure our tribe node, we need the following configuration in the elasticsearch.yml file:

tribe.mastering_one.cluster.name: mastering_one
tribe.mastering_two.cluster.name: mastering_two

All the configurations for the tribe node are prefixed with the tribe prefix. In the preceding configuration, we told Elasticsearch that we will have two tribes: one named mastering_one and the second one named mastering_two. These are arbitrary names that are used to distinguish the clusters that are a part of the tribe cluster.

We can start our tribe node, which we will start on a server with the 192.168.56.50 IP address. After starting Elasticsearch, we will try to use the default multicast discovery to find the mastering_one and mastering_two clusters and connect to them. You should see the following in the logs of the tribe node:

[2014-10-30 17:28:04,377][INFO ][cluster.service          ] [Feron] added {[mastering_one_node_1][mGF6HHoORQGYkVTzuPd4Jw][ragnar][inet[/192.168.56.10:9300]]{tribe.name=mastering_one},}, reason: cluster event from mastering_one, zen-disco-receive(from master [[mastering_one_node_1][mGF6HHoORQGYkVTzuPd4Jw][ragnar][inet[/192.168.56.10:9300]]])
[2014-10-30 17:28:08,288][INFO ][cluster.service          ] [Feron] added {[mastering_two_node_1][ZqvDAsY1RmylH46hqCTEnw][ragnar][inet[/192.168.56.40:9300]]{tribe.name=mastering_two},}, reason: cluster event from mastering_two, zen-disco-receive(from master [[mastering_two_node_1][ZqvDAsY1RmylH46hqCTEnw][ragnar][inet[/192.168.56.40:9300]]])

As we can see, our tribe node joins two clusters together.

Using the unicast discovery for tribes

Of course, multicast discovery is not the only possibility to connect multiple clusters together using the tribe node; we can also use the unicast discovery if needed. For example, to change our tribe node configuration to use unicast, we would change the elasticsearch.yml file to look as follows:

tribe.mastering_one.cluster.name: mastering_one
tribe.mastering_one.discovery.zen.ping.multicast.enabled: false
tribe.mastering_one.discovery.zen.ping.unicast.hosts: ["192.168.56.10:9300"]
tribe.mastering_two.cluster.name: mastering_two
tribe.mastering_two.discovery.zen.ping.multicast.enabled: false
tribe.mastering_two.discovery.zen.ping.unicast.hosts: ["192.168.56.40:9300"]

As you can see, for each tribe cluster, we disabled the multicast and we specified the unicast hosts. Also note the thing we already wrote about—each property for the tribe node is prefixed with the tribe prefix.

Reading data with the tribe node

We said in the beginning that the tribe node fetches the cluster state from all the connected clusters and merges it into a single cluster state. This is done in order to enable read and write operations on all the clusters when using the tribe node. Because the cluster state is merged, almost all operations work in the same way as they would on a single cluster, for example, searching.

Let's try to run a single query against our tribe now to see what we can expect. To do this, we use the following command:

curl -XGET '192.168.56.50:9200/_search?pretty'

The results of the preceding query look as follows:

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "index_two",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"name" : "Test document 1 cluster 2"}
    }, {
      "_index" : "index_one",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{"name" : "Test document 2 cluster 1"}
    }, {
      "_index" : "index_two",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{"name" : "Test document 2 cluster 2"}
    }, {
      "_index" : "index_one",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"name" : "Test document 1 cluster 1"}
    } ]
  }
}

As you can see, we have documents coming from both clusters—yes, that's right; our tribe node was about to automatically get data from all the connected tribes and return the relevant results. We can, of course, do the same with more sophisticated queries; we can use percolation functionality, suggesters, and so on.

Master-level read operations

Read operations that require the master to be present, such as reading the cluster state or cluster health, will be performed on the tribe cluster. For example, let's look at what cluster health returns for our tribe node. We can check this by running the following command:

curl -XGET '192.168.56.50:9200/_cluster/health?pretty'

The results of the preceding command will be similar to the following one:

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 10,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 10
}

As you can see, our tribe node reported 5 nodes to be present. We have a single node for each of the connected clusters: one tribe node and two internal nodes that are used to provide connectivity to the connected clusters. This is why there are 5 nodes and not three of them.

Writing data with the tribe node

We talked about querying and master-level read operations, so it is time to write some data to Elasticsearch using the tribe node. We won't say much; instead of talking about indexing, let's just try to index additional documents to one of our indices that are present on the connected clusters. We can do this by running the following command:

curl -XPOST '192.168.56.50:9200/index_one/doc/3' -d '{"name" : "Test document 3 cluster 1"}'

The execution of the preceding command will result in the following response:

{"_index":"index_one","_type":"doc","_id":"3","_version":1,"created":true}

As we can see, the document has been created and, what's more, it was indexed in the proper cluster. The tribe node just did its work by forwarding the request internally to the proper cluster. All the write operations that don't require the cluster state to change, such as indexing, will be properly executed using the tribe node.

Master-level write operations

Master-level write operations can't be executed on the tribe node—for example, we won't be able to create a new index using the tribe node. Operations such as index creation will fail when executed on the tribe node, because there is no global master present. We can test this easily by running the following command:

curl -XPOST '192.168.56.50:9200/index_three'

The preceding command will return the following error after about 30 seconds of waiting:

{"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}

As we can see, the index was not created. We should run the master-level write commands on the clusters that are a part of the tribe.

Handling indices conflicts

One of the things that the tribe node can't handle properly is indices with the same names present in multiple connected clusters. What the Elasticsearch tribe node will do by default is that it will choose one and only one index with the same name. So, if all your clusters have the same index, only a single one will be chosen.

Let's test this by creating the index called test_conflicts on the mastering_one cluster and the same index on the mastering_two cluster. We can do this by running the following commands:

curl -XPOST '192.168.56.10:9200/test_conflicts'
curl -XPOST '192.168.56.40:9200/test_conflicts'

In addition to this, let's index two documents—one to each cluster. We do this by running the following commands:

curl -XPOST '192.168.56.10:9200/test_conflicts/doc/11' -d '{"name" : "Test conflict cluster 1"}'
curl -XPOST '192.168.56.40:9201/test_conflicts/doc/21' -d '{"name" : "Test conflict cluster 2"}'

Now, let's run our tribe node and try to run a simple search command:

curl -XGET '192.168.56.50:9202/test_conflicts/_search?pretty'

The output of the command will be as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test_conflicts",
      "_type" : "doc",
      "_id" : "11",
      "_score" : 1.0,
      "_source":{"name" : "Test conflict cluster 1"}
    } ]
  }
}

As you can see, we only got a single document in the result. This is because the Elasticsearch tribe node can't handle indices with the same names coming from different clusters and will choose only one index. This is quite dangerous, because we don't know what to expect.

The good thing is that we can control this behavior by specifying the tribe.on_conflict property in elasticsearch.yml (introduced in Elasticsearch 1.2.0). We can set it to one of the following values:

  • any: This is the default value that results in Elasticsearch choosing one of the indices from the connected tribe clusters.
  • drop: Elasticsearch will ignore the index and won't include it in the global cluster state. This means that the index won't be visible when using the cluster node (both for write and read operations) but still will be present on the connected clusters themselves.
  • prefer_TRIBE_NAME: Elasticsearch allows us to choose the tribe cluster from which the indices should be taken. For example, if we set our property to prefer_mastering_one, it would mean that Elasticsearch will load the conflicting indices from the cluster named mastering_one.

Blocking write operations

The tribe node can also be configured to block all write operations and all the metadata change requests. To block all the write operations, we need to set the tribe.blocks.write property to true. To disallow metadata change requests, we need to set the tribe.blocks.metadata property to true. By default, these properties are set to false, which means that write and metadata altering operations are allowed. Disallowing these operations can be useful when our tribe node should only be used for searching and nothing else.

In addition to this, Elasticsearch 1.2.0 introduced the ability to block write operations on defined indices. We do this by using the tribe.blocks.indices.write property and setting its value to the name of the indices. For example, if we want our tribe node to block write operations on all the indices starting with test and production, we set the following property in the elasticsearch.yml file of the tribe node:

tribe.blocks.indices.write: test*, production*
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset