Sharding Lucene indexes

Just as you can balance your application load across multiple nodes in a cluster, you may also split up your Lucene indexes through a process called sharding. You might consider sharding for performance reasons if your indexes grow to a very large size, as larger index files take longer to index and optimize than smaller shards.

Sharding may offer additional benefits if your entities lend themselves to partitioning (for example, by language, geography, and so on). Performance may be improved if you can predictably steer queries toward the specific appropriate shard. Also, it sometimes makes lawyers happy when you can store "sensitive" data at a physically different location.

Even though its dataset is very small, this chapter's version of the VAPORware Marketplace application now splits its App index into two shards. The relevant line in hibernate.cfg.xml looks similar to the following:

...
<property
   name="hibernate.search.default.sharding_strategy.nbr_of_shards">
      2
</property>
...

As with all of the other Hibernate Search properties that include the substring default, this is a global setting. It can be made index-specific by replacing default with an index name (for example, App).

Note

This exact line appears in both hibernate.cfg.xml (used by our "master" node), and hibernate-slave.cfg.xml (used by our "slave" node). When running in a clustered environment, your sharding configuration should match all the nodes.

When an index is split into multiple shards, each shard includes the normal index name followed by a number (starting with zero). For example, com.packtpub.hibernatesearch.domain.App.0 instead of just com.packtpub.hibernatesearch.domain.App. This screenshot shows the Lucene directory structure of our two-node cluster, while it is up and running with both nodes configured for two shards:

Sharding Lucene indexes

An example of sharded Lucene indexes running in a cluster (note the numbering of each App entity directory)

Just as the shards are numbered on the filesystem, they can be separately configured by number in hibernate.cfg.xml. For example, if you want to store the shards at different locations, you might set properties as follows:

...
<property name="hibernate.search.App.0.indexBase">
   target/lucenceIndexMasterCopy/EnglishApps
</property>
<property name="hibernate.search.App.1.indexBase">
   target/lucenceIndexMasterCopy/FrenchApps
</property>
...

When a Lucene write operation is performed for an entity, or when a search query needs to read from an entity's index, a sharding strategy determines which shard to use.

If you are sharding simply to reduce the file size, then the default strategy (implemented by org.hibernate.search.store.impl.IdHashShardingStrategy) is perfectly fine. It uses each entity's ID to calculate a unique hash code, and distributes the entities among the shards in a roughly even manner. Because the hashing calculation is reproducible, the strategy is able to direct future updates for an entity towards the appropriate shard.

To create your own custom sharding strategy with more exotic logic, you can create a new subclass inheriting from IdHashShardingStrategy, and tweak it as needed. Alternatively, you can completely start from scratch with a new class implementing the org.hibernate.search.store.IndexShardingStrategy interface, perhaps referring to the source code of IdHashShardingStrategy for guidance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset