Just as you can balance your application load across multiple nodes in a cluster, you may also split up your Lucene indexes through a process called sharding. You might consider sharding for performance reasons if your indexes grow to a very large size, as larger index files take longer to index and optimize than smaller shards.
Sharding may offer additional benefits if your entities lend themselves to partitioning (for example, by language, geography, and so on). Performance may be improved if you can predictably steer queries toward the specific appropriate shard. Also, it sometimes makes lawyers happy when you can store "sensitive" data at a physically different location.
Even though its dataset is very small, this chapter's version of the VAPORware Marketplace application now splits its App
index into two shards. The relevant line in hibernate.cfg.xml
looks similar to the following:
...
<property
name="hibernate.search.default.sharding_strategy.nbr_of_shards">
2
</property>
...
As with all of the other Hibernate Search properties that include the substring default
, this is a global setting. It can be made index-specific by replacing default
with an index name (for example, App
).
When an index is split into multiple shards, each shard includes the normal index name followed by a number (starting with zero). For example, com.packtpub.hibernatesearch.domain.App.0
instead of just com.packtpub.hibernatesearch.domain.App
. This screenshot shows the Lucene directory structure of our two-node cluster, while it is up and running with both nodes configured for two shards:
Just as the shards are numbered on the filesystem, they can be separately configured by number in hibernate.cfg.xml
. For example, if you want to store the shards at different locations, you might set properties as follows:
... <property name="hibernate.search.App.0.indexBase"> target/lucenceIndexMasterCopy/EnglishApps </property> <property name="hibernate.search.App.1.indexBase"> target/lucenceIndexMasterCopy/FrenchApps </property> ...
When a Lucene write operation is performed for an entity, or when a search query needs to read from an entity's index, a sharding strategy determines which shard to use.
If you are sharding simply to reduce the file size, then the default strategy (implemented by org.hibernate.search.store.impl.IdHashShardingStrategy
) is perfectly fine. It uses each entity's ID to calculate a unique hash code, and distributes the entities among the shards in a roughly even manner. Because the hashing calculation is reproducible, the strategy is able to direct future updates for an entity towards the appropriate shard.
To create your own custom sharding strategy with more exotic logic, you can create a new subclass inheriting from IdHashShardingStrategy
, and tweak it as needed. Alternatively, you can completely start from scratch with a new class implementing the org.hibernate.search.store.IndexShardingStrategy
interface, perhaps referring to the source code of IdHashShardingStrategy
for guidance.