Scaling Neo4j applications

Large datasets in the Neo4j world refer to those that are substantially larger compared to the main memory of the system. In scenarios with such datasets, it is not possible for the Neo4j system to cache the entire database in memory, thereby providing blazingly fast traversals on the graph, because it will eventually lead to disk operations. Earlier, it was recommended to scale the system vertically using more RAM or solid state drives that have much lower seek times for the data on the disk compared to spinning drives. While SSDs considerably increase performance, even the fastest SSDs cannot replace the RAM, which in the end is the limiting factor.

In order to service huge workloads and manage large sets of data in Neo4j, partitioning graphs across multiple physical instances seem complex way to scale graph data. In versions up to 1.2, scaling seemed to be the con of this graph data store, but with the introduction of Neo4j High Availability (HA), there has been significant insight to handling large datasets and design solutions for scalability and availability.

One significant pattern that uses Neo4j HA is cache sharding, which is used to maintain increased performance with massive datasets that exceed the available main memory space of the system. Cache sharding is not the traditional sharding that most databases implement today. This is due to the fact that it expects a complete dataset to be present on each instance of the database. Instead, to implement cache sharding, the workload is partitioned among each database instance in order to increase the chances of hitting a warm cache for a particular request; believe it or not, warm caches in Neo4j give extremely high performance.

There are several issues that Neo4j HA addresses, the following being the prominent features:

  • It implements a fault-tolerant architecture for the database, in which you can configure multiple Neo4j slave database instances to be exact replica sets of a single Neo4j master database. Hence, the end user application that runs Neo4j will be perfectly operational when there is a hardware failure.
  • It provides a read-mostly, horizontally scaling architecture that facilitates the system to handle much higher read traffic as compared to a single instance of the Neo4j database, since every instance contains a complete graph dataset.

In other words, cache sharding refers to the injection of some logic into the load balancer of the high availability cluster of Neo4j, thereby directing queries to some specific node in the cluster based on a rule or property (such as sessions or start values). If implemented properly, each node in the cluster will be able to house a part of the total graph in the corresponding object cache so that the required traversal can be made.

The architecture for this solution is represented in the following figure. Thus, instead of typical graph sharding, we reduce the solution to that of consistent routing, which is a technique that has been in use with web farms for ages.

Scaling Neo4j applications

The logic that is used to perform the routing varies according to the domain of the data. At times, you can use specific characteristics of the data, such as label names or indexes for routing, whereas sometimes, sticky sessions are good enough. One simple technique is the database instance that serves a request for the first time for some user will also serve all subsequent requests for that user, with a high probability of processing the requests from a warm cache. You can use domain knowledge of the requests to route, for example, in the case of geographical or location-specific data, you route requests that pertain to particular locations to Neo4j instances that have data for that location in their warm cache. In a way, we are shooting up the likelihood of nodes and relationships being cached and hence, it becomes faster to access and process.

Apart from reading from the databases, to run multiple servers to harness the caching capabilities, we also need to sync data between these servers. This is where Neo4j HA comes into play. Effectively, with the deployment of Neo4J HA, a multimaster cluster is formed. A write to any instance propagates the write with the help of the HA protocol. When the elected master in the cluster is being written to, the data is first persisted there due to its ACID nature, and then, the modification is eventually transferred to the slaves through the HA protocol in order to maintain consistency.

Scaling Neo4j applications

If a cluster slave mode processes a write operation, then it updates the elected master node with the help of a transaction, and initially, the results are persisted in both. Other slaves are updated from the master with the use of the HA protocol.

Scaling Neo4j applications

With the use of such a pattern, a Neo4J HA cluster acts as a high-performance database for efficient read-write operations. Additionally, a good strategy for routing, it helps to perform in-memory and blazingly fast traversals for applications.

The following is a summary of the scaling techniques implemented, depending on the data at hand and the cost strategy:

Type 1:

Dataset size: Order of tens of GBs

Strategy: Scale a single machine vertically with more RAM

Reason: Most server racks contain the RAM of the order of 128 GB for a typical machine. Since Neo4j loves RAM for data caching, where O (dataset) ≈ O (memory), all of the data can be cached into memory, and operations on it can take place at extremely high speeds.

Weaknesses: Clustering needed for availability; disk performance limits the write scalability

Type 2:

Dataset size: Order of hundreds of GBs

Strategy: Cache sharding techniques

Reasoning: The data is too big to fit in the RAM of a single machine, but is possible to replicate onto disks of many machines. Cache sharding increases the chances of hitting a warm cache in order to provide high performance. Cache misses, however, are not critical and can be optimized using SSDs in the place of spinning disks.

Weaknesses: You need a router/load balancer in the Neo4j infrastructure for consistent routing.

Type 3:

Dataset size: TBs and more

Strategy: Sharding based on domain-specific knowledge

Reasoning: With such large datasets, which are not replicable across instances, sharding provides a way out. However, since there is no algorithm (yet) to arbitrarily shard graph data, we depend on the knowledge of the domain in order to predict which node to allocate to which machine or instance.

Weaknesses: It is difficult to classify or relate all domains for sharding purposes.

In most scenarios that developers face, these heuristics are a good choice. It's simple to calculate the scale of your data and accordingly plan a suitable Neo4j deployment. It also provides a glimpse into the future of connected data—while most enterprises dealing with connected data are in the tens to hundreds of GB segments, with the current rate, there will soon be a requirement for greater data-crunching techniques.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset