Chapter 4. Neo4j for High-volume Applications

There is an exponential surge in the amount of data being created annually; a pattern that is going to exist for quite some time. As data gets more complex, it is increasingly challenging to get valuable insights and information from it. However, the volume and complexity of data are not the only issues. There appears to be a rise in semi-structured and highly interconnected data. Several major tech firms such as Facebook, Google, and Twitter have resorted to the graph approach to tackle complexity in the big data arena. The analysis of trends and patterns out of the collected raw data has begun to gain popularity. From professional outlook websites such as LinkedIn to tiny specialized social media applications cropping up each day, all have a graph-processing layer in their core applications. The graph-oriented approach has led many industries to come up with scalable systems to manage information.

A multitude of next-generation databases that provide better performance and support for semi-structured data form the backbone of the big data revolution today. These technologies not only make the analysis, storage, and management of high volumes of data simpler, they also scale up and scale out at an extraordinary rate. Graph databases are crucial players that have made it convenient to house an information web in your application that can be traversed through labeled relationships. Graph problems are existent all round us—from managing access rights and permission in security systems to looking for where you put the keys and from simple graphs to complex social ones—a graph database can provide more natural storage and rapid querying.

In this chapter, we will look at the use of graphs and the Neo4j database in scenarios that handle large volumes of data including:

  • Graph processing
  • Use of graphs in big data
  • Transaction management
  • The graphalgo package of Neo4j
  • Introducing spring data Neo4j

Graph processing

Graph processing is an exciting development for those in the graph database space, since the utility of graph databases has been reinforced as a storage system as well as a computational model. However, the processing of graph-like data can be confused with graph databases due to the common data models they share, although each technique operates on fundamentally different scenarios. Some graph-processing platforms such as Pregel, developed by Google, are capable of achieving high-computational throughput, since it adopts the Bulk Synchronous Processing (BSP) model from the domain of parallel computing. This model supports the partition of the graph into multiple machines and uses the localized data from the vertices for computation. Exchange of local information takes place during the synchronization process. This model is used to process large interconnected datasets for business insights compared to traditional map-reduce operations, although high latency is a concern in this case.

For enterprise scenarios, a popular batch-processing platform for large volumes of data is Hadoop. Similar to Pregel, Hadoop is also a high-throughput and latency system that is used to optimize throughputs of computation for extremely large datasets and that too in parallel and exterior to the database. However, Hadoop is made for general computational use and although you can use it for processing graphs, the system and the components are "un-optimized" for graph-oriented operations.

What the two platforms have in common is the efficient handling of Online Analytical Processing (OLAP) for analytics, rather than simply dealing with transactions. This is contrary to the principles of Neo4j and other graph databases. These principles prioritize the optimization of storage and queries for Online Transaction Processing (OLTP), similar to relational databases, but implement a more powerful, simple, and expressive underlying data model. This can visualized from the following diagram:

Graph processing

As depicted in the preceding diagram, Pregel is strictly an OLAP graph-processing tool; Hadoop is a completely general-purpose OLAP system but it is closer to the OLTP axis since several current extensions are available to achieve near real-time processing with Hadoop. Relational databases are mostly OLTP systems that can be logically adapted in systems that require OLAP processing. Neo4j is designed solely for graph data and primarily involves scenarios for OLTP operations, although it can also be used for OLAP since it has a native graph model and high-read capability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset