Big data and graphs

Graph data analysis is a prime technique to extract information from very large datasets by assessing the similarity or associativity of data points. The need for such techniques arose when social networks started gaining popularity and expanded their user base rapidly, but today, graph analysis has a much broader scope of application.

Since graph processing has caught up in the race for crunching data, big data platforms and communities have been innovatively adapting themselves to the needs for solving graph problems with frameworks, such as Apache Giraph (http://giraph.apache.org/) and MapReduce extensions such as Pregel (goo.gl/hW3L40), Surfer, and GBASE (http://goo.gl/3QkB46); it is becoming simpler to address graph-processing issues.

Hadoop is a large-scale distributed batch-processing framework that operates at high latencies unlike graph databases. So, if you implement graph processing on a Hadoop-based system, data locality will lead to a more efficient batch execution, and therefore, we will see a higher throughput. However, latency still remains the drawback. Hence, the approach of graph processing through Hadoop batch jobs will not be feasible for OLTP applications, since they require quite low latency in the order of milliseconds (as compared to the seconds in Hadoop). Hence, it will find more applications operating on static data in the OLAP domain. You can use this for report generation purposes from static data stored in warehouses, especially if the data is carefully laid out. In order to increase the efficiency of such a system, denormalization of the data needs to take place within the HBase data store, which increases the cognitive difference between the obtained data and the manner in which it is represented for the purpose of graph processing.

However, Neo4j rules out these drawbacks. If you use Neo4j for the purposes of graph processing, you do not need to denormalize the data or set up any specialized infrastructure. Neo4j works seamlessly in OLTP and uses the same database (most often, a read-only replica in sync with the master) for OLAP, should you require to use it. The main advantage here is the low latency even when dealing with larger read queries as well as when exposed to heavy online loads.

Hadoop-based, batch-oriented graph processing is beneficial in scenarios where you can read or process data external to the database as compared to manipulating it in place. So, to obtain efficient processing, data needs to be carefully placed in HBase with no scope of mutation in the course of the processing. Neo4j, on the other hand, supports mutations of the graph in place, which is an essential feature to run analytics on real-time web data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset