Introduction

Graph analysis is much more commonplace in our life than we think. To take the most common example, when we ask a Global Positioning System (GPS) to find the shortest route to a destination, it uses a graph-processing algorithm.

Let's start by understanding graphs. A graph is a representation of a set of vertices, where some pairs of vertices are connected by edges. When these edges move from one direction to another, it's called a directed graph or digraph.

GraphX is the Spark API for graph processing. It provides a wrapper around an RDD called a resilient distributed property graph. The property graph is a directed multigraph, with properties attached to each vertex and edge.

There are two types of graphs—directed graphs (digraphs) and regular graphs. Directed graphs have edges that run in one direction; for example, from vertex A to vertex B. A Twitter follower is a good example of a digraph. If John is David's Twitter follower, it does not mean that David is John's follower. On the other hand, Facebook is a good example of a regular graph. If John is David's Facebook friend, David is also John's Facebook friend.

A multigraph is a graph that is allowed to have multiple edges (also called parallel edges). Since every edge in GraphX has properties, each edge has its own identity.

Traditionally, for distributed graph processing, there have been two types of systems:

  • Data parallel
  • Graph parallel

GraphX aims to combine the two together in one system. The GraphX API enables users to view the data both as graphs and as collections (RDDs), without data movement.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset