Chapter 2. Building and Exploring Graphs

This chapter aims to teach us how to represent various types of networks and complex systems as property graphs in Spark and GraphX. Before we can describe the behavior, and analyze the inner structure of these systems, we first need to map their components to vertices or nodes, and map the interactions between the individual components to edges or links. Building on what we learned in the previous chapter, we will delve into the details on how graphs are stored and represented in GraphX. In addition, this chapter introduces the language of graph theory, and the basic characteristics of graphs. Throughout this chapter, we will use real-world datasets that we will map to the different types of graphs. The examples include e-mail communication networks, food flavor network, and social ego networks. On completing this chapter, you will understand how to:

  • Load data and build Spark graphs in many ways
  • Use the join operator to mix external data into existing graphs
  • Build bipartite graphs and multigraphs
  • Explore graphs and compute their basic statistics

Network datasets

In the previous chapter, we constructed a small social network as a toy example. From this chapter onwards, we are going to work with real-world datasets, drawn from various applications. In fact, graphs are used to represent any complex system as it describes the interactions between the components of the system. Despite the diversity in form, size, nature, and granularity of different systems, graph theory provides a common language, and a set of tools, for representing and analyzing complex systems.

Note

In brief, a graph consists of a set of vertices connected by a set of edges. Each edge represents the relationship between a pair of connected vertices. In this book, we will sometimes use the less technical terms network nodes to refer to vertices, and links to refer to edges. Note that Spark supports multigraphs, that is, it is permitted to have multiple edges between any pair of nodes.

Let's get a preview of the networks that we are going to build in this chapter.

The communication network

The first type of communication network that we will encounter is an email communication graph. A history of e-mails that are exchanged within an organization can be mapped to a communication graph, so as to understand the informal structure behind the organization. Such graphs can also be used to determine influential people or the hubs of the organization that might not necessarily be the high-ranked ones. The email communication network is a canonical example of a directed graph, as each e-mail links a source node to the destination node. We will use the Enron Corpus, which is a database of e-mails generated by 158 employees of the Enron Corporation. It is one of the only mass collections of corporate e-mails that are open to public on the web. The Enron Corpus is particularly interesting, as it captures all the communication that occurred inside the company before the scandal that led to its bankruptcy. The original dataset was released by William Cohen at CMU, which can be downloaded from https://www.cs.cmu.edu/~./enron/. A detailed description of the complete dataset was done by Klimmt and Yang, 2004. A cleaner version of the dataset, which we use here, is provided by Leskovec et al., 2009, and can be obtained from https://snap.stanford.edu/data/email-Enron.html.

Flavor networks

Another example that we will borrow from the culinary world is the ingredient-compound network, introduced by Ahn et al., 2011. It is a bipartite graph in the sense that the nodes are divided into two disjoint sets: the ingredient nodes and the compound nodes. Each link connects an ingredient to a compound when the chemical compound is present in the food ingredient. From the ingredient-compound network, it is also possible to create what is called a flavor network. Instead of connecting food ingredients to compounds, the flavor network links pairs of ingredients whenever a pair of ingredients shares at least one chemical compound.

We will build the ingredient-compound network in this chapter, and in Chapter 4, Transforming and Shaping Up Graphs to Your Needs, we will construct the flavor network from the ingredient-compound network. Analyzing such graphs is fascinating because they help us understand more about food pairing and food culture. The flavor network can also help food scientists or amateur cooks create new recipes. The datasets that we will use consist of ingredient-compound data and the recipes collected from http://www.epicurious.com/, allrecipes.com, and http://www.menupan.com/. The datasets are available at http://yongyeol.com/2011/12/15/paper-flavor-network.html.

Social ego networks

The last dataset that we will explore in this chapter is a collection of social ego networks from Google+. The data was collected by (McAuley and Leskovec, 2012) from the users who had manually shared their social circles using the share circle feature. The dataset includes the user profiles, their circles, and their ego networks and can be downloaded from Stanford's SNAP project website at http://snap.stanford.edu/data/egonets-Gplus.html.

Note

These datasets are not provided with the Spark installation. They must first be downloaded from their source websites and copied into the $SPARKHOME/data folder. When different sizes of the datasets are available, we chose to use the smaller version of the datasets to quickly demonstrate the concepts taught in this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset