List of Figures

Chapter 1. Two important technologies: Spark and graphs

Figure 1.1. Big Data is data that is too big to fit on a single machine. Hadoop and Spark are technologies that distribute Big Data across a cluster of nodes. Spark is faster than Hadoop alone because it distributes data across the RAM in the cluster instead of the disks.

Figure 1.2. Three data blocks distributed with replication factor 2 across a Hadoop Distributed File System (HDFS)

Figure 1.3. MapReduce is the processing paradigm used by both Hadoop and Spark. Shown is a MapReduce operation to count the number of times “error” appears in a server log. The Map is (normally) a one-to-one operation that produces one transformed data item for each source data item. The Reduce is a many-to-one operation that summarizes the Map outputs. Both Hadoop and Spark use the MapReduce paradigm.

Figure 1.4. Spark provides RDDs that can be viewed as distributed in-memory arrays.

Figure 1.5. If Charles shares his status with friends of friends, determining the list of who could see his status would be cumbersome to figure out if you only had tables or arrays to work with.

Figure 1.6. The links between web pages can be represented as a graph. The structure of the graph provides information about the relative authority, or ranking, of each page.

Figure 1.7. Different types of data that can be represented by graphs

Figure 1.8. The Spark stack. Some components, including GraphX, come with Spark. Others, such as HDFS and YARN, are part of Apache Hadoop. In the last category, Tachyon comes from AMPLab at the University of California, Berkeley. Most of MLlib stands alone on top of Spark Core, but a couple of its algorithms make use of GraphX under the covers.

Figure 1.9. Both vertices and edges in a property graph contain additional attributes.

Figure 1.10. A graph with a high-degree vertex

Figure 1.11. GraphX facilitates data access for either graph-parallel or data-parallel operations.

Figure 1.12. Various possible GraphX data flows. Because GraphX’s capabilities for reading graph data files are so limited, data files usually have to be massaged and transformed using the Spark Core API into the graph format that GraphX uses. The output of a GraphX algorithm can be another graph, a number, some subgraphs, or a machine learning model.

Figure 1.13. Data flows that involve using MLlib, the Spark machine learning component. A couple of algorithms use GraphX behind the scenes, but GraphX can also be used alongside any MLlib algorithm.

Figure 1.14. The conventional and by far most common way for GraphX to store its data is out to HDFS or to some other distributed storage system (a). Some, however, use the power of a full-fledged graph database, and realize the best of both worlds: transactions in a graph database and fast processing in GraphX (b).

Chapter 2. GraphX quick start

Figure 2.1. Creating a graph object from a file in edge-list format

Figure 2.2. Defining an anonymous function

Figure 2.3. Running pageRank() creates a new graph where the vertex attributes are the PageRank values.

Chapter 3. Some fundamentals

Figure 3.1. Hadoop configured with replication factor 3 and Spark configured with replication factor 2.

Figure 3.2. From one iteration of an algorithm to the next, Spark avoids the six steps of serialize, compress, write to disk, read from disk, decompress, and deserialize.

Figure 3.3. A simple RDD constructed from an array transformed by a mapreduce operation

Figure 3.4. Spark requires two major pieces to be present: a distributed file system and a cluster manager. There are options for each.

Figure 3.5. Terminology for a standalone cluster. Terminology for YARN and Mesos clusters varies slightly.

Figure 3.6. Spark has to do a shuffle between a map and a reduce, and as of Spark 1.6, this shuffle is always written to and read from disk.

Figure 3.7. A simple RDD[String] converted to PairRDD[(String, String)]

Figure 3.8. zip() combines two RDDs so that both sets of data can be available in a subsequent (not shown) single map() operation.

Figure 3.9. All graphs in GraphX are inherently directed graphs, but GraphX also supports undirected graphs in some of its built-in algorithms that ignore the direction. You can do the same if you need undirected graphs.

Figure 3.10. A cyclic graph is one that has a cycle. In a cyclic graph, your algorithm could end up following edges forever if you’re not careful with your terminating condition.

Figure 3.11. A completely unlabeled graph is usually not useful. Normally at least the vertices are labeled. GraphX’s basic GraphLoader.edgeListFile() supports labeled vertices but only unlabeled edges.

Figure 3.12. Simple graphs are undirected with no parallel edges or loops. Multigraphs have parallel edges, and pseudographs also have loops.

Figure 3.13. Bipartite graphs frequently arise in social network analysis, either in group membership as shown here, or for separating groups of individuals, such as males and females on a heterosexual dating website. In a bipartite graph, all edges go from one set to another. Non-bipartite graphs cannot be so divided, and any attempt to divide them into two sets will end up with at least one edge fully contained in one of the two sets.

Figure 3.14. Without properties, RDF graphs get unwieldy, in particular when it comes to edge properties. GraphX supports property graphs, which can contain vertex properties and edge properties without adding a bunch of extra vertices to the base graph.

Figure 3.15. A graph and its equivalent adjacency matrix. Notice that an adjacency matrix doesn’t have a place to put edge properties.

Chapter 4. GraphX Basics

Figure 4.1. A GraphX Graph object is composed of two RDDs: one for the vertices and one for the edges.

Figure 4.2. UML diagram of GraphX’s Graph and its dependencies. Note that GraphX defines VertexId to be a type synonym of a 64-bit Long.

Figure 4.3. Example graph to be constructed and used throughout this chapter

Figure 4.4. The triplets() method is a convenience function that allows easy access to both edge and vertex attributes.

Figure 4.5. We opted to change the Edge type, compared to figure 4.3, as part of the way we invoked mapTriplets(). Whereas the Edge type was String in figure 4.3, this resulting graph’s Edge type is the Tuple2 of type (String,Boolean).

Figure 4.6. The generated .gexf file loaded into Gephi

Figure 4.7. A generated gridGraph() visualized in Gephi. There is no randomness to a grid graph; the grid is always complete.

Figure 4.8. A generated starGraph() visualized in Gephi. Like gridGraph(), it is not random.

Figure 4.9. A generated logNormalGraph() visualized in Gephi. The only constraint in logNormalGraph() is the out-degree of each vertex, and otherwise there are no restrictions on where edges are placed. Some edges have the same source and destination vertices. And some pairs of vertices have multiple parallel edges, represented in the Gephi visualization (which does not render parallel edges directly) by darker edges and larger arrowheads.

Figure 4.10. rmatGraph’s recursive quadrant subdivision leaves some relative loner vertices and also makes for some groups of vertices having a high number of interconnections.

Figure 4.11. Recursive quadrants from the 2004 paper “R-MAT: A Recursive Model for Graph Mining” by Chakrabarti et al. GraphX uses the hard-coded probabilities a=0.45, b=0.15, c=0.15, d=0.25.

Figure 4.12. BSP allows processing in parallel within a superstep. Messages are then generated and delivered for the next superstep. The next superstep does not begin until the current superstep is completed and all messages are delivered to the next superstep. This is illustrated by the synchronization barriers.

Figure 4.13. Messages for a vertex from the previous superstep are processed by the mergeMsg and vprog functions to update the vertex data. The vertex then sends messages to be delivered in the next superstep.

Figure 4.14. Flow of GraphX Pregel API

Chapter 5. Built-in algorithms

Figure 5.1. PageRank iteration. This particular iteration is also a steady and final state of the algorithm because after the redistribution of PageRank among the vertices, the resulting vertex PageRanks end up with the same values.

Figure 5.2. The companion object Graph contains an implicit (automatic) conversion from Graph to its corresponding GraphOps, so all of the operations available in GraphOps can be invoked as if those operations were declared in the Graph class. GraphOps contains a pageRank() method, but it calls run() from the PageRank singleton object, passing in the graph as the first parameter.

Figure 5.3. The real-world graph of web pages is said to resemble a bow-tie. With a damping factor of 1.0 (resetProb of .0.0), all the PageRanks would get trapped in the “OUT” vertices, and all the “IN” vertices would be left with a PageRank of 0.

Figure 5.4. When counting triangles, GraphX doesn’t care about edge direction. These are both triangles, even though the triangle on the left forms a cycle and the triangle on the right has a dead end.

Figure 5.5. ShortestPaths finds the number of hops (shown in the squares) from every vertex to a particular specified vertex (in this case, vertex #3). It does not take into account any edge weights (which might, for example, represent distances on a map), and it only returns the number of hops, not any routes on how to achieve that shortest distance.

Figure 5.6. This one graph of 7 vertices and 5 edges has three Connected Components. If this graph data were from a social network, each Connected Component would be considered a clique.

Figure 5.7. The egonets folder contains 111 egonet files, one for each user.

Figure 5.8. The egonet for user 3077 forms three connected components: two single-vertex components (vertices 3084 and 3087) and one large component (all the other vertices).

Figure 5.9. A folder path is passed to wholeTextFiles to generate a PairRDD with the file path as the key for each element and the file contents for the value. This is both good performing and a convenient way to read a directory of text files, resulting in all the data being in a single RDD.

Figure 5.10. In Strongly Connected Components, every vertex is reachable from every other vertex in the component. Within a Strongly Connected Component, no vertex can act as a dead end.

Figure 5.11. The LPA algorithm often doesn’t converge. Step 5 is the same as step 3, meaning that steps 3 and 4 keep repeating forever.

Chapter 6. Other useful graph algorithms

Figure 6.1. Example graph data and distances from vertex A after having been run through Dijkstra’s algorithm. Given a graph with edge weights on the left, Dijkstra’s algorithm annotates each vertex with a “shortest distance from vertex A.” Graph data credit: the graph data comes from the Wikipedia article on Kruskal’s algorithm (which, incidentally, is implemented in the last section of this chapter), which the contributor contributed to the public domain.

Figure 6.2. The greedy approach to the Travelling Salesman problem is the simplest, but it doesn’t always hit all the vertices. In this example, it neglected to hit vertex G.

Figure 6.3. A Minimum Spanning Tree is a tree (a graph with no cycles) that covers every vertex of an undirected graph, of minimum total weight (sum of the edge weights).

Figure 6.4. Iteration steps of Kruskal’s algorithm to find a Minimum Spanning Tree. In each iteration, the whole graph is searched for the unused edge of lowest weight. But there’s a catch: that edge can’t form a cycle (as a tree is what is being sought).

Figure 6.5. The data type after the leftOuterJoin() of listing 6.6.

Figure 6.6. Words are extracted from the training corpus and processed by the Word2Vec algorithm to produce an n-dimensional vector for each word—here, n is 300. Semantically similar words, such as Michelin and Firestone, are close together.

Figure 6.7. A section of the animal taxonomy Minimum Spanning Tree showing connections between birds.

Chapter 7. Machine learning

Figure 7.1. Unsupervised learning clusters similar data together, but doesn’t know how to attach any labels.

Figure 7.2. Recommending movies. What is the estimate for how Pat will rate Pride and Prejudice? (Edge labels represent ratings of one to five stars, and vertex numbers are vertex IDs we’ll use later instead of the text names.)

Figure 7.3. Sparse matrix representation of the graph represented in figure 7.2. The matrix is called sparse because not every matrix position has a number. In our tiny example, there are only four positions missing numbers (including the one with the question mark), but in a typical large example of, say, a million users and a hundred thousand items, almost all the positions would be empty. Recommender systems, including SVDPlusPlus, often internally use the matrix representation of the graph.

Figure 7.4. Although in our example we don’t get genre information with our data, latent variables can automatically infer genre or something close to it. The algorithm doesn’t know the actual label “Romance,” but infers that The Princess Bride and Pride and Prejudice are similar in some unspecified way. We have suggested humanappropriate labels in this illustration, but there are no such labels, human-applied or otherwise, for latent variables. It is somewhat of a mystery how algorithms pick latent variables; it could be something other than what we would call genre—it could be something like “a Harrison Ford movie,” a quirky sub-genre like “steampunk comedy,” or an age-related issue like “rough language.”

Figure 7.5. Latent Dirichlet Allocation. The topics are the latent variables and are determined automatically by the algorithm. The names of those topics shown in the thin strips are human-inferred and human-applied; the algorithm has no inherent capability to name the topics. Each document expresses each latent variable (topic) to a varying degree.

Figure 7.6. Bag of words representation of a document.

Figure 7.7. Not every word in the corpus vocabulary is used in every document. A graph representation realizes computational efficiency.

Figure 7.8. Web spam training data. Each vertex represents a web page, with the number of times the spam words free and earn occur, and a human-based determination of whether the page is spam. By augmenting the spam word data with PageRank data, link farms can be detected to assist in spam determination.

Figure 7.9. Test dataset

Figure 7.10. Original image (shown in black and white in this book) is on the left, and after segmentation on the right. The low resolution is to allow it to run on a single 4GB machine. The number of clusters has to be set in advance, and here the default of two was used.

Figure 7.11. The pixels are in a three-dimensional vector space—each vector has three numeric components (red intensity, green intensity, and blue intensity)—and the clustering algorithm finds the two clusters.

Figure 7.12. Starting condition: a bunch of points in two-dimensional space, almost all of them unlabeled, with the exception of two labeled points

Figure 7.13. After both the K-Nearest Neighbors graph construction algorithm and the semi-supervised learning label propagation algorithm have run

Figure 7.14. Distributed K-Nearest Neighbor graph construction. Divide the space into grids and perform bruteforce K-Nearest Neighbor graph construction within each grid cell. To avoid missing edges that would cross a cell boundary, vary the grid and run again, and take the union of the two edge sets.

Figure 7.15. Result of executing both the approximate distributed K-Nearest Neighborhood algorithm and the semi-supervised learning label propagation algorithm from the next section. There are only about two-thirds as many edges in this one. Compared to the exact result shown in figure 7.13, one vertex, number 13, was misclassified, and another vertex, number 12, wasn’t classified at all. But the benefit of the approximate result is that it does it with distributed computing and can be performed on large graphs.

Figure 7.16. The graph that results from applying our distributed, approximate, unsupervised learning algorithm. Now there is structure, but unlabeled vertices remain unlabeled.

Figure 7.17. Iterations of semi-supervised learning label propagation applied to the perfect K-Nearest Neighbors example

Chapter 8. The missing algorithms

Figure 8.1. If you use GraphX’s subgraph() to keep only the is-friends-with edges, it leaves straggler vertices like Diane (on the right).

Figure 8.2. Merging two graphs together based on commonly named vertices is a useful operation that isn’t built into GraphX (as of Spark 1.6).

Figure 8.3. An example of an RDF triple

Figure 8.4. Data flow representation of listing 8.5. First the Predicate (source) vertex string name is translated to a vertex ID and then the Object (destination) vertex is. Finally an Edge() is constructed with these two vertex IDs plus the original edge String attribute.

Figure 8.5. To find one particular data item by key in an RDD is a two-step process. First, the Spark driver uses the RDD’s HashPartitioner to identify which worker/partition the data item is in, and then the worker node finds the data item in the RDD partition that it is on. The difference between an IndexedRDD and a conventional RDD is in this second step. For an IndexedRDD, the worker node uses a search try, and for a conventional RDD, the worker node performs a linear scan. Note that even though search tries are a kind of graph, IndexedRDD uses a custom, efficient, non-distributed implementation and doesn’t use distributed GraphX Graphs to implement them.

Figure 8.6. The subgraph formed by the three vertices <France>, <Countries>, and <aircraft> is said to be isomorphic to the subgraph formed by the three vertices <Canada>, <Countries>, and <aircraft>. Not only are the edges in the same locations, but the edge attributes are the same. From this, could we infer that Canada also exports chemicals?

Figure 8.7. Data flow to find the most likely possible export from Canada that wasn’t in Wikipedia when YAGO3 took a snapshot of it

Figure 8.8. Example of computing the global clustering coefficient. There are three closed triplets associated with the one triangle: one closed triplet associated with vertex 1, one with vertex 2, and one with vertex 3. There are two open triplets associated with vertex 3, namely 1-3-4 and 2-3-4.

Chapter 9. Performance and monitoring

Figure 9.1. A job involving filter and map operations only requires a single stage. Each operation is performed in isolation on each element of the initial RDD.

Figure 9.2. Operations such as groupByKey that require data to be examined across partitions cause a shuffle to take place across stage boundaries.

Figure 9.3. The driver program creates the SparkContext that negotiates with the cluster manager to create executors on the worker nodes. The SparkContext contains a job scheduler (not shown) that distributes tasks to the executors.

Figure 9.4. The Jobs tab lists all the actions that have been (or are being) executed in the application. The screenshot shows information about timings and duration for two jobs that have been run. The columns are described in more detail in table 9.1.

Figure 9.5. Jobs tab now shows one job.

Figure 9.6. Stages list for the reduce job. Note the tab strip still has “Jobs” highlighted.

Figure 9.7. The Jobs timeline: a second executor is started at around 14:22. Four jobs are run on one executor; one job was run after the second executor had been started.

Figure 9.8. The Stages timeline shows you which stages Spark can run in parallel.

Figure 9.9. Stage details event timeline showing the timeline of task execution. Notice that only two tasks are ever executed in parallel.

Figure 9.10. Stage details event timeline over four cores—parallelism has increased to use all the cores.

Figure 9.11. Stage details event timeline over six cores.

Figure 9.12. DAG visualization for an example Spark job. Notice that RDDs that have been cached have their stage marked “skipped”—Spark will read the RDD data from a cache rather than recalculate the RDD.

Figure 9.13. The Spark Environment tab (accessed from the Environment tab at the top of the browser screen)

Figure 9.14. Spark history server UI containing two complete Spark applications

Figure 9.15. 31 GB memory allocated to each executor in the cluster

Figure 9.16. Multiple 63 GB executors on each machine in the cluster

Figure 9.17. All edges have been loaded into a single partition, so only one of the four executors will process the resulting graph.

Figure 9.18. The four different PartitionStrategy options showing how an example graph might be spread across four partitions

Chapter 10. Other languages and tools

Figure 10.1. Zeppelin is like the Spark REPL except that visualizations can be displayed inline.

Figure 10.2. Spark Job Server maintains the SparkContext, which in turn has its associated RDD references and allows multiple applications to share the same SparkContext and share the same RDDs. A graph of reference data can be loaded up into a Spark Job Server job, and multiple applications can “query” this reference graph. Even providing a way to update the graph is not out of the question.

Figure 10.3. Because GraphFrames is based on DataFrames rather than RDDs, it’s much faster than GraphX due to the Catalyst and Tungsten performance layers built into Spark SQL. Catalyst, the query optimizer, and Tungsten, the direct memory manager that bypasses the JVM, can be considered turbocharger add-ons. GraphX has no way to optimize join()s, for example, and must go through the JVM for all memory operations.

Figure 10.4. Whereas the fundamental graph type in GraphX is Graph, in GraphFrames it’s GraphFrame. The parameterized type system isn’t used in GraphFrames—rather there’s a convention (enforced at runtime) where columns in the DataFrames are expected to have particular names.

Figure 10.5. Example myGraph from listing 4.1, shown here again

Figure 10.6. The graph fragment represented by the Cypher syntax (u)-[e1]->(v); (v)-[e2]->w. This will find all graph fragments that match this structure—specifically, where the destination vertex of the first edge matches the source vertex of the second edge.

Figure 10.7. A potentially missing edge from Wikipedia, found using Cypher

Appendix B. Gephi visualization software

Figure B.1. Gephi’s dockable window layout

Figure B.2. The three windows to choose from the Window drop-down menu

Figure B.3. Available layout algorithms from the drop-down list inside the Layout window. The ones we haven’t labeled as Incremental are all first-class layout algorithms that perform a complete layout from scratch. The incremental ones nudge around an already-laid-out graph.

Figure B.4. Key adjustment so that small graphs don’t end up as a tiny, scrunched-up bunch

Figure B.5. Key Preview Settings window settings

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset