Table of Contents

Copyright

Brief Table of Contents

Table of Contents

Preface

Acknowledgments

About this Book

About the Cover Illustration

1. Spark and graphs

Chapter 1. Two important technologies: Spark and graphs

1.1. Spark: the step beyond Hadoop MapReduce

1.1.1. The elusive definition of Big Data

1.1.2. Hadoop: the world before Spark

1.1.3. Spark: in-memory MapReduce processing

1.2. Graphs: finding meaning from relationships

1.2.1. Uses of graphs

1.2.2. Types of graph data

1.2.3. Plain RDBMS inadequate for graphs

1.3. Putting them together for lightning fast graph processing: Spark GraphX

1.3.1. Property graph: adding richness

1.3.2. Graph partitioning: graphs meet Big Data

1.3.3. GraphX lets you choose: graph parallel or data parallel

1.3.4. Various ways GraphX fits into a processing flow

1.3.5. GraphX vs. other systems

1.3.6. Storing the graphs: distributed file storage vs. graph database

1.4. Summary

Chapter 2. GraphX quick start

2.1. Getting set up and getting data

2.2. Interactive GraphX querying using the Spark Shell

2.3. PageRank example

2.4. Summary

Chapter 3. Some fundamentals

3.1. Scala, the native language of Spark

3.1.1. Scala’s philosophy: conciseness and expressiveness

3.1.2. Functional programming

3.1.3. Inferred typing

3.1.4. Class declaration

3.1.5. Map and reduce

3.1.6. Everything is a function

3.1.7. Java interoperability

3.2. Spark

3.2.1. Distributed in-memory data: RDDs

3.2.2. Laziness

3.2.3. Cluster requirements and terminology

3.2.4. Serialization

3.2.5. Common RDD operations

3.2.6. Hello World with Spark and sbt

3.3. Graph terminology

3.3.1. Basics

3.3.2. RDF graphs vs. property graphs

3.3.3. Adjacency matrix

3.3.4. Graph querying systems

3.4. Summary

2. Connecting vertices

Chapter 4. GraphX Basics

4.1. Vertex and edge classes

4.2. Mapping operations

4.2.1. Simple graph transformation

4.2.2. Map/Reduce

4.2.3. Iterated Map/Reduce

4.3. Serialization/deserialization

4.3.1. Reading/writing binary format

4.3.2. JSON format

4.3.3. GEXF format for Gephi visualization software

4.4. Graph generation

4.4.1. Deterministic graphs

4.4.2. Random graphs

4.5. Pregel API

4.6. Summary

Chapter 5. Built-in algorithms

5.1. Seek out authoritative nodes: PageRank

5.1.1. PageRank algorithm explained

5.1.2. Invoking PageRank in GraphX

5.1.3. Personalized PageRank

5.2. Measuring connectedness: Triangle Count

5.2.1. Uses of Triangle Count

5.2.2. Slashdot friends and foes example

5.3. Find the fewest hops: ShortestPaths

5.4. Finding isolated populations: Connected Components

5.4.1. Predicting social circles

5.5. Reciprocated love only, please: Strongly Connected Components

5.6. Community detection: LabelPropagation

5.7. Summary

Chapter 6. Other useful graph algorithms

6.1. Your own GPS: Shortest Paths with Weights

6.2. Travelling Salesman: greedy algorithm

6.3. Route utilities: Minimum Spanning Trees

6.3.1. Deriving taxonomies with Word2Vec and Minimum Spanning Trees

6.4. Summary

Chapter 7. Machine learning

7.1. Supervised, unsupervised, and semi-supervised learning

7.2. Recommend a movie: SVDPlusPlus

7.2.1. Explanation of the Koren formula

7.3. Using GraphX With MLlib

7.3.1. Determine topics: Latent Dirichlet Allocation

7.3.2. Detect spam: LogisticRegressionWithSGD

7.3.3. Image segmentation (for computer vision) using Power Iteration Clustering

7.4. Poor man’s training data: graph-based semi-supervised learning

7.4.1. K-Nearest Neighbors graph construction

7.4.2. Semi-supervised learning label propagation

7.5. Summary

3. Over the arc

Chapter 8. The missing algorithms

8.1. Missing basic graph operations

8.1.1. Common sense subgraphs

8.1.2. Merge two graphs

8.2. Reading RDF graph files

8.2.1. Matching vertices and constructing the graph

8.2.2. Improving performance with IndexedRDD, the RDD HashMap

8.3. Poor man’s graph isomorphism: finding missing Wikipedia infobox items

8.4. Global clustering coefficient: compare connectedness

8.5. Summary

Chapter 9. Performance and monitoring

9.1. Monitoring your Spark application

9.1.1. How Spark runs your application

9.1.2. Understanding your application runtime with Spark monitoring

9.1.3. History server

9.2. Configuring Spark

9.2.1. Utilizing all CPU cores

9.3. Spark performance tuning

9.3.1. Speeding up Spark with caching and persistence

9.3.2. Checkpointing

9.3.3. Reducing memory pressure with serialization

9.4. Graph partitioning

9.5. Summary

Chapter 10. Other languages and tools

10.1. Using languages other than Scala with GraphX

10.1.1. Using GraphX with Java 7

10.1.2. Using GraphX with Java 8

10.1.3. Whether GraphX may gain Python or R bindings in the future

10.2. Another visualization tool: Apache Zeppelin plus d3.js

10.3. Almost a database: Spark Job Server

10.3.1. Example: Query Slashdot friends degree of separation

10.3.2. More on using Spark Job Server

10.4. Using SQL with Spark graphs with GraphFrames

10.4.1. Getting GraphFrames, plus GraphX interoperability

10.4.2. Using SQL for convenience and performance

10.4.3. Searching for vertices with the Cypher subset

10.4.4. Slightly more complex isomorphic searching on YAGO

10.5. Summary

Appendix A. Installing Spark

A.1. On a local virtual machine: CDH QuickStart VM

A.1.1. VirtualBox tweaks

A.2. Onto your laptop and Hadoopless: Linux or OS X

A.2.1. On a custom local virtual machine

A.3. In the cloud: Amazon Web Services

Appendix B. Gephi visualization software

B.1. Laying out your environment

B.2. Basic recipe

B.3. Key settings

B.3.1. Layout window

B.3.2. Preview Settings window

Appendix C. Resources: where to go for more

C.1. Spark

Apache mailing lists

Databricks forums

Conference and meetup videos

Jira

Twitter

spark-packages.org

AMPLab

Google Scholar Alerts

Author blogs

C.2. Scala

C.3. Graphs

Appendix D. List of Scala tips in this book

Chapter 2

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8

Index

List of Figures

List of Tables

List of Listings

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset