www.PacktPub.com Preface What this book covers What you need for this book Who this book is for Sections Getting ready How to do it... How it works... There's more... See also Conventions Reader feedback Customer support Downloading the color images of this book Errata Piracy Questions Getting Started with Apache Spark Introduction Leveraging Databricks Cloud How to do it... How it works... Cluster Notebook Table Library Deploying Spark using Amazon EMR What it represents is much bigger than what it looks EMR's architecture How to do it... How it works... EC2 instance types T2 - Free Tier Burstable (EBS only) M4 - General purpose (EBS only) C4 - Compute optimized X1 - Memory optimized R4 - Memory optimized P2 - General purpose GPU I3 - Storage optimized D2 - Storage optimized Installing Spark from binaries Getting ready How to do it... Building the Spark source code with Maven Getting ready How to do it... Launching Spark on Amazon EC2 Getting ready How to do it... See also Deploying Spark on a cluster in standalone mode Getting ready How to do it... How it works... See also Deploying Spark on a cluster with Mesos How to do it... Deploying Spark on a cluster with YARN Getting ready How to do it... How it works... Understanding SparkContext and SparkSession SparkContext SparkSession Understanding resilient distributed dataset - RDD How to do it... Developing Applications with Spark Introduction Exploring the Spark shell How to do it... There's more... Developing a Spark applications in Eclipse with Maven Getting ready How to do it... Developing a Spark applications in Eclipse with SBT How to do it... Developing a Spark application in IntelliJ IDEA with Maven How to do it... Developing a Spark application in IntelliJ IDEA with SBT How to do it... Developing applications using the Zeppelin notebook How to do it... Setting up Kerberos to do authentication How to do it... There's more... Enabling Kerberos authentication for Spark How to do it... There's more... Securing data at rest Securing data in transit Spark SQL Understanding the evolution of schema awareness Getting ready DataFrames Datasets Schema-aware file formats Understanding the Catalyst optimizer Analysis Logical plan optimization Physical planning Code generation Inferring schema using case classes How to do it... There's more... Programmatically specifying the schema How to do it... How it works... Understanding the Parquet format How to do it... How it works... Partitioning Predicate pushdown Parquet Hive interoperability Loading and saving data using the JSON format How to do it... How it works... Loading and saving data from relational databases Getting ready How to do it... Loading and saving data from an arbitrary source How to do it... There's more... Understanding joins Getting ready How to do it... How it works... Shuffle hash join Broadcast hash join The cartesian join There's more... Analyzing nested structures Getting ready How to do it... Working with External Data Sources Introduction Loading data from the local filesystem How to do it... Loading data from HDFS How to do it... Loading data from Amazon S3 How to do it... Loading data from Apache Cassandra How to do it... How it works CAP Theorem Cassandra partitions Consistency levels Spark Streaming Introduction Classic Spark Streaming Structured Streaming WordCount using Structured Streaming How to do it... Taking a closer look at Structured Streaming How to do it... There's more... Streaming Twitter data How to do it... Streaming using Kafka Getting ready How to do it... Understanding streaming challenges Late arriving/out-of-order data Maintaining the state in between batches Message delivery reliability Streaming is not an island Getting Started with Machine Learning Introduction Creating vectors Getting ready How to do it... How it works... Calculating correlation Getting ready How to do it... Understanding feature engineering Feature selection Quality of features Number of features Feature scaling Feature extraction TF-IDF Term frequency Inverse document frequency How to do it... Understanding Spark ML Getting ready How to do it... Understanding hyperparameter tuning How to do it... Supervised Learning with MLlib — Regression Introduction Using linear regression Getting ready How to do it... There's more... Understanding the cost function There's more... Doing linear regression with lasso Bias versus variance How to do it... Doing ridge regression Supervised Learning with MLlib — Classification Introduction Doing classification using logistic regression Getting ready How to do it... There's more... What is ROC? Doing binary classification using SVM Getting ready How to do it... Doing classification using decision trees Getting ready How to do it... How it works... There's more... Doing classification using random forest Getting ready How to do it... Doing classification using gradient boosted trees Getting ready How to do it... Doing classification with Naïve Bayes Getting ready How to do it... Unsupervised Learning Introduction Clustering using k-means Getting ready How to do it... Dimensionality reduction with principal component analysis Getting ready How to do it... Dimensionality reduction with singular value decomposition Getting ready How to do it... Recommendations Using Collaborative Filtering Introduction Collaborative filtering using explicit feedback Getting ready How to do it... Adding my recommendations and then testing predictions There's more... Collaborative filtering using implicit feedback How to do it... Graph Processing Using GraphX and GraphFrames Introduction Fundamental operations on graphs Getting ready How to do it... Using PageRank Getting ready How to do it... Finding connected components Getting ready How to do it... Performing neighborhood aggregation Getting ready How to do it... Understanding GraphFrames How to do it... Optimizations and Performance Tuning Optimizing memory How to do it... How it works... Garbage collection Mark and sweep G1 Spark memory allocation Leveraging speculation How to do it... Optimizing joins How to do it... Using compression to improve performance How to do it... Using serialization to improve performance How to do it... There's more... Optimizing the level of parallelism How to do it... Understanding project Tungsten How to do it... How it works... Tungsten phase 1 Bypassing GC Cache conscious computation Code generation for expression evaluation Tungsten phase 2 Wholesale code generation In-memory columnar format