Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Downloading the color images of this book Errata Piracy Questions Introduction to Scala History and purposes of Scala Platforms and editors Installing and setting up Scala Installing Java Windows Mac OS Using Homebrew installer Installing manually Linux Scala: the scalable language Scala is object-oriented Scala is functional Scala is statically typed Scala runs on the JVM Scala can execute Java code Scala can do concurrent and synchronized processing Scala for Java programmers All types are objects Type inference Scala REPL Nested functions Import statements Operators as methods Methods and parameter lists Methods inside methods Constructor in Scala Objects instead of static methods Traits Scala for the beginners Your first line of code I'm the hello world program, explain me well! Run Scala interactively! Compile it! Execute it with Scala command Summary Object-Oriented Scala Variables in Scala Reference versus value immutability Data types in Scala Variable initialization Type annotations Type ascription Lazy val Methods, classes, and objects in Scala Methods in Scala The return in Scala Classes in Scala Objects in Scala Singleton and companion objects Companion objects Comparing and contrasting: val and final Access and visibility Constructors Traits in Scala A trait syntax Extending traits Abstract classes Abstract classes and the override keyword Case classes in Scala Packages and package objects Java interoperability Pattern matching Implicit in Scala Generic in Scala Defining a generic class SBT and other build systems Build with SBT Maven with Eclipse Gradle with Eclipse Summary Functional Programming Concepts Introduction to functional programming Advantages of functional programming Functional Scala for the data scientists Why FP and Scala for learning Spark? Why Spark? Scala and the Spark programming model Scala and the Spark ecosystem Pure functions and higher-order functions Pure functions Anonymous functions Higher-order functions Function as a return value Using higher-order functions Error handling in functional Scala Failure and exceptions in Scala Throwing exceptions Catching exception using try and catch Finally Creating an Either Future Run one task, but block Functional programming and data mutability Summary Collection APIs Scala collection APIs Types and hierarchies Traversable Iterable Seq, LinearSeq, and IndexedSeq Mutable and immutable Arrays Lists Sets Tuples Maps Option Exists Forall Filter Map Take GroupBy Init Drop TakeWhile DropWhile FlatMap Performance characteristics Performance characteristics of collection objects Memory usage by collection objects Java interoperability Using Scala implicits Implicit conversions in Scala Summary Tackle Big Data – Spark Comes to the Party Introduction to data analytics Inside the data analytics process Introduction to big data 4 Vs of big data Variety of Data Velocity of Data Volume of Data Veracity of Data Distributed computing using Apache Hadoop Hadoop Distributed File System (HDFS) HDFS High Availability HDFS Federation HDFS Snapshot HDFS Read HDFS Write MapReduce framework Here comes Apache Spark Spark core Spark SQL Spark streaming Spark GraphX Spark ML PySpark SparkR Summary Start Working with Spark – REPL and RDDs Dig deeper into Apache Spark Apache Spark installation Spark standalone Spark on YARN YARN client mode YARN cluster mode Spark on Mesos Introduction to RDDs RDD Creation Parallelizing a collection Reading data from an external source Transformation of an existing RDD Streaming API Using the Spark shell Actions and Transformations Transformations General transformations Math/Statistical transformations Set theory/relational transformations Data structure-based transformations map function flatMap function filter function coalesce repartition Actions reduce count collect Caching Loading and saving data Loading data textFile wholeTextFiles Load from a JDBC Datasource Saving RDD Summary Special RDD Operations Types of RDDs Pair RDD DoubleRDD SequenceFileRDD CoGroupedRDD ShuffledRDD UnionRDD HadoopRDD NewHadoopRDD Aggregations groupByKey reduceByKey aggregateByKey combineByKey Comparison of groupByKey, reduceByKey, combineByKey, and aggregateByKey Partitioning and shuffling Partitioners HashPartitioner RangePartitioner Shuffling Narrow Dependencies Wide Dependencies Broadcast variables Creating broadcast variables Cleaning broadcast variables Destroying broadcast variables Accumulators Summary Introduce a Little Structure - Spark SQL Spark SQL and DataFrames DataFrame API and SQL API Pivots Filters User-Defined Functions (UDFs) Schema structure of data Implicit schema Explicit schema Encoders Loading and saving datasets Loading datasets Saving datasets Aggregations Aggregate functions Count First Last approx_count_distinct Min Max Average Sum Kurtosis Skewness Variance Standard deviation Covariance groupBy Rollup Cube Window functions ntiles Joins Inner workings of join Shuffle join Broadcast join Join types Inner join Left outer join Right outer join Outer join Left anti join Left semi join Cross join Performance implications of join Summary Stream Me Up, Scotty - Spark Streaming A Brief introduction to streaming At least once processing At most once processing Exactly once processing Spark Streaming StreamingContext Creating StreamingContext Starting StreamingContext Stopping StreamingContext Input streams receiverStream socketTextStream rawSocketStream fileStream textFileStream binaryRecordsStream queueStream textFileStream example twitterStream example Discretized streams Transformations Window operations Stateful/stateless transformations Stateless transformations Stateful transformations Checkpointing Metadata checkpointing Data checkpointing Driver failure recovery Interoperability with streaming platforms (Apache Kafka) Receiver-based approach Direct stream Structured streaming Structured streaming Handling Event-time and late data Fault tolerance semantics Summary Everything is Connected - GraphX A brief introduction to graph theory GraphX VertexRDD and EdgeRDD VertexRDD EdgeRDD Graph operators Filter MapValues aggregateMessages TriangleCounting Pregel API ConnectedComponents Traveling salesman problem ShortestPaths PageRank Summary Learning Machine Learning - Spark MLlib and Spark ML Introduction to machine learning Typical machine learning workflow Machine learning tasks Supervised learning Unsupervised learning Reinforcement learning Recommender system Semisupervised learning Spark machine learning APIs Spark machine learning libraries Spark MLlib Spark ML Spark MLlib or Spark ML? Feature extraction and transformation CountVectorizer Tokenizer StopWordsRemover StringIndexer OneHotEncoder Spark ML pipelines Dataset abstraction Creating a simple pipeline Unsupervised machine learning Dimensionality reduction PCA Using PCA Regression Analysis - a practical use of PCA Dataset collection and exploration What is regression analysis? Binary and multiclass classification Performance metrics Binary classification using logistic regression Breast cancer prediction using logistic regression of Spark ML Dataset collection Developing the pipeline using Spark ML Multiclass classification using logistic regression Improving classification accuracy using random forests Classifying MNIST dataset using random forest Summary My Name is Bayes, Naive Bayes Multinomial classification Transformation to binary Classification using One-Vs-The-Rest approach Exploration and preparation of the OCR dataset Hierarchical classification Extension from binary Bayesian inference An overview of Bayesian inference What is inference? How does it work? Naive Bayes An overview of Bayes' theorem My name is Bayes, Naive Bayes Building a scalable classifier with NB Tune me up! The decision trees Advantages and disadvantages of using DTs Decision tree versus Naive Bayes Building a scalable classifier with DT algorithm Summary Time to Put Some Order - Cluster Your Data with Spark MLlib Unsupervised learning Unsupervised learning example Clustering techniques Unsupervised learning and the clustering Hierarchical clustering Centroid-based clustering Distribution-based clustestering Centroid-based clustering (CC) Challenges in CC algorithm How does K-means algorithm work? An example of clustering using K-means of Spark MLlib Hierarchical clustering (HC) An overview of HC algorithm and challenges Bisecting K-means with Spark MLlib Bisecting K-means clustering of the neighborhood using Spark MLlib Distribution-based clustering (DC) Challenges in DC algorithm How does a Gaussian mixture model work? An example of clustering using GMM with Spark MLlib Determining number of clusters A comparative analysis between clustering algorithms Submitting Spark job for cluster analysis Summary Text Analytics Using Spark ML Understanding text analytics Text analytics Sentiment analysis Topic modeling TF-IDF (term frequency - inverse document frequency) Named entity recognition (NER) Event extraction Transformers and Estimators Standard Transformer Estimator Transformer Tokenization StopWordsRemover NGrams TF-IDF HashingTF Inverse Document Frequency (IDF) Word2Vec CountVectorizer Topic modeling using LDA Implementing text classification Summary Spark Tuning Monitoring Spark jobs Spark web interface Jobs Stages Storage Environment Executors SQL Visualizing Spark application using web UI Observing the running and completed Spark jobs Debugging Spark applications using logs Logging with log4j with Spark Spark configuration Spark properties Environmental variables Logging Common mistakes in Spark app development Application failure Slow jobs or unresponsiveness Optimization techniques Data serialization Memory tuning Memory usage and management Tuning the data structures Serialized RDD storage Garbage collection tuning Level of parallelism Broadcasting Data locality Summary Time to Go to ClusterLand - Deploying Spark on a Cluster Spark architecture in a cluster Spark ecosystem in brief Cluster design Cluster management Pseudocluster mode (aka Spark local) Standalone Apache YARN Apache Mesos Cloud-based deployments Deploying the Spark application on a cluster Submitting Spark jobs Running Spark jobs locally and in standalone Hadoop YARN Configuring a single-node YARN cluster Step 1: Downloading Apache Hadoop Step 2: Setting the JAVA_HOME Step 3: Creating users and groups Step 4: Creating data and log directories Step 5: Configuring core-site.xml Step 6: Configuring hdfs-site.xml Step 7: Configuring mapred-site.xml Step 8: Configuring yarn-site.xml Step 9: Setting Java heap space Step 10: Formatting HDFS Step 11: Starting the HDFS Step 12: Starting YARN Step 13: Verifying on the web UI Submitting Spark jobs on YARN cluster Advance job submissions in a YARN cluster Apache Mesos Client mode Cluster mode Deploying on AWS Step 1: Key pair and access key configuration Step 2: Configuring Spark cluster on EC2 Step 3: Running Spark jobs on the AWS cluster Step 4: Pausing, restarting, and terminating the Spark cluster Summary Testing and Debugging Spark Testing in a distributed environment Distributed environment Issues in a distributed system Challenges of software testing in a distributed environment Testing Spark applications Testing Scala methods Unit testing Testing Spark applications Method 1: Using Scala JUnit test Method 2: Testing Scala code using FunSuite Method 3: Making life easier with Spark testing base Configuring Hadoop runtime on Windows Debugging Spark applications Logging with log4j with Spark recap Debugging the Spark application Debugging Spark application on Eclipse as Scala debug Debugging Spark jobs running as local and standalone mode Debugging Spark applications on YARN or Mesos cluster Debugging Spark application using SBT Summary PySpark and SparkR Introduction to PySpark Installation and configuration By setting SPARK_HOME Using Python shell By setting PySpark on Python IDEs Getting started with PySpark Working with DataFrames and RDDs Reading a dataset in Libsvm format Reading a CSV file Reading and manipulating raw text files Writing UDF on PySpark Let's do some analytics with k-means clustering Introduction to SparkR Why SparkR? Installing and getting started Getting started Using external data source APIs Data manipulation Querying SparkR DataFrame Visualizing your data on RStudio Summary