Home Page Icon
Home Page
Table of Contents for
Index
Close
Index
by Denny Lee, Tomasz Drabas
Learning PySpark
Learning PySpark
Table of Contents
Learning PySpark
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Understanding Spark
What is Apache Spark?
Spark Jobs and APIs
Execution process
Resilient Distributed Dataset
DataFrames
Datasets
Catalyst Optimizer
Project Tungsten
Spark 2.0 architecture
Unifying Datasets and DataFrames
Introducing SparkSession
Tungsten phase 2
Structured Streaming
Continuous applications
Summary
2. Resilient Distributed Datasets
Internal workings of an RDD
Creating RDDs
Schema
Reading from files
Lambda expressions
Global versus local scope
Transformations
The .map(...) transformation
The .filter(...) transformation
The .flatMap(...) transformation
The .distinct(...) transformation
The .sample(...) transformation
The .leftOuterJoin(...) transformation
The .repartition(...) transformation
Actions
The .take(...) method
The .collect(...) method
The .reduce(...) method
The .count(...) method
The .saveAsTextFile(...) method
The .foreach(...) method
Summary
3. DataFrames
Python to RDD communications
Catalyst Optimizer refresh
Speeding up PySpark with DataFrames
Creating DataFrames
Generating our own JSON data
Creating a DataFrame
Creating a temporary table
Simple DataFrame queries
DataFrame API query
SQL query
Interoperating with RDDs
Inferring the schema using reflection
Programmatically specifying the schema
Querying with the DataFrame API
Number of rows
Running filter statements
Querying with SQL
Number of rows
Running filter statements using the where Clauses
DataFrame scenario – on-time flight performance
Preparing the source datasets
Joining flight performance and airports
Visualizing our flight-performance data
Spark Dataset API
Summary
4. Prepare Data for Modeling
Checking for duplicates, missing observations, and outliers
Duplicates
Missing observations
Outliers
Getting familiar with your data
Descriptive statistics
Correlations
Visualization
Histograms
Interactions between features
Summary
5. Introducing MLlib
Overview of the package
Loading and transforming the data
Getting to know your data
Descriptive statistics
Correlations
Statistical testing
Creating the final dataset
Creating an RDD of LabeledPoints
Splitting into training and testing
Predicting infant survival
Logistic regression in MLlib
Selecting only the most predictable features
Random forest in MLlib
Summary
6. Introducing the ML Package
Overview of the package
Transformer
Estimators
Classification
Regression
Clustering
Pipeline
Predicting the chances of infant survival with ML
Loading the data
Creating transformers
Creating an estimator
Creating a pipeline
Fitting the model
Evaluating the performance of the model
Saving the model
Parameter hyper-tuning
Grid search
Train-validation splitting
Other features of PySpark ML in action
Feature extraction
NLP - related feature extractors
Discretizing continuous variables
Standardizing continuous variables
Classification
Clustering
Finding clusters in the births dataset
Topic mining
Regression
Summary
7. GraphFrames
Introducing GraphFrames
Installing GraphFrames
Creating a library
Preparing your flights dataset
Building the graph
Executing simple queries
Determining the number of airports and trips
Determining the longest delay in this dataset
Determining the number of delayed versus on-time/early flights
What flights departing Seattle are most likely to have significant delays?
What states tend to have significant delays departing from Seattle?
Understanding vertex degrees
Determining the top transfer airports
Understanding motifs
Determining airport ranking using PageRank
Determining the most popular non-stop flights
Using Breadth-First Search
Visualizing flights using D3
Summary
8. TensorFrames
What is Deep Learning?
The need for neural networks and Deep Learning
What is feature engineering?
Bridging the data and algorithm
What is TensorFlow?
Installing Pip
Installing TensorFlow
Matrix multiplication using constants
Matrix multiplication using placeholders
Running the model
Running another model
Discussion
Introducing TensorFrames
TensorFrames – quick start
Configuration and setup
Launching a Spark cluster
Creating a TensorFrames library
Installing TensorFlow on your cluster
Using TensorFlow to add a constant to an existing column
Executing the Tensor graph
Blockwise reducing operations example
Building a DataFrame of vectors
Analysing the DataFrame
Computing elementwise sum and min of all vectors
Summary
9. Polyglot Persistence with Blaze
Installing Blaze
Polyglot persistence
Abstracting data
Working with NumPy arrays
Working with pandas' DataFrame
Working with files
Working with databases
Interacting with relational databases
Interacting with the MongoDB database
Data operations
Accessing columns
Symbolic transformations
Operations on columns
Reducing data
Joins
Summary
10. Structured Streaming
What is Spark Streaming?
Why do we need Spark Streaming?
What is the Spark Streaming application data flow?
Simple streaming application using DStreams
A quick primer on global aggregations
Introducing Structured Streaming
Summary
11. Packaging Spark Applications
The spark-submit command
Command line parameters
Deploying the app programmatically
Configuring your SparkSession
Creating SparkSession
Modularizing code
Structure of the module
Calculating the distance between two points
Converting distance units
Building an egg
User defined functions in Spark
Submitting a job
Monitoring execution
Databricks Jobs
Summary
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Summary
Index
A
action /
Internal workings of an RDD
actions
reference link /
Resilient Distributed Dataset
about /
Actions
.take(...) method /
The .take(...) method
.collect() method /
The .collect(...) method
.reduce(...) method /
The .reduce(...) method
.count() method /
The .count(...) method
.saveAsTextFile(...) method /
The .saveAsTextFile(...) method
.foreach(...) method /
The .foreach(...) method
airportCodes
URL /
Preparing your flights dataset
airport ranking
determining, with PageRank /
Determining airport ranking using PageRank
Airports D3 visualization
URL /
Visualizing flights using D3
Apache Spark
about /
What is Apache Spark?
reference link /
What is Apache Spark?
URL, for issues /
What is the Spark Streaming application data flow?
Apache Spark 2.0 Architecture
about /
Spark 2.0 architecture
references /
Spark 2.0 architecture
Datasets, unifying with DataFrames /
Unifying Datasets and DataFrames
SparkSession /
Introducing SparkSession
Project Tungsten 2 /
Tungsten phase 2
Structured Streaming /
Structured Streaming
continuous applications /
Continuous applications
Apache Spark APIs
about /
Spark Jobs and APIs
Resilient Distributed Dataset (RDD) /
Resilient Distributed Dataset
Datasets /
Datasets
Catalyst Optimizer /
Catalyst Optimizer
Apache Spark Jobs
about /
Spark Jobs and APIs
execution process /
Execution process
DataFrames /
DataFrames
Project Tungsten /
Project Tungsten
application, deploying
about /
Deploying the app programmatically
Sparksession, configuring /
Configuring your SparkSession
Sparksession, creating /
Creating SparkSession
code, modularizing /
Modularizing code
job, submitting /
Submitting a job
execution, monitoring /
Monitoring execution
associative /
The .reduce(...) method
B
bcolz format
URL /
Working with files
birth data
URL /
Overview of the package
,
Predicting the chances of infant survival with ML
loading /
Loading and transforming the data
transforming /
Loading and transforming the data
knowledge, obtaining /
Getting to know your data
descriptive statistics, calculating /
Descriptive statistics
correlations, calculating /
Correlations
statistical test, executing /
Statistical testing
final dataset, creating /
Creating the final dataset
RDD, creating of LabeledPoints /
Creating an RDD of LabeledPoints
splitting, into training and testing /
Splitting into training and testing
Blaze
installing /
Installing Blaze
block-wise reducing operations
about /
Blockwise reducing operations example
DataFrame, building of vectors /
Building a DataFrame of vectors
DataFrame, analyzing /
Analysing the DataFrame
elementwise sum, computing of vectors /
Computing elementwise sum and min of all vectors
elementwise min, computing of vectors /
Computing elementwise sum and min of all vectors
Breadth-first search (BFS)
about /
Using Breadth-First Search
using /
Using Breadth-First Search
Bureau of Transportation Statistics (BTS) /
Preparing your flights dataset
C
.collect() method /
The .collect(...) method
.count() method /
The .count(...) method
Catalyst Optimizer
about /
Catalyst Optimizer
,
Catalyst Optimizer refresh
references /
Catalyst Optimizer
reference link /
Catalyst Optimizer refresh
Chi-square
URL /
Transformer
classification
about /
Classification
,
Classification
LogisticRegression /
Classification
DecisionTreeClassifier /
Classification
GBTClassifier /
Classification
RandomForestClassifier /
Classification
NaiveBayes /
Classification
MultilayerPerceptronClassifier /
Classification
OneVsRest /
Classification
clustering
about /
Clustering
,
Clustering
BisectingKMeans /
Clustering
KMeans /
Clustering
GaussianMixture /
Clustering
LDA /
Clustering
clusters, searching in births dataset /
Finding clusters in the births dataset
topic mining /
Topic mining
code, application deploying
module structure /
Structure of the module
distance, calculating between two points /
Calculating the distance between two points
distance units, converting /
Converting distance units
.egg, building /
Building an egg
user defined functions, in Spark /
User defined functions in Spark
Code Generation /
Tungsten phase 2
command line, parameters
master /
Command line parameters
deploy mode /
Command line parameters
name /
Command line parameters
py-files /
Command line parameters
files /
Command line parameters
conf /
Command line parameters
properties-file /
Command line parameters
driver-memory /
Command line parameters
executor-memory /
Command line parameters
help /
Command line parameters
verbose /
Command line parameters
version /
Command line parameters
kill /
Command line parameters
supervise /
Command line parameters
status /
Command line parameters
commutative /
The .reduce(...) method
constants
used, for matrix multiplication /
Matrix multiplication using constants
continuous applications /
Continuous applications
continuous variables
discretizing /
Discretizing continuous variables
standardizing /
Standardizing continuous variables
correlations
calculating /
Correlations
Cost-based Optimizer framework
references /
Catalyst Optimizer refresh
D
.distinct() transformation /
The .distinct(...) transformation
DAG scheduler
reference link /
Execution process
data abstraction
about /
Abstracting data
NumPy arrays, working with /
Working with NumPy arrays
pandas' DataFrame, working /
Working with pandas' DataFrame
pandas DataFrame, working /
Working with pandas' DataFrame
files, working with /
Working with files
databases, working with /
Working with databases
databases
working, with /
Working with databases
relational databases, interacting with /
Interacting with relational databases
MongoDB databases, interacting with /
Interacting with the MongoDB database
databricks/tensorframes GitHub repository
URL /
TensorFrames – quick start
Databricks Community Edition
reference link /
DataFrame scenario – on-time flight performance
URL /
Installing GraphFrames
Databricks Jobs /
Databricks Jobs
DataFrame
reference link /
NLP - related feature extractors
DataFrame API
querying, with /
Querying with the DataFrame API
number of rows /
Number of rows
filter statements, executing /
Running filter statements
DataFrames /
DataFrames
PySpark, speeding up with /
Speeding up PySpark with DataFrames
reference link /
Speeding up PySpark with DataFrames
,
Creating a temporary table
creating /
Creating DataFrames
,
Creating a DataFrame
custom JSON data, creating /
Generating our own JSON data
temporary table, creating /
Creating a temporary table
simple queries, executing /
Simple DataFrame queries
DataFrame API query, using /
DataFrame API query
SQL query, writing /
SQL query
DataFrames, relating with Tungsten
URL /
Speeding up PySpark with DataFrames
data lineage
URL /
Resilient Distributed Dataset
data operations
about /
Data operations
columns, accessing /
Accessing columns
symbolic transformations /
Symbolic transformations
operations, performing on columns /
Operations on columns
data, reducing /
Reducing data
joins /
Joins
dataset
about /
Getting familiar with your data
URL /
Getting familiar with your data
descriptive statistics, calculating /
Descriptive statistics
correlations, calculating /
Correlations
Datasets /
Datasets
unifying, with DataFrames /
Unifying Datasets and DataFrames
Deep Learning
about /
What is Deep Learning?
need for /
The need for neural networks and Deep Learning
reference link /
The need for neural networks and Deep Learning
,
Introducing TensorFrames
feature engineering /
What is feature engineering?
data and algorithm, bridging /
Bridging the data and algorithm
departureDelays.csv
URL /
Preparing your flights dataset
descriptive statistics
calculating /
Descriptive statistics
distributed computing
advances /
The need for neural networks and Deep Learning
availability /
The need for neural networks and Deep Learning
Distributed File System (HDFS) /
Creating DataFrames
Dstreams
used, for Spark Streaming /
Simple streaming application using DStreams
DStreams
about /
Loading and transforming the data
URL /
Loading and transforming the data
duplicates
checking for /
Duplicates
E
edges /
Preparing your flights dataset
estimators
about /
Estimators
classification /
Classification
regression /
Regression
clustering /
Clustering
F
.filter(...) transformation /
The .filter(...) transformation
.flatMap(...) transformation /
The .flatMap(...) transformation
.foreach(...) method /
The .foreach(...) method
.format(...)
URL /
Interacting with relational databases
Faster Stateful Stream Processing
URL /
A quick primer on global aggregations
feature engineering
about /
The need for neural networks and Deep Learning
,
What is feature engineering?
feature extraction
about /
Feature extraction
,
What is feature engineering?
NLP related feature extractors /
NLP - related feature extractors
continuous variables, discretizing /
Discretizing continuous variables
continuous variables, standardizing /
Standardizing continuous variables
reference link /
What is feature engineering?
feature learning
about /
The need for neural networks and Deep Learning
features
about /
The need for neural networks and Deep Learning
restaurant recommendations /
The need for neural networks and Deep Learning
references /
The need for neural networks and Deep Learning
Handwritten Digit recognition /
The need for neural networks and Deep Learning
Image Processing /
The need for neural networks and Deep Learning
feature selection
about /
What is feature engineering?
flights dataset
preparing /
Preparing your flights dataset
functions
URL /
Duplicates
G
global aggregations
about /
A quick primer on global aggregations
URL /
A quick primer on global aggregations
Gradient Boosted Trees /
Classification
graph
building /
Building the graph
GraphFrames
about /
Introducing GraphFrames
references /
Introducing GraphFrames
installing /
Installing GraphFrames
library, creating /
Creating a library
graph queries
executing /
Executing simple queries
number of airports, determining /
Determining the number of airports and trips
number of trips, determining /
Determining the number of airports and trips
longest delay, determining in dataset /
Determining the longest delay in this dataset
number of delayed flights, versus on-time flights, determining /
Determining the number of delayed versus on-time/early flights
filght delays, determining /
What flights departing Seattle are most likely to have significant delays?
states, determining for significant delays from SEA /
What states tend to have significant delays departing from Seattle?
grid search
about /
Grid search
H
Hartsfield-Jackson Atlanta International Airport (ATL)
references /
Determining airport ranking using PageRank
Haversine formula
URL /
Modularizing code
histograms
about /
Histograms
hyperparameters
reference link /
Introducing TensorFrames
I
Incremental Execution Plan
about /
Introducing Structured Streaming
infant survival
predicting /
Predicting infant survival
predicting, with logistic regression in MLlib /
Logistic regression in MLlib
most predictable features, selecting /
Selecting only the most predictable features
predicting, with random forest in MLlib /
Random forest in MLlib
infant survival prediction, with ML
performing /
Predicting the chances of infant survival with ML
data, loading /
Loading the data
transformers, creating /
Creating transformers
estimator, creating /
Creating an estimator
pipeline, creating /
Creating a pipeline
model, fitting /
Fitting the model
performance, evaluating /
Evaluating the performance of the model
model, saving /
Saving the model
International Air Transport Association(IATA) code /
Joining flight performance and airports
Inverse Document Frequency (IDF) /
Transformer
J
joins
about /
Joins
L
.leftOuterJoin(...) transformation /
The .leftOuterJoin(...) transformation
L1-Norm
reference link /
Descriptive statistics
L2-Norm
reference link /
Descriptive statistics
Lambda expressions
about /
Lambda expressions
reference link /
Lambda expressions
,
The .map(...) transformation
Latent Dirichlet Allocation
about /
Topic mining
learning PySpark
URL /
DataFrame scenario – on-time flight performance
Limited-memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) /
Logistic regression in MLlib
URL /
Logistic regression in MLlib
lines DStream /
A quick primer on global aggregations
local mode, versus cluster mode
URL /
Global versus local scope
logistic regression
used, for predicting infant survival /
Logistic regression in MLlib
M
.map(...) transformation /
The .map(...) transformation
matrix multiplication
with constants /
Matrix multiplication using constants
with placeholders /
Matrix multiplication using placeholders
,
Running another model
Maven repository
URL /
Creating a library
Meetup Streaming API
URL /
What is Spark Streaming?
missing observations
checking /
Missing observations
MLlib
overview /
Overview of the package
data preparation /
Overview of the package
machine learning algorithms /
Overview of the package
utilities /
Overview of the package
infant survival, predicting with logistic regression /
Logistic regression in MLlib
infant survival, predicting with random forest /
Random forest in MLlib
ML package
overview /
Overview of the package
Transformer class /
Transformer
estimators /
Estimators
Pipeline /
Pipeline
features /
Other features of PySpark ML in action
feature extraction /
Feature extraction
classification /
Classification
clustering /
Clustering
regression /
Regression
MongoDB
URL /
Working with databases
MongoDB database
interacting, with /
Interacting with the MongoDB database
Mortality dataset
URL /
Creating RDDs
motifs
about /
Understanding motifs
N
neural networks
reference link /
What is Deep Learning?
need for /
The need for neural networks and Deep Learning
NLP related feature extractors
about /
NLP - related feature extractors
NumPy arrays
working with /
Working with NumPy arrays
O
odo
URL /
Working with databases
on-time flight performance
references /
Preparing your flights dataset
on-time flight performance, use cases
about /
DataFrame scenario – on-time flight performance
source datasets, preparing /
Preparing the source datasets
flight performance, joining /
Joining flight performance and airports
airports, joining /
Joining flight performance and airports
data, visualizing /
Visualizing our flight-performance data
on time flight performance
popular non-stop flights, determining /
Determining the most popular non-stop flights
reference link /
Determining the most popular non-stop flights
,
Visualizing flights using D3
flights, visualizing with D3 /
Visualizing flights using D3
outliers
checking /
Outliers
P
PageRank
airport ranking, determining /
Determining airport ranking using PageRank
reference link /
Determining airport ranking using PageRank
pandas' DataFrame
working, with /
Working with pandas' DataFrame
parameter hyper-tuning
about /
Parameter hyper-tuning
grid search /
Grid search
train-validation splitting /
Train-validation splitting
pip
installing /
Installing Pip
Pipeline
about /
Pipeline
placeholders
used, for matrix multiplication /
Matrix multiplication using placeholders
,
Running another model
Polyglot persistence
about /
Polyglot persistence
references /
Polyglot persistence
Population vs. Price Linear Regression Job /
Databricks Jobs
PostgreSQL
URL /
Working with databases
principal component analysis (PCA)
about /
What is feature engineering?
URL /
What is feature engineering?
project management committee (PMC)
about /
Why do we need Spark Streaming?
Project Tungsten
about /
Project Tungsten
,
Catalyst Optimizer refresh
references /
Project Tungsten
,
Tungsten phase 2
improvements /
Tungsten phase 2
reference link /
Catalyst Optimizer refresh
Project Tungsten 2
about /
Tungsten phase 2
improvements /
Tungsten phase 2
pseudo-algorithm
URL /
Clustering
PySpark
speeding up, with DataFrames /
Speeding up PySpark with DataFrames
pyspark.sql.DataFrame
URL /
Visualizing our flight-performance data
pyspark.sql.functions
URL /
Visualizing our flight-performance data
pyspark.sql.types
URL /
Descriptive statistics
PySpark performance, improving
URL /
Python to RDD communications
Python
communicating, to RDD /
Python to RDD communications
Python Dataset
URL /
Spark Dataset API
R
.reduce(...) method /
The .reduce(...) method
.repartition(...) transformation /
The .repartition(...) transformation
random forest
used, for predicting infant survival /
Random forest in MLlib
Receiver-Operating Characteristic (ROC)
about /
Logistic regression in MLlib
URL /
Logistic regression in MLlib
record schema
URL /
Creating RDDs
regression
about /
Regression
,
Regression
AFTSurvivalRegression /
Regression
DecisionTreeRegressor /
Regression
GBTRegressor /
Regression
GeneralizedLinearRegression /
Regression
IsotonicRegression /
Regression
LinearRegression /
Regression
RandomForestRegressor /
Regression
Regular Expressions
reference link /
Lambda expressions
Relational Database Management System (RDBMS) /
Polyglot persistence
relational database management system (RDBMS)
about /
Catalyst Optimizer refresh
relational databases
interacting, with /
Interacting with relational databases
Resilient Distributed Datasets (RDDs /
Resilient Distributed Dataset
Resilient Distributed Datasets (RDDs) /
Resilient Distributed Dataset
internal functions /
Internal workings of an RDD
creating /
Creating RDDs
schema /
Schema
files, reading from /
Reading from files
Lambda expressions /
Lambda expressions
global scope, versus local scope /
Global versus local scope
communicating, to Python /
Python to RDD communications
interoperating, with /
Interoperating with RDDs
schema, inferring with reflection /
Inferring the schema using reflection
schema, specifying programmatically /
Programmatically specifying the schema
Row object /
SQL query
S
.sample(...) transformation /
The .sample(...) transformation
.saveAsTextFile(...) method /
The .saveAsTextFile(...) method
S3 FileStream Wordcount (Databricks notebook)
URL /
Simple streaming application using DStreams
setup.py files
URL /
Structure of the module
spark-submit command
about /
The spark-submit command
URL /
The spark-submit command
command line parameters /
Command line parameters
Spark Dataset API
about /
Spark Dataset API
Spark Packages
URL /
Introducing TensorFrames
Spark performance
URL, for Scala vs Python /
User defined functions in Spark
spark rdd
reference link, for removing elements /
Lambda expressions
Sparksession
configuring /
Configuring your SparkSession
creating /
Creating SparkSession
SparkSession
about /
Introducing SparkSession
Spark Streaming
about /
What is Spark Streaming?
URL /
What is Spark Streaming?
,
A quick primer on global aggregations
reference link /
What is Spark Streaming?
need for /
Why do we need Spark Streaming?
references /
Why do we need Spark Streaming?
use cases /
Why do we need Spark Streaming?
application data flow /
What is the Spark Streaming application data flow?
DStreams, using /
Simple streaming application using DStreams
Spark Streaming, use cases
Streaming ETL /
Why do we need Spark Streaming?
triggers /
Why do we need Spark Streaming?
data enrichment /
Why do we need Spark Streaming?
complex sessions /
Why do we need Spark Streaming?
continuous learning /
Why do we need Spark Streaming?
SQL
querying, with /
Querying with SQL
number of rows /
Number of rows
filter statement, executing with where clause /
Running filter statements using the where Clauses
references /
Running filter statements using the where Clauses
Stateful Network Wordcount Python
URL /
A quick primer on global aggregations
Stateful Streaming
URL /
A quick primer on global aggregations
statistical model
reference link /
Descriptive statistics
stochastic gradient descent (SGD) /
Logistic regression in MLlib
Structured Streaming
about /
Structured Streaming
,
Introducing Structured Streaming
reference link /
Structured Streaming
URL /
Introducing Structured Streaming
Structuring Spark
URL /
Catalyst Optimizer refresh
T
.take(...) method /
The .take(...) method
TensorFlow
about /
What is TensorFlow?
URL /
What is TensorFlow?
,
Installing TensorFlow
pip, installing /
Installing Pip
installing /
Installing TensorFlow
matrix multiplication, with constants /
Matrix multiplication using constants
matrix multiplication, with placeholders /
Matrix multiplication using placeholders
,
Running another model
references /
Discussion
constant, adding /
Using TensorFlow to add a constant to an existing column
tensor graph, executing /
Executing the Tensor graph
TensorFrames
about /
Introducing TensorFrames
TensorFlow, utilizing with data /
Introducing TensorFrames
optimal hyperparameters, determining via parallel training /
Introducing TensorFrames
using /
TensorFrames – quick start
reference link /
TensorFrames – quick start
,
Using TensorFlow to add a constant to an existing column
configuration /
Configuration and setup
setup /
Configuration and setup
Spark cluster, launching /
Launching a Spark cluster
library, creating /
Creating a TensorFrames library
TensorFlow, installing on cluster /
Installing TensorFlow on your cluster
constant, adding with TensorFlow /
Using TensorFlow to add a constant to an existing column
block-wise reducing operations /
Blockwise reducing operations example
tf.reduce_min
about /
Computing elementwise sum and min of all vectors
URL /
Computing elementwise sum and min of all vectors
tf.reduce_sum
about /
Computing elementwise sum and min of all vectors
URL /
Computing elementwise sum and min of all vectors
topic mining
about /
Topic mining
top transfer airports
determining /
Determining the top transfer airports
Traffic Violations
URL /
Working with files
train-validation splitting
about /
Train-validation splitting
transform /
Internal workings of an RDD
transformations
reference link /
Resilient Distributed Dataset
about /
Transformations
URL, for methods /
Transformations
.map(...) /
The .map(...) transformation
.filter(...) /
The .filter(...) transformation
.flatMap(...) /
The .flatMap(...) transformation
.distinct() /
The .distinct(...) transformation
.sample(...) /
The .sample(...) transformation
.leftOuterJoin(...) /
The .leftOuterJoin(...) transformation
.repartition(...) /
The .repartition(...) transformation
Transformer class
about /
Transformer
Binarizer /
Transformer
ChiSqSelector /
Transformer
CountVectorizer /
Transformer
DCT /
Transformer
ElementwiseProduct /
Transformer
HashingTF /
Transformer
IDF /
Transformer
IndexToString /
Transformer
MaxAbsScaler /
Transformer
MinMaxScaler /
Transformer
NGram /
Transformer
Normalizer /
Transformer
OneHotEncoder /
Transformer
PCA /
Transformer
PolynomialExpansion /
Transformer
QuantileDiscretizer /
Transformer
RegexTokenizer /
Transformer
RFormula /
Transformer
SQLTransformer /
Transformer
StandardScaler /
Transformer
StopWordsRemover /
Transformer
StringIndexer /
Transformer
Tokenizer /
Transformer
VectorAssembler /
Transformer
VectorIndexer /
Transformer
VectorSlicer /
Transformer
Word2Vec /
Transformer
U
Uniform Resource Identifier (URI)
about /
Interacting with relational databases
V
vertex degrees
states, deabout /
Understanding vertex degrees
vertices /
Preparing your flights dataset
visualization
about /
Visualization
histograms /
Histograms
used, for interacting between features /
Interactions between features
VS14MORT.txt file
URL /
Creating RDDs
W
where clause
filter statement, executing /
Running filter statements using the where Clauses
Word count
URL /
A quick primer on global aggregations
Y
YARN cluster
queue parameter /
Command line parameters
num-executors parameter /
Command line parameters
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset