Home Page Icon
Home Page
Table of Contents for
Table of Contents
Close
Table of Contents
by Joseph Babcock, Ashish Kumar
Python: Advanced Predictive Analytics
Python: Advanced Predictive Analytics
Table of Contents
Python: Advanced Predictive Analytics
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. Getting Started with Predictive Modelling
Introducing predictive modelling
Scope of predictive modelling
Ensemble of statistical algorithms
Statistical tools
Historical data
Mathematical function
Business context
Knowledge matrix for predictive modelling
Task matrix for predictive modelling
Applications and examples of predictive modelling
LinkedIn's "People also viewed" feature
What it does?
How is it done?
Correct targeting of online ads
How is it done?
Santa Cruz predictive policing
How is it done?
Determining the activity of a smartphone user using accelerometer data
How is it done?
Sport and fantasy leagues
How was it done?
Python and its packages – download and installation
Anaconda
Standalone Python
Installing a Python package
Installing pip
Installing Python packages with pip
Python and its packages for predictive modelling
IDEs for Python
Summary
2. Data Cleaning
Reading the data – variations and examples
Data frames
Delimiters
Various methods of importing data in Python
Case 1 – reading a dataset using the read_csv method
The read_csv method
Use cases of the read_csv method
Passing the directory address and filename as variables
Reading a .txt dataset with a comma delimiter
Specifying the column names of a dataset from a list
Case 2 – reading a dataset using the open method of Python
Reading a dataset line by line
Changing the delimiter of a dataset
Case 3 – reading data from a URL
Case 4 – miscellaneous cases
Reading from an .xls or .xlsx file
Writing to a CSV or Excel file
Basics – summary, dimensions, and structure
Handling missing values
Checking for missing values
What constitutes missing data?
How missing values are generated and propagated
Treating missing values
Deletion
Imputation
Creating dummy variables
Visualizing a dataset by basic plotting
Scatter plots
Histograms
Boxplots
Summary
3. Data Wrangling
Subsetting a dataset
Selecting columns
Selecting rows
Selecting a combination of rows and columns
Creating new columns
Generating random numbers and their usage
Various methods for generating random numbers
Seeding a random number
Generating random numbers following probability distributions
Probability density function
Cumulative density function
Uniform distribution
Normal distribution
Using the Monte-Carlo simulation to find the value of pi
Geometry and mathematics behind the calculation of pi
Generating a dummy data frame
Grouping the data – aggregation, filtering, and transformation
Aggregation
Filtering
Transformation
Miscellaneous operations
Random sampling – splitting a dataset in training and testing datasets
Method 1 – using the Customer Churn Model
Method 2 – using sklearn
Method 3 – using the shuffle function
Concatenating and appending data
Merging/joining datasets
Inner Join
Left Join
Right Join
An example of the Inner Join
An example of the Left Join
An example of the Right Join
Summary of Joins in terms of their length
Summary
4. Statistical Concepts for Predictive Modelling
Random sampling and the central limit theorem
Hypothesis testing
Null versus alternate hypothesis
Z-statistic and t-statistic
Confidence intervals, significance levels, and p-values
Different kinds of hypothesis test
A step-by-step guide to do a hypothesis test
An example of a hypothesis test
Chi-square tests
Correlation
Summary
5. Linear Regression with Python
Understanding the maths behind linear regression
Linear regression using simulated data
Fitting a linear regression model and checking its efficacy
Finding the optimum value of variable coefficients
Making sense of result parameters
p-values
F-statistics
Residual Standard Error
Implementing linear regression with Python
Linear regression using the statsmodel library
Multiple linear regression
Multi-collinearity
Variance Inflation Factor
Model validation
Training and testing data split
Summary of models
Linear regression with scikit-learn
Feature selection with scikit-learn
Handling other issues in linear regression
Handling categorical variables
Transforming a variable to fit non-linear relations
Handling outliers
Other considerations and assumptions for linear regression
Summary
6. Logistic Regression with Python
Linear regression versus logistic regression
Understanding the math behind logistic regression
Contingency tables
Conditional probability
Odds ratio
Moving on to logistic regression from linear regression
Estimation using the Maximum Likelihood Method
Likelihood function:
Log likelihood function:
Building the logistic regression model from scratch
Making sense of logistic regression parameters
Wald test
Likelihood Ratio Test statistic
Chi-square test
Implementing logistic regression with Python
Processing the data
Data exploration
Data visualization
Creating dummy variables for categorical variables
Feature selection
Implementing the model
Model validation and evaluation
Cross validation
Model validation
The ROC curve
Confusion matrix
Summary
7. Clustering with Python
Introduction to clustering – what, why, and how?
What is clustering?
How is clustering used?
Why do we do clustering?
Mathematics behind clustering
Distances between two observations
Euclidean distance
Manhattan distance
Minkowski distance
The distance matrix
Normalizing the distances
Linkage methods
Single linkage
Compete linkage
Average linkage
Centroid linkage
Ward's method
Hierarchical clustering
K-means clustering
Implementing clustering using Python
Importing and exploring the dataset
Normalizing the values in the dataset
Hierarchical clustering using scikit-learn
K-Means clustering using scikit-learn
Interpreting the cluster
Fine-tuning the clustering
The elbow method
Silhouette Coefficient
Summary
8. Trees and Random Forests with Python
Introducing decision trees
A decision tree
Understanding the mathematics behind decision trees
Homogeneity
Entropy
Information gain
ID3 algorithm to create a decision tree
Gini index
Reduction in Variance
Pruning a tree
Handling a continuous numerical variable
Handling a missing value of an attribute
Implementing a decision tree with scikit-learn
Visualizing the tree
Cross-validating and pruning the decision tree
Understanding and implementing regression trees
Regression tree algorithm
Implementing a regression tree using Python
Understanding and implementing random forests
The random forest algorithm
Implementing a random forest using Python
Why do random forests work?
Important parameters for random forests
Summary
9. Best Practices for Predictive Modelling
Best practices for coding
Commenting the codes
Defining functions for substantial individual tasks
Example 1
Example 2
Example 3
Avoid hard-coding of variables as much as possible
Version control
Using standard libraries, methods, and formulas
Best practices for data handling
Best practices for algorithms
Best practices for statistics
Best practices for business contexts
Summary
A. A List of Links
2. Module 2
1. From Data to Decisions – Getting Started with Analytic Applications
Designing an advanced analytic solution
Data layer: warehouses, lakes, and streams
Modeling layer
Deployment layer
Reporting layer
Case study: sentiment analysis of social media feeds
Data input and transformation
Sanity checking
Model development
Scoring
Visualization and reporting
Case study: targeted e-mail campaigns
Data input and transformation
Sanity checking
Model development
Scoring
Visualization and reporting
Summary
2. Exploratory Data Analysis and Visualization in Python
Exploring categorical and numerical data in IPython
Installing IPython notebook
The notebook interface
Loading and inspecting data
Basic manipulations – grouping, filtering, mapping, and pivoting
Charting with Matplotlib
Time series analysis
Cleaning and converting
Time series diagnostics
Joining signals and correlation
Working with geospatial data
Loading geospatial data
Working in the cloud
Introduction to PySpark
Creating the SparkContext
Creating an RDD
Creating a Spark DataFrame
Summary
3. Finding Patterns in the Noise – Clustering and Unsupervised Learning
Similarity and distance metrics
Numerical distance metrics
Correlation similarity metrics and time series
Similarity metrics for categorical data
K-means clustering
Affinity propagation – automatically choosing cluster numbers
k-medoids
Agglomerative clustering
Where agglomerative clustering fails
Streaming clustering in Spark
Summary
4. Connecting the Dots with Models – Regression Methods
Linear regression
Data preparation
Model fitting and evaluation
Statistical significance of regression outputs
Generalize estimating equations
Mixed effects models
Time series data
Generalized linear models
Applying regularization to linear models
Tree methods
Decision trees
Random forest
Scaling out with PySpark – predicting year of song release
Summary
5. Putting Data in its Place – Classification Methods and Analysis
Logistic regression
Multiclass logistic classifiers: multinomial regression
Formatting a dataset for classification problems
Learning pointwise updates with stochastic gradient descent
Jointly optimizing all parameters with second-order methods
Fitting the model
Evaluating classification models
Strategies for improving classification models
Separating Nonlinear boundaries with Support vector machines
Fitting and SVM to the census data
Boosting – combining small models to improve accuracy
Gradient boosted decision trees
Comparing classification methods
Case study: fitting classifier models in pyspark
Summary
6. Words and Pixels – Working with Unstructured Data
Working with textual data
Cleaning textual data
Extracting features from textual data
Using dimensionality reduction to simplify datasets
Principal component analysis
Latent Dirichlet Allocation
Using dimensionality reduction in predictive modeling
Images
Cleaning image data
Thresholding images to highlight objects
Dimensionality reduction for image analysis
Case Study: Training a Recommender System in PySpark
Summary
7. Learning from the Bottom Up – Deep Networks and Unsupervised Features
Learning patterns with neural networks
A network of one – the perceptron
Combining perceptrons – a single-layer neural network
Parameter fitting with back-propagation
Discriminative versus generative models
Vanishing gradients and explaining away
Pretraining belief networks
Using dropout to regularize networks
Convolutional networks and rectified units
Compressing Data with autoencoder networks
Optimizing the learning rate
The TensorFlow library and digit recognition
The MNIST data
Constructing the network
Summary
8. Sharing Models with Prediction Services
The architecture of a prediction service
Clients and making requests
The GET requests
The POST request
The HEAD request
The PUT request
The DELETE request
Server – the web traffic controller
Application – the engine of the predictive services
Persisting information with database systems
Case study – logistic regression service
Setting up the database
The web server
The web application
The flow of a prediction service – training a model
On-demand and bulk prediction
Summary
9. Reporting and Testing – Iterating on Analytic Systems
Checking the health of models with diagnostics
Evaluating changes in model performance
Changes in feature importance
Changes in unsupervised model performance
Iterating on models through A/B testing
Experimental allocation – assigning customers to experiments
Deciding a sample size
Multiple hypothesis testing
Guidelines for communication
Translate terms to business values
Visualizing results
Case Study: building a reporting service
The report server
The report application
The visualization layer
Summary
Bibliography
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Cover
Next
Next Chapter
Python: Advanced Predictive Analytics
Table of Contents
Python: Advanced Predictive Analytics
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. Getting Started with Predictive Modelling
Introducing predictive modelling
Scope of predictive modelling
Ensemble of statistical algorithms
Statistical tools
Historical data
Mathematical function
Business context
Knowledge matrix for predictive modelling
Task matrix for predictive modelling
Applications and examples of predictive modelling
LinkedIn's "People also viewed" feature
What it does?
How is it done?
Correct targeting of online ads
How is it done?
Santa Cruz predictive policing
How is it done?
Determining the activity of a smartphone user using accelerometer data
How is it done?
Sport and fantasy leagues
How was it done?
Python and its packages – download and installation
Anaconda
Standalone Python
Installing a Python package
Installing pip
Installing Python packages with pip
Python and its packages for predictive modelling
IDEs for Python
Summary
2. Data Cleaning
Reading the data – variations and examples
Data frames
Delimiters
Various methods of importing data in Python
Case 1 – reading a dataset using the read_csv method
The read_csv method
Use cases of the read_csv method
Passing the directory address and filename as variables
Reading a .txt dataset with a comma delimiter
Specifying the column names of a dataset from a list
Case 2 – reading a dataset using the open method of Python
Reading a dataset line by line
Changing the delimiter of a dataset
Case 3 – reading data from a URL
Case 4 – miscellaneous cases
Reading from an .xls or .xlsx file
Writing to a CSV or Excel file
Basics – summary, dimensions, and structure
Handling missing values
Checking for missing values
What constitutes missing data?
How missing values are generated and propagated
Treating missing values
Deletion
Imputation
Creating dummy variables
Visualizing a dataset by basic plotting
Scatter plots
Histograms
Boxplots
Summary
3. Data Wrangling
Subsetting a dataset
Selecting columns
Selecting rows
Selecting a combination of rows and columns
Creating new columns
Generating random numbers and their usage
Various methods for generating random numbers
Seeding a random number
Generating random numbers following probability distributions
Probability density function
Cumulative density function
Uniform distribution
Normal distribution
Using the Monte-Carlo simulation to find the value of pi
Geometry and mathematics behind the calculation of pi
Generating a dummy data frame
Grouping the data – aggregation, filtering, and transformation
Aggregation
Filtering
Transformation
Miscellaneous operations
Random sampling – splitting a dataset in training and testing datasets
Method 1 – using the Customer Churn Model
Method 2 – using sklearn
Method 3 – using the shuffle function
Concatenating and appending data
Merging/joining datasets
Inner Join
Left Join
Right Join
An example of the Inner Join
An example of the Left Join
An example of the Right Join
Summary of Joins in terms of their length
Summary
4. Statistical Concepts for Predictive Modelling
Random sampling and the central limit theorem
Hypothesis testing
Null versus alternate hypothesis
Z-statistic and t-statistic
Confidence intervals, significance levels, and p-values
Different kinds of hypothesis test
A step-by-step guide to do a hypothesis test
An example of a hypothesis test
Chi-square tests
Correlation
Summary
5. Linear Regression with Python
Understanding the maths behind linear regression
Linear regression using simulated data
Fitting a linear regression model and checking its efficacy
Finding the optimum value of variable coefficients
Making sense of result parameters
p-values
F-statistics
Residual Standard Error
Implementing linear regression with Python
Linear regression using the statsmodel library
Multiple linear regression
Multi-collinearity
Variance Inflation Factor
Model validation
Training and testing data split
Summary of models
Linear regression with scikit-learn
Feature selection with scikit-learn
Handling other issues in linear regression
Handling categorical variables
Transforming a variable to fit non-linear relations
Handling outliers
Other considerations and assumptions for linear regression
Summary
6. Logistic Regression with Python
Linear regression versus logistic regression
Understanding the math behind logistic regression
Contingency tables
Conditional probability
Odds ratio
Moving on to logistic regression from linear regression
Estimation using the Maximum Likelihood Method
Likelihood function:
Log likelihood function:
Building the logistic regression model from scratch
Making sense of logistic regression parameters
Wald test
Likelihood Ratio Test statistic
Chi-square test
Implementing logistic regression with Python
Processing the data
Data exploration
Data visualization
Creating dummy variables for categorical variables
Feature selection
Implementing the model
Model validation and evaluation
Cross validation
Model validation
The ROC curve
Confusion matrix
Summary
7. Clustering with Python
Introduction to clustering – what, why, and how?
What is clustering?
How is clustering used?
Why do we do clustering?
Mathematics behind clustering
Distances between two observations
Euclidean distance
Manhattan distance
Minkowski distance
The distance matrix
Normalizing the distances
Linkage methods
Single linkage
Compete linkage
Average linkage
Centroid linkage
Ward's method
Hierarchical clustering
K-means clustering
Implementing clustering using Python
Importing and exploring the dataset
Normalizing the values in the dataset
Hierarchical clustering using scikit-learn
K-Means clustering using scikit-learn
Interpreting the cluster
Fine-tuning the clustering
The elbow method
Silhouette Coefficient
Summary
8. Trees and Random Forests with Python
Introducing decision trees
A decision tree
Understanding the mathematics behind decision trees
Homogeneity
Entropy
Information gain
ID3 algorithm to create a decision tree
Gini index
Reduction in Variance
Pruning a tree
Handling a continuous numerical variable
Handling a missing value of an attribute
Implementing a decision tree with scikit-learn
Visualizing the tree
Cross-validating and pruning the decision tree
Understanding and implementing regression trees
Regression tree algorithm
Implementing a regression tree using Python
Understanding and implementing random forests
The random forest algorithm
Implementing a random forest using Python
Why do random forests work?
Important parameters for random forests
Summary
9. Best Practices for Predictive Modelling
Best practices for coding
Commenting the codes
Defining functions for substantial individual tasks
Example 1
Example 2
Example 3
Avoid hard-coding of variables as much as possible
Version control
Using standard libraries, methods, and formulas
Best practices for data handling
Best practices for algorithms
Best practices for statistics
Best practices for business contexts
Summary
A. A List of Links
2. Module 2
1. From Data to Decisions – Getting Started with Analytic Applications
Designing an advanced analytic solution
Data layer: warehouses, lakes, and streams
Modeling layer
Deployment layer
Reporting layer
Case study: sentiment analysis of social media feeds
Data input and transformation
Sanity checking
Model development
Scoring
Visualization and reporting
Case study: targeted e-mail campaigns
Data input and transformation
Sanity checking
Model development
Scoring
Visualization and reporting
Summary
2. Exploratory Data Analysis and Visualization in Python
Exploring categorical and numerical data in IPython
Installing IPython notebook
The notebook interface
Loading and inspecting data
Basic manipulations – grouping, filtering, mapping, and pivoting
Charting with Matplotlib
Time series analysis
Cleaning and converting
Time series diagnostics
Joining signals and correlation
Working with geospatial data
Loading geospatial data
Working in the cloud
Introduction to PySpark
Creating the SparkContext
Creating an RDD
Creating a Spark DataFrame
Summary
3. Finding Patterns in the Noise – Clustering and Unsupervised Learning
Similarity and distance metrics
Numerical distance metrics
Correlation similarity metrics and time series
Similarity metrics for categorical data
K-means clustering
Affinity propagation – automatically choosing cluster numbers
k-medoids
Agglomerative clustering
Where agglomerative clustering fails
Streaming clustering in Spark
Summary
4. Connecting the Dots with Models – Regression Methods
Linear regression
Data preparation
Model fitting and evaluation
Statistical significance of regression outputs
Generalize estimating equations
Mixed effects models
Time series data
Generalized linear models
Applying regularization to linear models
Tree methods
Decision trees
Random forest
Scaling out with PySpark – predicting year of song release
Summary
5. Putting Data in its Place – Classification Methods and Analysis
Logistic regression
Multiclass logistic classifiers: multinomial regression
Formatting a dataset for classification problems
Learning pointwise updates with stochastic gradient descent
Jointly optimizing all parameters with second-order methods
Fitting the model
Evaluating classification models
Strategies for improving classification models
Separating Nonlinear boundaries with Support vector machines
Fitting and SVM to the census data
Boosting – combining small models to improve accuracy
Gradient boosted decision trees
Comparing classification methods
Case study: fitting classifier models in pyspark
Summary
6. Words and Pixels – Working with Unstructured Data
Working with textual data
Cleaning textual data
Extracting features from textual data
Using dimensionality reduction to simplify datasets
Principal component analysis
Latent Dirichlet Allocation
Using dimensionality reduction in predictive modeling
Images
Cleaning image data
Thresholding images to highlight objects
Dimensionality reduction for image analysis
Case Study: Training a Recommender System in PySpark
Summary
7. Learning from the Bottom Up – Deep Networks and Unsupervised Features
Learning patterns with neural networks
A network of one – the perceptron
Combining perceptrons – a single-layer neural network
Parameter fitting with back-propagation
Discriminative versus generative models
Vanishing gradients and explaining away
Pretraining belief networks
Using dropout to regularize networks
Convolutional networks and rectified units
Compressing Data with autoencoder networks
Optimizing the learning rate
The TensorFlow library and digit recognition
The MNIST data
Constructing the network
Summary
8. Sharing Models with Prediction Services
The architecture of a prediction service
Clients and making requests
The GET requests
The POST request
The HEAD request
The PUT request
The DELETE request
Server – the web traffic controller
Application – the engine of the predictive services
Persisting information with database systems
Case study – logistic regression service
Setting up the database
The web server
The web application
The flow of a prediction service – training a model
On-demand and bulk prediction
Summary
9. Reporting and Testing – Iterating on Analytic Systems
Checking the health of models with diagnostics
Evaluating changes in model performance
Changes in feature importance
Changes in unsupervised model performance
Iterating on models through A/B testing
Experimental allocation – assigning customers to experiments
Deciding a sample size
Multiple hypothesis testing
Guidelines for communication
Translate terms to business values
Visualizing results
Case Study: building a reporting service
The report server
The report application
The visualization layer
Summary
Bibliography
Index
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset