Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Python: Advanced Predictive Analytics

Table of Contents

Python: Advanced Predictive Analytics

Credits

Preface

What this learning path covers

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Module 1

1. Getting Started with Predictive Modelling

Introducing predictive modelling

Scope of predictive modelling

Ensemble of statistical algorithms

Statistical tools

Historical data

Mathematical function

Business context

Knowledge matrix for predictive modelling

Task matrix for predictive modelling

Applications and examples of predictive modelling

LinkedIn's "People also viewed" feature

What it does?

How is it done?

Correct targeting of online ads

How is it done?

Santa Cruz predictive policing

How is it done?

Determining the activity of a smartphone user using accelerometer data

How is it done?

Sport and fantasy leagues

How was it done?

Python and its packages – download and installation

Anaconda

Standalone Python

Installing a Python package

Installing pip

Installing Python packages with pip

Python and its packages for predictive modelling

IDEs for Python

Summary

2. Data Cleaning

Reading the data – variations and examples

Data frames

Delimiters

Various methods of importing data in Python

Case 1 – reading a dataset using the read_csv method

The read_csv method

Use cases of the read_csv method

Passing the directory address and filename as variables

Reading a .txt dataset with a comma delimiter

Specifying the column names of a dataset from a list

Case 2 – reading a dataset using the open method of Python

Reading a dataset line by line

Changing the delimiter of a dataset

Case 3 – reading data from a URL

Case 4 – miscellaneous cases

Reading from an .xls or .xlsx file

Writing to a CSV or Excel file

Basics – summary, dimensions, and structure

Handling missing values

Checking for missing values

What constitutes missing data?

How missing values are generated and propagated

Treating missing values

Deletion

Imputation

Creating dummy variables

Visualizing a dataset by basic plotting

Scatter plots

Histograms

Boxplots

Summary

3. Data Wrangling

Subsetting a dataset

Selecting columns

Selecting rows

Selecting a combination of rows and columns

Creating new columns

Generating random numbers and their usage

Various methods for generating random numbers

Seeding a random number

Generating random numbers following probability distributions

Probability density function

Cumulative density function

Uniform distribution

Normal distribution

Using the Monte-Carlo simulation to find the value of pi

Geometry and mathematics behind the calculation of pi

Generating a dummy data frame

Grouping the data – aggregation, filtering, and transformation

Aggregation

Filtering

Transformation

Miscellaneous operations

Random sampling – splitting a dataset in training and testing datasets

Method 1 – using the Customer Churn Model

Method 2 – using sklearn

Method 3 – using the shuffle function

Concatenating and appending data

Merging/joining datasets

Inner Join

Left Join

Right Join

An example of the Inner Join

An example of the Left Join

An example of the Right Join

Summary of Joins in terms of their length

Summary

4. Statistical Concepts for Predictive Modelling

Random sampling and the central limit theorem

Hypothesis testing

Null versus alternate hypothesis

Z-statistic and t-statistic

Confidence intervals, significance levels, and p-values

Different kinds of hypothesis test

A step-by-step guide to do a hypothesis test

An example of a hypothesis test

Chi-square tests

Correlation

Summary

5. Linear Regression with Python

Understanding the maths behind linear regression

Linear regression using simulated data

Fitting a linear regression model and checking its efficacy

Finding the optimum value of variable coefficients

Making sense of result parameters

p-values

F-statistics

Residual Standard Error

Implementing linear regression with Python

Linear regression using the statsmodel library

Multiple linear regression

Multi-collinearity

Variance Inflation Factor

Model validation

Training and testing data split

Summary of models

Linear regression with scikit-learn

Feature selection with scikit-learn

Handling other issues in linear regression

Handling categorical variables

Transforming a variable to fit non-linear relations

Handling outliers

Other considerations and assumptions for linear regression

Summary

6. Logistic Regression with Python

Linear regression versus logistic regression

Understanding the math behind logistic regression

Contingency tables

Conditional probability

Odds ratio

Moving on to logistic regression from linear regression

Estimation using the Maximum Likelihood Method

Likelihood function:

Log likelihood function:

Building the logistic regression model from scratch

Making sense of logistic regression parameters

Wald test

Likelihood Ratio Test statistic

Chi-square test

Implementing logistic regression with Python

Processing the data

Data exploration

Data visualization

Creating dummy variables for categorical variables

Feature selection

Implementing the model

Model validation and evaluation

Cross validation

Model validation

The ROC curve

Confusion matrix

Summary

7. Clustering with Python

Introduction to clustering – what, why, and how?

What is clustering?

How is clustering used?

Why do we do clustering?

Mathematics behind clustering

Distances between two observations

Euclidean distance

Manhattan distance

Minkowski distance

The distance matrix

Normalizing the distances

Linkage methods

Single linkage

Compete linkage

Average linkage

Centroid linkage

Ward's method

Hierarchical clustering

K-means clustering

Implementing clustering using Python

Importing and exploring the dataset

Normalizing the values in the dataset

Hierarchical clustering using scikit-learn

K-Means clustering using scikit-learn

Interpreting the cluster

Fine-tuning the clustering

The elbow method

Silhouette Coefficient

Summary

8. Trees and Random Forests with Python

Introducing decision trees

A decision tree

Understanding the mathematics behind decision trees

Homogeneity

Entropy

Information gain

ID3 algorithm to create a decision tree

Gini index

Reduction in Variance

Pruning a tree

Handling a continuous numerical variable

Handling a missing value of an attribute

Implementing a decision tree with scikit-learn

Visualizing the tree

Cross-validating and pruning the decision tree

Understanding and implementing regression trees

Regression tree algorithm

Implementing a regression tree using Python

Understanding and implementing random forests

The random forest algorithm

Implementing a random forest using Python

Why do random forests work?

Important parameters for random forests

Summary

9. Best Practices for Predictive Modelling

Best practices for coding

Commenting the codes

Defining functions for substantial individual tasks

Example 1

Example 2

Example 3

Avoid hard-coding of variables as much as possible

Version control

Using standard libraries, methods, and formulas

Best practices for data handling

Best practices for algorithms

Best practices for statistics

Best practices for business contexts

Summary

A. A List of Links

2. Module 2

1. From Data to Decisions – Getting Started with Analytic Applications

Designing an advanced analytic solution

Data layer: warehouses, lakes, and streams

Modeling layer

Deployment layer

Reporting layer

Case study: sentiment analysis of social media feeds

Data input and transformation

Sanity checking

Model development

Scoring

Visualization and reporting

Case study: targeted e-mail campaigns

Data input and transformation

Sanity checking

Model development

Scoring

Visualization and reporting

Summary

2. Exploratory Data Analysis and Visualization in Python

Exploring categorical and numerical data in IPython

Installing IPython notebook

The notebook interface

Loading and inspecting data

Basic manipulations – grouping, filtering, mapping, and pivoting

Charting with Matplotlib

Time series analysis

Cleaning and converting

Time series diagnostics

Joining signals and correlation

Working with geospatial data

Loading geospatial data

Working in the cloud

Introduction to PySpark

Creating the SparkContext

Creating an RDD

Creating a Spark DataFrame

Summary

3. Finding Patterns in the Noise – Clustering and Unsupervised Learning

Similarity and distance metrics

Numerical distance metrics

Correlation similarity metrics and time series

Similarity metrics for categorical data

K-means clustering

Affinity propagation – automatically choosing cluster numbers

k-medoids

Agglomerative clustering

Where agglomerative clustering fails

Streaming clustering in Spark

Summary

4. Connecting the Dots with Models – Regression Methods

Linear regression

Data preparation

Model fitting and evaluation

Statistical significance of regression outputs

Generalize estimating equations

Mixed effects models

Time series data

Generalized linear models

Applying regularization to linear models

Tree methods

Decision trees

Random forest

Scaling out with PySpark – predicting year of song release

Summary

5. Putting Data in its Place – Classification Methods and Analysis

Logistic regression

Multiclass logistic classifiers: multinomial regression

Formatting a dataset for classification problems

Learning pointwise updates with stochastic gradient descent

Jointly optimizing all parameters with second-order methods

Fitting the model

Evaluating classification models

Strategies for improving classification models

Separating Nonlinear boundaries with Support vector machines

Fitting and SVM to the census data

Boosting – combining small models to improve accuracy

Gradient boosted decision trees

Comparing classification methods

Case study: fitting classifier models in pyspark

Summary

6. Words and Pixels – Working with Unstructured Data

Working with textual data

Cleaning textual data

Extracting features from textual data

Using dimensionality reduction to simplify datasets

Principal component analysis

Latent Dirichlet Allocation

Using dimensionality reduction in predictive modeling

Images

Cleaning image data

Thresholding images to highlight objects

Dimensionality reduction for image analysis

Case Study: Training a Recommender System in PySpark

Summary

7. Learning from the Bottom Up – Deep Networks and Unsupervised Features

Learning patterns with neural networks

A network of one – the perceptron

Combining perceptrons – a single-layer neural network

Parameter fitting with back-propagation

Discriminative versus generative models

Vanishing gradients and explaining away

Pretraining belief networks

Using dropout to regularize networks

Convolutional networks and rectified units

Compressing Data with autoencoder networks

Optimizing the learning rate

The TensorFlow library and digit recognition

The MNIST data

Constructing the network

Summary

8. Sharing Models with Prediction Services

The architecture of a prediction service

Clients and making requests

The GET requests

The POST request

The HEAD request

The PUT request

The DELETE request

Server – the web traffic controller

Application – the engine of the predictive services

Persisting information with database systems

Case study – logistic regression service

Setting up the database

The web server

The web application

The flow of a prediction service – training a model

On-demand and bulk prediction

Summary

9. Reporting and Testing – Iterating on Analytic Systems

Checking the health of models with diagnostics

Evaluating changes in model performance

Changes in feature importance

Changes in unsupervised model performance

Iterating on models through A/B testing

Experimental allocation – assigning customers to experiments

Deciding a sample size

Multiple hypothesis testing

Guidelines for communication

Translate terms to business values

Visualizing results

Case Study: building a reporting service

The report server

The report application

The visualization layer

Summary

Bibliography

Index

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.