Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

About the Authors

Index

A

A/B testing, Testing Production Systems
accuracy, Evaluating the Model, Relation to accuracy
acknowledgments, From Andreas
adjusted rand index (ARI), Evaluating clustering with ground truth
agglomerative clustering
- evaluating and comparing, Evaluating clustering with ground truth
- example of, Agglomerative Clustering
- hierarchical clustering, Hierarchical clustering and dendrograms
- linkage choices, Agglomerative Clustering
- principle of, Agglomerative Clustering
algorithm chains and pipelines, Algorithm Chains and Pipelines-Summary and Outlook
- building pipelines, Building Pipelines
- building pipelines with make_pipeline, Convenient Pipeline Creation with make_pipeline-Accessing Attributes in a Pipeline inside GridSearchCV
- grid search preprocessing steps, Grid-Searching Preprocessing Steps and Model Parameters
- grid-searching for model selection, Grid-Searching Which Model To Use
- importance of, Algorithm Chains and Pipelines
- overview of, Summary and Outlook
- parameter selection with preprocessing, Parameter Selection with Preprocessing
- pipeline interface, The General Pipeline Interface
- using pipelines in grid searches, Using Pipelines in Grid Searches-Using Pipelines in Grid Searches
algorithm parameter, Estimating complexity in neural networks
algorithms (see also models; problem solving)
- evaluating, Generalization, Overfitting, and Underfitting
- minimal code to apply to algorithm, Summary and Outlook
- sample datasets, Some Sample Datasets-Some Sample Datasets
- scaling
- supervised, classification
  - decision trees, Decision Trees-Strengths, weaknesses, and parameters
  - gradient boosting, Gradient boosted regression trees (gradient boosting machines)-Gradient boosted regression trees (gradient boosting machines), Uncertainty Estimates from Classifiers, Uncertainty in Multiclass Classification
  - k-nearest neighbors, k-Nearest Neighbors-Strengths, weaknesses, and parameters
  - kernelized support vector machines, Kernelized Support Vector Machines-Strengths, weaknesses, and parameters
  - linear SVMs, Linear models for classification
  - logistic regression, Linear models for classification
  - naive Bayes, Naive Bayes Classifiers-Strengths, weaknesses, and parameters
  - neural networks, Neural Networks (Deep Learning)-Estimating complexity in neural networks
  - random forests, Building random forests-Strengths, weaknesses, and parameters
- supervised, regression
  - decision trees, Decision Trees-Strengths, weaknesses, and parameters
  - gradient boosting, Gradient boosted regression trees (gradient boosting machines)-Gradient boosted regression trees (gradient boosting machines)
  - k-nearest neighbors, k-neighbors regression
  - Lasso, Lasso-Lasso
  - linear regression (OLS), Linear regression (aka ordinary least squares), Binning, Discretization, Linear Models, and Trees-Interactions and Polynomials
  - neural networks, Neural Networks (Deep Learning)-Estimating complexity in neural networks
  - random forests, Building random forests-Strengths, weaknesses, and parameters
  - Ridge, Ridge regression-Lasso, Strengths, weaknesses, and parameters, Tuning neural networks, Interactions and Polynomials, Univariate Nonlinear Transformations, Using Pipelines in Grid Searches, Grid-Searching Preprocessing Steps and Model Parameters-Grid-Searching Preprocessing Steps and Model Parameters
- unsupervised, clustering
- unsupervised, manifold learning
  - t-SNE, Manifold Learning with t-SNE-Manifold Learning with t-SNE
- unsupervised, signal decomposition
  - non-negative matrix factorization, Non-Negative Matrix Factorization (NMF)-Applying NMF to face images
  - principal component analysis, Principal Component Analysis (PCA)-Eigenfaces for feature extraction
alpha parameter in linear models, Ridge regression
Anaconda, Installing scikit-learn
analysis of variance (ANOVA), Univariate Statistics
area under the curve (AUC), Receiver operating characteristics (ROC) and AUC-Receiver operating characteristics (ROC) and AUC
attributions, Using Code Examples
average precision, Precision-recall curves and ROC curves

B

bag-of-words representation
- applying to movie reviews, Bag-of-Words for Movie Reviews-Bag-of-Words for Movie Reviews
- applying to toy dataset, Applying Bag-of-Words to a Toy Dataset
- more than one word (n-grams), Bag-of-Words with More Than One Word (n-Grams)-Advanced Tokenization, Stemming, and Lemmatization
- steps in computing, Representing Text Data as a Bag of Words
BernoulliNB, Naive Bayes Classifiers
bigrams, Bag-of-Words with More Than One Word (n-Grams)
binary classification, Classification and Regression, Linear models for classification, Metrics for Binary Classification-Receiver operating characteristics (ROC) and AUC
binning, Applying PCA to the cancer dataset for visualization, Binning, Discretization, Linear Models, and Trees-Binning, Discretization, Linear Models, and Trees
bootstrap samples, Building random forests
Boston Housing dataset, Some Sample Datasets
boundary points, DBSCAN
Bunch objects, Some Sample Datasets
business metric, Keep the End Goal in Mind, Approaching a Machine Learning Problem

C

C parameter in SVC, Tuning SVM parameters
calibration, Taking uncertainty into account
cancer dataset, Some Sample Datasets
categorical features
- categorical data, defined, Types of Data Represented as Strings
- defined, Representing Data and Engineering Features
- encoded as numbers, Numbers Can Encode Categoricals
- example of, Categorical Variables
- representation in training and test sets, Checking string-encoded categorical data
- representing using one-hot-encoding, One-Hot-Encoding (Dummy Variables)
categorical variables (see categorical features)
chaining (see algorithm chains and pipelines)
class labels, Classification and Regression
classification problems
- binary vs. multiclass, Classification and Regression
- examples of, Classification and Regression
- goals for, Classification and Regression
- iris classification example, A First Application: Classifying Iris Species
- k-nearest neighbors, k-Neighbors classification
- linear models, Linear models for classification
- naive Bayes classifiers, Naive Bayes Classifiers
- vs. regression problems, Classification and Regression
classifiers
- DecisionTreeClassifier, Controlling complexity of decision trees, Imbalanced datasets
- DecisionTreeRegressor, Controlling complexity of decision trees, Feature importance in trees
- KNeighborsClassifier, Building Your First Model: k-Nearest Neighbors-Summary and Outlook, k-Neighbors classification-k-neighbors regression
- KNeighborsRegressor, k-neighbors regression-Linear models for regression
- LinearSVC, Linear models for classification-Linear models for classification, Linear models for multiclass classification, Strengths, weaknesses, and parameters, Naive Bayes Classifiers
- LogisticRegression, Linear models for classification-Linear models for classification, Strengths, weaknesses, and parameters, Summary and Outlook, Cross-Validation in scikit-learn, Imbalanced datasets, Accessing Attributes in a Pipeline inside GridSearchCV, Bag-of-Words for Movie Reviews-Advanced Tokenization, Stemming, and Lemmatization
- MLPClassifier, The neural network model-Estimating complexity in neural networks
- naive Bayes, Naive Bayes Classifiers-Strengths, weaknesses, and parameters
- SVC, Linear models for classification, Tuning SVM parameters, Applying Data Transformations, The Effect of Preprocessing on Supervised Learning, Grid Search, Analyzing the result of cross-validation-Search over spaces that are not grids, Nested cross-validation, Algorithm Chains and Pipelines-Using Pipelines in Grid Searches, Convenient Pipeline Creation with make_pipeline-Grid-Searching Which Model To Use
- uncertainty estimates from, Uncertainty Estimates from Classifiers-Summary and Outlook
cluster centers, k-Means Clustering
clustering algorithms
- agglomerative clustering, Agglomerative Clustering-Hierarchical clustering and dendrograms
- applications for, Types of Unsupervised Learning
- comparing on faces dataset, Comparing algorithms on the faces dataset-Analyzing the faces dataset with agglomerative clustering
- DBSCAN, DBSCAN-DBSCAN
- evaluating with ground truth, Comparing and Evaluating Clustering Algorithms-Evaluating clustering without ground truth
- evaluating without ground truth, Evaluating clustering without ground truth-Evaluating clustering without ground truth
- goals of, Clustering
- k-means clustering, k-Means Clustering-Vector quantization, or seeing k-means as decomposition
- summary of, Summary of Clustering Methods
code examples
- downloading, Using Code Examples
- permission for use, Using Code Examples
coef_ attribute, Linear regression (aka ordinary least squares), Ridge regression
comments and questions, How to Contact Us
competitions, Honing Your Skills
conflation, Advanced Tokenization, Stemming, and Lemmatization
confusion matrices, Confusion matrices-Precision, recall, and f-score
context, Bag-of-Words with More Than One Word (n-Grams)
continuous features, Representing Data and Engineering Features, Numbers Can Encode Categoricals
core samples/core points, DBSCAN
corpus, Types of Data Represented as Strings
cos function, Univariate Nonlinear Transformations
CountVectorizer, Bag-of-Words for Movie Reviews
cross-validation
- analyzing results of, Analyzing the result of cross-validation-Analyzing the result of cross-validation
- benefits of, Benefits of Cross-Validation
- cross-validation splitters, More control over cross-validation
- grid search and, Grid Search with Cross-Validation
- in scikit-learn, Cross-Validation in scikit-learn
- leave-one-out cross-validation, Leave-one-out cross-validation
- nested, Nested cross-validation
- parallelizing with grid search, Parallelizing cross-validation and grid search
- principle of, Cross-Validation
- purpose of, Benefits of Cross-Validation
- shuffle-split cross-validation, Shuffle-split cross-validation
- stratified k-fold, Stratified k-Fold Cross-Validation and Other Strategies-Stratified k-Fold Cross-Validation and Other Strategies
- with groups, Cross-validation with groups
cross_val_score function, Benefits of Cross-Validation, Parameter Selection with Preprocessing

D

data points, defined, Problems Machine Learning Can Solve
data representation, Representing Data and Engineering Features-Summary and Outlook (see also feature extraction/feature engineering; text data)
- automatic feature selection, Automatic Feature Selection-Iterative Feature Selection
- binning and, Binning, Discretization, Linear Models, and Trees-Binning, Discretization, Linear Models, and Trees
- categorical features, Categorical Variables-Binning, Discretization, Linear Models, and Trees
- effect on model performance, Representing Data and Engineering Features
- integer features, Numbers Can Encode Categoricals
- model complexity vs. dataset size, Relation of Model Complexity to Dataset Size
- overview of, Summary and Outlook
- table analogy, Problems Machine Learning Can Solve
- in training vs. test sets, Checking string-encoded categorical data
- understanding your data, Knowing Your Task and Knowing Your Data
- univariate nonlinear transformations, Univariate Nonlinear Transformations-Univariate Nonlinear Transformations
data transformations, Applying Data Transformations
- (see also preprocessing)
data-driven research, Introduction
DBSCAN
- evaluating and comparing, Evaluating clustering with ground truth-Analyzing the faces dataset with agglomerative clustering
- parameters, DBSCAN
- principle of, DBSCAN
- returned cluster assignments, DBSCAN
- strengths and weaknesses, DBSCAN
decision boundaries, Analyzing KNeighborsClassifier, Linear models for classification
decision function, The Decision Function
decision trees
- analyzing, Analyzing decision trees
- building, Building decision trees
- controlling complexity of, Controlling complexity of decision trees
- data representation and, Binning, Discretization, Linear Models, and Trees-Binning, Discretization, Linear Models, and Trees
- feature importance in, Feature importance in trees
- if/else structure of, Decision Trees
- parameters, Strengths, weaknesses, and parameters
- vs. random forests, Random forests
- strengths and weaknesses, Strengths, weaknesses, and parameters
decision_function, Taking uncertainty into account
deep learning (see neural networks)
dendrograms, Hierarchical clustering and dendrograms
dense regions, DBSCAN
dimensionality reduction, Principal Component Analysis (PCA), Non-Negative Matrix Factorization (NMF)
discrete features, Representing Data and Engineering Features
discretization, Binning, Discretization, Linear Models, and Trees-Binning, Discretization, Linear Models, and Trees
distributed computing, Other Machine Learning Frameworks and Packages
document clustering, Topic Modeling and Document Clustering
documents, defined, Types of Data Represented as Strings
dual_coef_ attribute, Understanding SVMs

E

eigenfaces, Eigenfaces for feature extraction
embarrassingly parallel, Parallelizing cross-validation and grid search
encoding, Representing Text Data as a Bag of Words
ensembles
- defined, Ensembles of Decision Trees
- gradient boosted regression trees, Gradient boosted regression trees (gradient boosting machines)-Strengths, weaknesses, and parameters
- random forests, Random forests-Strengths, weaknesses, and parameters
Enthought Canopy, Installing scikit-learn
estimators, Building Your First Model: k-Nearest Neighbors, Building Your Own Estimator
estimator_ attribute of RFECV, Analyzing random forests
evaluation metrics and scoring
- for binary classification, Metrics for Binary Classification-Receiver operating characteristics (ROC) and AUC
- for multiclass classification, Metrics for Multiclass Classification-Regression Metrics
- metric selection, Keep the End Goal in Mind
- model selection and, Using Evaluation Metrics in Model Selection
- regression metrics, Regression Metrics
- testing production systems, Testing Production Systems
exp function, Univariate Nonlinear Transformations
expert knowledge, Utilizing Expert Knowledge-Summary and Outlook

F

f(x)=y formula, Measuring Success: Training and Testing Data
facial recognition, Eigenfaces for feature extraction, Applying NMF to face images
factor analysis (FA), Applying NMF to face images
false positive rate (FPR), Receiver operating characteristics (ROC) and AUC
false positive/false negative errors, Kinds of errors
feature extraction/feature engineering, Representing Data and Engineering Features-Summary and Outlook (see also data representation; text data)
- augmenting data with, Representing Data and Engineering Features
- automatic feature selection, Automatic Feature Selection-Iterative Feature Selection
- categorical features, Categorical Variables-Binning, Discretization, Linear Models, and Trees
- continuous vs. discrete features, Representing Data and Engineering Features
- defined, Problems Machine Learning Can Solve, Some Sample Datasets, Representing Data and Engineering Features
- interaction features, Interactions and Polynomials-Interactions and Polynomials
- with non-negative matrix factorization, Non-Negative Matrix Factorization (NMF)
- overview of, Summary and Outlook
- polynomial features, Interactions and Polynomials-Interactions and Polynomials
- with principal component analysis, Eigenfaces for feature extraction
- univariate nonlinear transformations, Univariate Nonlinear Transformations-Univariate Nonlinear Transformations
- using expert knowledge, Utilizing Expert Knowledge-Summary and Outlook
feature importance, Feature importance in trees
features, defined, Problems Machine Learning Can Solve
feature_names attribute, Some Sample Datasets
feed-forward neural networks, Neural Networks (Deep Learning)
fit method, Building Your First Model: k-Nearest Neighbors, Strengths, weaknesses, and parameters, Estimating complexity in neural networks, Applying Data Transformations
fit_transform method, Scaling Training and Test Data the Same Way
floating-point numbers, Classification and Regression
folds, Cross-Validation
forge dataset, Some Sample Datasets
frameworks, Other Machine Learning Frameworks and Packages
free string data, Types of Data Represented as Strings
freeform text data, Types of Data Represented as Strings

G

gamma parameter, Tuning SVM parameters
Gaussian kernels of SVC, The kernel trick, Tuning SVM parameters
GaussianNB, Naive Bayes Classifiers
generalization
- building models for, Generalization, Overfitting, and Underfitting
- defined, Measuring Success: Training and Testing Data
- examples of, Generalization, Overfitting, and Underfitting
get_dummies function, Numbers Can Encode Categoricals
get_support method of feature selection, Univariate Statistics
gradient boosted regression trees
- for feature selection, Binning, Discretization, Linear Models, and Trees-Binning, Discretization, Linear Models, and Trees
- learning_rate parameter, Gradient boosted regression trees (gradient boosting machines)
- parameters, Strengths, weaknesses, and parameters
- vs. random forests, Gradient boosted regression trees (gradient boosting machines)
- strengths and weaknesses, Strengths, weaknesses, and parameters
- training set accuracy, Gradient boosted regression trees (gradient boosting machines)
graphviz module, Analyzing decision trees
grid search
- accessing pipeline attributes, Accessing Attributes in a Pipeline inside GridSearchCV
- alternate strategies for, Using different cross-validation strategies with grid search
- avoiding overfitting, The Danger of Overfitting the Parameters and the Validation Set
- model selection with, Grid-Searching Which Model To Use
- nested cross-validation, Nested cross-validation
- parallelizing with cross-validation, Parallelizing cross-validation and grid search
- pipeline preprocessing, Grid-Searching Preprocessing Steps and Model Parameters
- searching non-grid spaces, Search over spaces that are not grids
- simple example of, Simple Grid Search
- tuning parameters with, Grid Search
- using pipelines in, Using Pipelines in Grid Searches-Using Pipelines in Grid Searches
- with cross-validation, Grid Search with Cross-Validation
GridSearchCV
- best_estimator_ attribute, Grid Search with Cross-Validation
- best_params_ attribute, Grid Search with Cross-Validation
- best_score_ attribute, Grid Search with Cross-Validation

H

handcoded rules, disadvantages of, Why Machine Learning?
heat maps, Applying PCA to the cancer dataset for visualization
hidden layers, The neural network model
hidden units, The neural network model
hierarchical clustering, Hierarchical clustering and dendrograms
high recall, Receiver operating characteristics (ROC) and AUC
high-dimensional datasets, Some Sample Datasets
histograms, Applying PCA to the cancer dataset for visualization
hit rate, Precision, recall, and f-score
hold-out sets, Measuring Success: Training and Testing Data
human involvement/oversight, Humans in the Loop

I

imbalanced datasets, Imbalanced datasets
independent component analysis (ICA), Applying NMF to face images
inference, Probabilistic Modeling, Inference, and Probabilistic Programming
information leakage, Using Pipelines in Grid Searches
information retrieval (IR), Types of Data Represented as Strings
integer features, Numbers Can Encode Categoricals
"intelligent" applications, Why Machine Learning?
interactions, Some Sample Datasets, Interactions and Polynomials-Interactions and Polynomials
intercept_ attribute, Linear regression (aka ordinary least squares)
iris classification application
- data inspection, First Things First: Look at Your Data
- dataset for, Meet the Data
- goals for, A First Application: Classifying Iris Species
- k-nearest neighbors, Building Your First Model: k-Nearest Neighbors
- making predictions, Making Predictions
- model evaluation, Evaluating the Model
- multiclass problem, Classification and Regression
- overview of, Summary and Outlook
- training and testing data, Measuring Success: Training and Testing Data
iterative feature selection, Iterative Feature Selection

J

Jupyter Notebook, Jupyter Notebook

K

k-fold cross-validation, Cross-Validation
k-means clustering
- applying with scikit-learn, k-Means Clustering
- vs. classification, k-Means Clustering
- cluster centers, k-Means Clustering
- complex datasets, Vector quantization, or seeing k-means as decomposition
- evaluating and comparing, Evaluating clustering with ground truth
- example of, k-Means Clustering
- failures of, Failure cases of k-means
- strengths and weaknesses, Vector quantization, or seeing k-means as decomposition
- vector quantization with, Vector quantization, or seeing k-means as decomposition
k-nearest neighbors (k-NN)
- analyzing KNeighborsClassifier, Analyzing KNeighborsClassifier
- analyzing KNeighborsRegressor, Analyzing KNeighborsRegressor
- building, Building Your First Model: k-Nearest Neighbors
- classification, k-Neighbors classification-k-Neighbors classification
- vs. linear models, Linear models for regression
- predictions with, k-Nearest Neighbors
- regression, k-neighbors regression
Kaggle, Honing Your Skills
kernelized support vector machines (SVMs)
- kernel trick, The kernel trick
- linear models and nonlinear features, Linear models and nonlinear features
- vs. linear support vector machines, Kernelized Support Vector Machines
- mathematics of, Kernelized Support Vector Machines
- parameters, Strengths, weaknesses, and parameters
- predictions with, Understanding SVMs
- preprocessing data for, Preprocessing data for SVMs
- strengths and weaknesses, Strengths, weaknesses, and parameters
- tuning SVM parameters, Tuning SVM parameters
- understanding, Understanding SVMs
knn object, Building Your First Model: k-Nearest Neighbors

L

L1 regularization, Lasso
L2 regularization, Ridge regression, Linear models for classification, Strengths, weaknesses, and parameters
Lasso model, Lasso
Latent Dirichlet Allocation (LDA), Latent Dirichlet Allocation-Latent Dirichlet Allocation
leafs, Decision Trees
leakage, Using Pipelines in Grid Searches
learn from the past approach, Utilizing Expert Knowledge
learning_rate parameter, Gradient boosted regression trees (gradient boosting machines)
leave-one-out cross-validation, Leave-one-out cross-validation
lemmatization, Advanced Tokenization, Stemming, and Lemmatization-Advanced Tokenization, Stemming, and Lemmatization
linear functions, Linear models for classification
linear models
- classification, Linear models for classification
- data representation and, Binning, Discretization, Linear Models, and Trees-Binning, Discretization, Linear Models, and Trees
- vs. k-nearest neighbors, Linear models for regression
- Lasso, Lasso
- linear SVMs, Linear models for classification
- logistic regression, Linear models for classification
- multiclass classification, Linear models for multiclass classification
- ordinary least squares, Linear regression (aka ordinary least squares)
- parameters, Strengths, weaknesses, and parameters
- predictions with, Linear Models
- regression, Linear models for regression
- ridge regression, Ridge regression
- strengths and weaknesses, Strengths, weaknesses, and parameters
linear regression, Linear regression (aka ordinary least squares), Interactions and Polynomials-Interactions and Polynomials
linear support vector machines (SVMs), Linear models for classification
linkage arrays, Hierarchical clustering and dendrograms
live testing, Testing Production Systems
log function, Univariate Nonlinear Transformations
loss functions, Linear models for classification
low-dimensional datasets, Some Sample Datasets

M

machine learning
- algorithm chains and pipelines, Algorithm Chains and Pipelines-Summary and Outlook
- applications for, Why Machine Learning?-Knowing Your Task and Knowing Your Data
- approach to problem solving, Approaching a Machine Learning Problem-Honing Your Skills
- benefits of Python for, Why Python?
- building your own systems, Preface
- data representation, Representing Data and Engineering Features-Summary and Outlook
- examples of, Introduction, A First Application: Classifying Iris Species-Evaluating the Model
- mathematics of, Who Should Read This Book
- model evaluation and improvement, Model Evaluation and Improvement-Summary and Outlook
- preprocessing and scaling, Preprocessing and Scaling-The Effect of Preprocessing on Supervised Learning
- prerequisites to learning, Who Should Read This Book
- resources, Online Resources, Where to Go from Here-Honing Your Skills
- scikit-learn and, scikit-learn-Versions Used in this Book
- supervised learning, Supervised Learning-Summary and Outlook
- understanding your data, Knowing Your Task and Knowing Your Data
- unsupervised learning, Unsupervised Learning and Preprocessing-Summary and Outlook
- working with text data, Working with Text Data-Summary and Outlook
make_pipeline function
- accessing step attributes, Accessing Step Attributes
- displaying steps attribute, Convenient Pipeline Creation with make_pipeline
- grid-searched pipelines and, Accessing Attributes in a Pipeline inside GridSearchCV
- syntax for, Convenient Pipeline Creation with make_pipeline
manifold learning algorithms
- applications for, Manifold Learning with t-SNE
- example of, Manifold Learning with t-SNE
- results of, Manifold Learning with t-SNE
- visualizations with, Manifold Learning with t-SNE
mathematical functions for feature transformations, Univariate Nonlinear Transformations
matplotlib, matplotlib
max_features parameter, Building random forests
meta-estimators for trees and forests, Grid Search with Cross-Validation
method chaining, Strengths, weaknesses, and parameters
metrics (see evaluation metrics and scoring)
mglearn, mglearn
mllib, Other Machine Learning Frameworks and Packages
model-based feature selection, Model-Based Feature Selection
models (see also algorithms)
- calibrated, Taking uncertainty into account
- capable of generalization, Generalization, Overfitting, and Underfitting
- coefficients with text data, Investigating Model Coefficients-Advanced Tokenization, Stemming, and Lemmatization
- complexity vs. dataset size, Relation of Model Complexity to Dataset Size
- cross-validation of, Cross-Validation-Cross-validation with groups
- effect of data representation choices on, Representing Data and Engineering Features
- evaluation and improvement, Model Evaluation and Improvement-Model Evaluation and Improvement
- evaluation metrics and scoring, Evaluation Metrics and Scoring-Summary and Outlook
- iris classification application, A First Application: Classifying Iris Species-Evaluating the Model
- overfitting vs. underfitting, Generalization, Overfitting, and Underfitting
- pipeline preprocessing and, Grid-Searching Preprocessing Steps and Model Parameters
- selecting, Using Evaluation Metrics in Model Selection
- selecting with grid search, Grid-Searching Which Model To Use
- theory behind, Theory
- tuning parameters with grid search, Grid Search
movie reviews, Example Application: Sentiment Analysis of Movie Reviews
multiclass classification
- vs. binary classification, Classification and Regression
- evaluation metrics and scoring for, Metrics for Multiclass Classification-Regression Metrics
- linear models for, Linear models for multiclass classification
- uncertainty estimates, Uncertainty in Multiclass Classification
multilayer perceptrons (MLPs), Neural Networks (Deep Learning)
MultinomialNB, Naive Bayes Classifiers

N

n-grams, Bag-of-Words with More Than One Word (n-Grams)
naive Bayes classifiers
- kinds in scikit-learn, Naive Bayes Classifiers
- parameters, Strengths, weaknesses, and parameters
- strengths and weaknesses, Strengths, weaknesses, and parameters
natural language processing (NLP), Types of Data Represented as Strings, Summary and Outlook
negative class, Classification and Regression
nested cross-validation, Nested cross-validation
Netflix prize challenge, Ranking, Recommender Systems, and Other Kinds of Learning
neural networks (deep learning)
- accuracy of, Tuning neural networks
- estimating complexity in, Estimating complexity in neural networks
- predictions with, The neural network model
- randomization in, Tuning neural networks
- recent breakthroughs in, Neural Networks
- strengths and weaknesses, Strengths, weaknesses, and parameters
- tuning, Tuning neural networks
non-negative matrix factorization (NMF)
- applications for, Non-Negative Matrix Factorization (NMF)
- applying to face images, Applying NMF to face images
- applying to synthetic data, Applying NMF to synthetic data
normalization, Advanced Tokenization, Stemming, and Lemmatization
normalized mutual information (NMI), Evaluating clustering with ground truth
NumPy (Numeric Python) library, NumPy

O

offline evaluation, Testing Production Systems
one-hot-encoding, One-Hot-Encoding (Dummy Variables)-Checking string-encoded categorical data
one-out-of-N encoding, One-Hot-Encoding (Dummy Variables)-Checking string-encoded categorical data
one-vs.-rest approach, Linear models for multiclass classification
online resources, Online Resources
online testing, Testing Production Systems
OpenML platform, Honing Your Skills
operating points, Precision-recall curves and ROC curves
ordinary least squares (OLS), Linear models for regression
out-of-core learning, Scaling to Larger Datasets
outlier detection, Analyzing the faces dataset with DBSCAN
overfitting, Generalization, Overfitting, and Underfitting, The Danger of Overfitting the Parameters and the Validation Set

P

pair plots, First Things First: Look at Your Data
pandas
- benefits of, pandas
- checking string-encoded data, Checking string-encoded categorical data
- column indexing in, Checking string-encoded categorical data
- converting data to one-hot-encoding, One-Hot-Encoding (Dummy Variables)
- get_dummies function, Numbers Can Encode Categoricals
parallelization over a cluster, Scaling to Larger Datasets
permissions, Using Code Examples
pipelines (see algorithm chains and pipelines)
polynomial features, Interactions and Polynomials-Interactions and Polynomials
polynomial kernels, The kernel trick
polynomial regression, Interactions and Polynomials
positive class, Classification and Regression
POSIX time, Utilizing Expert Knowledge
pre- and post-pruning, Controlling complexity of decision trees
precision, Precision, recall, and f-score, Humans in the Loop
precision-recall curves, Precision-recall curves and ROC curves-Precision-recall curves and ROC curves
predict for the future approach, Utilizing Expert Knowledge
predict method, Making Predictions, k-Neighbors classification, Strengths, weaknesses, and parameters, Grid Search with Cross-Validation
predict_proba function, Predicting Probabilities, Taking uncertainty into account
preprocessing, Preprocessing and Scaling-The Effect of Preprocessing on Supervised Learning
- data transformation application, Applying Data Transformations
- effect on supervised learning, The Effect of Preprocessing on Supervised Learning
- kinds of, Different Kinds of Preprocessing
- parameter selection with, Parameter Selection with Preprocessing
- pipelines and, Grid-Searching Preprocessing Steps and Model Parameters
- purpose of, Preprocessing and Scaling
- scaling training and test data, Scaling Training and Test Data the Same Way
principal component analysis (PCA)
- drawbacks of, Applying PCA to the cancer dataset for visualization
- example of, Principal Component Analysis (PCA)
- feature extraction with, Eigenfaces for feature extraction
- unsupervised nature of, Applying PCA to the cancer dataset for visualization
- visualizations with, Applying PCA to the cancer dataset for visualization
- whitening option, Eigenfaces for feature extraction
probabilistic modeling, Probabilistic Modeling, Inference, and Probabilistic Programming
probabilistic programming, Probabilistic Modeling, Inference, and Probabilistic Programming
problem solving
- building your own estimators, Building Your Own Estimator
- business metrics and, Approaching a Machine Learning Problem
- initial approach to, Approaching a Machine Learning Problem
- resources, Where to Go from Here-Honing Your Skills
- simple vs. complicated cases, Humans in the Loop
- steps of, Approaching a Machine Learning Problem
- testing your system, Testing Production Systems
- tool choice, From Prototype to Production
production systems
- testing, Testing Production Systems
- tool choice, From Prototype to Production
pruning for decision trees, Controlling complexity of decision trees
pseudorandom number generators, Measuring Success: Training and Testing Data
pure leafs, Building decision trees
PyMC language, Probabilistic Modeling, Inference, and Probabilistic Programming
Python
- benefits of, Why Python?
- prepackaged distributions, Installing scikit-learn
- Python 2 vs. Python 3, Python 2 Versus Python 3
- Python(x,y), Installing scikit-learn
- statsmodel package, Other Machine Learning Frameworks and Packages

R

R language, Other Machine Learning Frameworks and Packages
radial basis function (RBF) kernel, The kernel trick
random forests
- analyzing, Analyzing random forests
- building, Building random forests
- data representation and, Binning, Discretization, Linear Models, and Trees-Binning, Discretization, Linear Models, and Trees
- vs. decision trees, Random forests
- vs. gradient boosted regression trees, Gradient boosted regression trees (gradient boosting machines)
- parameters, Strengths, weaknesses, and parameters
- predictions with, Building random forests
- randomization in, Random forests
- strengths and weaknesses, Strengths, weaknesses, and parameters
random_state parameter, Measuring Success: Training and Testing Data
ranking, Ranking, Recommender Systems, and Other Kinds of Learning
real numbers, Classification and Regression
recall, Precision, recall, and f-score
receiver operating characteristics (ROC) curves, Receiver operating characteristics (ROC) and AUC-Receiver operating characteristics (ROC) and AUC
recommender systems, Ranking, Recommender Systems, and Other Kinds of Learning
rectified linear unit (relu), The neural network model
rectifying nonlinearity, The neural network model
recurrent neural networks (RNNs), Summary and Outlook
recursive feature elimination (RFE), Iterative Feature Selection
regression
- f_regression, Univariate Statistics, Using Pipelines in Grid Searches
- LinearRegression, Linear regression (aka ordinary least squares)-Linear models for classification, Feature importance in trees, Utilizing Expert Knowledge
regression problems
- Boston Housing dataset, Some Sample Datasets
- vs. classification problems, Classification and Regression
- evaluation metrics and scoring, Regression Metrics
- examples of, Classification and Regression
- goals for, Classification and Regression
- k-nearest neighbors, k-neighbors regression
- Lasso, Lasso
- linear models, Linear models for regression
- ridge regression, Ridge regression
- wave dataset illustration, Some Sample Datasets
regularization
- L1 regularization, Lasso
- L2 regularization, Ridge regression, Linear models for classification
rescaling
- example of, Preprocessing and Scaling-The Effect of Preprocessing on Supervised Learning
- kernel SVMs, Preprocessing data for SVMs
resources, Online Resources
ridge regression, Ridge regression
robustness-based clustering, Evaluating clustering without ground truth
roots, Building decision trees

S

samples, defined, Problems Machine Learning Can Solve
scaling, Preprocessing and Scaling-The Effect of Preprocessing on Supervised Learning
- data transformation application, Applying Data Transformations
- effect on supervised learning, The Effect of Preprocessing on Supervised Learning
- into larger datasets, Scaling to Larger Datasets
- kinds of, Different Kinds of Preprocessing
- purpose of, Preprocessing and Scaling
- training and test data, Scaling Training and Test Data the Same Way
scatter plots, First Things First: Look at Your Data
scikit-learn
- alternate frameworks, Other Machine Learning Frameworks and Packages
- benefits of, scikit-learn
- Bunch objects, Some Sample Datasets
- cancer dataset, Some Sample Datasets
- core code for, Summary and Outlook
- data and labels in, Measuring Success: Training and Testing Data
- documentation, scikit-learn
- feature_names attribute, Some Sample Datasets
- fit method, Building Your First Model: k-Nearest Neighbors, Strengths, weaknesses, and parameters, Estimating complexity in neural networks, Applying Data Transformations
- fit_transform method, Scaling Training and Test Data the Same Way
- installing, Installing scikit-learn
- knn object, Building Your First Model: k-Nearest Neighbors
- libraries and tools, Essential Libraries and Tools-mglearn
- predict method, Making Predictions, k-Neighbors classification, Strengths, weaknesses, and parameters
- Python 2 vs. Python 3, Python 2 Versus Python 3
- random_state parameter, Measuring Success: Training and Testing Data
- scaling mechanisms in, The Effect of Preprocessing on Supervised Learning
- score method, Evaluating the Model, k-Neighbors classification, k-neighbors regression
- transform method, Applying Data Transformations
- user guide, scikit-learn
- versions used, Versions Used in this Book
scikit-learn classes and functions
- accuracy_score, Evaluating clustering with ground truth
- adjusted_rand_score, Evaluating clustering with ground truth
- AgglomerativeClustering, Agglomerative Clustering, Evaluating clustering with ground truth, Analyzing the faces dataset with agglomerative clustering-Analyzing the faces dataset with agglomerative clustering
- average_precision_score, Precision-recall curves and ROC curves
- BaseEstimator, Building Your Own Estimator
- classification_report, Precision, recall, and f-score-Taking uncertainty into account, Metrics for Multiclass Classification
- confusion_matrix, Confusion matrices-Regression Metrics
- CountVectorizer, Applying Bag-of-Words to a Toy Dataset-Summary and Outlook
- cross_val_score, Cross-Validation in scikit-learn, More control over cross-validation, Using Evaluation Metrics in Model Selection, Parameter Selection with Preprocessing, Building Your Own Estimator
- DBSCAN, DBSCAN-DBSCAN
- DecisionTreeClassifier, Controlling complexity of decision trees, Imbalanced datasets
- DecisionTreeRegressor, Controlling complexity of decision trees, Feature importance in trees
- DummyClassifier, Imbalanced datasets
- ElasticNet class, Lasso
- ENGLISH_STOP_WORDS, Stopwords
- Estimator, Building Your First Model: k-Nearest Neighbors
- export_graphviz, Analyzing decision trees
- f1_score, Precision, recall, and f-score, Precision-recall curves and ROC curves
- fetch_lfw_people, Eigenfaces for feature extraction
- f_regression, Univariate Statistics, Using Pipelines in Grid Searches
- GradientBoostingClassifier, Gradient boosted regression trees (gradient boosting machines)-Gradient boosted regression trees (gradient boosting machines), Uncertainty Estimates from Classifiers, Uncertainty in Multiclass Classification
- GridSearchCV, Grid Search with Cross-Validation, Using Evaluation Metrics in Model Selection-Using Evaluation Metrics in Model Selection, Algorithm Chains and Pipelines-Using Pipelines in Grid Searches, Accessing Attributes in a Pipeline inside GridSearchCV-Grid-Searching Which Model To Use, Building Your Own Estimator
- GroupKFold, Cross-validation with groups
- KFold, More control over cross-validation, Cross-validation with groups
- KMeans, Failure cases of k-means-Vector quantization, or seeing k-means as decomposition
- KNeighborsClassifier, Building Your First Model: k-Nearest Neighbors-Summary and Outlook, k-Neighbors classification-k-neighbors regression
- KNeighborsRegressor, k-neighbors regression-Linear models for regression
- Lasso, Lasso-Lasso
- LatentDirichletAllocation, Latent Dirichlet Allocation
- LeaveOneOut, Leave-one-out cross-validation
- LinearRegression, Linear regression (aka ordinary least squares)-Linear models for classification, Feature importance in trees, Utilizing Expert Knowledge
- LinearSVC, Linear models for classification-Linear models for classification, Linear models for multiclass classification, Strengths, weaknesses, and parameters, Naive Bayes Classifiers
- load_boston, Some Sample Datasets, Interactions and Polynomials, Grid-Searching Preprocessing Steps and Model Parameters
- load_breast_cancer, Some Sample Datasets, Analyzing KNeighborsClassifier, Linear models for classification, Controlling complexity of decision trees, Applying Data Transformations, Applying PCA to the cancer dataset for visualization, Univariate Statistics, Algorithm Chains and Pipelines
- load_digits, Manifold Learning with t-SNE, Imbalanced datasets
- load_iris, Meet the Data, Uncertainty in Multiclass Classification, Cross-Validation in scikit-learn
- LogisticRegression, Linear models for classification-Linear models for classification, Strengths, weaknesses, and parameters, Summary and Outlook, Cross-Validation in scikit-learn, Imbalanced datasets, Accessing Attributes in a Pipeline inside GridSearchCV, Bag-of-Words for Movie Reviews-Advanced Tokenization, Stemming, and Lemmatization
- make_blobs, Linear models and nonlinear features, Uncertainty Estimates from Classifiers, Scaling Training and Test Data the Same Way, Failure cases of k-means-Agglomerative Clustering, DBSCAN, Taking uncertainty into account
- make_circles, Uncertainty Estimates from Classifiers
- make_moons, Analyzing random forests, Tuning neural networks, Failure cases of k-means, DBSCAN-Evaluating clustering without ground truth
- make_pipeline, Convenient Pipeline Creation with make_pipeline-Grid-Searching Preprocessing Steps and Model Parameters
- MinMaxScaler, Preprocessing data for SVMs, Different Kinds of Preprocessing, Applying Data Transformations-The Effect of Preprocessing on Supervised Learning, DBSCAN, Interactions and Polynomials, Building Pipelines, Using Pipelines in Grid Searches, Grid-Searching Which Model To Use
- MLPClassifier, The neural network model-Estimating complexity in neural networks
- NMF, Dimensionality Reduction, Feature Extraction, and Manifold Learning, Applying NMF to face images-Applying NMF to face images, Vector quantization, or seeing k-means as decomposition-Agglomerative Clustering, Latent Dirichlet Allocation
- Normalizer, Different Kinds of Preprocessing
- OneHotEncoder, Numbers Can Encode Categoricals, Utilizing Expert Knowledge
- ParameterGrid, Nested cross-validation
- PCA, Principal Component Analysis (PCA)-Manifold Learning with t-SNE, Vector quantization, or seeing k-means as decomposition, Comparing algorithms on the faces dataset-Analyzing the faces dataset with agglomerative clustering, The General Pipeline Interface-Accessing Step Attributes, Latent Dirichlet Allocation
- Pipeline, Algorithm Chains and Pipelines-Grid-Searching Which Model To Use, Summary and Outlook
- PolynomialFeatures, Interactions and Polynomials-Interactions and Polynomials, Utilizing Expert Knowledge, Grid-Searching Preprocessing Steps and Model Parameters
- precision_recall_curve, Precision-recall curves and ROC curves-Precision-recall curves and ROC curves
- RandomForestClassifier, Building random forests-Analyzing random forests, Model-Based Feature Selection, Precision-recall curves and ROC curves, Grid-Searching Which Model To Use
- RandomForestRegressor, Building random forests, Interactions and Polynomials, Model-Based Feature Selection
- RFE, Iterative Feature Selection-Iterative Feature Selection
- Ridge, Ridge regression, Strengths, weaknesses, and parameters, Tuning neural networks, Interactions and Polynomials, Univariate Nonlinear Transformations, Using Pipelines in Grid Searches, Grid-Searching Preprocessing Steps and Model Parameters-Grid-Searching Preprocessing Steps and Model Parameters
- RobustScaler, Different Kinds of Preprocessing
- roc_auc_score, Receiver operating characteristics (ROC) and AUC-Using Evaluation Metrics in Model Selection
- roc_curve, Receiver operating characteristics (ROC) and AUC-Receiver operating characteristics (ROC) and AUC
- SCORERS, Using Evaluation Metrics in Model Selection
- SelectFromModel, Model-Based Feature Selection
- SelectPercentile, Univariate Statistics, Using Pipelines in Grid Searches
- ShuffleSplit, Shuffle-split cross-validation, Shuffle-split cross-validation
- silhouette_score, Evaluating clustering without ground truth
- StandardScaler, Tuning neural networks, Different Kinds of Preprocessing, Scaling Training and Test Data the Same Way, Applying PCA to the cancer dataset for visualization, Eigenfaces for feature extraction, DBSCAN-Evaluating clustering without ground truth, Convenient Pipeline Creation with make_pipeline-Grid-Searching Which Model To Use
- StratifiedKFold, Cross-validation with groups, Nested cross-validation
- StratifiedShuffleSplit, Shuffle-split cross-validation, Advanced Tokenization, Stemming, and Lemmatization
- SVC, Linear models for classification, Tuning SVM parameters, Applying Data Transformations, The Effect of Preprocessing on Supervised Learning, Grid Search-Grid Search with Cross-Validation, Analyzing the result of cross-validation-Search over spaces that are not grids, Algorithm Chains and Pipelines-Using Pipelines in Grid Searches, Convenient Pipeline Creation with make_pipeline-Grid-Searching Which Model To Use
- SVR, Kernelized Support Vector Machines, Interactions and Polynomials
- TfidfVectorizer, Rescaling the Data with tf–idf-Summary and Outlook
- train_test_split, Measuring Success: Training and Testing Data-First Things First: Look at Your Data, Model Evaluation and Improvement, Taking uncertainty into account, Precision-recall curves and ROC curves
- TransformerMixin, Testing Production Systems
- TSNE, Manifold Learning with t-SNE
SciPy, SciPy
score method, Evaluating the Model, k-Neighbors classification, k-neighbors regression, Grid Search with Cross-Validation, Building Pipelines
sensitivity, Precision, recall, and f-score
sentiment analysis example, Example Application: Sentiment Analysis of Movie Reviews
shapes, defined, Meet the Data
shuffle-split cross-validation, Shuffle-split cross-validation
sin function, Univariate Nonlinear Transformations
soft voting strategy, Building random forests
spark computing environment, Other Machine Learning Frameworks and Packages
sparse coding (dictionary learning), Applying NMF to face images
splits, Cross-Validation
Stan language, Probabilistic Modeling, Inference, and Probabilistic Programming
statsmodel package, Other Machine Learning Frameworks and Packages
stemming, Advanced Tokenization, Stemming, and Lemmatization-Advanced Tokenization, Stemming, and Lemmatization
stopwords, Stopwords
stratified k-fold cross-validation, Stratified k-Fold Cross-Validation and Other Strategies-Stratified k-Fold Cross-Validation and Other Strategies
string-encoded categorical data, Checking string-encoded categorical data
supervised learning, Supervised Learning-Summary and Outlook (see also classification problems; regression problems)
- algorithms for
  - decision trees, Decision Trees-Strengths, weaknesses, and parameters
  - ensembles of decision trees, Ensembles of Decision Trees-Strengths, weaknesses, and parameters
  - k-nearest neighbors, k-Nearest Neighbors-Strengths, weaknesses, and parameters
  - kernelized support vector machines, Kernelized Support Vector Machines-Strengths, weaknesses, and parameters
  - linear models, Linear Models-Strengths, weaknesses, and parameters
  - naive Bayes classifiers, Naive Bayes Classifiers
  - neural networks (deep learning), Neural Networks (Deep Learning)-Estimating complexity in neural networks
  - overview of, Problems Machine Learning Can Solve
- data representation, Problems Machine Learning Can Solve
- examples of, Problems Machine Learning Can Solve
- generalization, Generalization, Overfitting, and Underfitting
- goals for, Supervised Learning
- model complexity vs. dataset size, Relation of Model Complexity to Dataset Size
- overfitting vs. underfitting, Generalization, Overfitting, and Underfitting
- overview of, Summary and Outlook
- sample datasets, Some Sample Datasets-Some Sample Datasets
- uncertainty estimates, Uncertainty Estimates from Classifiers-Summary and Outlook
support vectors, Understanding SVMs
synthetic datasets, Some Sample Datasets

T

t-SNE algorithm (see manifold learning algorithms)
tangens hyperbolicus (tanh), The neural network model
term frequency–inverse document frequency (tf–idf), Rescaling the Data with tf–idf-Advanced Tokenization, Stemming, and Lemmatization
terminal nodes, Decision Trees
test data/test sets
- Boston Housing dataset, Some Sample Datasets
- defined, Measuring Success: Training and Testing Data
- forge dataset, Some Sample Datasets
- wave dataset, Some Sample Datasets
- Wisconsin Breast Cancer dataset, Some Sample Datasets
text data, Working with Text Data-Summary and Outlook
- bag-of-words representation, Representing Text Data as a Bag of Words-Bag-of-Words for Movie Reviews
- examples of, Working with Text Data
- model coefficients, Investigating Model Coefficients
- overview of, Summary and Outlook
- rescaling data with tf-idf, Rescaling the Data with tf–idf-Rescaling the Data with tf–idf
- sentiment analysis example, Example Application: Sentiment Analysis of Movie Reviews
- stopwords, Stopwords
- topic modeling and document clustering, Topic Modeling and Document Clustering-Latent Dirichlet Allocation
- types of, Types of Data Represented as Strings-Types of Data Represented as Strings
time series predictions, Ranking, Recommender Systems, and Other Kinds of Learning
tokenization, Representing Text Data as a Bag of Words, Advanced Tokenization, Stemming, and Lemmatization-Advanced Tokenization, Stemming, and Lemmatization
top nodes, Building decision trees
topic modeling, with LDA, Topic Modeling and Document Clustering-Latent Dirichlet Allocation
training data, Measuring Success: Training and Testing Data
train_test_split function, Benefits of Cross-Validation
transform method, Applying Data Transformations, The General Pipeline Interface, Bag-of-Words for Movie Reviews
transformations
- selecting, Univariate Nonlinear Transformations
- univariate nonlinear, Univariate Nonlinear Transformations-Univariate Nonlinear Transformations
- unsupervised, Types of Unsupervised Learning
tree module, Analyzing decision trees
trigrams, Bag-of-Words with More Than One Word (n-Grams)
true positive rate (TPR), Precision, recall, and f-score, Receiver operating characteristics (ROC) and AUC
true positives/true negatives, Confusion matrices
typographical conventions, Conventions Used in This Book

U

uncertainty estimates
- applications for, Uncertainty Estimates from Classifiers
- decision function, The Decision Function
- in binary classification evaluation, Taking uncertainty into account-Taking uncertainty into account
- multiclass classification, Uncertainty in Multiclass Classification
- predicting probabilities, Predicting Probabilities
underfitting, Generalization, Overfitting, and Underfitting
unigrams, Bag-of-Words with More Than One Word (n-Grams)
univariate nonlinear transformations, Univariate Nonlinear Transformations-Univariate Nonlinear Transformations
univariate statistics, Univariate Statistics
unsupervised learning, Unsupervised Learning and Preprocessing-Summary and Outlook
- algorithms for
  - agglomerative clustering, Agglomerative Clustering-Hierarchical clustering and dendrograms
  - clustering, Clustering-Summary of Clustering Methods
  - DBSCAN, DBSCAN-DBSCAN
  - k-means clustering, Clustering-Vector quantization, or seeing k-means as decomposition
  - manifold learning with t-SNE, Manifold Learning with t-SNE-Manifold Learning with t-SNE
  - non-negative matrix factorization, Non-Negative Matrix Factorization (NMF)-Applying NMF to face images
  - overview of, Problems Machine Learning Can Solve
  - principal component analysis, Principal Component Analysis (PCA)-Eigenfaces for feature extraction
- challenges of, Challenges in Unsupervised Learning
- data representation, Problems Machine Learning Can Solve
- examples of, Problems Machine Learning Can Solve
- overview of, Summary and Outlook
- scaling and preprocessing for, Preprocessing and Scaling-The Effect of Preprocessing on Supervised Learning
- types of, Unsupervised Learning and Preprocessing
unsupervised transformations, Types of Unsupervised Learning

V

value_counts function, Checking string-encoded categorical data
vector quantization, Vector quantization, or seeing k-means as decomposition
vocabulary building, Representing Text Data as a Bag of Words
voting, k-Neighbors classification
vowpal wabbit, Other Machine Learning Frameworks and Packages

W

wave dataset, Some Sample Datasets
weak learners, Gradient boosted regression trees (gradient boosting machines)
weights, Linear regression (aka ordinary least squares), The neural network model
whitening option, Eigenfaces for feature extraction
Wisconsin Breast Cancer dataset, Some Sample Datasets
word stems, Advanced Tokenization, Stemming, and Lemmatization

X

xgboost package, Gradient boosted regression trees (gradient boosting machines)
xkcd Color Survey, Types of Data Represented as Strings

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.