Index
A
absolute error measures 226, 227, 244
absolute penalty fit method 211
accuracy characteristic of prediction model 231
activation function, neural network 203–204
Adjusted R2 75, 76
affinity grouping 252
agglomerative algorithm, clustering 154–163
AI (artificial intelligence) 250
AIC/AICc (Akaike information criterion) approach 76, 184, 195–196
Analyze command
Fit Line 65
Fit Model 67–69, 95
Fit Stepwise 74
Fit Y by X 31–33, 65, 83–84
ANOVA (analysis of variance)
one-way/one-factor 83–96
process 82–83
two-way/two-factor 97–102
area under the curve (AUC) 235, 236, 237
artificial intelligence (AI) 250
association rules 252
association task, predictive analytics 254
AUC (area under the curve) 235, 236, 237
average linkage method, distance between clusters 154–155
Axis Titles, Histogram Data Analysis 23
B
BA (business analytics) 3–5, 250
bar chart 59–61
Bayesian information criterion (BIC) 76
bell-shaped distribution 18
BI (business intelligence) 3–5
bias term, neural network 203
BII (business information intelligence) 3–5
binary dependent variable 104, 221–222, 230–237
binary vs. multiway splits, decision tree 181
bivariate analysis 6, 31–36, 124–133
BMI (business modeling intelligence) 3–5
boosting option, neural network predictability 210, 216–217
BSI (business statistical intelligence) 3–5
bubble plot 53–55
business analytics (BA) 3–5, 250
business information intelligence (BII) 3–5
business intelligence (BI) 3–5
business modeling intelligence (BMI) 3–5
business statistics intelligence (BSI) 3–5
C
categorical variables
See also ANOVA
deciding on statistical technique 26, 28–29
decision tree 180, 181–192
graphs 45–46, 50–51, 52
neural network 208, 212–213
regression 76–82
tables 42
causality 39
central limit theorem (CLT) 18–24
centroid method, distance between clusters 154–155, 164–166
chaining in single linkage method, distance between clusters 154
chi-square test of independence 185
churn analysis 122–133
classification task, predictive analytics 254
classification tree 181–192, 237, 239–240
cleaning data for practical study 8
CLT (central limit theorem) 18–24
cluster analysis
credit card user example 152–153
definition 152
hierarchical clustering 154–163, 177
k-means clustering 154, 164–177
regression, using clusters in 164
Cluster command 156–157, 159
Clustering History, Cluster command 159
clustering task, predictive analytics 254
coefficient of determination (RSquare or R2) 66
Color Clusters, Cluster command 157
Column Contributions, decision tree 198
complete linkage method, distance between clusters 154–155
confusion matrix
binary dependent variable model comparison 230–231
bivariate analysis contingency table 35
confusion matrix (continued)
logistic regression 120, 132
neural network 222–223
Connect Thru Missing, Overlay Plot command
166–167
constant variance assumption 105
contingency table 35
See also confusion matrix
continuous variables
See also ANOVA
deciding on statistical technique 26, 28
decision tree 180, 183, 192–199
logistic regression 129–130, 132
model comparison 226–230, 244
neural network 208
regression 76–77
contour graphs 55–56
cornerstone of statistics, CLT theorem 20
correlation coefficient 227
correlation matrix
logistic regression 125
multiple regression 67, 68
PCA 136–138, 142, 148–149
criterion function, decision tree 184–185
CRM (customer relation management) 250
cross-validation, neural network 207–208
customer relation management (CRM) 250
D
data, role of 2–3
data discovery 9, 11
See also graphs
See also tables
Data Filter, Graph Builder 56–61
data mining 4, 251, 254–255
See also predictive analytics
data warehouse 2
decision trees
classification tree 182–192, 237, 239–240
credit risk example 180–182
definition 180
pros and cons of using 124
regression tree 192–199
dendrogram 154, 157, 158, 159–160, 164
dependence, multivariate analysis framework 11
See also specific techniques
differences, testing for
one-way ANOVA 90–96
dimension reduction, PCA 136, 142–144
directed (supervised) predictive analytics techniques 252, 253, 254
“dirty data,” problem of 6
discovery, multivariate analysis framework 9, 11
See also graphs
See also tables
discovery task, predictive analytics 254
Distribution command 30, 85–87, 125, 175
drop zones, Graph Builder 46–48
dummy variables 76–77, 79–82, 212
Dunnett’s test 95
Durbin-Watson test 73
dynamic histogram, Excel 23
dynamic linking feature, JMP 58
E
Effect Likelihood Ratio Tests 129
eigenvalue analysis 141–144, 145, 147
eigenvalues-greater-than-1 method, PCA 144
elbow discovery method 144, 167
enterprise resource planning (ERP) 2
equal replication design, ANOVA with 97–102
error table 230
See also confusion matrix
estimation task, predictive analytics 254
Excel, Microsoft
measuring continuous variables 228–229
opening files in JMP 28
PivotTable 40–42
random sample generation 20–24
reasons for using 10–11
Exclude/Unexclude option, data table 147
F
factor analysis vs. PCA 140–141
See also Principal Component Analysis
factor loadings 145
false positive rate (FPR), prediction model 231–232
features, neural network 204–205
filtering data 56–61, 236–237
Fit Line, Analyze command 65
Fit Model, Analyze command 67–69, 95
Fit Stepwise, Analyze command 74
Fit Y by X, Analyze command 31–33, 65, 83–84
fitting to the model
ANOVA 83–84, 95
clusters 164
G2 (goodness-of-fit) statistic, decision tree 184, 185–190
neural networks 206, 211, 215, 220
regression 65, 67–69, 71, 74, 122, 128
statistics review 31–33
train-validate-test paradigm for 240–246
Formula command 77–79
FPR (false positive rate), prediction model 231–232
fraud detection 250
frequency distribution, Excel Data Analysis Tool
22–23
F-test 65, 71–72, 83
G
G2 (goodness-of-fit) statistic, decision tree 184,
185–190
Gaussian radial basis function 204
gradient boosting, neural network 210
Graph Builder 45–61
graphs
bar chart 59–61
bubble plot 53–55
contours 55–56
Graph Builder dialog box 45–48
line graphs 55–56
scatterplot matrix 48–51, 123, 126
trellis chart 51–53, 55–56, 58
Group X drop zone 46–47
Group Y drop zone 46–47
H
hidden layer, neural network 205, 208–210
hierarchical clustering 154–163, 177
high-variance procedure, decision tree as 198–199
Histogram, Excel Data Analysis Tool 21–22
holdback validation, neural network 206, 215
homocedasticity assumption 105
Hsu’s MCB (multiple comparison with best) 94–95
hyperbolic tangent (tanh) 204
hypothesis testing 24–26
I
“include all variables” approach, logistic regression 123, 124
indicator variables 76–77, 79–82, 212
input layer, neural network 202–203
in-sample and out-of-sample data sets, measures to compare 82, 228–229, 244
interactions terms, introducing 130–132
interdependence, multivariate analysis framework 11
See also cluster analysis
See also Principal Component Analysis
J
JMP
See SAS JMP statistical software application
Johnson Sb transformation 211
Johnson Su transformation 211
K
k-fold cross-validation, neural network 207–208
k-means clustering 154, 164–177
L
Lack of Fit test, logistic regression 122, 128
Leaf Report, decision tree 198
learning rate for algorithm 210
least squares criterion 206, 211
least squares differences (LSD) 94
Levene test, ANOVA 89, 90, 102
lift chart 237–240
line graphs 55–56
Linear Probability Model (LPM) 105
linear regression
See also logistic regression
definition 65
LPM 105
multiple 67–76
simple 64–66
sum of squared residuals 128
linearity of logit, checking 132
loading plots 139, 145–146, 148–149
log odds of 0/1 convention 113
logistic function 106–112
logistic regression
bivariate method 124
decision tree method 124
lift curve 237–240
logistic function 106–112
LPM 105
odds ratios 109–111, 113–122
ROC curve 235–237
statistical study example 122–133
stepwise method 124
logit transformation 107
LogWorth statistic, decision tree 185, 186, 187–190
low- vs. high-variance procedures 198–199
LPM (Linear Probability Model) 105
LSD (least squares differences) 94
LSMeans Plot command 95, 97
M
Make into Data Table, ROC curve 236
Mark Clusters, Cluster command 157
market basket analysis 252
mean absolute error (MAE) measure 226, 244
mean square error (MSE) measure 226, 244
means comparison tests, ANOVA 90–95
Means/ANOVA command 88–89
model comparison
binary dependent variable 230–237
continuous dependent variable 226–230, 244
introduction 225
lift chart 237–240
training-validation-test paradigm 240–246
Model Launch command, neural network 216
Mosaic plot 34–35
Move Up, Value Ordering 116–117
MSE (mean square error) measure 226, 244
multicollinearity of independent variables 73–74
multiple regression 67–76
Multivariate command 67, 142
multivariate data analysis
and data sets 37–39
as prerequisite to predictive modeling 249–250
commonality for practical statistical study 7
framework 9, 11
multiway splits in decision tree 181, 185
N
neural networks
basic process 202–206
data preparation 212–213
fitting options for the model 206, 211, 215, 220
hidden layer structure 205, 208–210
prediction example 213–223
purpose and application 201
validation methods 206–208, 215–216
New Columns command 100
no penalty fit option 211
nominal data 26
nonlinear transformation 74, 204
normal (bell-shaped) distribution 18
Normal Quantile Plot, Distribution command 85–87, 125
Number of Models, neural network 216
Number of Tours, neural network model 216, 217
O
odds ratios, logistic regression 109–111, 113–122
Odds Ratios command 116, 118
one-sample hypothesis testing 24–25
one-way/one-factor ANOVA 83–96
online analytical processing (OLAP) 40–45
optimal classification, ROC curves 233–235, 236
ordinal data 26
outliers, scrubbing data of 212, 219
out-of-sample and in-sample data sets, measures to compare 82, 228–229, 244
output layer, neural network 202–203
overfitting the model/data
clusters 164
decision trees 191
neural network 206–211, 216, 218
train-validation-test paradigm to avoid 240–246
Overlap drop zone 46–47
Overlay Plot command 166–167
P
Pairwise Correlations, Multivariate command 142
parallel coordinate plots, k-means clustering 172–173
Parameter Estimates, Odds Ratios command 118
parsimony, principle of 74, 123
partition initial output, decision tree 183–184, 193
PCA
See Principal Component Analysis
penalty fit method 211, 215, 220
PivotTable, Excel 40–42
Plot Residual by Predicted 72–73, 218–219
PPAR (plan, perform, analyze, reflect) cycle 9–11
practical statistical study 7, 8–9
prediction task, predictive analytics 254
predictive analytics
availability of courses 7
definition 4, 252
framework 252–253
goal 253–254
model development and evaluation phase
255–256
multivariate data analysis role in 249–250
phases 254–256
specific applications 5
tasks of discovery 254
vs. statistics 254–255
predictive modeling
See predictive analytics
Principal Component Analysis (PCA)
dimension reduction 136, 142–144
eigenvalue analysis of weights 141–142
example 135–140
structure of data, insights into 145–149
vs. factor analysis 140–141
probabilities
estimating for logistic regression 119–120
relationship to odds 112
probability formula, saving 119
proportion of variation method, PCA 144, 148
pruning variables in decision tree 191, 195–196
p-values, hypothesis testing 25–26
R
random sample 14, 20–24
Range Odds Ratios, Odds Ratios command 116
Receiver Operating Characteristic (ROC) curve
191–192, 232–237
regression
See also logistic regression
categorical variables 76–82
clusters 164
continuous variables 76–77
fitting to the model 65, 67–69, 71, 74, 122, 128
linear 64–76, 105, 128
multiple 67–76
purposes 64
simple 64–66
stepwise 74–75, 124, 241–243
regression tree 192–199
relative absolute error 227
relative squared error 226
Remove Fit, neural network 215
repeated measures ANOVA 82
representative sample 14
residuals
ANOVA 85, 87
linear regression 128
multiple regression 72–73
neural network 218–219
return on investment (ROI) from data collection 2–3
robust fit method 211
ROC (Receiver Operating Characteristic) curve
191–192, 232–237
root mean square error (RMSE/se) measure 75, 76, 140, 192, 226
RSquare or R2 (coefficient of determination) 66
S
sampling
in-sample and out-of-sample data sets 82,
228–229, 244
one-sample hypothesis testing 24–25
principles 14–15, 18–20
random sample generation 20–24
SAS JMP statistical software application
See also specific screen options and commands
as used in book 10, 11
deciding on best statistical technique 28–36
features to support predictive analytics 58, 254
opening files in Excel 28
saturated model, logistic regression 122
scales for standardizing data, neural network 212
scatterplot matrix 48–51, 123, 126
score plot 139, 145
scree plot
hierarchical clustering 160
PCA 142–143, 145, 146, 147
se (RMSE) 75, 76, 140, 192, 226
Selection button, copying output 44
SEMMA approach 256
sensitivity component of prediction model 231
Show Split Count, Display Options 188
Show Split Prob, Display Options 185
simple regression 64–66
single linkage method, distance between clusters
154–155
sorting data
Graph Builder 59–60
PCA 142, 145
specificity component of prediction model 231
Split command, decision tree variables 185–186
squared penalty fit method 211, 220
squaring distances, k-means clustering 173–174
SSBG (sum of squares between groups) 82, 83
SSE (sum of squares between groups [or error]) 82, 83, 166–167, 175
standard error 19
standardized beta coefficient (Std Beta) 69, 71
statistical assumptions, testing for
one-way ANOVA 85–89
statistics coursework
central limit theorem 18–24
coverage and real-world limitations 5–7
effective vs. ineffective approaches 26–36
one-sample hypothesis testing and p-values
24–26
sampling principles 14–15, 18–20
statistics as inexact science 14, 15–16
Z score/value 17, 24–25
statistics vs. predictive analytics 254–255
Std Beta, Fit Model command 69, 71
stepwise regression 74–75, 124, 241–243
Subset option, Table in Graph Builder 58–59
sum of squares between groups (or error) (SSE) 82, 83, 166–167, 175
sum of squares between groups (SSBG) 82, 83
Summary Statistics, Distribution command 175
supervised (directed) predictive analytics techniques 252, 253, 254
T
tables 40–45
Tabulate command 42–45
testing for differences, one-way ANOVA 90–96
testing statistical assumptions, one-way ANOVA
85–89
Tests that the Variances are Equal report 85
time series, Durbin-Watson test 73
total sum of squares (TSS) 82, 83
train-validate-test paradigm for model evaluation 240–246
Transform Covariates, neural network 212
trellis chart 51–53, 55–58
true positive rate (TPR) component of prediction model 232
TSS (total sum of squares) 82, 83
t-test 65, 71–72, 93
Tukey HSD test 93, 95
Tukey-Kramer HSD test 93, 95
2R (representative and random) sample 14, 16
two-way/two-factor ANOVA 97–102
U
unequal replication design, ANOVA with 97
Unequal Variances test, ANOVA 85, 86, 89
Unit Odds Ratios, Odds Ratios command 116
univariate analysis 6
unsupervised (undirected) predictive analytics techniques 252, 253, 254
V
validation
logistic regression 132
neural network 206–208, 215–216
train-validate-test paradigm 240–246
Validation variable 208
Value Ordering, Column properties 116–117
variables
See also categorical variables
See also continuous variables
automatic assignments for neural network 214
binary dependent variable 104, 221–222,
230–237
decision tree 182–191, 194–196
dummy 76–77, 79–82, 212
model building 123–124
multicollinearity 73–74
neural network 208
reclassifying 113, 123
weighting 141–142, 211, 215
variance inflation factor (VIF) 73, 74
W
Ward’s method, distance between clusters 154
weak classifier, boosting option 210
weight decay penalty fit method 211
weighting of variables 141–142, 211, 215
Whole Model Test 121–122, 127–128
within-sample variability 82
without replication design, ANOVA 97
Wrap drop zone 46–47
Z
Z score/value 17, 24–25