A
a priori algorithm 359
absolute error measures 286–287
absolute penalty fit method 254
accuracy characteristic of prediction model 291
activation function, neural network 247–248
adaptive estimation 184
Adjusted R2 99–100
affinity grouping 368
agglomerative algorithm, clustering 187
AI (artificial intelligence) 365
AIC/AICc (Akaike information criterion) approach 100–101, 227
analysis, k-nearest neighbors algorithm 235–237
Analyze command
Fit Line 88
Fit Model 89–92, 119–121
Fit Stepwise 99
Fit Y by X 27–32, 88, 108
ANOVA (analysis of variance)
one-way/one-factor 108–121
process 107–108
two-way/two-factor 122–128
area under the curve (AUC) 294, 296
artificial intelligence (AI) 365
association analyses 355–356
association analysis 360
association rules 357
association task, predictive analytics 369
AUC (area under the curve) 294, 296
average linkage method, distance between clusters 188
Axis Titles, Histogram Data Analysis 19
B
BA (business analytics) 2–4, 365
bagged trees 269
bar chart 84–86
Bartlett Test 164
Bayesian information criterion (BIC) 100–101
bell-shaped distribution 15–16
BI (business intelligence) 2–4
bias term, neural network 247
bias-variance tradeoff 174–175
binary dependent variable 132, 263–265, 290–296
binary vs. multiway splits, decision tree 213–214
bivariate analysis 4, 27–34, 148, 150–152
boosted trees
about 267–268, 275
performing boosting 275–278
performing for regression trees 278
boosting option
neural network predictability 253–254, 259–260
performing at default settings 280
bootstrap forest
about 267–268
bagged trees 269
performing 269–273
performing for regression trees 274
bubble plot 73–78
business analytics (BA) 4, 365
business intelligence (BI) 2–4
C
categorical variables
See also ANOVA
deciding on statistical technique 22–23
decision tree 213, 214–224
graphs 64, 69, 71, 72–73
neural network 252, 255–256
regression 101–107
causality 59
central limit theorem (CLT) 15–20
centroid method, distance between clusters 188, 197
chaining in single linkage method, distance between clusters 187–188
chi-square test of independence 216
churn analysis 147–158
classification task, predictive analytics 369
classification tree 214–224, 297–299
cleaning data for practical study 6–7
CLT (central limit theorem) 15–20
cluster analysis
about 342–347
credit card user example 186–187
definition 185
hierarchical clustering 189–191, 208
k-means clustering 187, 197–208
regression, using clusters in 196–197
Clustering History, Cluster command 192
clustering task, predictive analytics 369
coefficient of determination (RSquare or R2) 88
Color Clusters, Cluster command 191
Column Contributions, decision tree 229–230
complete linkage method, distance between clusters 189–191
confidence 357–358, 357–360, 358–360
confusion matrix
binary dependent variable model comparison 290–291
logistic regression 145–146, 157–158
neural network 264–265
Connect Thru Missing, Overlay Plot command 198–200
contingency table 136–138
See also confusion matrix
continuous variables
See also ANOVA
deciding on statistical technique 22–23
decision tree 213, 224–230
logistic regression 154–155, 157
model comparison 286–290
neural network 252
contour graphs 74–78
conventional correction methods 45–48
conventional estimate values methods 45, 46–47
cornerstone of statistics, CLT theorem 16
correlation coefficient 287
correlation matrix
multiple regression 90–91
PCA 160–161, 164–165
criterion function, decision tree 216
CRM (customer relation management) 366
cross-validation, neural network 251–252
customer relation management (CRM) 366
D
data discovery 7, 9–10
See also graphs
See also tables
Data Filter, Graph Builder 64–68
data mining 4, 365, 370–371
See also predictive analytics
data sets
about 34–35
conventional correction methods 45–48
document term matrix and larger 332–337
error detection 35–38
JMP approach 48–54
missing values 43–44
outlier detection 39–54
recommended initial steps for 54
data warehouse 2
decision trees
classification tree 214–224, 297–299
credit risk example 213
definition 213
pros and cons of using 149
regression tree 224–230
dendrogram 187, 191, 192–193, 195
dependence, multivariate analysis framework 9–10
See also specific techniques
differences, testing for one-way ANOVA 115–121
dimension reduction, PCA 159, 165–167
directed (supervised) predictive analytics techniques 365, 367–369, 369–370
dirty data 33
discovery, multivariate analysis framework 9–10
See also graphs
See also tables
discovery task, predictive analytics 369
discrete dependent variable 315–317
discrete variables 154–155
Distribution command 25–26, 35, 39, 149
document term matrix
about 321
developing 321–323
larger data sets and 332–337
drop zones, Graph Builder 64–68
dummy variables 47, 255, 279–280
Dunnett's test 119
Durbin-Watson test 98
dynamic histogram, Excel 19–20
dynamic linking feature, JMP 83–84
E
Effect Likelihood Ratio Tests 153–154
eigenvalue analysis 163–165, 170
eigenvalues-greater-than-1 method, PCA 167
elastic net
about 173, 182
results of 183
technique of 182
vs. LASSO 183–184
elbow discovery method 00, 166
enterprise resource planning (ERP) 2
equal replication design, ANOVA with 122–128
error detection 35–38
error table 290–291
See also confusion matrix
estimation task, predictive analytics 369
Excel, Microsoft
measuring continuous variables 288–290
opening files in JMP 23–24
PivotTable 59–61
Exclude/Unexclude option, data table 170
expectation step 47
F
factor analysis vs. PCA
See Principal Component Analysis
factor loadings 167
false positive rate (FPR), prediction model 291
filtering data 295–296
Fit Line, Analyze command 88
Fit Model, Analyze command 89–92, 119–121
Fit Stepwise, Analyze command 99
Fit Y by X, Analyze command 27–32, 88, 108
fitting to the model
ANOVA 108–109, 119–121
G2 (goodness-of-fit) statistic, decision tree 216, 218–221
neural networks 254
regression 88–89, 89–92, 99, 147, 152
statistics review 27–32
train-validate-test paradigm for 299–317
FPR (false positive rate), prediction model 291
fraud detection 366
frequency distribution, Excel Data Analysis Tool 17–20
F-test 88, 96–97, 107
G
G2 (goodness-of-fit) statistic, decision tree 216, 218–221
Gaussian radial basis function 248
generate complete data sets stage 48
gradient boosting, neural network 253–254
Graph Builder 64–86
graphs
bar chart 84–86
bubble plot 73–78
contours 74–78
Graph Builder dialog box 64–68
line graphs 74–78
scatterplot matrix 68–71, 150
trellis chart 71–73, 80–83, 81
Group X drop zone 65–66
Group Y drop zone 65–66
H
hidden layer, neural network 249, 252–254
hierarchical clustering 189–191, 208
high-variance procedure, decision tree as 282
Histogram, Excel Data Analysis Tool 17–20
holdback validation, neural network 250–251
Hsu's MCB (multiple comparison with best) 118–119
hyperbolic tangent (tanh) 247
hypothesis testing 20–21
I
impute estimates stage 48
“include all variables” approach, logistic regression 148, 149
indicator variables 50–52, 255, 321
input layer, neural network 246–247
in-sample and out-of-sample data sets, measures to compare 288–290, 302–303
interactions terms, introducing 155–157
interdependence, multivariate analysis framework 9–10
See also cluster analysis
See also Principal Component Analysis
iterative clustering 206–207
J
JMP
See SAS JMP statistical software application
Johnson Sb transformation 254
Johnson Su transformation 254
K
k-fold cross-validation, neural network 251–252
k-means clustering 187, 197–208
k-nearest neighbors algorithm
about 231–232
analysis 235–237
example 232–234
for multiclass problems 237–239
limitations and drawbacks of 242–243
regression models 239–242
standardizing units of measurement 234–235
ties with 234
L
Lack of Fit test, logistic regression 147, 153
Latent Semantic Analysis (LSA) 338–341
Leaf Report, decision tree 229–230
learning rate for algorithm 254
least absolute shrinkage and selection operator (LASSO)
about 173, 179–180, 307–314
results of 180–182
technique of 180
vs. elastic net 183–184
vs. Ridge Regression 180
least squares criterion 254
least squares differences (LSD) 118
Levene test, ANOVA 114, 115, 128
lift 358–359, 358–360
lift chart 296–299
line graphs 74–78
Linear Probability Model (LPM) 132–133
linear regression
See also logistic regression
definition 88
k-nearest neighbors algorithm 239
LPM 132–133
multiple 89–101
simple 88–89
sum of squared residuals 153
linearity of logit, checking 157–158
listwise deletion method 45–46
loading plots 163, 167–171
log odds of 0/1 convention 139
logistic function 133–135
logistic regression
bivariate method 148
decision tree method 149
lift curve 296–299
logistic function 133–135
LPM 132
odds ratios 136–147
predictive techniques and 349–353
ROC curve 291–294
statistical study example 147–158
stepwise method 148
logit transformation 134–135
LogWorth statistic, decision tree 216, 217, 218–219
low- vs. high-variance procedures 282
LSA (Latent Semantic Analysis) 338–341
LSD (least squares differences) 118
LSMeans Plot command 120
M
macimul likelihood methods 47–48
Make into Data Table, ROC curve 295
MAR (missing at random) 44–45
Mark Clusters, Cluster command 191
market basket analysis
association analysis 355–356, 360
association rules 357
confidence 357–358, 358–360
examples 356
introduction 355
lift 358–360
support 357
maximization step 47–48
MCAR (missing completely at random) 44
mean, substitution for 46
mean absolute error (MAE) measure 286–287, 302–303
mean square error (MSE) measure 286, 302–303
means comparison tests, ANOVA 117–121
median, substitution for 46
missing at random (MAR) 44–45
missing completely at random (MCAR) 44
missing data mechanism 44–45
missing not at random (MNAR) 45
Missing Value Clustering tool 52
Missing Value Snapshot tool 52
missing values 43–44
MNAR (missing not at random) 45
mode, substitution for 46
model comparison
binary dependent variable 290–296
continuous dependent variable 286–290, 305
introduction 285–286
lift chart 296–299
training-validation-test paradigm 299–317
model-based methods 45, 47–48
Mosaic plot 30
Move Up, Value Ordering 142–143
MSE (mean square error) measure 286, 302–303
multiclass problems, k-nearest neighbors algorithm for 237–239
multicollinearity of independent variables 98
multiple imputation methods 48
multiple regression 89–101
Multivariate command 90, 164–165
multivariate data analysis
and data sets 57–59
as prerequisite to predictive modeling 365–366
framework 9–10
multivariate normal imputation 48
multivariate singular value decomposition (SVD) imputation 48
multivariate techniques
cluster analysis 342–347
Latent Semantic Analysis (LSA) 338–341
text mining and 321
topic analysis 342
multiway splits in decision tree 213–214, 216
N
neural networks
basic process 246–250
data preparation 255–256
fitting options for the model 254
hidden layer structure 249, 252–254
prediction example 256–265
purpose and application 245–246
validation methods 250–252, 260–265
New Columns command 126
no penalty fit option 254
nominal data 22
nonlinear transformation 98, 247
normal (bell-shaped) distribution 15–16
Normal Quantile Plot, Distribution command 111–112, 149
Number of Models, neural network 253
Number of Tours, neural network model 262, 263
O
odds ratios, logistic regression 136–147
Odds Ratios command 142, 144
one-sample hypothesis testing 20–21
one-way/one-factor ANOVA 108–121
online analytical processing (OLAP) 59–64
optimal classification, ROC curves 294, 295
order of operations, text mining and 331–332
ordinal data 22
outlier detection 39–54
outliers
defined 40
scrubbing data of 255
out-of-sample and in-sample data sets, measures to compare 288–290, 302–303
output layer, neural network 246–247
overfitting the model/data
about 303–305
neural network 250–254
train-validation-test paradigm to avoid 299–317
Overlap drop zone 65–66
Overlay Plot command 198–200
P
Pairwise Correlations, Multivariate command 164–165
parallel coordinate plots, k-means clustering 204–206
Parameter Estimates, Odds Ratios command 142
parsimony, principle of 99, 148
Partition command 235–237
partition initial output, decision tree 215–216, 224–225
PCA
See Principal Component Analysis
penalty fit method 254
phrasing stage, of text mining 329–330
PivotTable, Excel 59–61
Plot Residual by Predicted 97–98, 261
PPAR (plan, perform, analyze, reflect) cycle 7–8
practical statistical study 6–9
prediction task, predictive analytics 369
predictions, making 236–237, 242
predictive analytics
about 347–348
defined 4
definition 4, 365
framework 367–369
goal 369–370
logistic regressions 349–353
model development and evaluation phase 371–372
multivariate data analysis role in 365–366
phases 369–370
primary analysis 348–349
tasks of discovery 369–370
text mining and 321
vs. statistics 370–371
predictive modeling
See predictive analytics
primary analysis 348–349
Principal Component Analysis (PCA)
dimension reduction 159, 165–167
eigenvalue analysis of weights 163–164
example 159–163
structure of data, insights into 167–171
probabilities
estimating for logistic regression 145
relationship to odds 138
probability formula, saving 145
proportion of variation method, PCA 167, 170
pruning variables in decision tree 222–223
p-values, hypothesis testing 21
R
random forests
See bootstrap forest
random sample 12–13
Range Odds Ratios, Odds Ratios command 142
Receiver Operating Characteristic (ROC) curve 223–224, 291–294
regression
See also logistic regression
categorical variables 101–107
clusters 196–197
fitting to the model 88–89, 89–92, 147, 153
linear 88–107, 132–133, 153
multiple 89–101
purposes 88
simple 88–89
stepwise 98–101, 148, 299–302
regression imputation 47
regression models, k-nearest neighbors algorithm 239–242
regression trees
about 224–230
performing Boosted Trees for 278
performing Bootstrap Forest for 274
relative absolute error 287
relative squared error 287
repeated measures ANOVA 107
representative sample 12–13
residuals
linear regression 153
multiple regression 97–98
return on investment (ROI) from data collection 2
ridge regression
about 173
JMP and 176–179
techniques and limitations of 175–176
vs. LASSO 180
robust fit method 254
ROC (Receiver Operating Characteristic) curve 223–224, 291–294
root mean square error (RMSE/se) measure 100–101, 163, 240–242, 286–287
RSquare or R2 (coefficient of determination) 88
S
sampling
in-sample and out-of-sample data sets 288–290, 302–303
one-sample hypothesis testing 20–21
principles 13–14, 15–16
SAS JMP statistical software application
See also specific screen options and commands
as used in book 8–9, 9–10
deciding on best statistical technique 23–32
features to support predictive analytics 369–370
opening files in Excel 23–24
saturated model, logistic regression 147
scales for standardizing data, neural network 255
scatterplot matrix 68–71, 90–91, 148, 150, 160–161
score plot 162–163
scree plot
hierarchical clustering 193
PCA 166, 167, 168
se (RMSE) 100–101, 163, 240–242, 286–287
Selection button, copying output 63
SEMMA approach 371–372
sensitivity component of prediction model 291
Show Split Count, Display Options 219
Show Split Prob, Display Options 217
simple regression 88–89
single linkage method, distance between clusters 188
singular value decomposition (SVD) 338–341
sorting data, Graph Builder 84–86
specificity component of prediction model 291
Split command, decision tree variables 218–221
squared penalty fit method 254
SSBG (sum of squares between groups) 107
SSE (sum of squares between groups [or error]) 107, 198–200, 206–207, 240–242
standard error 16
standardized beta coefficient (Std Beta) 95–96
standardized units of measurement 234–235
statistical assumptions, testing for one-way ANOVA 110–114
statistics coursework
central limit theorem 15–20
coverage and real-world limitations 4–6
effective vs. ineffective approaches 22–32
one-sample hypothesis testing and p-values 20–21
sampling principles 13–14, 15–16
statistics as inexact science 13–14
Z score/value 14, 20–21
statistics vs. predictive analytics 370–371
Std Beta, Fit Model command 95–96
stemming 324
stepwise regression 98–101, 149, 299–302
stop words 330–331
Subset option, Table in Graph Builder 83–84
sum of squares between groups (or error) (SSE) 107, 198–200, 206–207, 240–242
sum of squares between groups (SSBG) 107
supervised (directed) predictive analytics techniques 367–369, 369–370
support 357
T
tables 59–64
Tabulate command 62–64
terming stage, of text mining 330–331
terms
adding frequent phrases to list of 335–336
defined 326
grouping 334–335
identifying dominant 346–347
parsing 336–337
testing for differences, one-way ANOVA 115–121
testing statistical assumptions, one-way ANOVA 110–114
Tests that the Variances are Equal report 110
Text Explorer dialog box 324–325
text mining
introduction 319–320
phrasing stage 329–330
terming stage 330–331
tokenizing stage 323–329
unstructured data 320–321
ties, in k-nearest neighbors algorithm 234
time series, Durbin-Watson test 98
tokenizing stage, of text mining 323–329
topic analysis 342
total sum of squares (TSS) 107
train-validate-test paradigm for model evaluation 299–317
Transform Covariates, neural network 255
trellis chart 71–73, 80–83
true positive rate (TPR) component of prediction model 291
TSS (total sum of squares) 107
t-test 8, 97, 118
Tukey HSD test 118, 120
Tukey-Kramer HSD test 118, 120
2R (representative and random) sample 13–14
two-way/two-factor ANOVA 122–128
U
unequal replication design, ANOVA with 122
Unequal Variances test, ANOVA 110–111, 115
Unit Odds Ratios, Odds Ratios command 142
univariate analysis 4
unstructured data 320–321
unsupervised (undirected) predictive analytics techniques 367–369, 369–370
V
validation
boosted trees 278–283
logistic regression 157–158
neural network 250–252, 260–265
train-validate-test paradigm 299–317
Value Ordering, Column properties 142–143
variable removal method 45, 46
variables
See also categorical variables
See also continuous variables
binary dependent variable 132, 263–265, 290–296
decision tree 214–224, 225–228
dummy 47, 255, 279–280
in data sets 34–35
model building 147–150
multicollinearity 98
neural network 252
reclassifying 148
weighting 163–164, 254
variance inflation factor (VIF) 98
W
Ward's method, distance between clusters 187–188
weak classifier, boosting option 253
weight decay penalty fit method 254
weighting of variables 163–164, 254
Welch's Test 110–111, 114, 115
Whole Model Test 146–147, 151–152
within-sample variability 107
without replication design, ANOVA 122
word clouds 331, 333–334
Wrap drop zone 65–66
Z
Z score/value 14, 20–21