Index

A

a priori algorithm 359

absolute error measures 286–287

absolute penalty fit method 254

accuracy characteristic of prediction model 291

activation function, neural network 247–248

adaptive estimation 184

Adjusted R2 99–100

affinity grouping 368

agglomerative algorithm, clustering 187

AI (artificial intelligence) 365

AIC/AICc (Akaike information criterion) approach 100–101, 227

analysis, k-nearest neighbors algorithm 235–237

Analyze command

Fit Line 88

Fit Model 89–92, 119–121

Fit Stepwise 99

Fit Y by X 27–32, 88, 108

ANOVA (analysis of variance)

one-way/one-factor 108–121

process 107–108

two-way/two-factor 122–128

area under the curve (AUC) 294, 296

artificial intelligence (AI) 365

association analyses 355–356

association analysis 360

association rules 357

association task, predictive analytics 369

AUC (area under the curve) 294, 296

average linkage method, distance between clusters 188

Axis Titles, Histogram Data Analysis 19

B

BA (business analytics) 2–4, 365

bagged trees 269

bar chart 84–86

Bartlett Test 164

Bayesian information criterion (BIC) 100–101

bell-shaped distribution 15–16

BI (business intelligence) 2–4

bias term, neural network 247

bias-variance tradeoff 174–175

binary dependent variable 132, 263–265, 290–296

binary vs. multiway splits, decision tree 213–214

bivariate analysis 4, 27–34, 148, 150–152

boosted trees

about 267–268, 275

performing boosting 275–278

performing for regression trees 278

boosting option

neural network predictability 253–254, 259–260

performing at default settings 280

bootstrap forest

about 267–268

bagged trees 269

performing 269–273

performing for regression trees 274

bubble plot 73–78

business analytics (BA) 4, 365

business intelligence (BI) 2–4

C

categorical variables

See also ANOVA

deciding on statistical technique 22–23

decision tree 213, 214–224

graphs 64, 69, 71, 72–73

neural network 252, 255–256

regression 101–107

causality 59

central limit theorem (CLT) 15–20

centroid method, distance between clusters 188, 197

chaining in single linkage method, distance between clusters 187–188

chi-square test of independence 216

churn analysis 147–158

classification task, predictive analytics 369

classification tree 214–224, 297–299

cleaning data for practical study 6–7

CLT (central limit theorem) 15–20

cluster analysis

about 342–347

credit card user example 186–187

definition 185

hierarchical clustering 189–191, 208

k-means clustering 187, 197–208

regression, using clusters in 196–197

Clustering History, Cluster command 192

clustering task, predictive analytics 369

coefficient of determination (RSquare or R2) 88

Color Clusters, Cluster command 191

Column Contributions, decision tree 229–230

complete linkage method, distance between clusters 189–191

confidence 357–358, 357–360, 358–360

confusion matrix

binary dependent variable model comparison 290–291

logistic regression 145–146, 157–158

neural network 264–265

Connect Thru Missing, Overlay Plot command 198–200

contingency table 136–138

See also confusion matrix

continuous variables

See also ANOVA

deciding on statistical technique 22–23

decision tree 213, 224–230

logistic regression 154–155, 157

model comparison 286–290

neural network 252

contour graphs 74–78

conventional correction methods 45–48

conventional estimate values methods 45, 46–47

cornerstone of statistics, CLT theorem 16

correlation coefficient 287

correlation matrix

multiple regression 90–91

PCA 160–161, 164–165

criterion function, decision tree 216

CRM (customer relation management) 366

cross-validation, neural network 251–252

customer relation management (CRM) 366

D

data discovery 7, 9–10

See also graphs

See also tables

Data Filter, Graph Builder 64–68

data mining 4, 365, 370–371

See also predictive analytics

data sets

about 34–35

conventional correction methods 45–48

document term matrix and larger 332–337

error detection 35–38

JMP approach 48–54

missing values 43–44

outlier detection 39–54

recommended initial steps for 54

data warehouse 2

decision trees

classification tree 214–224, 297–299

credit risk example 213

definition 213

pros and cons of using 149

regression tree 224–230

dendrogram 187, 191, 192–193, 195

dependence, multivariate analysis framework 9–10

See also specific techniques

differences, testing for one-way ANOVA 115–121

dimension reduction, PCA 159, 165–167

directed (supervised) predictive analytics techniques 365, 367–369, 369–370

dirty data 33

discovery, multivariate analysis framework 9–10

See also graphs

See also tables

discovery task, predictive analytics 369

discrete dependent variable 315–317

discrete variables 154–155

Distribution command 25–26, 35, 39, 149

document term matrix

about 321

developing 321–323

larger data sets and 332–337

drop zones, Graph Builder 64–68

dummy variables 47, 255, 279–280

Dunnett's test 119

Durbin-Watson test 98

dynamic histogram, Excel 19–20

dynamic linking feature, JMP 83–84

E

Effect Likelihood Ratio Tests 153–154

eigenvalue analysis 163–165, 170

eigenvalues-greater-than-1 method, PCA 167

elastic net

about 173, 182

results of 183

technique of 182

vs. LASSO 183–184

elbow discovery method 00, 166

enterprise resource planning (ERP) 2

equal replication design, ANOVA with 122–128

error detection 35–38

error table 290–291

See also confusion matrix

estimation task, predictive analytics 369

Excel, Microsoft

measuring continuous variables 288–290

opening files in JMP 23–24

PivotTable 59–61

Exclude/Unexclude option, data table 170

expectation step 47

F

factor analysis vs. PCA

See Principal Component Analysis

factor loadings 167

false positive rate (FPR), prediction model 291

filtering data 295–296

Fit Line, Analyze command 88

Fit Model, Analyze command 89–92, 119–121

Fit Stepwise, Analyze command 99

Fit Y by X, Analyze command 27–32, 88, 108

fitting to the model

ANOVA 108–109, 119–121

G2 (goodness-of-fit) statistic, decision tree 216, 218–221

neural networks 254

regression 88–89, 89–92, 99, 147, 152

statistics review 27–32

train-validate-test paradigm for 299–317

FPR (false positive rate), prediction model 291

fraud detection 366

frequency distribution, Excel Data Analysis Tool 17–20

F-test 88, 96–97, 107

G

G2 (goodness-of-fit) statistic, decision tree 216, 218–221

Gaussian radial basis function 248

generate complete data sets stage 48

gradient boosting, neural network 253–254

Graph Builder 64–86

graphs

bar chart 84–86

bubble plot 73–78

contours 74–78

Graph Builder dialog box 64–68

line graphs 74–78

scatterplot matrix 68–71, 150

trellis chart 71–73, 80–83, 81

Group X drop zone 65–66

Group Y drop zone 65–66

H

hidden layer, neural network 249, 252–254

hierarchical clustering 189–191, 208

high-variance procedure, decision tree as 282

Histogram, Excel Data Analysis Tool 17–20

holdback validation, neural network 250–251

Hsu's MCB (multiple comparison with best) 118–119

hyperbolic tangent (tanh) 247

hypothesis testing 20–21

I

impute estimates stage 48

“include all variables” approach, logistic regression 148, 149

indicator variables 50–52, 255, 321

input layer, neural network 246–247

in-sample and out-of-sample data sets, measures to compare 288–290, 302–303

interactions terms, introducing 155–157

interdependence, multivariate analysis framework 9–10

See also cluster analysis

See also Principal Component Analysis

iterative clustering 206–207

J

JMP

See SAS JMP statistical software application

Johnson Sb transformation 254

Johnson Su transformation 254

K

k-fold cross-validation, neural network 251–252

k-means clustering 187, 197–208

k-nearest neighbors algorithm

about 231–232

analysis 235–237

example 232–234

for multiclass problems 237–239

limitations and drawbacks of 242–243

regression models 239–242

standardizing units of measurement 234–235

ties with 234

L

Lack of Fit test, logistic regression 147, 153

Latent Semantic Analysis (LSA) 338–341

Leaf Report, decision tree 229–230

learning rate for algorithm 254

least absolute shrinkage and selection operator (LASSO)

about 173, 179–180, 307–314

results of 180–182

technique of 180

vs. elastic net 183–184

vs. Ridge Regression 180

least squares criterion 254

least squares differences (LSD) 118

Levene test, ANOVA 114, 115, 128

lift 358–359, 358–360

lift chart 296–299

line graphs 74–78

Linear Probability Model (LPM) 132–133

linear regression

See also logistic regression

definition 88

k-nearest neighbors algorithm 239

LPM 132–133

multiple 89–101

simple 88–89

sum of squared residuals 153

linearity of logit, checking 157–158

listwise deletion method 45–46

loading plots 163, 167–171

log odds of 0/1 convention 139

logistic function 133–135

logistic regression

bivariate method 148

decision tree method 149

lift curve 296–299

logistic function 133–135

LPM 132

odds ratios 136–147

predictive techniques and 349–353

ROC curve 291–294

statistical study example 147–158

stepwise method 148

logit transformation 134–135

LogWorth statistic, decision tree 216, 217, 218–219

low- vs. high-variance procedures 282

LSA (Latent Semantic Analysis) 338–341

LSD (least squares differences) 118

LSMeans Plot command 120

M

macimul likelihood methods 47–48

Make into Data Table, ROC curve 295

MAR (missing at random) 44–45

Mark Clusters, Cluster command 191

market basket analysis

association analysis 355–356, 360

association rules 357

confidence 357–358, 358–360

examples 356

introduction 355

lift 358–360

support 357

maximization step 47–48

MCAR (missing completely at random) 44

mean, substitution for 46

mean absolute error (MAE) measure 286–287, 302–303

mean square error (MSE) measure 286, 302–303

means comparison tests, ANOVA 117–121

median, substitution for 46

missing at random (MAR) 44–45

missing completely at random (MCAR) 44

missing data mechanism 44–45

missing not at random (MNAR) 45

Missing Value Clustering tool 52

Missing Value Snapshot tool 52

missing values 43–44

MNAR (missing not at random) 45

mode, substitution for 46

model comparison

binary dependent variable 290–296

continuous dependent variable 286–290, 305

introduction 285–286

lift chart 296–299

training-validation-test paradigm 299–317

model-based methods 45, 47–48

Mosaic plot 30

Move Up, Value Ordering 142–143

MSE (mean square error) measure 286, 302–303

multiclass problems, k-nearest neighbors algorithm for 237–239

multicollinearity of independent variables 98

multiple imputation methods 48

multiple regression 89–101

Multivariate command 90, 164–165

multivariate data analysis

and data sets 57–59

as prerequisite to predictive modeling 365–366

framework 9–10

multivariate normal imputation 48

multivariate singular value decomposition (SVD) imputation 48

multivariate techniques

cluster analysis 342–347

Latent Semantic Analysis (LSA) 338–341

text mining and 321

topic analysis 342

multiway splits in decision tree 213–214, 216

N

neural networks

basic process 246–250

data preparation 255–256

fitting options for the model 254

hidden layer structure 249, 252–254

prediction example 256–265

purpose and application 245–246

validation methods 250–252, 260–265

New Columns command 126

no penalty fit option 254

nominal data 22

nonlinear transformation 98, 247

normal (bell-shaped) distribution 15–16

Normal Quantile Plot, Distribution command 111–112, 149

Number of Models, neural network 253

Number of Tours, neural network model 262, 263

O

odds ratios, logistic regression 136–147

Odds Ratios command 142, 144

one-sample hypothesis testing 20–21

one-way/one-factor ANOVA 108–121

online analytical processing (OLAP) 59–64

optimal classification, ROC curves 294, 295

order of operations, text mining and 331–332

ordinal data 22

outlier detection 39–54

outliers

defined 40

scrubbing data of 255

out-of-sample and in-sample data sets, measures to compare 288–290, 302–303

output layer, neural network 246–247

overfitting the model/data

about 303–305

neural network 250–254

train-validation-test paradigm to avoid 299–317

Overlap drop zone 65–66

Overlay Plot command 198–200

P

Pairwise Correlations, Multivariate command 164–165

parallel coordinate plots, k-means clustering 204–206

Parameter Estimates, Odds Ratios command 142

parsimony, principle of 99, 148

Partition command 235–237

partition initial output, decision tree 215–216, 224–225

PCA

See Principal Component Analysis

penalty fit method 254

phrasing stage, of text mining 329–330

PivotTable, Excel 59–61

Plot Residual by Predicted 97–98, 261

PPAR (plan, perform, analyze, reflect) cycle 7–8

practical statistical study 6–9

prediction task, predictive analytics 369

predictions, making 236–237, 242

predictive analytics

about 347–348

defined 4

definition 4, 365

framework 367–369

goal 369–370

logistic regressions 349–353

model development and evaluation phase 371–372

multivariate data analysis role in 365–366

phases 369–370

primary analysis 348–349

tasks of discovery 369–370

text mining and 321

vs. statistics 370–371

predictive modeling

See predictive analytics

primary analysis 348–349

Principal Component Analysis (PCA)

dimension reduction 159, 165–167

eigenvalue analysis of weights 163–164

example 159–163

structure of data, insights into 167–171

probabilities

estimating for logistic regression 145

relationship to odds 138

probability formula, saving 145

proportion of variation method, PCA 167, 170

pruning variables in decision tree 222–223

p-values, hypothesis testing 21

R

random forests

See bootstrap forest

random sample 12–13

Range Odds Ratios, Odds Ratios command 142

Receiver Operating Characteristic (ROC) curve 223–224, 291–294

regression

See also logistic regression

categorical variables 101–107

clusters 196–197

fitting to the model 88–89, 89–92, 147, 153

linear 88–107, 132–133, 153

multiple 89–101

purposes 88

simple 88–89

stepwise 98–101, 148, 299–302

regression imputation 47

regression models, k-nearest neighbors algorithm 239–242

regression trees

about 224–230

performing Boosted Trees for 278

performing Bootstrap Forest for 274

relative absolute error 287

relative squared error 287

repeated measures ANOVA 107

representative sample 12–13

residuals

linear regression 153

multiple regression 97–98

return on investment (ROI) from data collection 2

ridge regression

about 173

JMP and 176–179

techniques and limitations of 175–176

vs. LASSO 180

robust fit method 254

ROC (Receiver Operating Characteristic) curve 223–224, 291–294

root mean square error (RMSE/se) measure 100–101, 163, 240–242, 286–287

RSquare or R2 (coefficient of determination) 88

S

sampling

in-sample and out-of-sample data sets 288–290, 302–303

one-sample hypothesis testing 20–21

principles 13–14, 15–16

SAS JMP statistical software application

See also specific screen options and commands

as used in book 8–9, 9–10

deciding on best statistical technique 23–32

features to support predictive analytics 369–370

opening files in Excel 23–24

saturated model, logistic regression 147

scales for standardizing data, neural network 255

scatterplot matrix 68–71, 90–91, 148, 150, 160–161

score plot 162–163

scree plot

hierarchical clustering 193

PCA 166, 167, 168

se (RMSE) 100–101, 163, 240–242, 286–287

Selection button, copying output 63

SEMMA approach 371–372

sensitivity component of prediction model 291

Show Split Count, Display Options 219

Show Split Prob, Display Options 217

simple regression 88–89

single linkage method, distance between clusters 188

singular value decomposition (SVD) 338–341

sorting data, Graph Builder 84–86

specificity component of prediction model 291

Split command, decision tree variables 218–221

squared penalty fit method 254

SSBG (sum of squares between groups) 107

SSE (sum of squares between groups [or error]) 107, 198–200, 206–207, 240–242

standard error 16

standardized beta coefficient (Std Beta) 95–96

standardized units of measurement 234–235

statistical assumptions, testing for one-way ANOVA 110–114

statistics coursework

central limit theorem 15–20

coverage and real-world limitations 4–6

effective vs. ineffective approaches 22–32

one-sample hypothesis testing and p-values 20–21

sampling principles 13–14, 15–16

statistics as inexact science 13–14

Z score/value 14, 20–21

statistics vs. predictive analytics 370–371

Std Beta, Fit Model command 95–96

stemming 324

stepwise regression 98–101, 149, 299–302

stop words 330–331

Subset option, Table in Graph Builder 83–84

sum of squares between groups (or error) (SSE) 107, 198–200, 206–207, 240–242

sum of squares between groups (SSBG) 107

supervised (directed) predictive analytics techniques 367–369, 369–370

support 357

T

tables 59–64

Tabulate command 62–64

terming stage, of text mining 330–331

terms

adding frequent phrases to list of 335–336

defined 326

grouping 334–335

identifying dominant 346–347

parsing 336–337

testing for differences, one-way ANOVA 115–121

testing statistical assumptions, one-way ANOVA 110–114

Tests that the Variances are Equal report 110

Text Explorer dialog box 324–325

text mining

introduction 319–320

phrasing stage 329–330

terming stage 330–331

tokenizing stage 323–329

unstructured data 320–321

ties, in k-nearest neighbors algorithm 234

time series, Durbin-Watson test 98

tokenizing stage, of text mining 323–329

topic analysis 342

total sum of squares (TSS) 107

train-validate-test paradigm for model evaluation 299–317

Transform Covariates, neural network 255

trellis chart 71–73, 80–83

true positive rate (TPR) component of prediction model 291

TSS (total sum of squares) 107

t-test 8, 97, 118

Tukey HSD test 118, 120

Tukey-Kramer HSD test 118, 120

2R (representative and random) sample 13–14

two-way/two-factor ANOVA 122–128

U

unequal replication design, ANOVA with 122

Unequal Variances test, ANOVA 110–111, 115

Unit Odds Ratios, Odds Ratios command 142

univariate analysis 4

unstructured data 320–321

unsupervised (undirected) predictive analytics techniques 367–369, 369–370

V

validation

boosted trees 278–283

logistic regression 157–158

neural network 250–252, 260–265

train-validate-test paradigm 299–317

Value Ordering, Column properties 142–143

variable removal method 45, 46

variables

See also categorical variables

See also continuous variables

binary dependent variable 132, 263–265, 290–296

decision tree 214–224, 225–228

dummy 47, 255, 279–280

in data sets 34–35

model building 147–150

multicollinearity 98

neural network 252

reclassifying 148

weighting 163–164, 254

variance inflation factor (VIF) 98

W

Ward's method, distance between clusters 187–188

weak classifier, boosting option 253

weight decay penalty fit method 254

weighting of variables 163–164, 254

Welch's Test 110–111, 114, 115

Whole Model Test 146–147, 151–152

within-sample variability 107

without replication design, ANOVA 122

word clouds 331, 333–334

Wrap drop zone 65–66

Z

Z score/value 14, 20–21

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset