Index
A
Actual by Predicted Plot report 89, 95, 162, 168–169, 171–172, 197, 213
actual values, compared with predicted values 96–99, 170–172, 179–180
AGPT, transforming 148–150
analysis, performing 79–81, 156–158, 175–176
Andersson, M. 222
applied statistics 1
Arlot, S. 67
assessing models using test set 133–136
B
Baking Bread That People Like example
about 183–184
combined model 215–219
data 184–186
first stage model 187–202
second stage model 202–215
Belsley, D.A. 22
Bias-Variance Tradeoff in PLS
about 250, 261
examples 250–253
motivation 254–255
results and discussion 257–261
simulation study 255–257
BigClassCVDemo.jmp 67–68
bivariate distributions 205, 207
Blue Ridge Ecoregion, PLS models for 173–180
Bradykinin potentiating activity 76, 103
Bread.jmp 185–186, 203, 215–219
C
Cars example 11–15
CarsSmall.jmp 11–15
Cauchy-Schwarz Inequality 233
Celisse, A. 67
centering
data 3–4
example of 28–31
in PLS 37–38
Chackrapani, C. 184
Chaloud, D.J. 140, 142, 155, 167, 179, 181, 182
Chatfield, C. 2
Chong, I.G. 267
Coefficient Plots 131–132, 181, 212
choosing number of factors 65–71
column vector 12–15
combined model, for Baking Bread That People Like example 215–219
Compare_NIPALS_VIP_and_VIPStar.jsl script 264
ComparePLS1andPLS2.jsl script 262–263
comparing
actual values to predicted values 96–99, 170–172, 179–180
residuals 99–101
stage one models for Baking Bread That People Like example 200–202
variable selection error rates 273–280
VIP values 270–273
confidence ellipses,
Scatterplot Matrix 188
X Score Scatterplot Matrix 87
correlation
effect of among predictors 18–23
for Ys and Xs in PLS 38
with factors, loadings as measures of 235–236, 243
structure of Xs 264–266
Correlations report 189
covariance 32, 232–233
creating
formula 148
plots of individual spectra 111–112
stacked data tables 109–111
subsets 173–174
test set indicator columns 107–108
cross validation 18, 65–66, 66–67, 246–248
See also k-fold cross validation
See also leave-one-out cross validation
D
data
Baking Bread That People Like example 184–186
centering 3–4
contextual nature of 2
diversity of 2
imputing missing 146–147
initial visualization 77–79
performance on 96–104
Predicting Biological Activity example 76–79
Predicting Octane Rating of Gasoline example 106–108
reviewing 174–175
scaling 3–4
transforming 3–4
viewing 108–116
Water Quality in Savannah River Basin example 140–141
data filter 46–48, 57, 113–116, 124–125, 257
de Jong, S. 72, 237
deflation 225
design matrix 14
diagnostics
Predicting Biological Activity example 95–96
Predicting Octane Rating of Gasoline example 121–125
pruned PLS model for Savannah River basin 168–169
Diagnostics Plots 88–89, 122 , 160–162, 193–194, 210–211
differences, by ecoregion 150–155
Dijkstra, T.K. 71
dimensionality reduction 34–36
DimensionalityReduction.jsl script 31–34, 34–36
dimensions, of matrices 15
Distance Plots 96, 123
distances, to X and Y models 242
distributions, Water Quality in Savannah River Basin example 147–148
Draper, N.R. 15
E
eigenvalues 26–27, 222–224, 233
eigenvectors 27, 223–224
Eriksson, L. 4, 28, 89, 96, 163
examples
See also specific examples
Bias-Variance Tradeoff in PLS 250–253
Cars 11–15
centering 28–31
of PLS analysis 4–10
prediction 59–64
scaling 28–31
scores and loadings 54–55
excluding test set 116–117
expected value 19
exploratory data analysis, in multivariate studies 31–34
extracting factors 50–51
F
factor loadings 52, 224
factors
choosing number of 65–71
determining number of, using cross validation 246–248
extracting 50–51
3-D scatterplots for one 46
"fat matrices" 106
Fidell, L.S. 48
fingerprint 42
first principal component 26, 36
Fit Model Launch Window 5–6
Fit Model Platform 15
Fitting PLS models 117–118, 120, 136–137, 190–195, 207–208
formula editor window 217
Friedman, J. 23, 39, 54, 250
G
Gasoline.jmp 106–116
generative Gaussian process 252
H
Hastie, T. 23, 39, 54, 250
Hellberg, S. 76, 77
Heuver, G. 76, 77, 79, 103
holdout cross validation
method 66
set 247
Hoskuldsson, A. 232
I
identity matrix 16
imputing missing data 146–147
initial data visualization
Baking Bread That People Like example 77–79
Water Quality in Savannah River Basin example 144–145
initial reports 118–120
inner relation regression 52
inputs 1
inverse of a square matrix 16
J
JMP customizations for SIMPLS 241
JMP Pro
Fit Model launch 5–6
KFold validation 7
Validation Methods 69–70
Johansson, E. 4, 28, 89, 96, 163
Jun, C.H. 267
K
Kalivas, J.H. 106
Kettaneh-Wold, N. 4, 28, 89, 96, 163
k-fold cross validation 66, 67–68
KFold Cross Validation report 119, 167–168, 208, 209
Kourti, T. 126
L
leave-one-out cross validation 66–67, 190–192, 208–209
Leave-One-Out report 190–192, 208–209
left singular vectors 222
linked subset 174
loading matrix 52, 83–85
loading plots 27, 85–86, 127–128
loadings
as measures of correlation with factors 235–236, 243
PLS 50–59
properties of 232
LoWarp.jmp 28–31
lurking variables 103
M
MacGregor, J.F. 126
Make Model Using VIP 92, 130, 136, 195
Make Model Using Selection 92, 93, 130, 164, 166, 178, 212
Mason, R.L. 126
Mateos-Aparicio, G. 71
matrices
dimensions of 15
"fat matrices" 106
identity 16
loading 52, 83–85
scatterplot 83–87, 125–127
singular value decomposition of 222–223
matrix algebra 222
maximization of covariance 232–233
mean squared prediction error (MSPE) 257–261
Microsoft Research 2
missing response values, Water Quality in Savannah River Basin example 145–146
Missing Value Imputation report 147, 159–160
MLR
See multiple linear regression (MLR)
Model Coefficients report 193, 210
Model Comparison Summary report 119, 158–159, 191–192, 208, 209
model fitting
for Baking Bread That People Like example 195–197, 197–199, 212–215
PLS model for Blue Ridge Ecoregion 178–180
pruned PLS model for Savannah River Basin example 166–168
modeling 1–2
models
assessing using test set 133–136
fitting 117–118, 120, 136–137, 207–208
in terms of X scores 52, 228–229, 241
testing 9–10
for X and Y 52–53, 228–229, 241
MSPE (mean squared prediction error) 257–261
multicollinearity 18–23
Multicollinearity.jsl script 18–23, 263–264
multiple linear regression (MLR)
Cars example 11–15
effect of correlation among predictors 18–23
estimating coefficients 15–16
overfitting 16–18
underfitting 16–18
multivariate studies, exploratory data analysis in 31–34
multivariate technique, PLS as a 38–39
N
Nash, M.S. 140, 142, 155, 167, 179, 181, 182
NIPALS algorithm
about 71–72, 226–228
computational results 228–231
extracting factors 50
models in terms of X scores 52
models in terms of Xs 53
notation 225–226
one-factor model 60–63
properties of 231–237
two-factor model 63–64
NIPALS Fit report 159–160, 176–178
NIPALS Fit with 1 Factors report 158–163, 191–192, 193, 196
NIPALS Fit with 2 Factors report 208, 209, 247–248
NIPALS Fit with 3 Factors report 7–8, 120
noise 14
Nomikos, P. 126
nonlinear iterative partial least squares algorithm
See NIPALS algorithm
notation
for NIPALS algorithm 225–226
for SIMPLS algorithm 238–240
number of factors 246–248
O
O'Mahony, M. 184
opening formula editor window 217
optimization criterion, SIMPLS algorithm 237
outputs 1
overfitting 16–18
P
parameters 15
partial least squares (PLS)
See also variable selection
about 1–2
algorithms 224–225
analysis example 4–10
as a multivariate technique 38–39
centering in 37–38
compared with PCA 49–50
how it works 45–49
loadings 50–59
models 155–181, 173–180
models for Blue Ridge Ecoregion 173–180
models for predicting biological activity 79–96
models for predicting octane ratings of gasoline 116–138
report 44, 81–82, 158–159, 191–192, 208–212
reasons for using 39–45
scaling in 37–38
scores 50–59
overview 72–73
in today's world 2–3
variable reduction in 89–90
Partial Least Squares Model Launch window 7
Partial Least Squares report 158–159, 191–192, 208–212
PCA
See principal components analysis (PCA)
PCA platform 27
PCR (Principal Components Regression) 39, 223–224
Penta.jmp 76–77
Percent Variation Explained for X Effects 230, 242
Percent Variation Explained for Y Responses 230, 242
Percent Variation Explained report 121, 137, 192, 209
Pérez-Enciso, M. 89
performing analysis 79–81, 96–104. 156–158, 175–176
plots
construction for individual spectra 111–112
diagnostics 88–89, 122
loading 27, 85–86, 127–128
variable importance 90–93
PLS
See partial least squares (PLS)
PLS platform 69–71
PLS procedure 77
PLS Report 81–82
PLS1 models 222
PLS2 models 222
PLSGeometry.jsl script 45–49
PLS_PCA.jsl script 49–50
PLSScoresAndLoadings.jmp 54–55
PLSvsTrueModel.jmp 59–60
PolyRegr.jsl script 16–18
PolyRegr2.jsl script 250–253
Predicted Residual Sums of Squares (PRESS) statistic 246–248
predicted values, compared with actual values 96–99, 170–172, 179–180
Predicting Biological Activity example
about 75–76
data 76–79
first PLS model 79–93
performance on data from second study 96–104
pruned PLS model 93–96
Predicting Octane Rating of Gasoline example
about 106
data 106–108
first PLS model 116–120
pruned model 136–138
second PLS model 120–136
viewing data 108–116
prediction
example using simulation 59–64
formulas, saving 8–9, 60–64, 169–170
Prediction Profiler 201–202, 214–215, 218–219
predictors, effect of correlation among 18–23, 59–64
PRESS (Predicted Residual Sums of Squares) statistic 246–248
principal components 224
principal components analysis (PCA)
about 25–27, 223–224
compared with PLS 49–50
dimensionality reduction via 34–36
Principal Components Regression (PCR) 39, 223–224
Profiler
comparing via the 201–202
viewing 213–215, 218–219
projection method 48
projection to latent structures 48
properties
of loadings 232
of NIPALS algorithm 231–237
of SIMPLS algorithm 237–238
of scores 232
shared by NIPALS and SIMPLS 53–54
R
regression
inner relation in PLS 52
stepwise 263
regression coefficients 15, 130, 234–235
regression parameters 12
regularization techniques 23
reports
Actual by Predicted Plot 89, 95, 162, 168–169, 171–172, 197, 213
Coefficient Plots 131–132, 181, 212
Diagnostics Plots 88–89, 122, 160–162, 193–194, 210–211
Distance Plots 96, 123
initial 118–120
KFold Cross Validation 119, 167–168, 208, 209
Leave-One-Out 190–192, 208–209
Loading Plots 83–86, 127–128
Missing Value Imputation 147, 159–160
Model Coefficients 193, 210
Model Comparison Summary 119, 158–159, 191–192, 208, 209
NIPALS Fit with 1 Factors 158–163, 191–192, 193, 196
NIPALS Fit with 2 Factors 208, 209, 247–248
NIPALS Fit with 3 Factors 7–8, 120
Partial Least Squares (PLS) 44, 81–82, 158–159, 191–192, 208–212
Percent Variation Explained 121, 137, 192, 209
Profiler 201–202, 213–215, 218–219
Residual by Predicted Plot 89, 95, 122, 168, 197
Score Scatterplot Matrices 86–87, 125–127
SIMPLS Fit with 2 Factors 82–83
Stepwise Regression Control 198–199
T Square Plot 123, 160–161
Variable Importance Plot 44–45, 90–91, 129, 131–132, 165
VIP vs Coefficients Plots 91–93, 130–132, 136, 163–166, 177–178, 194–195, 212
X-Y Scores Plots 82–83, 120–121, 159, 192, 196, 209
Residual by Predicted Plot report 89, 95, 122, 168, 197
residuals
about 14–15, 34
comparing 99–101
right singular vectors 223
RMSE (Root Mean Square Error) 17–18, 67–68
Root Mean PRESS (Predicted Residual Sum of Squares) statistic 69, 119–120, 167, 192, 209, 246–249
Root Mean Square Error (RMSE) 17–18, 67–68
Rose, David 184
S
SAS/STAT 9.3 User's Guide 77, 248
saving prediction formulas 8–9, 96, 133, 169–170, 248
scaling
data 3–4
example of 28–31
in PLS 37–38
scatterplot matrices
loading matrix 83–86
scoring 86–87, 125–127
score vectors 224
scores
PLS (partial least squares) 50–59
properties of 232
Score Scatterplot Matrices report 86–87, 125–127
second principal component 26
Sensory Evaluation of Food: Statistical Methods and Procedures (O'Mahony) 184
SIMPLS algorithm
about 71–72, 237, 240–246
extracting factors 50
fits 64
implications for 237–238
models in terms of Xs 53
notation 238–240
optimization criterion 237
SIMPLS Fit report 82–83
simulation studies
about 249–250
Bias-Variance Tradeoff in PLS 250–261
overfitting 16–18
underfitting 16–18
using PLS for variable selection 263–280
Utility Script to Compare PLS2 and PLS2 261–263
singular value decomposition of a matrix 222–223
singular values 222–223
Sjöström, M. 76, 77
Smith, H. 15
Solubility.jmp 25–27, 34
Spearheads.jmp 4–5, 66
spectra
combined 113–116
constructing plots of individual 111–112
individual 112–113
spectral decomposition, relationship to singular value decomposition 223
SpectralData.jsl script 40–45
SS(YModel) 242
stacked data tables, creating 109–111
stage one MLR model, for Baking Bread That People Like example 197–200
stage one pruned model, for Baking Bread That People Like example 195–197
stage two MLR model, for Baking Bread That People Like example 212–215
stage two PLS model, for Baking Bread That People Like example 207–208
Standardize X option 246
statistical models 1–2
Statistically Inspired Modification of the PLS Method
See SIMPLS algorithm
Statistics in Market Research (Chackrapani) 184
stepwise regression 189, 263
Stepwise Regression Control report 198–199
Stratified sample, creating 155–156
subsets, creating 173–174
sum of squares
for contribution of factor f to X model 230
for factor f to Y model 229
for X 242
for Y 242
T
T Square Plot 123 , 160–161
Tabachnick, B.G. 48
Tenenhaus, M. 89
test set
about 5
assessing models using 133–136
creating indicator columns 107–108
creating stratified sample 155–156
excluding 116–117
testing models 9–10
3-D scatterplots, for one factor 46
Tibshirani, R. 23, 39, 54, 250
Tobias, R.D. 38
Tracy, N.D. 126
training set 5, 65
transforming
creating a column formula 148–149
through a launch window 148–150
weights 236–237
transforming data 3–4
transpose 16
Trygg, J. 4, 28, 89, 96, 163
U
Ufkes, J.G.R. 76, 77, 79, 103
underfitting 16–18
univariate distributions 204
Utility Script to Compare PLS2 and PLS2 261–263
V
validation
k-fold cross validation 66, 67–68, 119, 167–168, 208, 209
leave-one-out cross validation 66–67, 190–192, 208–209
in PLS platform 69–71, 246–249
validation set 65
van den Wollenberg, A.L. 39
van der Meer, C. 76, 77, 79, 103
van der Voet, H. 69, 119
van der Voet tests 69–70, 137, 167–168, 248
Variable Importance for the Projection (VIP) statistic
See VIP (Variable Importance for the Projection) statistic
Variable Importance Plot report 44–45, 90–91, 129, 131–132, 165
variable selection
about 64, 189, 263–264, 280
comparing error rates in simulation study 273–280
computation of result measures for simulation study 268–270
results of simulation study 270–280
simulation 267–268
structure of simulation study 264–267
variables
comparing selection error rates 273–280
lurking 103
reduction in PLS 89–90
relationships between 187–188
visualizing two at a time 152–154
variance, bias toward X directions with high variance 234
viewing
data 108–116
Profiler 201–202, 213–215, 218–219
VIPs for spectral data 131
VIP (Variable Importance for the Projection) statistic
about 129–133
comparing values 270–273
for ith predictor 230–231, 243–244
variable reduction in PLS 89
viewing for spectral data 131
VIP vs Coefficients Plots report 91–93, 130–132, 136, 163–166, 177–178, 194–195, 212
VIP* 231, 244–245, 268–273, 280
Visser, B.J. 76, 77, 79, 103
visualizing
data 77–79
two variables at a time 152–154
Ys and Xs 202–207
W
Water Quality in Savannah River Basin example
about 140–141
data 141–155
defined 140
first PLS model 155–166
pruned PLS model 166–172
WaterQuality2.jmp 155–156
WaterQuality2_Train.jmp 156–158
WaterQuality_BlueRidge.jmp 174–175
WaterQuality.jmp 140–141
WaterQuality_PRESSCalc.jmp 247–248
weights, transforming 236–237
Wikstrom, C. 28
Wold, H. 71
Wold, S. 4, 28, 71, 76, 77, 89, 96, 129, 163, 231
Wynne, H.J. 76, 77, 103
X
X
Active, in simulation 267
correlation structure of, in simulation 264–266
models for 52–53, 241
models in terms of scores 50–52
properties of weights 232
sums of squares for 242
X-Y Scores Plots report 82–83, 120–121, 159, 192, 196, 209
Y
Young, J.C. 126
Y
models for 52–53, 241
sums of squares for 242
Symbols
* matrix multiplication 12–13
β column vector of regression parameters 12–13, 15–16
ε column vector of errors 12–13, 15–16
Σ correlation matrix 38–39