Application

What Does Bootstrapping Look Like?

Now that we are familiar with the syntax, let us review an example of the use and application of bootstrap resampling. Remember that EFA is a complex procedure with lots of effects. It can have many communalities, eigenvalues, and factor loadings. So although it does not take a terribly long time with modern computers (it took our computer about 2 minutes to run 2000 bootstrap analyses and save them to a data file), the process that takes the longest is examining all the different parameters one wants to examine.
The analyses in Chapter 3 indicated that the engineering data had two small, clear factors. The sample was relatively small, however, so we could ask whether it is likely that the two-factor model will replicate. With such a small sample, we would be hesitant to split it and perform replication analyses as in Chapter 6. Bootstrap analysis can provide another means through which to address this question. Other questions that we might ask could concern whether the communalities and factor loadings estimated are within a reasonable range for each of the variables. Let’s take things one step at a time. To perform this bootstrap analysis, we used syntax similar to that reviewed above.
Eigenvalues and variance. First, we examined the initial eigenvalues that were extracted to evaluate the replicability of the factor structure via the Kaiser Criterion (eigenvalue > 1 criteria, which is simplest in this analysis). According to our previous analyses, we expect two factors to be extracted. Thus, if we examine the first three factors extracted, we can explore the likelihood that this basic structure would replicate in a similar sample. The syntax used to perform these first few steps is presented below. Notice we also request additional ODS tables be output (ObliqueRotFactPat and FinalCommun) for use in our subsequent analyses.
**Conduct bootstrap resampling from the data set;
proc surveyselect data = engdata  method = URS  samprate = 1  outhits 
   out = outboot_eng (compress=binary)  seed = 1  rep = 2000;
run;

**Run model on bootstrap estimates;
ods output Eigenvalues=boot_eigen (compress=binary) 
      ObliqueRotFactPat=boot_loadings (compress=binary) 
      FinalCommun=boot_commun (compress=binary); 
proc factor data = outboot_eng  nfactors = 2  method = prinit  
      priors = SMC rotate = OBLIMIN;
   by replicate;
   var EngProb: INTERESTeng: ;
run;
ods output close;

**Estimate CI for Eigenvalues and corresponding variance;
proc sort data=boot_eigen nodupkey; by Number replicate; run;
proc univariate data=boot_eigen;
   by Number;
   var Eigenvalue Proportion;
   Histogram;
   output out=ci_eigen pctlpts=2.5, 97.5 mean=Eigenvalue_mean 
      proportion_mean std=Eigenvalue_std Proportion_std 
      pctlpre=Eigenvalue_ci Proportion_ci;
run;
Figure 7.4 Distribution of first eigenvalue extracted over 2000 bootstrap analyses displays a histogram of the bootstrap distribution for the first eigenvalue extracted. Our average initial eigenvalue among the samples is 7.4, but across the resamples it ranges from 6.1 to 8.8. We compute the 95% CI directly from this distribution by identifying the eigenvalues that fall at the 2.5 and 97.5 percentile, leading us to the estimates of 6.7 and 8.2. Each of the remaining CIs are calculated in a similar manner. Bootstrap results for the first three eigenvalues extracted displays a summary of our results (extracted from the data set ci_eigen, output by the UNIVARIATE procedure).
Figure 7.4 Distribution of first eigenvalue extracted over 2000 bootstrap analyses
Table 7.2 Bootstrap results for the first three eigenvalues extracted
Mean Eigen
95% CI
Mean % Variance
95% CI
Factor 1
7.41
(6.67, 8.20)
68.90%
(63,83%, 74.10%)
Factor 2
3.27
(2.75, 3.80)
30.46%
(25.30%, 35.43%)
Factor 3
0.24
(0.16, 0.33)
2.20%
(1.44%, 3.09%)
Note: Iterated PAF extraction with oblimin rotation was used.
The first three eigenvalues are congruent with our initial analysis in Chapter 2. The third eigenvalue does not rise close to 1.0 in any bootstrapped data set. In addition, 95% CI for the variance indicates that the third eigenvalue accounts only for between 1.4% and 3.1% of the total variance. The two-factor solution is therefore strongly supported, and we can conclude it is reasonable to expect this factor structure to be found in a similar data set.
Communalities and factor loadings. Next, we examined the relative stability of the shared item-level variance and the partitioning of the variance between factors. The syntax to estimate the CI of the communalities and the factor loadings from the boot_commun and the boot_loadings data sets output by the ODS system is presented below.
**Estimate CI for Communalities and Pattern Loading Matrix;
*transpose communality so can use BY processing in proc univariate;
proc transpose data=boot_commun  out=boot_commun_t  prefix=communality 
      name=Variable;
   by replicate;
run;
proc sort data=boot_commun_t; by Variable Replicate; run;
proc univariate data=boot_commun_t;
   by Variable;
   var Communality1;
   output out=ci_commun pctlpts=2.5, 97.5 mean=mean std=std pctlpre=ci;
run;

**Re-Order results on factors using the alignFactors macro we built;
*Include alignFactors macro syntax for use below;
filename align 'C:alignFactors_macro.sas';
%include align;

*Run Macro and estimate CI;
%alignFactors(orig_loadings,boot_loadings,aligned_loadings,summary);	
proc sort data=aligned_loadings; by VariableN; run;
proc univariate data=aligned_loadings;
   by VariableN;
   var Factor1;
   output out=ci1_loadings pctlpts=2.5, 97.5 mean=mean std=std pctlpre=ci;
run;
proc univariate data=aligned_loadings;
   by VariableN;
   var Factor2;
   output out=ci2_loadings pctlpts=2.5, 97.5 mean=mean std=std pctlpre=ci;
run;
In Bootstrap results for communalities and factor loadings, we present the communalities and pattern matrix factor loadings, along with their corresponding bootstrap CI. In general, the communalities are relatively strong. They consistently range from 0.68 to 0.82 and they have 95% confidence intervals that are reasonably narrow. These results lead us to expect that another replication in a comparable sample would extract similar communalities that are relatively strong.
Table 7.3 Bootstrap results for communalities and factor loadings
Var:
Communalities
Factor 1 Pattern Loadings
Factor 2 Pattern Loadings
Coeff.
95% CI
Coeff.
95% CI
Coeff.
95% CI
EngProb1
0.73
(.66, .79)
0.86
(.81,.90)
-0.02
(-.06,.04)
EngProb2
0.67
(.58, .75)
0.84
(.77,.90)
-0.07
(-.13,-.01)
EngProb3
0.77
(.70, .83)
0.88
(.84,.91)
-0.01
(-.05,.04)
EngProb4
0.81
(.75, .86)
0.91
(.87,.94)
-0.03
(-.06,.02)
EngProb5
0.80
(.74, .85)
0.89
(.84,.92)
0.02
(-.02,.07)
EngProb6
0.77
(.70, .83)
0.87
(.82,.90)
0.02
(-.03,.07)
EngProb7
0.78
(.71, .83)
0.87
(.82,.91)
0.03
(-.02,.09)
EngProb8
0.67
(.59, .74)
0.79
(.72,.84)
0.07
(.01,.14)
INTeng1
0.67
(.56, .77)
0.04
(-.02,.11)
0.80
(.71,.87)
INTeng2
0.83
(.76, .89)
-0.02
(-.07,.02)
0.92
(.87,.96)
INTeng3
0.84
(.79, .89)
-0.01
(-.05,.02)
0.92
(.89,.95)
INTeng4
0.82
(.68, .91)
0.00
(-.04,.04)
0.90
(.82,.96)
INTeng5
0.80
(.72, .86)
-0.01
(-.05,.04)
0.90
(.85,.93)
INTeng6
0.75
(.66, .83)
0.01
(-.03,.06)
0.86
(.81,.91)
Note: Iterated PAF extraction with oblimin rotation was used. 95% CIs in parentheses. Factor loadings highlighted are those expected to load on the factor.
The factor loading results show similar trends. The variables are all strong on the factors that they should load on, and the corresponding CI are relatively narrow. These results suggest that our estimates are likely to replicate. Overall, these results strongly support our solution as one that is likely to generalize.
Interpretation of CI. In general, two things are taken into account when interpreting the CI around these statistics: the relative magnitude of the statistics captured in the CI and the relative range covered by the CI. This is a different use than the more commonplace use of CI for null hypothesis testing. When evaluating null hypotheses, we often look to see whether our CIs contain zero, which would indicate there is not a significant effect or difference. However, in the world of EFA there are very few tests of statistical significance and just above zero is often not good enough. An eigenvalue of 0.5 is frequently not interesting but a factor loading of 0.5 might at least pique our interest. Similarly, a factor loading of 0.1 and 0.2 is often just as uninteresting as factor loadings of zero. Thus, we are instead interested in the relative magnitude of the effects captured within the CI. In addition, if the range of magnitudes covered is broad, this could suggest there is a large amount of sampling error and that our results might not reflect those in the population.

Generalizability and Heywood Cases

As we have reviewed above, bootstrap resampling helps us answer the question of reliability and generalizability. In this example, we will examine the results from a relatively small sample of the Marsh SDQ data and compare it to our larger data set. We will assert that analysis of approximately 16,000 students represents the “gold standard” or population factor structure, and we will see whether bootstrapping CI for our small sample can lead us to infer the “population” parameters with reasonable accuracy. A random sample of 300 cases was selected and then subjected to bootstrap resampling and analysis (with maximum likelihood extraction and direct oblimin rotation).
Imagining that this small sample was our only information about this scale, we started by performing an EFA on this sample only. The syntax to select the subsample and run the basic EFA is presented below.
**Select subsample from SDQ data;
proc surveyselect data=sdqdata  method=srs  n=300  out=sdqdata_ss1
      seed=39302;
run;
**Look at factor structure in subsample;
proc factor data=sdqdata_ss1  nfactors=3  method=ml  rotate=OBLIMIN;
   var Math: Par: Eng:;
run;
Three factors exceeded an eigenvalue of 1.0, which was corroborated by MAP analysis and produced a recommendation to extract three factors. Thus, if this were my only sample, theory, Kaiser Criterion, and MAP analysis would lead me to extract three factors.
Heywood cases. We then performed bootstrap resampling on the eigenvalues and pattern matrix factor loadings to examine the relative stability of our solution. Again, we selected 2000 resamples and replicated our analysis in each sample. The syntax used to select the subsamples and run the EFA on each sample is presented below.
**Conduct bootstrap resampling from the data set;
proc surveyselect data=sdqdata_ss1  method=URS  samprate=1  outhits 
      out=outboot_sdq (compress=binary)  seed=1  rep=2000;
run;

**Notice occurrence of Heywood cases;
ods output 	Eigenvalues=boot_eigen
      ObliqueRotFactPat=boot_loadings; 
proc factor data=outboot_sdq  nfactors=3  method=ml  rotate=OBLIMIN ;
   by replicate;
   var Math: Par: Eng:;
run;
ods output close; run;
However, after we ran PROC FACTOR to replicate our analyses, the following error message appeared in our log in three different places.
Figure 7.5 Error in Log
The above message indicates that some of the communality estimates in three resamples exceed 1. Logically, this should not happen. Remember that communalities represent the amount of shared variance that can be extracted from an item. Communalities greater than 1 are equivalent to more than 100% of the variance! Although it is not logical, this is occasionally possible because of some mathematical peculiarities of the common factor model. An eigenvalue that exceeds 1 is called an ultra-Heywood case, and an eigenvalue that is equivalent to 1 is called a Heywood case (Heywood, 1931). The occurrence of either case can be problematic and can indicate that our model might be misspecified (e.g., wrong number of factors, etc.) or we might have insufficient cases to produce stable estimates.
For the purpose of bootstrapping CI, this message is not extremely concerning. It simply indicates that some of our resamples have potentially biased data—particular cases were oversampled, leading to a biased or more uniform data set. So, do not worry. This message is not a death sentence for our bootstrap analysis. However, if we got this error message for our original sample or for the majority of our resamples, this could indicate there are substantial problems with our original data (e.g., bias, size).
There are two ways we can proceed with the current analysis: 1) change our estimation method; or 2) allow SAS to solve solutions with Heywood or ultra-Heywood cases. The first option is the preferred course of action. The maximum likelihood method of extraction is known to be particularly susceptible to Heywood cases (Brown, Hendrix, Hedges, & Smith, 2012). By switching to a method that is more robust for non-normal data (i.e., ULS or iterated PAF), we might be able to get better estimates from the data and avoid the problematic Heywood cases. The second option entails using the HEYW OOD or ULTRAHEYWOOD option on the FACTOR statement to permit communality estimates that are equal to or greater than 1. This option should be used with care because it does not fix any problems—it just allows us to estimate results for potentially problematic data.
Eigenvalues and variance. We have proceeded with the current example by estimating eigenvalues using both options. The syntax is provided below.
**Solution to Heywood Cases #1: Change the method;
*Look at original sample and subsample solution with revised method;
proc factor data=sdqdata  nfactors=3  method=uls  rotate=OBLIMIN;
   var Math: Par: Eng:;
run;
proc factor data=sdqdata_ss1 nfactors=3 method=uls rotate=OBLIMIN ;
   var Math: Par: Eng:;
run;
*Bootstrap results;
ods output 	Eigenvalues=boot_eigen
      ObliqueRotFactPat=boot_loadings; 
proc factor data=outboot_sdq nfactors=3 method=uls rotate=OBLIMIN;
   by replicate;
   var Math: Par: Eng:;
run;
ods output close;run;
*Estimate CI for Eigenvalues and Corresponding Variance;
proc sort data=boot_eigen nodupkey; 
   by Number Replicate; 
run;
proc univariate data=boot_eigen;
   by Number;
   var Eigenvalue Proportion;
   Histogram;
   output out=ci_eigen pctlpts=2.5, 97.5 mean=Eigenvalue_mean 
      Proportion_mean std=Eigenvalue_std Proportion_std 
      pctlpre=Eigenvalue_ci Proportion_ci;
run;
*Estimate CI for Pattern Loading Matrix;
%alignFactors(orig_loadings,boot_loadings,aligned_loadings,summary);	
proc sort data=aligned_loadings; by VariableN; run;
proc univariate data=aligned_loadings;
   by VariableN;
   var Factor1;
   output out=ci1_loadings pctlpts=2.5, 97.5 mean=mean std=std 
      pctlpre=ci;
run;
proc univariate data=aligned_loadings;
   by VariableN;
   var Factor2;
   output out=ci2_loadings pctlpts=2.5, 97.5 mean=mean std=std 
      pctlpre=ci;
run;
proc univariate data=aligned_loadings;
   by VariableN;
   var Factor3;
   output out=ci3_loadings pctlpts=2.5, 97.5 mean=mean std=std 
      pctlpre=ci;
run;
**Solution to Heywood Cases #2: Use ULTRAHEYWOOD option to allow SAS to  
  solve;
ods output Eigenvalues=boot_eigen
      ObliqueRotFactPat=boot_loadings; 
proc factor data=outboot_sdq nfactors=3 method=ml ULTRAHEYWOOD 
      rotate=OBLIMIN ;
   by replicate;
   var Math: Par: Eng:;
run;
ods output close;run;
*Estimate CI for Eigenvalues and Corresponding Variance;
proc sort data=boot_eigen nodupkey; 
   by Number Replicate; 
run;
proc univariate data=boot_eigen;
   by Number;
   var Eigenvalue Proportion;
   Histogram;
   output out=ci_eigen pctlpts=2.5, 97.5 mean=Eigenvalue_mean 
      Proportion_mean std=Eigenvalue_std Proportion_std 
      pctlpre=Eigenvalue_ci Proportion_ci;
run;
Results for the first four eigenvalues using ULS extraction shows the results for the eigenvalue estimates after changing our extraction method to ULS, and Results for the first four eigenvalues using ML extraction and ULTRAHEYWOOD option shows the results after using the ULTRAHEYWOOD option. Please note, the estimates that are produced by ML extraction are larger because a weighting process is used in ML. Thus, the relative difference in eigenvalue magnitude should be ignored when comparing these tables.
Table 7.4 Results for the first four eigenvalues using ULS extraction
Factor
Original Sample eigenvalue
Reduced Sample eigenvalue
95% CI
Original Sample variance
Reduced Sample variance
95% CI
1
3.62
3.68
(3.31,4.21)
52.1%
50.0%
(43.9%,54.6%)
2
2.16
2.47
(2.11,2.86)
31.0%
33.6%
(27.9%,37.2%)
3
1.73
1.53
(1.23,1.88)
24.9%
20.8%
(16.1%,24.3%)
4
0.36
0.43
(0.26,0.72)
5.2%
5.9%
(3.4%,9.5%)
Table 7.5 Results for the first four eigenvalues using ML extraction and ULTRAHEYWOOD option
Factor
Original Sample eigenvalue
Reduced Sample eigenvalue
95% CI
Original Sample variance
Reduced Sample variance
95% CI
1
3.62
3.68
(3.31,4.21)
52.1%
50.0%
(43.9%,54.6%)
2
2.16
2.47
(2.11,2.86)
31.0%
33.6%
(27.9%,37.2%)
3
1.73
1.53
(1.23,1.88)
24.9%
20.8%
(16.1%,24.3%)
4
0.36
0.43
(0.26,0.72)
5.2%
5.9%
(3.4%,9.5%)
Both methods, ML and ULS extraction, identify three factors to be extracted from the original sample and the reduced sample using the Kaiser Criterion (eigenvalue > 1). The 95% CI around the ULS estimates of the third and fourth eigenvalues indicates there is a clear delineation and the factor structure of the reduced sample is likely to replicate. Notice the original sample eigenvalues are also well within the 95% CI produced from our reduced sample—this is a nice check that our results do indeed generalize!
The 95% CI around the ML estimates tells us a slightly different story. These estimates appear to have a tail to the right and the CI for the fourth eigenvalue contains 1. The tail suggests that our reduced sample is likely suffering from some degree of non-normality and ML is having a difficult time producing stable estimates. The inclusion of 1 in the CI for the fourth eigenvalue means this factor structure might not replicate very well. In some instances, four factors might be extracted instead of three. Finally, when we compare the results from our reduced sample to those from our original sample, we can see that one of the eigenvalues from our original sample is not within the 95% CI produced from the reduced sample. Altogether, we can clearly see there are problems with ML extraction in this reduced sample. The 95% CIs are useful in helping us to detect the problems.
Factor loadings. We continue by examining the pattern matrix factor loadings that were output using ULS extraction. The factor loadings were run through our macro presented above, %checkFactorOrder, to clean them (i.e., take absolute values and align results) before the CIs were estimated. The results are presented in Pattern matrix loadings using ULS extraction and oblimin rotation.
Table 7.6 Pattern matrix loadings using ULS extraction and oblimin rotation
Var:
Original Sample
Factor Loadings
Reduced Sample
Factor Loadings
1
2
3
1
2
3
Math1
.90
-.05
-.03
.91 (.86,.95)
.01 (-.06,.08)
-.09 (-.16,-.02)
Math2
.86
.00
.01
.85 (.79,.90)
.01 (-.05,.09)
.06 (-.01,.14)
Math3
.89
-.01
.02
.88 (.81,.93)
-.01 (-.07,.05)
.00 (-.08,.07)
Math4
-.59
-.06
.01
-.66 (-.75,-.56)
.01 (-.08,.09)
-.02 (-.11,.07)
Par1
.00
.71
.02
-.03 (-.10,.05)
.68 (.55,.78)
.04 (-.06,.16)
Par2
.05
-.69
.04
-.03 (-.11,.04)
-.72 (-.82,-.60)
.05 (-.06,.12)
Par3
.03
.82
-.01
-.02 (-.08,.05)
.96 (.89,1.00)
-.08 (-.14,.01)
Par4
-.04
-.60
-.11
-.05 (-.16,.05)
-.57 (-.69,-.43)
-.12 (-.26,.01)
Par5
.02
.74
-.04
-.02 (-.09,.05)
.78 (.70,.84)
.00 (-.08,.08)
Eng1
.00
.02
.78
.03 (-.05,.11)
.01 (-.06,.11)
.75 (.65,.85)
Eng2
-.01
-.10
.84
-.04 (-.11,.03)
-.07 (-.14,.02)
.78 (.67,.86)
Eng3
.06
-.03
.85
-.02 (-.09,.06)
.00 (-.07,.08)
.81 (.71,.89)
Eng4
.04
-.12
-.61
-.03 (-.11,.05)
-.09 (-.22,.01)
-.63 (-.74,-.48)
Note: Factor loadings highlighted are those expected to load on the factor.
The majority of the bootstrapped CIs are relatively narrow. They are a little larger than those seen in the previous example, but they are still sufficiently narrow to suggest that the relative magnitude and structure of the loadings should replicate. To test the accuracy of the bootstrapped CIs, we can compare their range to the original sample estimates. Only one variable (Par4) had a loading outside the CI. Overall, not bad! We were able to get a pretty good idea of what was occurring in our “population” based on the results from only 300 cases. Together, these results reinforce the idea that with a reasonable sample, appropriate methods (e.g., extraction, rotation, etc.), and appropriate bootstrapping methodology (e.g., attention to factor alignment and relative magnitude), we can gain valuable insight into the population parameters.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset