Missing Data

What Is Missing or Incomplete Data?

Missing data is an issue in exploratory factor analysis because EFA will analyze only complete cases, and thus any case with missing data will be deleted. This can reduce sample size, causing estimates to be more volatile. If missingness is random, then your estimates should be unbiased. However, it is unusual for missing data to be completely at random. Thus, it is likely that missing data is causing bias in the results in addition to reducing sample size—unless you deal with the missing data in some appropriate manner. In SAS, we can see how many cases are missing a response by adding the missing option on the TABLE statement of PROC FREQ (e.g., table variable-names /missing;).
If any data on any variable from any participant is not present, the researcher is dealing with missing or incomplete data. In many types of research, there can be legitimate missing data. This can come in many forms, for many reasons. Most commonly, legitimate missing data is an absence of data when it is appropriate for there to be an absence. Imagine you are filling out a survey that asks you whether you are married, and if so, how happy you are with your marriage. If you say you are not married, it is legitimate for you to skip the follow-up question on how happy you are with your marriage. If a survey asks you whether you voted in the last election, and if so, how much research you did about the candidates before voting, it is legitimate to skip the second part if you did not vote in the last election.
Legitimately missing data can be dealt with in different ways. One common way of dealing with this sort of data could be using analyses that do not require (or can deal effectively with) incomplete data. These include things like hierarchical linear modeling (HLM; Raudenbush & Bryk, 2002) or item response theory. Another common way of dealing with this sort of legitimate missing data is adjusting the denominator. Again taking the example of the marriage survey, we could eliminate non-married individuals from the particular analysis looking at happiness with marriage, but would leave non-married respondents in the analysis when looking at issues relating to being married versus not being married. Thus, instead of asking a slightly silly question of the data—“How happy are individuals with their marriage, even unmarried people?”—we can ask two more refined questions: “What are the predictors of whether someone is currently married?” and “Of those who are currently married, how happy are they on average with their marriage?” In this case, it makes no sense to include non-married individuals in the data on how happy someone is with marriage.
Illegitimately missing data are also common in all types of research. Sensors fail or become miscalibrated, leaving researchers without data until that sensor is replaced or recalibrated. Research participants choose to skip questions on surveys that the researchers expect everyone to answer. Participants drop out of studies before they are complete. Missing data also, somewhat ironically, can be caused by data cleaning (if you delete outlying values).
Few authors seem to explicitly deal with the issue of missing data, despite its obvious potential to substantially skew the results (Cole, 2008). For example, in a recent survey of highly regarded journals from the American Psychological Association, the first author and his students found that just over one-third (38.89%) of authors discussed the issue of missing data in their articles. Do those 61% who fail to report anything relating to missing data have complete data (rare in the social sciences, but possible for some authors), do they have complete data because they removed all subjects with any missing data (undesirable, and potentially biasing the results, as we discuss below), did they deal effectively with the missing data and fail to report it (less likely, but possible), or did they allow the statistical software to treat the missing data via whatever the default method is, which most often leads to deletion of subjects with missing data? If this survey is representative of researchers across the sciences, we have cause for concern. Of those researchers who did report something to do with missing data, most reported having used the classic methods of listwise deletion (complete case analysis) or mean substitution, neither of which are best practices (Schafer & Graham, 2002). In only a few cases did researchers report doing anything constructive with the missing data, such as estimation or imputation.

Dealing with Missing Data

Regression and multiple imputation have emerged as two more progressive methods of dealing with missing data, particularly in cases like factor analysis where there are other closely correlated variables with valid data. Regression imputation (also referred to as simple imputation) creates a regression equation to predict missing values based on variables with valid data. After each missing value is replaced, analysis can continue as planned. A popular type of regression imputation used in factor analysis is the expectation-maximization (EM) algorithm. This algorithm uses an iterative maximum likelihood process to identify the best estimates for each missing value. The EM algorithm is available through the MI procedure in SAS (SAS Institute Inc., 2015) . We will review an example of this procedure in the next section. The EM algorithm, along with other forms of regression imputation, has been shown to be superior to mean substitution or complete case analysis, particularly when data is not missing completely at random (Graham, 2009).
Multiple imputation takes the concept of imputation a step further by providing multiple estimates for each missing value. A variety of advanced techniques—e.g., EM/maximum likelihood estimation, propensity score estimation, or Markov chain Monte Carlo (MCMC) simulation—are used to provide the estimates and create multiple versions of the same data set (sort of a statistician’s view of the classic science fiction scenario of alternate realities or parallel universes). These parallel data sets can then be analyzed via standard methods and results combined to produce estimates and confidence intervals that are often more robust than simple imputation or previously mentioned methods of dealing with missing values (Schafer, 1997, 1999). However, little attention has been given to methods to combine the results from imputed data sets. Because most factor analysis models do not include readily available estimates of standard error, more commonplace methods of aggregating results through PROC MIANALYZE are not available. Researchers have recommended many alternatives, including averaging component loadings after correcting for the alignment problem (reviewed in Chapter 7), conducting analysis on the average correlation or covariance matrix produced by imputed data sets, or using a generalized Procrustes analysis to pool loadings (Ginkel & Kroonenberg, 2014).
Since the goal of EFA is primarily exploration, it is important to simply recognize the role of missing data. We do not believe that the more complex multiple imputation methods are necessary in the context of EFA—especially because they can be programmatically challenging! Instead, explore your data and try to understand the impact of any missingness. If you have concerns about missing data, try using a regression imputation method.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset