Missing
data is an issue in exploratory factor analysis because EFA will analyze
only complete cases, and thus any case with missing data will be deleted.
This can reduce sample size, causing estimates to be more volatile.
If missingness is random, then your estimates should be unbiased.
However, it is unusual for missing data to be completely at random.
Thus, it is likely that missing data is causing bias in the results in
addition to reducing sample size—unless
you deal with the missing data in some appropriate manner. In SAS,
we can see how many cases are missing a response by adding the missing
option on the TABLE
statement of PROC
FREQ
(e.g., table variable-names /missing;
).
If any data on any variable from
any participant is not present, the researcher is dealing with missing
or incomplete data. In many types of research, there can be legitimate
missing data. This can come in many forms, for
many reasons. Most commonly, legitimate missing data is an absence
of data when it is appropriate for there to be an absence. Imagine
you are filling out a survey that asks you whether you are married,
and if so, how happy you are with your marriage. If you say you are
not married, it is legitimate for you to skip the follow-up question
on how happy you are with your marriage. If a survey asks you whether
you voted in the last election, and if so, how much research you did
about the candidates before voting, it is legitimate to skip the second
part if you did not vote in the last election.
Legitimately
missing data can be dealt with in different ways. One
common way of dealing with this sort of data could be using analyses
that do not require (or can deal effectively with) incomplete data.
These include things like hierarchical linear modeling (HLM; Raudenbush
& Bryk, 2002) or item response theory. Another common way of dealing
with this sort of legitimate missing data is adjusting the denominator.
Again taking the example of the marriage survey, we could eliminate
non-married individuals from the particular analysis looking at happiness
with marriage, but would leave non-married respondents in the analysis
when looking at issues relating to being married versus not being
married. Thus, instead of asking a slightly silly question of the
data—“How happy are individuals with their marriage,
even unmarried people?”—we can ask two more refined
questions: “What are the predictors of whether someone is currently
married?” and “Of those who are currently married, how
happy are they on average with their marriage?” In this case,
it makes no sense to include non-married individuals in the data on
how happy someone is with marriage.
Illegitimately
missing data are also common in all types of research.
Sensors fail or become miscalibrated, leaving researchers without
data until that sensor is replaced or recalibrated. Research participants
choose to skip questions on surveys that the researchers expect everyone
to answer. Participants drop out of studies before they are complete.
Missing data also, somewhat ironically, can be caused by data cleaning
(if you delete outlying values).
Few authors seem to
explicitly deal with the issue of missing data, despite its obvious
potential to substantially skew the results (Cole, 2008). For example,
in a recent survey of highly regarded journals from the American Psychological
Association, the first author and his students found that just over
one-third (38.89%) of authors discussed the issue of missing data
in their articles. Do those 61% who fail to report anything relating
to missing data have complete data (rare in the social sciences, but
possible for some authors), do they have complete data because they
removed all subjects with any missing data (undesirable, and potentially
biasing the results, as we discuss below), did they deal effectively
with the missing data and fail to report it (less likely, but possible),
or did they allow the statistical software to treat the missing data
via whatever the default method is, which most often leads to deletion
of subjects with missing data? If this survey is representative of
researchers across the sciences, we have cause for concern. Of those
researchers who did report something to do with missing data, most
reported having used the classic methods of listwise deletion (complete
case analysis) or mean substitution, neither of which are best practices
(Schafer & Graham, 2002). In only a few cases did researchers
report doing anything constructive with the missing data, such as
estimation or imputation.