Outliers

Two Types of Outliers in EFA

There are two different types of outliers in factor analysis.[2] The first type of outlier is a case (or value) that does not belong. There are many reasons why cases might not belong: the case could be from a different population, could be the result of data recording or entry error, could have resulted from motivated misresponding, or could represent a small subgroup with a different factor structure that has not been recognized. Whatever the reason, having these “illegitimate” cases in the data does not serve any useful purpose. When you identify such a case, you should check the original data, if it is available, to identify whether the case suffers from a data entry or key error. It might be possible to fix the data manually. Alternatively, you could remove it and then use missing-data techniques (described below) to replace it with a reasonable estimate of what it might have been.
The second type of outlier in EFA is a variable that does not belong. In this context, a variable can be considered an outlier when it loads on its own factor as a single-item factor. If an item does not load on any other factor, and no other items load on that factor, it is considered an outlier, and should be removed from the analysis (or the scale should be reworked to more fully represent that dimension of the latent construct if it is a legitimate facet of the construct).
Many outliers can be identified through careful review of data and analyses. The out-of-bounds value (a subtype of the first outlier) can usually be identified through frequency distributions of the variables to be analyzed. Anything off the scale or out of plausible range should be reviewed. PROC FREQ can be used to produce such tables. This type of examination should always be the first step for researchers. The second type of outlier, the variable that does not belong, is even easier to identify. This is the variable in your analysis that does not play well with others and instead takes a factor for itself. This variable should be removed, and the analysis should be rerun.
The individual case that is an outlier (the other subtype of the first outlier) is trickier to identify in EFA. Visual examination of the data can help one identify odd patterns for review and potential elimination (like the participant who answers “3” to every item). Other patterns, like random responding or motivated misresponding, are more difficult to identify. These patterns can be classified into response sets that individuals use to mask or filter their true belief or response. Each is associated with a different motive and outcome, and only some might be detected. We discuss these response sets and the challenges associated with them in the following section.

Response Sets and Unexpected Patterns in the Data

Response sets[3] (such as random responding) are strategies that individuals use (consciously or otherwise) when responding to educational or psychological tests or scales. These response sets range on a continuum from unbiased retrieval (where individuals use direct, unbiased recall of factual information in memory to answer questions) to generative strategies (where individuals create responses not based on factual recall because of inability or unwillingness to produce relevant information from memory; see Meier, 1994, p. 43). Response sets have been discussed in the measurement and research methodology literature for over seventy years now (Cronbach, 1942; Goodfellow, 1940; Lorge, 1937), and some (e.g., Cronbach, 1950) argue that response sets are ubiquitous, found in almost every population on almost every type of test or assessment. In fact, early researchers identified response sets on assessments as diverse as the Strong Interest Inventory (Strong, 1927); tests of clerical aptitude, word meanings, temperament, and spelling; and judgments of proportion in color mixtures, seashore pitch, and pleasantness of stimuli. (See summary in Cronbach, 1950, Table 1.)
Response sets can be damaging to factor analysis and to the quality of measurement in research. Much of the research we as scientists perform relies upon the goodwill of research participants (students, teachers, participants in organizational interventions, minimally compensated volunteers, etc.) with little incentive to expend effort in providing data to researchers. If we are not careful, participants with lower motivation to perform at their maximum level might increase the error variance in our data, masking real effects of our research. In the context of this book, random and motivated misresponding can have deleterious effects such as masking a clear factor structure or attenuating factor loadings and communalities.
Here are some examples of response sets that are commonly discussed in the literature:
Random responding is a response set where individuals respond with little pattern or thought (Cronbach, 1950). This behavior, which completely negates the usefulness of responses, adds substantial error variance to analyses. Meier (1994) and others suggest this might be motivated by lack of preparation, reactivity to observation, lack of motivation to cooperate with the testing, disinterest, or fatigue (Berry et al., 1992; Wise, 2006). Random responding is a particular concern in this paper as it can mask the effects of interventions, biasing results toward null hypotheses, smaller effect sizes, and much larger confidence intervals than would be the case with valid data.
Dissimulation and malingering. Dissimulation refers to a response set where respondents falsify answers in an attempt to be seen in a more negative or more positive light than honest answers would provide. Malingering is a response set where individuals falsify and exaggerate answers to appear weaker or more medically or psychologically symptomatic than honest answers would indicate. Individuals are often motivated by a goal of receiving services that they would not otherwise be entitled to (e.g., attention deficit or learning disabilities evaluation; Kane (2008); see also Rogers (1997)) or avoiding an outcome that they might otherwise receive (such as a harsher prison sentence; see e.g, Ray, 2009; Rogers, 1997). These response sets are more common on psychological scales where the goal of the question is readily apparent (e.g., “Do you have suicidal thoughts?”; see also Kuncel & Borneman, 2007). Clearly, this response set has substantial costs to society when individuals dissimulate or malinger, but researchers should also be vigilant for these response sets because motivated responding such as this can dramatically skew research results.
Social desirability is related to malingering and dissimulation in that it involves altering responses in systematic ways to achieve a desired goal—in this case, to conform to social norms or to “look good” to the examiner. (See, e.g., Nunnally & Bernstein, 1994.) Many scales in psychological research have attempted to account for this long-discussed response set (Crowne & Marlowe, 1964), yet it remains a real and troubling aspect of research in the social sciences that might not have a clear answer, but that can have clear effects for important research (e.g., surveys of risky behavior, compliance in medical trials, etc.).
Acquiescence and criticality are response patterns in which individuals are more likely to agree with (acquiescence) or disagree with (criticality) questionnaire items in general, regardless of the nature of the item (e.g., Messick, 1991; Murphy & Davidshofer, 1988).
Response styles peculiar to educational testing are also discussed in the literature. While the response styles above can be present in educational data, other biases peculiar to tests of academic mastery (often multiple choice) include: (a) response bias for particular columns (e.g., A or D) on multiple choice type items, (b) bias for or against guessing when uncertain of the correct answer, and (c) rapid guessing (Bovaird, 2003), which is a form of random responding discussed above. As mentioned above, random responding (rapid guessing) is undesirable as it introduces substantial error into the data, which can suppress the ability for researchers to detect real differences between groups, change over time, and the effect or effects of interventions.
As we mentioned above, random responding can be particularly problematic to research. The majority of the other response sets bias results to a degree, but there is still some pattern that likely reflects the individual’s level of a particular construct or that at least reflects societal norms. Random responding contradicts expected patterns. Thus, an individual case of random responding can introduce more error into an analysis than most other response sets. We will spend the rest of this section discussing these tricky outliers.

How Common Is Random Responding?

Random responding does not appear to be a rare or isolated behavior. In one study conducted among a sample of university sophomores, 26% of students were identified (by inappropriately short response times) as having engaged in random responding (Wise, 2006). Furthermore, only 25.5% of the responses among the random responders were correct, compared to 72% of the responses among the nonrandom responders. This is exactly what we would expect and serves to validate the selection of random responders. In another study, the incidence of randomly responding on the Minnesota Multiphasic Personality Inventory (MMPI-2) was reported to be 60% among college students, 32% in the general adult population, and 53% among applicants to a police training program (Berry et al., 1992). In this case, responses identified as random were more likely to be near the end of this lengthy assessment, indicating these responses were probably random because of fatigue or lack of motivation.
Another study conducted by the first author found similar trends. Osborne & Blanchard (2011) found that about 40% of 560 students involved in a study designed to assess the effects of an educational intervention were engaging in motivated misresponding—in this case, probably random responding. The students were identified by two different criteria discussed in the next section: the Rasch outfit measures and performance on a random responding scale. To confirm the label, they demonstrated that random responders received substantially lower test scores than other students, and also showed much less change over time (before vs after intervention) compared to other students. Additional analyses were conducted to further validate the accuracy of the random responder categorization; please see Osborne & Blanchard (2011) for more information.

Detection of Random Responding

There is a well-developed literature on how to detect many different types of response sets. Some examples of methods to detect response sets include addition of particular types of items to detect social desirability, altering instructions to respondents in particular ways, creating equally desirable items that are worded positively and negatively, and for more methodologically sophisticated researchers, using item response theory (IRT) to explicitly estimate a guessing (random response) parameter. Meier (1994; see also Rogers, 1997) contains a succinct summary of some of the more common issues and recommendations around response set detection and avoidance. Three particular methods are useful in detecting random responding:
Creation of a simple random responding scale. For researchers not familiar with IRT methodology, it is still possible to be highly effective in detecting random responding on multiple choice educational tests (and often on psychological tests using Likert-type response scales as well). In general, a simple random responding scale involves creating items in such a way that 100% or 0% of the respondent population should respond in a particular way, leaving responses that deviate from that expected response suspect. There are several ways to do this, depending on the type of scale in question. For a multiple-choice educational test, one method is to have one or more choices that are illegitimate responses.[4] This is most appropriate when students are using a separate answer sheet, such as a machine-scored answer sheet, used in this study, and described below.
A variation of this is to have questions scattered throughout the test that 100% of respondents should answer in a particular way if they are reading the questions (Beach, 1989). These can be content that should not be missed (e.g., 2+ 2= __), behavioral/attitudinal questions (e.g., I weave the fabric for all my clothes), nonsense items (e.g., there are 30 days in February), or targeted multiple choice test items, such as these:
How do you spell ‘forensics’?
  1. fornsis,
  2. forensics,
  3. phorensicks,
  4. forensix
Item response theory. Item response theory can be used to create person-fit indices that can be helpful in identifying random responders (Meijer, 2003). The idea behind this approach is to quantitatively group individuals by their pattern of responding, and then use these groupings to identify individuals who deviate from an expected pattern of responding. This could lead to inference of groups using particular response sets, such as random responding. Also, it is possible to estimate a “guessing parameter” (using the three-parameter logistic model) and then account for it in analyses, as mentioned above. In SAS, the NLMIXED, GLMMIX, and IRT procedures (SAS Institute Inc., 2015) can be used to estimate the “guessing parameter”. However, as far as we know, there is no readily available estimate of person-fit output by these procedures. Instead, person-fit estimates must be hard-coded or estimated via different software (e.g., R, IRTPRO, BILOG-MG).
Interested readers should consult references such as Edelen & Reeve (2007; see also Hambleton, Swaminathan, & Rogers, 1991; Wilson, 2005) for a more thorough discussion of this topic. Unfortunately, IRT does have some drawbacks. It generally requires large (e.g., N ≥ 500) samples, significant training and resources, and, finally, although it does identify individuals who do not fit with the general response pattern, it does not necessarily show what the response set, if any, is. Thus, although IRT is useful in many instances, we cannot use it for our study.
Rasch measurement approaches. Rasch measurement models are another class of modern measurement tools with applications for identifying response sets. Briefly, Rasch analyses produce two fit statistics of particular interest to this application: infit and outfit, both of which measure sum of squared standardized residuals for individuals. Large infit statistics can reflect unexpected patterns of observations by individual (usually interpreted as items misperforming for the individuals being assessed). By contrast, large outfit mean squares can reflect unexpected observations by persons on items (might be the result of haphazard or random responding). Thus, large outfit mean squares can indicate an issue that deserves exploration, including haphazard or random responding. In SAS, the NLMIXED, GLMMIX, and IRT procedures can also be used to conduct a Rasch analysis. However, similar to the IRT person-fit measures, none of the above procedures currently outputs estimates of infit and outfit. Thus, individuals will need to hard-code these estimates or produce them with different software (e.g., WINSTEPS, R).
Again, the challenge is interpreting the cause (response set or missing knowledge) of the substantial outfit values. Interested readers are encouraged to explore Bond & Fox (2001) or Smith & Smith (2004) for a more thorough discussion of this topic.
No matter the method, we assert that it is imperative for educational researchers to include mechanisms for identifying random responding in their research, as random responding from research participants is a threat to the validity of educational research results. Best practices in response bias detection are worthy of more research and discussion, given the implications for the quality of the field of educational research.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset