Cronbach’s alpha (Cronbach, 1951) is one of the
most widely reported indicators of scale reliability in the social
sciences. It has some conveniences over other measures of reliability,
and it has some drawbacks as well. There are also many misconceptions
about the appropriate use of alpha. In this section we will review
the strengths, weaknesses, uses, and misuses of alpha. However, let
us start by reviewing the original goal for alpha.
Prior to Cronbach’s
seminal work in this area, the reliability of a scale in a particular
sample was evaluated
through methods such as test-retest correlations. This type of reliability
is still discussed today in psychometrics textbooks, but it has serious
drawbacks. These can include the difficulty of convening the same
group of individuals to retake instruments, memory effects, and attenuation
that are due to real change between administrations. Another serious
drawback is particular to constructs (e.g., mood, content knowledge)
that are expected to change over time. Thus, as Cronbach himself put
it, test-retest reliability is generally best considered as an index
of stability rather than reliability per
se.
The split-half reliability
estimate was also developed early in the 20th century. To perform
this evaluation, items are divided into two groups (most commonly,
even- and odd-numbered items) and scored. Those two scores are then
compared as a proxy for an immediate test-retest correlation. This,
too, has drawbacks—the number of items is halved, there is
some doubt as to whether the two groups of items are parallel, and
different splits of items can yield different coefficients. The Spearman-Brown
correction was developed to help correct for the reduction in item
number and to give a coefficient that is intended to be similar to
the test-retest coefficient. As Cronbach (1951) pointed out, this
coefficient is best characterized as an indicator of equivalence between
two forms, much as today we also talk about parallel forms.
The Kuder Richardson Formula
20 (KR-20) and Cronbach’s alpha were developed to address some
of the concerns over other forms of reliability, particularly split-half
reliability. KR-20 was developed first as a method to compute reliability
for items scored dichotomously, as either “0” or “1”,
as they often are for academic tests or personality inventories (such
as the Geriatric Depression Scale that we use as an example earlier
in the book). Afterward, the mechanics behind KR-20 was further refined
to create alpha, a measure of reliability that is more general than
KR-20 and applicable to items with dichotomous or continuous scales.
Both measures will yield the same estimate from dichotomous data,
but only alpha can estimate reliability for a non-dichotomous scale.
Thus, alpha has emerged as the most general and preferred indicator
of reliability in modern statistical methodology. Others have further
extended the use of alpha through development of a test to determine
whether alpha is the same across two samples (Feldt, 1980), and methods
to estimate of confidence intervals for alpha (see Barnette, 2005).
The
correct interpretation of alpha. Cronbach (1951) himself
wrote and provided proofs for several assertions about alpha:
-
Alpha is n/n-1 times the ratio
of inter-item covariance to total variance—in other words,
a direct assessment of the ratio of error (unexplained) variance in
the measure.
-
The average of all possible split-half
coefficients for a given test.
-
The coefficient of equivalence
from two tests composed of items randomly sampled (without replacement)
from a universe of items with the mean covariance as the test or scale
in questions.
-
A lower-bound estimate of the coefficient
of precision (accuracy of the test with these particular items) and
coefficient of equivalency (simultaneous administration of two tests
with matching items).
-
The proportion (lower-bound) of
the test variance that is due to all common factors among the items.
As Nunnally
& Bernstein (1994, p. 235) distill from all this, alpha is an
expected correlation between one test and an alternative form of the
test containing the same number of items. The square root of alpha
is also, as they point out, the correlation between the score on a
scale and errorless “true scores.” Let us unpack this
for a moment.
What
alpha is not. Alpha is not a
measure of unidimensionality (an indicator that a scale is measuring
a single construct rather than multiple related constructs) as is
often thought (Cortina, 1993; Schmitt, 1996). Unidimensionality is
an important assumption of alpha, in that scales that are multidimensional
will cause alpha to be under-estimated if not assessed separately
for each dimension, but high values for alpha are not necessarily
indicators of unidimensionality (e.g., Cortina, 1993; Schmitt, 1996).
Also, as we mentioned
before, alpha is not a characteristic
of the instrument, but rather it is a characteristic of the sample
in which the instrument was used. A biased, unrepresentative, or small
sample could produce a very different estimate than a large, representative
sample. Furthermore, the estimates from one large, supposedly representative
sample can differ from another, and the results in one population
can most certainly differ from another. This is why we place such
an emphasis on replication. Replication is necessary to support the
reliability of an instrument. In addition, the reliability of an established
instrument must be re-established when using the instrument with a
new population.