Chapter 5

Designing a Criterion-Referenced Test

Sharon A. Shrock and William C. Coscarelli

In This Chapter

This chapter describes the process of creating valid and defensible criterion-referenced tests—tests that measure the performance of an individual against a standard of competency. This chapter will enable you to

  • distinguish between norm-referenced and criterion-referenced tests
  • describe the basic steps in designing a criterion-referenced test
  • use the decision table to estimate the number of items to include on a criterion-referenced test
  • use the Angoff technique to set the cut-off score for a criterion-referenced test.
 

A Rationale and Model for Designing a Criterion-Referenced Test

Two types of tests are in common use today—norm-referenced tests (NRTs) and criterion-referenced tests (CRTs). These tests are constructed to provide two very different types of information. Norm-referenced tests are created to compare test takers against one another, whereas criterion-referenced tests are designed to measure a test taker’s performance against a performance standard. NRTs are appropriately used to make selection decisions. For example, one might use an NRT to select the strongest applicants for admission to veterinary school; a valid NRT would be helpful in making those decisions, because there are typically many more applicants to vet schools than there are available openings. These tests are commonly given in schools, but are also used within personnel departments to improve hiring decisions. CRTs, however, are widely used in professional certification exams and are typically more appropriate in a training and performance improvement environment, when an organization needs to verify that workers can accomplish important job tasks. In terms of Kirkpatrick’s (1994) four evaluation levels, CRTs are appropriately administered following training to see if trainees have met the objectives of instruction—Level 2 evaluation.

A Criterion-Referenced Test Development Model

Even though CRTs are informative and useful to organizations, the procedures for constructing these tests have only recently been disseminated beyond professional psychometricians to training and performance improvement specialists. Figure 5-1 displays the model written by Shrock and Coscarelli (2000, 2007), which overviews the steps of creating CRTs that are both valid and defensible.

Because most training professionals, even those with graduate school degrees in instructional design, have not been schooled in measurement theory and practice, they typically write tests similar to the teacher-made tests they took so frequently during their own educational experiences. The majority of those tests will have been “topic-based” tests, neither designed to reliably separate test-taker scores from one another, nor grounded in specific competencies that would support a criterion-referenced interpretation of scores (Shrock and Coscarelli, 2007). Such invalid assessments are poor, even dangerous, examples to follow in today’s high-stakes world of competition and litigation.

The Value of Criterion-Referenced Testing

This chapter introduces some of the most essential guidelines to follow to make testing work for you and your organization. In particular, this chapter provides guidance for establishing content validity, determining test length, and setting the passing score for CRTs. Indeed, valid and defensible testing is doable in most organizations and the rewards are many: employees made more productive through improved knowledge and skill; a training and development staff guided by data to maximum effectiveness; a performance improvement function armed with the solid knowledge required to pinpoint problems through evaluation at Levels 3, 4, and 5—transfer, results, and ROI (Phillips, 1996); and an organization that succeeds for both its employees and its customers.

Ensuring That Tests Provide Valid and Defensible Information

Test validity and test defensibility both have roots in a single concept: job relatedness. In fact, the term job relatedness is a technical and legal phrase. Simply put, a test that does not measure what people actually know and do to perform a job successfully cannot be expected to produce valid information about who can do that job. Furthermore, administering tests that are not verifiably job related to make substantive decisions about test takers (for example, hiring or promotion decisions) is likely to be found illegal if challenged.

Content Validity

Although there are many different kinds of test validity (for example, face validity, predictive validity, concurrent validity, as well as others), the most important one for CRTs to possess is content validity. In the context of testing employees, the content of the test is valid if it asks test takers to demonstrate that they have job-related knowledge or can perform job-related tasks. One must understand an important legal qualification at this point. The test cannot demand that the employee demonstrate greater knowledge or more proficient performance of tasks than that required to do the job successfully. It is the role of an NRT to determine whom among the test takers has the greatest knowledge and/or performs the job tasks the best. The goal of the CRT is to distinguish between those who are competent to perform the job and those who are not. Notice how the CRT development process will require the organization to engage in a very healthy examination of what employees actually must do on their jobs and how well they must do it.

Procedure for Establishing Content Validity. The process of formally establishing content validity is straightforward and conceptually simple. Job analysis performed by subject matter experts (SMEs) is the basis of the competency statements upon which the test items are written. Following item creation, SMEs must then examine the items and verify that they match important job knowledge and/or tasks. SMEs performing this review should also report any items that are unclear, have implausible incorrect answers among multiple-choice options, require a knowledge of vocabulary or terminology greater than that required by the job, or contain other suspected flaws. Because SMEs are often new to test creation, it is a good idea for the test designer to have some other knowledgeable colleague read over the test, looking for item flaws, such as cues to the correct answers.

Importance of Test Documentation. Legal defensibility of the test is likely to be important if performance on the test has substantial consequences for test takers, if the test measures mastery of competencies that protect health and safety, or if the test fulfills compliance obligations in a regulated field. To maintain legal defensibility of the test, it is critical to document the test-creation process. Because of content validation’s centrality among legal criteria, documentation of the content-validation procedure and reviewers is essential. Not only are the names of those SMEs who validated the content of the exam important elements of the test documentation, but also their credentials to serve in that role. Be certain to complete contemporaneously all test documentation as the steps in creating the test are taken. Trying to reconstruct test documentation after the test has been challenged is not likely to be successful.

Test Length

Determining how many items to include on a test is one of the most challenging decisions in creating the test. There is no simple guideline or rule that easily answers this question. We know that test length is positively related to test reliability—the consistency of test scores over repeated administrations or across different forms of the test. A longer test allows more accurate assessment than a shorter test, assuming all of the items are of equal quality. (This is another way of saying a bigger collection of invalid items doesn’t help.) However, longer tests are more expensive to create and, obviously, more expensive to administer, assuming that test-taking time is time lost to instruction or lost on the job.

Problems with Uniform Test Length. Some decision makers, especially those in senior management, seek to simplify the test length decision by dictating a uniform test length throughout an organization, that is, mandating that all tests administered will have a specific number of items, usually 50 or 60. This practice is generally not wise. Organizations that use tests strategically will determine carefully the business consequences of testing errors when deciding on the necessary quality of the test—one component of which is test length. Remember there are two types of CRT errors—erroneously classifying someone who has not acquired the competencies assessed by the test as a “master” of them or judging as a “nonmaster” someone who is truly a master performer of the competencies. Therefore, tests that measure critical competencies and/or have serious consequences for test takers should be the longest tests the organization administers.

Content Domain Size and Test Length. Another important factor in determining test length is what psychometricians often call content domain size. The objective from which the items have been written describes the content domain, and theoretically, its size refers to the total number of possible items that could be written from that objective. Obviously, for most objectives, that could be a very large number. An actual count of the possible items, however, is not the intent of considering domain size. Rather, most test writers can imagine that the possible sizes of different objective content domains vary considerably. For example, compare the content domain sizes of the following objectives:

  • From memory, state the number of elements represented in the periodic table.
  • Without reference material, multiply any two-digit number by any two-digit number.
  • Given an instrument panel with all values represented, determine the cause of loss of altitude of the Boeing 747.

Without attempting to count the possibilities that could be written to measure any of these objectives, most test writers can see that the first objective represents an extremely small content domain; the second one represents a much larger, but finite domain; and the third one represents a large domain with less certain boundaries. Extending this comparison to item numbers, objective one can be measured with one item, whereas, objectives two and three are going to require considerably more items. In addition, objective three will require creating multiple items to assess the most common and the most dangerous causes of an airliner dropping in altitude. Criticality of the objective is a key governor of how thoroughly mastery of a large domain should be explored by test items.

Relatedness of Objectives and Test Length. An additional factor that affects test length is how closely related the objectives that the test assesses are to one another. For example, if one objective represents a competency that is a prerequisite to performance of another objective, test items that measure the latter objective will simultaneously measure the former. For example, items that measure competency in long division also measure competencies in multiplication and subtraction. The point for test length is that if the responses for items that measure a given objective are likely to be the same as responses to items that measure a second objective (responses are positively correlated), fewer items will likely suffice to assess these two objectives than would be required to assess two unrelated objectives (responses to which are likely uncorrelated). Essentially, the test writer can take advantage of the dual assessment properties of some items.

Problems with “Weighting” Test Items. Research on the relationship between test length and test reliability indicates that for each objective measured, the gain in reliability attributed to each item that is added to the test starts to decline after the fourth item (Hambleton, Mills, and Simon, 1983). Reliability continues to increase as items are added, but the rate of improvement starts to decline after the fourth item. Even though this finding is useful, most test writers think that, among the objectives assessed by a test, some objectives are likely to be more critical than others. The more critical objectives warrant more test items included to measure them. In fact, this practice is the legitimate way to “weight” test scores; rather than weighting important items by multiplying the points awarded for correct answers to them, it is far better to weight important competencies by including more test items that measure them. The additional items improve the test’s quality, whereas multiplying item scores simply multiplies the effect on total test scores of any error associated with the multiplied item.

A Decision Table for Test Length Determination. Table 5-1 was created to help test writers decide how many items for each objective assessed by the test should be included on the completed test (Shrock and Coscarelli, 2007). Read from left to right, the table indicates in the last column an estimated number of test items per objective following decisions about criticality of the objective, content domain size, and relatedness to other objectives. Notice that estimated total test length would then be the sum of the numbers of items included for each objective measured by the test.

It should be noted that table 5-1 provides only estimates based on the factors discussed above that impinge on the test length decision. The final decision regarding test length is always a matter of professional judgment, and the opinions of SMEs can be essential in making the criticality, content domain size, and objectives relatedness decisions.

Setting the Test’s Mastery Cut-Off Score

Unlike NRTs, where the meaning of a test score is a statement (typically a percentile) reflecting where that score falls in relation to the scores obtained by others in a given comparison group, the score on a CRT has meaning only in relation to (above or below) a predetermined cut-off, that is, “passing” score. Typically, a test taker’s raw score on a CRT is not reported; instead the test taker is reported as either a master or a nonmaster based on his or her CRT performance. Therefore, CRTs require that this cut-off score be established.

Table 5-1. Decision Table for Test Length Determination


If the Objective Is: Criticality of Objective Content Domain Size Relatedness to Other Objectives Number of Test Items
Critical From Large Domain Unrelated 10–20
Related 10
From Small Domain Unrelated 5–10
Related 5
Not Critical From Large Domain Unrelated 6
Related 4
From Small Domain Unrelated 2
Related 1

It is not uncommon for an organization to establish the cut-off score centrally, that is, to create a single cut-off score for all tests administered within the organization. Such cut-off scores might even be determined by a single manager and communicated as a directive to all test developers and administrators (not to mention test takers) affected. Too often, these singular cut-off scores reflect the school experiences of a handful of managers, that is, 80 percent was a B grade in high school, so 80 percent becomes the mandated passing score throughout an organization. CRT cut-off scores determined in this manner are likely to be indefensible if challenged through the courts.

The Angoff Method of Cut-Off Score Determination

There are several different recognized, defensible ways to set the CRT cut-off score. The technique presented here is one that belongs to a class of techniques called conjectural methods. The Angoff method (named for the psychometrician who devised it) is probably the technique used most often to set the passing scores for professional certification exams. Its popularity is no doubt largely due to its logistical feasibility, and its wide application probably strengthens its defensibility in the face of a challenge.

Steps in Using the Angoff Method. The Angoff technique relies on SME ability to estimate for each item the likelihood that a “minimally competent” test taker (one who is just competent enough to do the assessed job tasks) will answer correctly. The list of steps in implementing the Angoff method follows.

1. Begin by carefully selecting SMEs who are totally familiar with the competencies that the test assesses and who are also knowledgeable about the skill sets of persons who succeed and those who do not succeed in performing those competencies on the job.

2. If at all possible, bring the SMEs together at a common location where they can work both independently and together without interruptions.

3. Give each SME the items proposed for the test (or for inclusion in the item bank for the test), including the proposed correct answer to each item.

4. Separate the SMEs and ask them to read each item and for each item individually estimate the probability (from 0 to 100) of the “minimally competent” performer answering that item correctly. Clarify that their estimates can be any value from 0, (indicating that virtually no minimally competent performer will get the item correct) to 100 (indicating that virtually any minimally competent performer will answer correctly); estimates need not be in 10- or 25-percent increments, that is, they can be .67 or .43—whatever probability best reflects the SME’s judgment. You will have to stress that the “minimally competent” performer is not an incompetent performer; rather, the minimally competent person is a just-passing master of the competencies reflected in the items.

5. Bring the SMEs back together and examine their estimates for each item. Ask them to discuss and reconcile any estimates that deviate by more than 10 percentage points.

6. Add up the probability estimates for each SME.

7. Add all of the totals and divide by the number of SMEs—average the total estimates across all SMEs.

8. The average estimate is the proposed cut-off score for the test. This cut-off score can be adjusted higher or lower depending on whether the organization is especially concerned about preventing nonmasters from assuming the assessed job position or whether the greater concern is failing to recognize true masters.

Knowledge Check: A Microtest to Practice Setting a Cut-Off Score

For this practice exercise, assume that you are an SME trying to establish a cut-off score for a test composed of the following five items. The test has been created to measure competence in creating CRTs using some of the information from this chapter. Read each item, choose the best answer, and note the correct answer appearing below the last choice. Then estimate the probability that the minimally competent test designer will answer the item correctly. (A real test to measure this competency would require far more items than appear here; this microtest is just for practice with setting a cut-off score.) Check your answers in the appendix.

Probability

_____ 1. Nearly every high-school graduating class selects a valedictorian and salutatorian. What kind of decision do these choices represent?

a. Norm-referenced

b. Criterion-referenced

c. Domain-referenced

d. None of the above

(Answer is a.)

_____ 2. Ann goes bowling. Every time she rolls the ball, it goes into the gutter. In testing terms, Ann’s performance might best be described as

a. both reliable and valid

b. neither reliable nor valid

c. valid, but not reliable

d. reliable, but not valid

(Answer is d.)

_____ 3. According to the research on the relationship between test length and test reliability, a test that measures seven objectives should have a total of how many items?

a. 7

b. 14

c. 21

d. 28

(Answer is d.)

_____ 4. Which of the following is most likely a criterion-referenced test?

a. A fourth-grade teacher-made geography test

b. A state driver’s license test

c. A spelling bee contest

d. A standardized college admission exam

(Answer is b.)

_____ 5. Which of the following objectives would probably be represented by the largest number of questions on a test?

a. Given the relevant numerical values, calculate gross national product for the United States in 2008.

b. Given a decade from history and a choice among national political leaders, identify the most influential leaders of the specified time period.

c. Given access to survey data regarding food preferences, religious beliefs, national origin, age, and gender, generate strategies for weight loss among those who responded.

(Answer is c.)

_____ Cut-Off Score (sum of estimates)

Guidelines for Creating Criterion-Referenced Tests

As you might imagine from looking at the 13 steps in the Model for Criterion-Referenced Test Development (figure 5.1), there are quite a few decisions and processes you will have to attend to for creating a valid CRT. However, in this section we have tried to distill the most pressing CRT issues into eight guidelines that should serve as important mileposts for your efforts.

1. Clarify the purpose of any test you are asked to create. If the purpose is to rank order test takers or choose the best performers, use a norm-referenced test. Use a criterion-referenced test when you want to make a master/nonmaster decision about individual test takers based upon their performances of specific competencies.

2. Be certain that your CRT assesses only job-related competencies; its content validity and defensibility rely on a careful job analysis.

3. Document the content validation process and all subsequent steps taken in creating the test; the defensibility of the test relies heavily on contemporaneous documentation.

4. Because test reliability is related to test length, make strategic decisions regarding the criticality of testing consequences before determining test length; the more critical the competencies assessed, the longer the test should be.

5. Consider the content domain size implied by the competencies assessed as well as the relatedness of the assessed competencies to one another in determining test length. Include more items on the test for critical competencies, those that require mastery of a large content domain, and those that are unrelated to other competencies assessed by the test.

6. Use a defensible, recognized technique to set the cut-off score that separates masters from nonmasters; the Angoff technique is perhaps the easiest one logistically for most organizations to use.

7. Be certain that the SMEs chosen to participate in the Angoff cut-off score determination understand the concept of “minimal competence.” Allow the SMEs to practice estimating probabilities for several items and share their opinions to help them reach consensus about the concept of “minimal competence” for the assessed job competencies. Remind SMEs that the estimated probability for any item should not be below the chance probability of answering the item correctly. For example, the minimum probability estimate for a four-choice multiple-choice item should be .25.

8. Adjust the cut-off score up or down depending on which error in test results—false positives or false negatives—the organization most wants to minimize.

About the Authors

Sharon A. Shrock, PhD, graduated from Indiana University in 1979 with a PhD in instructional systems technology. She joined the faculty at Virginia Tech before moving to Southern Illinois University Carbondale, where she is currently professor and coordinator for the instructional design and technology programs within the Department of Curriculum and Instruction. She specializes in instructional design and program evaluation and has been an evaluation consultant to international corporations, school districts, and federal instructional programs. She is the former co-director of the Hewlett-Packard World Wide Test Development Center. Shrock is first author (with William C. Coscarelli) of Criterion-Referenced Test Development: Technical and Legal Guidelines for Corporate Training (2007), now in its third edition, and has written extensively in the field of testing and evaluation. In 1991, she won the Outstanding Book Award for Criterion-Referenced Test Development from both AECT’s Division of Instructional Development and from the National Society for Performance and Instruction. She can be reached at [email protected].

William C. Coscarelli, PhD, graduated from Indiana University in 1977 with a PhD in instructional systems. He joined the faculty at Southern Illinois University Carbondale, where he is professor emeritus in the instructional design program of the Department of Curriculum and Instruction. He also served as the co-director of the Hewlett-Packard World Wide Test Development Center. Coscarelli is a former president of the International Society for Performance Improvement (ISPI), an international association dedicated to improving performance in the workplace. He was ISPI’s first vice president of publications and is the recipient of ISPI’s Distinguished Service Award. He is author of the Decision-Making Style Inventory (2007), coauthor (with Gregory White) of The Guided Design Guidebook (1986), and second author (with Sharon Shrock) of Criterion-Referenced Test Development: Technical and Legal Guidelines for Corporate Training (2007). He has made more than one hundred presentations in his career and written more than 60 articles. He can be reached at [email protected].

References

Hambleton, R. K., C. N. Mills, and R. Simon. (1983). Determining the Lengths for Criterion- Referenced Tests. Journal of Educational Measurement 20(1): 27–38.

Kirkpatrick, D. L. (1994). Evaluating Training Programs: The Four Levels. San Francisco: Berrett-Koehler.

Phillips, J. J. (1996, April). Measuring ROI: The Fifth Level of Evaluation. Technical and Skills Training. www.astd.org/virtual_community/comm_evaluation/phillips/pdf. Retrieved November 9, 2009.

Shrock, S. A. and W. C. Coscarelli. (2000). Criterion-Referenced Test Development: Technical and Legal Guidelines for Corporate Training and Certification. Washington, DC: International Society for Performance Improvement.

Shrock, S. A. and W. C. Coscarelli. (2007). Criterion-Referenced Test Development: Technical and Legal Guidelines for Corporate Training. San Francisco: Pfeiffer.

Additional Reading

Cizek, G. J. and M. B. Bunch. (2007). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests. Thousand Oaks, CA: Sage Publications.

Downing, S. M. and T. M. Haladyna, eds. (2006). Handbook of Test Development. Mahwah, NJ: Lawrence Earlbaum.

Hambleton, R. K. (1999). Criterion-Referenced Testing Principles, Technical Advances, and Evaluation Guidelines. In Gutkin, Terry and Cecil Reynolds eds., The Handbook of School Psychology 3rd ed. New York, NY: Wiley and Sons, 409–433.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset