Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER SIXTEEN
OUTCOME ASSESSMENT AND PROGRAM EVALUATION

John Clayton Thomas

Nonprofit organizations need to know how effectively they are performing their jobs. Are their programs achieving the desired results? How could programs be modified to improve those results? Because the goals of nonprofit programs are often subjective and not readily observable, the answers to these questions may be far from obvious.

These questions have grown in urgency in recent years as a consequence of new external pressures. As perhaps the watershed event, “the Government Performance and Results Acts of 1993…placed a renewed emphasis on accountability in federal agencies and nonprofit organizations receiving federal support” (Stone, Bigelow, and Crittenden, 1999, p. 415). More specific to nonprofit organizations, funders increasingly demand evidence of program effectiveness, as exemplified by the United Way of America's outcome measurement initiative of recent decades and its more recent Outcome Measurement Resource Network (see, for example, United Way of America website for the Outcome Measurement Resource Network). As Ebrahim describes in Chapter Four of this volume, there is today a major press for nonprofit organizations to be highly accountable in all sorts of ways. Yet research on contemporary practice indicates that many nonprofit agencies still perform relatively little assessment of program performance (Carman, 2007; Morley, Hatry, and Cowan, 2002).

To address these needs nonprofit organizations need, at a minimum, to engage in systematic outcome assessment—that is, regular measurement and monitoring of how well their programs are performing relative to the desired outcomes. (The terms outcome assessment and performance assessment will be used interchangeably in this chapter.) Nonprofit executives may want, in addition, to employ the techniques of program evaluation in order to define the specific role their programs played in producing any observed beneficial changes. Used appropriately, outcome assessment and program evaluation inform a wide range of decisions about whether and how programs should be continued in the future and satisfy funder requirements.

This chapter introduces the techniques of outcome assessment and program evaluation as they might be employed by nonprofit organizations. These techniques are not designed for more general evaluations of organizational effectiveness which, as Herman and Renz (2008, p. 411–412) have observed, seldom “could be legitimately considered to equal” program effectiveness (see Chapter Ten in this volume for a detailed discussion of organizational effectiveness). The emphasis here is on how these tools can be useful to organization executives by providing information that speaks to decisions those executives must make. To make that case, we will first provide a step-by-step description of how to conduct outcome assessment, before turning to how that assessment can be incorporated into more advanced program evaluations processes.

Planning the Process for Outcome Assessment

For outcome assessment to have maximum value, the process for that assessment must be well planned and executed. The first step in that regard is for the organization's leaders to be committed to the effort. Ideally, an initiative of this kind will begin with the chief executive of the nonprofit organization but, in any event, the chief executive and the organization's board should understand and support the initiative. Support includes recognizing and accepting that outcome assessment could uncover unwelcome truths about program performance. There is no point in taking the time to develop and obtain performance data if those in charge are not committed to using the data.

Assuming this support is assured, outcome assessment should be undertaken on a program-by-program basis. For a nonprofit organization with multiple programs, that guideline means that each program will require separate outcome assessment planning. A specific individual should be assigned primary responsibility for that planning for each program, preferably as part of a small team. (The discussion that follows will use the term decision makers to encompass both possibilities.)

Regardless of whether a team is used, the planning process should entail extensive involvement of staff and perhaps even clients who are involved with the program. That involvement serves at least two functions. First, it assists in information gathering. Since those who are involved with a program know it from the inside, they can provide valuable intelligence on the program's desired outcomes and possible means for measuring their achievement. Second, involvement can build ownership in the outcome assessment process. If program staff and clients have the opportunity to speak to how the program will be assessed, they are more likely to buy into the eventual results of the assessment.

Finally, the process should also be linked to the organization's information technology. Improved information technology, by facilitating the recording and analysis of performance data, is a major factor underlying the recent push for better performance assessment in both the nonprofit and public sectors. Building a strong performance assessment system requires that the system be planned in conjunction with the organization's technology.

Defining Program Goals

Outcome assessment is a goals-based process; programs are assessed relative to the goals they are designed to achieve. Defining those goals can prove to be a difficult task since the project leader or team must define and differentiate several types of goals while navigating the often-difficult politics of goal definition. This section first explains several goal types and then discusses how to define them in a political context.

Types of Goals

A first type of goal refers to the ultimate desired program impact. United Way of America, in its outcomes assessment website, defines outcome goals as “benefits or changes for individuals or populations during or after participating in program activities.” Here we prefer a broader definition of outcome goals as the final intended consequences of a program for its clients and/or society. An outcome goal has value in and of itself, not as a means to some other end, and is usually people-oriented because most public and nonprofit programs are designed ultimately to help people. This broader definition probably fits better the new United Way interest in measuring the success of programs based on both “client-level outcomes” and “achievement of measurable community outcomes” (Minich et al., 2006, p. 183).

Activity goals, by contrast, refer to the internal mechanics of a program, the desired substance and level of activities within the program. These specify the actual work of the program, such as the number of clients a program hopes to serve. How the staff of a program spend their time—or are supposed to spend their time—is the stuff of activity goals.

The distinction between outcome and activity goals can be illustrated through a hypothetical employment-counseling program. An activity goal for this program might be “to provide regular employment counseling to clients,” with an outcome goal being “to increase independence of clients from public assistance.” The activity goal refers to the work of the program, the outcome goal to what the work is designed to achieve. As this example also suggests, outcome goals tend to be more abstract, conceptual, and long term; activity goals are more concrete, operational, and immediate.

Understanding the distinction is crucial if outcome assessments are to resist pressures to evaluate program success in terms of activity rather than outcome goals. Program staff often push in that direction for several reasons. First, activity goals are easier for them to see; they can more readily see the results of their day-to-day work than what that work is designated to achieve sometime in the future. Second, activity goals tend to be more measurable; it is easier to measure the “regularity” of counseling than “independence from welfare.” Finally, activity goals are usually easier to achieve. Police working in a crime prevention program, for example, can be much more confident of achieving an activity goal of “increasing patrols” than an outcome goal of “reducing crime.”

Outcome assessment planning often can sidestep pressures of this kind by including both types of goals in the goal definition. As a practical matter, both outcome and activity goals must be examined in most outcome assessments in order to know how different parts of a program link to eventual program outcomes.

Sometimes a program will have so many activity goals that it could be too much work to attempt to define much less measure all of them as part of the outcome assessment system. A good guideline in such cases is to define activity goals only for key junctures in the program, that is, only at the major points in the program sequence where information is or might be wanted (see also Savaya and Waysman, 2005, p. 97).

Falling between activity and outcome goals are bridging goals, so named because they supposedly connect activities to outcomes (Weiss, 1972, pp. 48–49). Bridging goals, like outcome goals, relate to intended consequences of a program for society, but bridging goals are an expected route to the final intended consequences, rather than being final ends in and of themselves. In an advertising campaign designed to reduce smoking, for example, a bridging goal between advertising (activity) and reduced smoking (outcome) might be “increased awareness of the risks of smoking.” That increased awareness would be a consequence of the program for society but, instead of being the final intended consequence, it is only a bridge from activity to outcome.

Bridging goals can be important for outcome assessment systems for a variety of reasons. For one thing, because they are often essential linkages in a program's theory of change—the hypothesized process by which program inputs lead to outcomes—their achievement may be a prerequisite to demonstrating that program activities have produced the desired outcomes. Thus, to confirm that a program works, it may be necessary to establish first that the bridging goal is achieved before any change on the outcome goal would even be relevant. Bridging goals also may provide a means to obtain an early reading on whether a program is working. Effects may be observable on a bridging goal when it is still too early to see any impact on final outcome goals.

Outcome assessment systems may also occasionally incorporate side effects. Side effects, like outcome and bridging goals, are also consequences of a program for society, but unintended consequences. They represent possible results other than the program's goals. For example, a neighborhood crime prevention program might displace crime to an adjacent neighborhood, producing the side effect of increased crime there. A side effect can also be positive, as when a neighborhood street cleanup program induces residents to spruce up their yards and homes, too.

Given the potential for any given program to have a wide range of side effects, where should an assessment draw the line? An outcome assessment for a nonprofit agency program should incorporate side effects only to the extent that the chief executive, staff, and/or other key stakeholders view specific possible side effects as important program aspects. Is there an interest in examining a possible negative side effect, perhaps with an eye to changing the program so as to reduce or eliminate that result? Agency decision makers must make that judgment based on whatever data they believe are necessary for a full outcome assessment. In most cases, given a principal interest in activity and outcome goals, the executive may not want to spare limited resources to monitor possible side effects, too. On occasion, though, possible side effects may loom as so important that they must be addressed.

Whatever the type of goal, its definition should satisfy several criteria:

Each goal should contain only one idea. A goal statement that contains two ideas (for example, “increase independence from welfare through employment counseling”) should be divided into two parts, with each idea expressed as a distinct goal.
Each goal should be distinct from every other goal. If goals overlap, they may express the same idea and so should be differentiated.
Goals should employ action verbs (for example, “increase, improve, reduce”), avoiding the passive voice.

Goal definitions can be derived from two principal sources: (1) program documentation, including initial policy statements, program descriptions, and the like, and (2) the personnel of the program, including program staff, the organization's executive, and possibly other key stakeholders such as clients. These personnel should always be asked to react to draft goals before they are finalized.

The Politics of Goals Definition

Understanding the different types of goals and where to find them may be the easy part of goal definition. The difficult part can be articulating those definitions in a manner that satisfies all important stakeholders. To do that may require navigating the perilous politics of goal definition.

As a first difficulty, many programs begin without clearly defined goals. Initial program development focuses on where money should be spent to the neglect of defining what the program is expected to achieve. Second, as programs adapt to their environments, goals sometimes change and, in the process, depart from the program's original intent. “Policy drift” can result wherein programs move away from that original intent, and once-distinct goals become fuzzy or inconsistent (for an example, see Kress, Springer, and Koehler, 1980).

More difficulties can arise when planning for outcome assessment begins. The commonly perceived threat from any kind of assessment may prompt some program staff or other stakeholders, when they are asked, to be evasive about goals. Or, those staff or other stakeholders from their different vantage points inside and outside the program may simply have different views, resulting in conflicting opinions about a program's goals.

A variety of techniques is available to cope with these problems. Fuzzy or inconsistent goals may be accommodated by including all of the different possible goals in a comprehensive goals statement. If some perspectives appear too contradictory to fit in the same statement, a goals clarification process might be initiated (for an illustration, see Kress, Springer, and Koehler, 1980). Working with staff and stakeholders to clarify the goals of a program could be the most important contribution of an outcome assessment planning process since it may build a cohesiveness previously lacking in the program.

Disagreement over goals can also sometimes be sidestepped as irrelevant. Patton (2008, pp. 238–241) recommends asking stakeholders what they see as the important issues or questions about the program. These issues, because they represent areas where information might be used, should be the focus of most eventual data analysis anyway. And, there may be more agreement about issues than about goals. Decision makers might then be able to express these issues in terms of the types of goals outlined earlier.

The agency's chief executive can play any of several roles in the definition of program goals. At a minimum, the executive should oversee the entire process to ensure the necessary participation, lending the authority of her or his position as necessary. Ideally, the executive should review proposed goals as they are defined, both for clarity and for conformity to the agency's overall focus. Finally, if conflicts over goals arise, the executive may need to intervene to achieve resolution.

The Impact or Logic Model

As part of the process of goal definition, a program's various goals should be combined into a visual impact or logic model—an abstracted model of how the various goals are expected to link to produce the desired outcomes (see Savaya and Waysman, 2005; McLaughlin and Jordan, 2015; W. K. Kellogg Foundation, 2004). Such a model should have several characteristics. First, it should be an abstraction, removed from reality but representing reality, just as the goals are. Second, the model should simplify matters, reducing substantially the detail of reality. Third, as the “logic” component, the model should make explicit the significant relationships among its elements, showing for example how activity goals are expected to progress to outcome goals. Fourth, the model may involve formulation of hypotheses—the suggestion of possible relationships not previously made explicit in program documents or by program actors. Indeed, a principal benefit of model development often lies in how program stakeholders are prompted to articulate hypotheses they had not previously recognized. Exhibit 16.1 shows an impact model for a hypothetical nonprofit training program.

The model links the various goals from the initial activity goals through the bridging goals to the ultimate outcome goal. As the model illustrates, bridging goals sometimes fall between two activity goals, but still serve as links in the chain from activity goals to outcome goals. This model may be atypical in that the goals follow a single line of expected causality, where the more common model may fork at one or more points (as, for example, if different types of executives received different kinds of training). Should staff and/or stakeholders disagree about the likely impact model, decision makers must determine whether the disagreement is sufficiently important to require resolution before further outcome assessment planning can proceed.

Development of an impact model can help staff and stakeholders clarify how they expect a program to work and the questions they have about its operation, in the process perhaps suggesting how to use assessment data once it becomes available. As Savaya and Waysman (2005, pp. 85–86) have documented, impact models can be useful for a variety of other purposes, too—from “assessing the feasibility of proposed programs” to “developing performance monitoring systems.” To date, however, these models still appear to be used only infrequently by nonprofit organizations (Carman, 2007, p. 66).

Measuring Goals

Once the goals have been defined, attention must turn to how to measure them. Before thinking about specific measures, decision makers should become familiar with some basic measurement concepts and with the various types of measures available.

Concepts of Measurement

Measurement is an inexact process, as suggested by the fact that social scientists commonly speak of “indicators” rather than measures. As the term implies, measurement instruments indicate something about a concept (that is, about a goal), rather than provide perfect reflections of it. So, crime reported to police constitutes only a fraction of actual crime; and scores on paper-and-pencil aptitude test reflect the test anxiety and/or cultural backgrounds of test takers as well as their aptitudes.

The concepts of measurement validity and measurement reliability rest on recognition of the inexactness of measurement. Measurement validity refers to whether or to what extent a measure taps what it purports to measure. More valid measures capture more of what they purport to measure. Measurement reliability refers to a measurement instrument's consistency from one application to another. Reliability is higher if the instrument produces the same reading (a) when applied to the same phenomenon at two different times or (b) when applied by different observers to the same phenomenon at the same time. Obviously, the better measures are those that are more valid and reliable.

Executives and staff of nonprofit agencies need not become experts on how to assess the validity and reliability of measures, but they should know to keep at least two points in mind. First, given the fallibility of any particular measure, multiple measures—two or more indicators—are desirable for any important goal, especially any major outcome goal. (One measure each may prove sufficient for many activity goals.) Once data collection begins, the different measures should then be compared to see if they appear to be tapping the same concept. Second, if there are concerns about reliability, taking multiple observations is recommended. Any important measure should, if possible, be applied at a number of time points to see if and how readings might fluctuate. (Multiple observations are also useful for other aspects of research design, as explained as follows.)

Decision makers must also consider face validity; that is, whether measures appear valid to key stakeholders. Measurement experts sometimes discount the importance of face validity on the grounds that measures that appear valid sometimes are not. However, the appearance of validity can be crucial to the acceptance of a measure as really reflecting program performance. As a result, decision makers should be concerned for whether recommended measures appear valid, but they must try to avoid using any seemingly attractive measures that may not actually tap the relevant goal.

In selecting measures, the ability of program staff to assess measurement validity should not be underestimated. By virtue of their experience with the program, staff often have unique insights into the merits of specific measures, insights that trained outside experts might miss.

Types of Measures

Outcome assessments can employ several types of measures, and to achieve the benefit of multiple measures, will typically utilize two or more of the types. The different types are briefly introduced below in terms of what nonprofit executives and staff may need to know about each.

Program Records and Statistics

An obvious first source for data is the program itself. Records can be kept and statistics maintained by program staff for a variety of measures. Almost every evaluation will employ at least some measures based on program records (for a detailed treatment, see Hatry, 2015).

These measures must be chosen and used with caution, however. For one thing, program staff ordinarily should be asked to record only relatively objective data such as numbers of clients served, gender and age of clients, dates and times services are delivered, and the like. Staff usually can record these more objective data with little difficulty and high reliability; they should not be expected, without training, to record more subjective data such as client attitudes, client progress toward goals, and so on.

In a similar vein, although program records can serve as an excellent source of measures of activity goals—the amount of activity in the program—they should be used only sparingly as outcome measures and probably never as the sole outcome measures. Program staff are placed in an untenable position if they are asked to provide the principal measures of their own effectiveness, especially if those measures include subjective elements.

That concern not withstanding, the staff who will record the measures should be involved in defining the measures. In addition to offering insights about measurement validity, staff can speak to the feasibility of the proposed record keeping and help ensure that the record keeping process will not be so onerous that staff must choose between spending their time on the evaluation or on the actual program. If that were to happen, either the evaluation would interfere with the program because staff give too much time to record keeping, or the measures will produce poor data because staff slight record keeping in favor of working on the program's activities.

Client Questionnaire Surveys

Any program serving client populations, including most nonprofit programs, should include some measures of client perceptions and attitudes. These perceptions could include ratings of the program's services and service providers, client self-assessments, and other basic client information. The obvious means for obtaining these measures is a questionnaire survey, of which there are several forms. Each has its own advantages and disadvantages (see also Newcomer and Triplett, 2004; Rea and Parker, 2014).

Phone surveys can produce good response rates (that is, responses from a high proportion of the sample), assuming respondents are contacted at good times (usually in the evening) and interviewed for no more than ten to fifteen minutes. However, phone surveys can be expensive due to interviewer costs and the need for multiple phone calls in order to reach many respondents.

The desire for a lower-cost procedure often leads to consideration of mail surveys. Here questionnaires are mailed to respondents, who are asked to complete and return by mail. Any reduction in costs through using a mail survey can be more than offset, however, by the frequent poor response rate, typically lower and less representative than with a phone survey. Questions on mail surveys must also be structured more simply since no interviewer is available to guide the respondent through the questionnaire. Mail surveys work best when sent to groups that are both highly motivated to respond (as sometimes with clients of nonprofit programs) and willing and able to work through written questions independently. Even then, obtaining a high response rate usually requires sending one or two follow-up mailings to nonrespondents.

E-mail surveys represent a contemporary variation on the mail survey. Relatively inexpensive online options are now available, too, for recording and summarizing responses. Obviously, though, this technique will work only with a computer-literate population, and, as with mail surveys, the population must be motivated to respond. As another alternative, e-mail surveys might be combined with traditional mail surveys, an approach that can often produce excellent response rates (see, for example, Thomas, Poister, and Ertas, 2010).

The best choice for many programs will be the so-called convenience survey, a survey of respondents who are available in some convenient setting such as when they receive program services. A program can capitalize on that availability by asking clients, while on site, to complete and return a brief questionnaire. As with mail surveys, the questionnaires must be kept simple and brief to permit easy and rapid completion. To reassure respondents about the confidentiality of their responses, ballot-box-like receptacles might be provided for depositing completed questionnaires. A well-planned convenience survey can produce a good response rate at a cost lower than that of any of the alternatives.

Construction of any kind of questionnaire requires some expertise. Agency executives wishing to economize might share the construction process with an outside consultant. The outcome assessment planners might draft initial questions for the consultant to critique before another review by staff and again by the consultant. This collaborative procedure can both reduce the organization's costs and provide training in questionnaire construction to program staff.

Formal Testing Instruments

With many programs, the outcomes desired for clients—self-confidence, sense of personal satisfaction—are sufficiently common that experts elsewhere have already developed appropriate measurement instruments. Some formal testing instruments are available free in the public domain; others may be available at a modest per-unit cost. In either case, it is sometimes wiser to obtain these instruments than to develop new measures.

Trained Observer Ratings

These ratings can be especially useful “for those outcomes that can be assessed by visual ratings of physical conditions,” such as physical appearance of a neighborhood for a community development program (Hatry and Lampkin, 2003, p. 15). As that example suggests, these ratings work best for subjective outcomes that are not easily measured by other techniques. These ratings can be expensive in terms of both time and money, however, since their use necessitates development of a rating system, training of raters, and a plan for oversight of the raters. It may also be difficult in a small or moderate-sized nonprofit agency to find raters who do not have a personal stake in a program's effectiveness.

Qualitative Measures

Outcome assessment will typically be enhanced by use of some qualitative measures, measures designed to capture nonnumerical in-depth description and understanding of program operations. After long disdaining these measures as too subjective to be trusted, most experts now recognize that programs with subjective goals cannot be evaluated without qualitative data.

Qualitative measures can be obtained through two principal techniques, observation and in-depth interviews. Observation can provide a sense of how a process is operating, as, for example, in evaluating how well group counseling sessions have functioned. By observing and describing group interaction, an evaluator could gain a sense of process unavailable from quantitative measures.

In-depth interviews have a similar value. In contrast to questionnaire surveys, relatively unstructured interviews are composed principally of open-ended questions designed to elicit respondent feelings about programs without the constraints of the predefined multiple-response choices of structured questionnaires. These interviews can be extremely useful as, for example, in assessing the success of individualized client treatment plans.

Still, nonprofit agencies should use qualitative measures with caution, taking care to avoid either over- or underreliance on them. Evaluation of most nonprofit programs calls for multiple measures, including both quantitative and qualitative measures. Outcome assessment planners should be sure that both perspectives are obtained.

Finally, outcome assessment planners should be prepared for the possibility that discussion of measures may rekindle debate about goals. Perhaps staff paid too little attention to the earlier goal definition, or maybe thinking about measures prompts staff to see goals differently. When that happens, planners should be open to a possible need to re-formulate goals.

Data Collection, Analysis, and Reporting

Once the necessary measures have been defined, decision makers should plan for data collection, analysis, and reporting. They must ensure first that procedures are established for recording observations on the measures. They must also determine how the new data collection process will be integrated with the agency's information technology, including evaluating whether new software will be needed for the effort.

Before putting the full outcome assessment in place, the agency should pilot test the measures and the data collection procedures to see how well they work. Measures sometimes prove not to produce the anticipated information. Convenience surveys, for example, sometimes elicit only partial responses from program clients, which could require either improving or abandoning that instrument. Problems can also arise in the recording of data, perhaps necessitating rethinking the recording procedures.

Decision makers, certainly including the agency's chief executive, should also establish a schedule for regular reporting and review of the data. Depending on agency preferences and perceived needs, reviews might be planned as frequently as weekly or as infrequently as annually. Or, less intensive reviews might be planned more often with more intensive reviews scheduled only occasionally (e.g., on a quarterly or annual basis).

The details of the schedule are probably less important than that a schedule is established and implemented. Judging from the findings of one study (Morley, Hatry, and Cowan, 2002, p. 36), many nonprofit agencies that collect outcome information do not systematically tabulate or review the data, instead “leaving it to supervisors and caseworkers to mentally ‘process’ the data to identify patterns and trends.” Agencies unnecessarily hamstring themselves when they make such choices. If systematic outcome data are available, agencies should ensure that the data are tabulated and reviewed.

Growing numbers of governments and some nonprofit agencies (see Carman, 2007, p. 65) are taking the additional step of putting program effectiveness and efficiency data into comprehensive performance management systems and then reporting the data through summary “scorecards” or “dashboards.” Popularized by Kaplan and Norton (1996), the so-called “balanced scorecards” are designed to present a 360-degree picture of an organization's performance at any given time. With their growing popularity among governments (see, for example, Edwards and Thomas, 2005), these scorecards seem likely to become increasingly common among nonprofit agencies in the coming years.

Actual review of the data can go in a number of directions depending on what the data look like and what questions the agency has about the programs. At the outset, initial data on any new measures can be at once the most interesting, yet the most difficult to interpret. Novelty accounts for the likely high interest: agency executives and staff may be looking at outcome readings they have only been able to guess at before. However, with initially only one data point to analyze, those readings may seem uninterpretable. Interpretation becomes easier as readings accumulate over time, permitting comparisons of current performance to past performance.

To increase interest and potential utilization of the data, agencies should consider asking key staff to predict results in advance. Poister and Thomas (2007) have documented that asking for such predictions (albeit on a limited number of measures) increased interest among state administrators in the results of stakeholder surveys. Prediction questions can be asked relatively simply, too (e.g., “what proportion of program clients will say they are satisfied with the program?”).

The focus of the interpretation depends on a variety of factors. If the data show an unexpected trend or pattern—such as an unanticipated decline on an outcome measure from one quarter to the next—attention may focus on explaining that pattern. More generally, though, the analysis of the data should be driven by the questions and concerns of the agency. Is there a concern about whether a program is working at all? Or, might the concern instead be whether a new program component is achieving desired improvements?

At the same time, care should be taken not to over-interpret outcome data. In particular, outcome data should not by themselves be read as implying causality—that is, to conclude that any observed changes resulted from a specific program or programs. Such changes could have resulted from other factors (a change in the economy, for example) that are wholly independent of the program. Outcome assessment data by themselves speak to important questions of whether progress is being made on key agency objectives, but cannot explain the part the agency has played in inducing those changes.

When questions about program performance turn toward these issues of causality, agency executives must move a step beyond outcome assessment to conduct a full program evaluation. Program evaluations, in essence, start from a foundation of strong outcome assessment and add the techniques of comparison and control necessary to speak more definitively to the role of specific programs in producing desired outcomes.

Two Approaches to Program Evaluation

Program evaluation can seem a frightening prospect, raising the specter of outside experts “invading” the organization, seeking information in a mysterious and furtive manner, and ultimately producing a report that may contain unexpected criticisms. Such fears are not ungrounded. The traditional approach to program evaluation, sometimes termed the objective scientist approach, often proceeds along those lines.

Borrowed from the natural sciences, the objective scientist approach entails several elements. To begin with, objectivity is valued above all else. To achieve objectivity, the evaluator seeks to maintain critical distance from the program being evaluated in order to minimize possible influence by program staff, who may be biased in the program's favor. The objective scientist also strongly prefers quantitative data, recognizing qualitative data to be subjective by nature—the antithesis of objectivity. Finally, the usual purpose of an evaluation for the objective scientist is to determine whether or to what extent the program has achieved its goals. Is the program sufficiently effective to be continued, or should it be terminated? The objective scientist takes little interest in how a program's internal mechanics are functioning.

Two decades of experience have revealed shortcomings to this approach. Evaluators who insist on keeping their distance miss the unique insights staff often have about their programs. Disdaining qualitative data further limits the ability to assess a program because the goals of most public and nonprofit programs are too subjective to be measured only by quantitative techniques. Finally, the insistence on critical distance combined with an exclusive focus on program outcomes can result in evaluations that fail to answer the questions decision makers have.

Recognition of these problems led to the development of an alternative, the approach that Michael Quinn Patton has termed utilization-focused evaluation. As Patton (2008, pp. 451–452) has explained, this approach begins with the goal of balance rather than objectivity. Where objectivity implies taking an unbiased view of a program by observing from a distance, balance recommends viewing program operation from up close as well as from afar, thus to discern important details as well as broad patterns. Achieving balance also requires qualitative as well as quantitative data because the latter are unlikely to capture all that is important about programs whose goals are subjective. A balanced assessment necessitates multiple perspectives.

The balanced approach also rejects outcome assessment—“did the program work?”—as the only purpose of an evaluation. A utilization-focused evaluation seeks information for use in modifying and improving programs, too. Getting close to the program helps by putting the evaluator in contact with the program administrators who have questions about how programs should be modified as well as the authority to implement those modifications.

The balanced approach is not appropriate for every program, every evaluator, or every nonprofit executive. In getting close to a program, an evaluator can risk being “captured” by the program and, at the extreme, becoming only a “mouthpiece” for those who are vested in the program. For that reason, if there are serious questions about the quality of a program or about the competence of its staff, the nonprofit executive may prefer an evaluation performed from the critical distance of the objective scientist.

For the most part, though, nonprofit executives will find that the utilization-focused evaluation approach promises both a more balanced assessment and information more likely to be useful in program development. As a consequence, the discussion that follows assumes a utilization-focused approach to evaluation.

Who Does the Evaluation?

A first question, when planning a program evaluation, is who should conduct the evaluation. Here the principal options are (a) an internal evaluation performed by the organization's staff, (b) an external evaluation performed by outside consultants, and (c) an externally directed evaluation with extensive internal staff assistance.

An internal evaluation is possible only if the organization has one or more staff members with extensive training and experience in program evaluation. Unlike outcome assessment, full-scale program evaluation is too technical a task to attempt without that expertise. An internal evaluation also requires that the nonprofit executive give essentially a free rein to the evaluation staff. Since inside evaluators may face strong pressures to conform their findings to the predispositions of program staff, standing up to those pressures is possible only if the nonprofit executive has made an unequivocal commitment to an unbiased evaluation.

As a practical matter, although many nonprofit organizations prefer to conduct evaluations internally (e.g., Carman, 2007, p. 70), most appear to lack sufficient in-house expertise to produce high-quality evaluations. They are probably better advised to seek outside assistance from United Way or private-sector consulting firms, management assistance agencies for the nonprofit sector, or university faculty (often found in public administration, education, or psychology departments).

Hiring an outside consultant carries its own risks. Perhaps the greatest risk is that the external evaluators, perhaps trained in the objective scientist tradition, may resist getting close to the program and consequently conduct the evaluation with insufficient concern for the organization's needs. A preference for critical distance may blind them to the questions and insights the agency has about the program.

To minimize this risk, the nonprofit executive should discuss at length with any prospective evaluators how the evaluation should be conducted, including whether they are capable of taking a utilization-focused approach. It is also wise to negotiate a contract that specifies in detail how the nonprofit organization will be involved in the evaluation.

Perhaps the best means for conducting an evaluation is through a combination of outside consultants and internal staff. In this mode, outside consultants provide technical expertise plus some independence from internal organizational pressures while internal staff perform much of the legwork and collaborate with the consultants in developing the research design, collecting data, and interpreting findings. The idea, as documented in one study of successful evaluations (Minich et al., 2006, p. 186), is to “not expect program staff to be researchers” and, in that spirit, to “shift measurement tasks to full-time evaluators.”

There are several advantages to this approach. First, it provides the necessary technical expertise without sacrificing closeness to the program. Second, greater staff involvement should produce greater staff commitment to the findings, increasing the likelihood that findings will be utilized. Third, the evaluation can be used to train staff to serve a greater role in future evaluations. Finally, having staff implement much of the legwork could reduce the out-of-pocket costs for the outside consultants. This reduction is possible, however, only if care is taken that working with the staff does not require too much of the consultants' time.

That time commitment can be limited by creating a small advisory team to oversee the evaluation. This team should include the outside evaluators, the nonprofit executive (or the executive's representative), and at least one to three other staff members in the nonprofit organization. The team should serve as the central entity to which the evaluator reports, reducing the time necessary for working with program staff. Keeping its size small (in the range of three to five members) facilitates the team ability to provide clear and prompt feedback to the evaluation process. A team of this kind is probably desirable for wholly internal or external evaluations, too.

The only way to assure that the chief executive's concerns about the program are addressed is for that executive to be personally involved in the evaluation, optimally as a member of the evaluation advisory team. In addition, as the literature on organizational change attests (see, for example, Fernandez and Rainey, 2006), programmatic change is unlikely to occur through an evaluation unless the chief executive is involved and committed to the process.

The goal of this involvement should not be to obtain the “right” answers—answers that conform to the executive's predispositions—but to ensure that the right questions (the questions crucial to the program's future) are asked. The chief executive should emphasize this distinction to the evaluator(s) up front, and then monitor to be sure the distinction is observed as the evaluation proceeds.

Determining the Purpose of the Evaluation

The first task of an evaluation is to define its purpose. That is, what sort of information is desired and why? How will the information be used? Answers to these questions will be crucial in determining the other elements of the evaluation.

Discussion of evaluation purposes typically begins with a dichotomy between summative and formative purposes (see Rossi, Lipsey, and Freeman, 2004, pp. 34–36). A summative purpose implies a principal interest in program outcomes, in “summing up” a program's overall achievements. A formative purpose, by contrast, means that the principal interest is in forming or “re-forming” the program by focusing the evaluation on how well the program's internal operations function. In reality, though, the purposes of evaluations are much more complex than a dichotomy can convey. Saying an evaluation has a formative purpose, for example, does not indicate which of the program's internal mechanisms are of interest.

An evaluation's purpose should reflect the concerns key stakeholders have about the program. The process of defining this purpose thus should begin with the nonprofit organization's executive: What questions does he or she have about how the program is working? What kinds of information might speak to anticipated decisions about the program? Opinions of other stakeholders, including funders, may also be solicited.

In the end, a number of purposes is possible, depending on the perceptions of stakeholders and the specific program. An evaluation performed primarily for funders, who may be most interested in whether the program is having the desired impact, is likely to have a summative purpose. By contrast, a program that has only recently been implemented may be a good candidate for an implementation assessment—an evaluation of how well a program has been put into operation—but a poor candidate for a summative evaluation because the program has not been operating in the field long enough to expect an observable impact. Evaluations designed mainly for program staff are likely to have principally formative purposes to help staff modify and strengthen the program.

Because this purpose will guide decisions at all subsequent steps in the evaluation, a mistake at this stage can hamper the entire effort. The nonprofit executive should consequently review this purpose and make certain it reflects his or her concerns as well as the concerns of other key stakeholders. It is also true, though, that an evaluation's purpose may become clearer as the evaluation progresses. Stakeholders may be able to articulate their questions about programs only as they consider program goals and measures. Evaluators should be open to this possibility.

Evaluators and nonprofit executives must also be alert to the possibility of so-called covert purposes, unvoiced hidden purposes for an evaluation (Weiss, 1972, pp. 11–12). Program managers, for example, sometimes have an unspoken goal of “whitewashing” a program by producing a favorable evaluation. The responsible chief executive will reject such an evaluation as unethical as well as incapable of producing useful information.

It is at this stage that the evaluator and the organization's chief executive should also consider whether the evaluation is worth doing. Revelation of a dominant covert purpose would provide one reason to bow out. Or, it may be impossible to complete an evaluation in time to inform an approaching decision about the program. The resources necessary for a program evaluation are difficult to justify unless the results can be meaningful and useful.

Outcome Evaluation Designs

Most program evaluations will be concerned to some extent with assessing program impact—whether or to what extent a program has produced the desired outcomes. To achieve that end, evaluators can employ a number of outcome evaluation designs. Nonprofit executives and staff usually will neither need nor desire to become experts on these designs. However, to participate intelligently in the evaluation process, they need to understand at least their basic structure and underlying principles. This section will explain those principles and then briefly survey the most important of the designs. (For a more detailed discussion of the designs, see Rossi, Lipsey, and Freeman, 2004, Chapters 8–10.)

Causality

The goal of any outcome evaluation design is to assess causality—whether a program has caused the desired changes. To do so, the evaluation design must satisfy three conditions:

Covariation: Changes in the program must covary with changes in the outcome(s). Changes in outcome measures should occur in tandem with changes in program effort.
Time order: Since cause must come before effect, changes in the program must precede changes in the outcome measures.
Nonspuriousness: The evaluator must be able to rule out alternative explanations of the relationship between the program and outcome. The evaluator must demonstrate that the relationship is not spurious, that it is not the result of a joint relationship between the program, the outcome, and some third variable.

An evaluation design has internal validity to the extent that it satisfies these three conditions. Internal validity, in other words, refers to how accurately the design describes what the program actually achieved or caused.

Evaluation designs can also be judged for their external validity: the extent to which findings can be generalized to contexts beyond that of the program being evaluated. Ordinarily, nonprofit organizations will have little or no concern for external validity; nonprofit executives usually will be interested only in how their own program works, not with how it might work elsewhere. External validity becomes a major concern only if, for example, a program is being run as a pilot to test its value for possible broader implementation. Even then, internal validity must still take first priority. We must be sure that findings are accurate before considering how they might be generalized.

Threats to Internal Validity

The difficulties of satisfying the three conditions for causality can be illustrated relative to three so-called pre-experimental designs, designs that are frequently but often carelessly used in program evaluations:

One-shot case study: X 01
Posttest only with comparison group: X 01 02
One-group pretest/posttest 01 X 02

In each case, X refers to treatment, 01 to a first observation, and 02 to a second observation (on the comparison group in item 2, on the experimental group in item 3).

The one-shot case study satisfies none of the conditions of causality. As the most rudimentary design, it provides no mechanism for showing whether outcomes and program covary, much less for demonstrating either time order or nonspuriousness.

The posttest only with comparison group design can establish covariation since the comparison of a program group to a nonprogram group will show whether outcomes and program covary. However, this design can tell us nothing about time order; we cannot tell whether any outcome differences occurred after the program's inception or were already in place beforehand.

The one-group pretest/posttest design can satisfy the first two conditions for causality since taking observations before and after a program's inception tests for covariation and time order. The weakness of the design—and it is a glaring weakness—lies in its inability to establish nonspuriousness.

Take, for purposes of illustration, a rehabilitation program for substance abusers as evaluated by the one-group pretest/posttest design. This design can establish covariation, whether substance abuse decreases with program involvement, and it can establish time order, since substance abuse is measured both before and after the program intervention. But it does not control for such threats to nonspuriousness as the following:

Maturation: Decreased substance abuse could have resulted from the maturing of participants during the time of the program, a maturation not caused by the program.
Regression: Extreme scores tend to “regress toward the mean” rather than become more extreme. If program participants were selected on the basis of their extreme scores (that is, high levels of substance abuse), decreased abuse could be a function of irrelevant statistical regression rather than a program effect.
History: Events concurrent with but unrelated to the program can affect program outcomes. Perhaps a rise in the street price of illegal drugs produced a decline in substance abuse, which could mistakenly be attributed to the program.

These flaws make the pre-experimental designs undesirable as the principal design for most evaluations. Stronger designs are necessary to provide reasonable tests of the conditions of causality.

Experiments

Experimental designs offer the strongest internal validity. The classic experimental design takes this form:

$c16-math-0003$ refers to randomization, meaning that subjects are assigned by chance—for example, by lot or by drawing numbers from a hat—to the experimental or control group in advance of the experiment.

Randomization is a crucial defining element of experimental designs. With the inter-group and across-time components of this design testing for covariation and time order, randomization establishes the final condition of causality, nonspuriousness, by making the experimental and control groups essentially equivalent. As a consequence of that equivalence, the control group provides a test of “change across time”—the changes due to maturation, regression, history, and so forth, which could affect program outcomes. Comparing the experimental and control groups can thus separate program effects from other changes across time, as this simple subtraction illustrates:

Unfortunately, many practical problems work against the use of experimental outcome designs in evaluations. In particular, randomization poses a number of difficulties. First, it must be done prior to the beginning of an intervention; participants must be randomly assigned before they receive treatment. Second, ethical objections may be raised to depriving some subjects of a treatment that other subjects receive, or political objections may be raised to providing treatment on anything other than a “first come, first served” basis. Experiments can also be costly, given the need to establish, maintain, and monitor distinct experimental and control groups. Since many programs are still changing as they begin operation, it sometimes proves impossible to (as an experiment requires) maintain the same program structure throughout the length of the experiment.

But the possibility of conducting an experiment should not be dismissed too quickly. The need for prior planning can sometimes be surmounted by running an experiment not on the first cohort group of subjects, but on a second or later cohort group, such as a second treatment group of substance abusers. Ethical and political objections often can be overcome by giving the control group a traditional treatment rather than no treatment. That choice may make more sense for the purpose of the evaluation anyway, since the ultimate choice is likely to be between the new treatment and the old, not between the new treatment and no treatment.

Quasi-Experiments

If an experimental design cannot be used, the evaluator should consider one of the so-called quasi-experimental designs. These designs are so named because they attempt, through a variety of means, to approximate the controls that experiments achieve through randomization. The strongest of these designs come close to achieving the rigor of an experiment.

A first quasi-experimental design is the nonequivalent control group:

Here, in lieu of randomization, a comparison group is matched to the experimental group in the hope that the pre-post comparison of the two groups will furnish an indication of program impact.

This design is as strong—or weak—as the quality of the match. The goal of matching is to create a comparison group that is as similar as possible to the experimental group, except that it does not participate in the program. A good match can be difficult to achieve because the available comparison groups often differ in crucial respects from the experimental group.

Consider a hypothetical job-training program for the unemployed that takes participants on a first come, first served basis. The obvious candidates for a comparison group are would-be participants who volunteer after the program has filled all of the available slots. The evaluator might select from those late volunteers a group similar to the experimental group in terms of race, sex, education, previous employment history, and the like—similar, in other words, on the extraneous variables that could affect the desired outcome of employment success.

The difficulty arises in trying to match on all of the key variables at once. Creating a comparison group similar to the experimental on two of those variables—say, race and gender—may be possible, but the two groups are unlikely then also to have equivalent education levels, employment histories, and other characteristics. In addition, the two groups may differ on some unrecorded or intangible variable. Perhaps the early volunteers were more motivated than late volunteers, accounting for why they volunteered sooner. If that difference were not measured and incorporated in the analysis, the program could erroneously be credited for employment gains that actually stemmed from the differences in motivation. In cases such as this, no match is preferable to a bad match.

A second kind of quasi-experimental design is the interrupted time series design, diagramed as follows:

The defining elements of this design are three or more observations recorded both before and after the program intervention. Multiple observations are important because they provide a reading on trends, thereby controlling for most changes over time (maturation, regression, and so on), which experimental designs achieve through randomization. Those controls give this design relatively good internal validity.

History is the principal weakness of this design, with respect to internal validity. There is no control for any event that, by virtue of occurring at the same time as the program, could affect program impact. A program to improve the situation of the homeless could be affected, for example, by an economic upturn (or downturn) that began at about the same time as the program.

Obtaining the necessary multiple observations can also prove difficult. On the front end, preprogram observations may be unavailable if measurement of key outcome indicators began only when the program itself began. On the back end, stakeholders may demand evidence of program impact before several post-program observations can be obtained.

One of the strongest of the quasi-experimental designs is the multiple interrupted time series:

The strength of this design results from combining the key features of the interrupted time series and the nonequivalent group design. The time series dimension controls for most changes across time; the nonequivalent control group dimension controls for the threat of history.

The problems with this design derive from the possible weakness of its component parts. A bad match can provide a misleading comparison; the lack of longitudinal data can rule out use of this design at all.

Other Designs and Controls

The realities of many programs preclude the use of either experimental or quasi-experimental designs. Perhaps no one planned for an evaluation until the program was well under way, thereby ruling out randomization and providing no preprogram observations. Finding a comparison group may also prove too difficult or too costly. Under these conditions, the evaluator may be forced to rely on one or more of the pre-experimental designs as the principal outcome evaluation design, leaving the evaluation susceptible to many threats to internal validity.

Fortunately, means are available to compensate for if not to eliminate these design weaknesses. A first possibility is to use statistical controls. If their numbers and variability are sufficient, the subjects of a program can be divided for comparison and control. For example, a one-group pretest/posttest might be subdivided into those receiving a little of the program (x) and those receiving a lot (X). The resulting design becomes more like the stronger nonequivalent control group design:

There remains the question of whether the two groups are comparable in all respects other than the varying program involvement. If that comparability can be established, the design can provide a reading on whether more program involvement produces more impact, substituting for the unavailable comparison of program versus no program. The option to strengthen designs through statistical controls can also be useful with quasi-experimental and experimental designs. When a nonequivalent control group design is used, the evaluator may want to subdivide and compare subjects on variables on which the matching was flawed. If the two groups were matched on race and gender but not on education, the experimental and control groups might be compared while statistically controlling for education. Or, where a time series design is employed, additional data might be sought to control for threats of history. In a study of how the 55-mile-per-hour speed limit affected traffic fatalities, researchers examined data on total miles traveled to test an alternative explanation that fatalities declined as a consequence of reduced travel (amid the 1974–1975 energy crisis), not as a consequence of reduced speed (Meier and Morgan, 1981, pp. 670–671). The data added to the evidence that reduced speed was the cause.

Combining several outcome evaluation designs can also add to the strength of the overall design. Many evaluations will employ multiple designs, each for a different measure. Stronger designs on some measures might then help to compensate for the weaker designs necessary for other measures.

Assuming an outside evaluator is involved, these design decisions will be made principally by that individual. Still, to the extent that executives and staff understand these basic principles of evaluation design, they will be able to advise evaluators on these decisions. The nonprofit executive can perform an even more important role by monitoring the design planning to assure its fit to the purposes of the evaluation. The most rigorous design will be of no use unless it speaks to the issues of concern to the organization's board, executive and/or stakeholders. It is up to the executive to ensure that the evaluation remains relevant and appropriate to the organization's needs.

Process Evaluation

With most program evaluations, nonprofit executives will want to evaluate the program process as well as its ultimate impact. Outcome evaluation designs usually indicate only whether a program is working, not why. Process evaluation may be able to discern what steps in a program's process are not working as intended, perhaps pointing to how a program can be changed to increase its effectiveness. These suggestions will often prove the most useful.

The techniques of process evaluation are both simpler and less systematic than those for outcome evaluations (see also Thomas, 1980). In essence, process evaluation entails examining the internal workings of a program—as represented largely through activity goals—both for their functioning and for their role in producing the desired outcomes. It usually begins with the development of a good logic model (see Savaya and Waysman, 2005), then progresses to an examination of specific parts of that model.

The executives and staff of nonprofit organizations should be key actors in any process evaluation. To begin with, they should attempt to define at the outset the specific questions they have about the program's process. Conceivably, they may already feel adequately informed about performance as it pertains to some activity goals, and so may not desire new information there. They will then want to be certain that the evaluation includes the questions they do have about program process.

The basics of a process evaluation can be illustrated by the case of an affirmative action program designed to increase the hiring of minority firefighters by a municipal government. The activity goals of interest in this evaluation included the following:

Increase the number of minority applicants.
Increase the success rate of minority applicants on the written examination.
Increase the success rate of minority applicants on the physical examination.

These activity goals are designed to lead to this outcome goal (among others):

Increase the proportion of minority firefighters in the fire department.

The several activity goals can illustrate how a process evaluation can be useful. Data on these various activities could indicate where, if at all, the program might be failing. Are too few minorities applying? Or are minorities applying only to be eliminated disproportionately by written or physical exams? Answering these questions could help a program administrator to decide whether, or how, and where to change the program.

A good process evaluation often can help to compensate for weaknesses in the outcome evaluation designs. When the difficulty of controlling for all threats to internal validity in an outcome evaluation design leaves unanswered questions about the linkage of program to outcomes, the process evaluation could provide an additional test of this linkage by documenting whether the program activities have occurred in a manner consistent with the observed outcomes. If an impact evaluation shows significant gains on the outcome measures and the process evaluation shows high levels of program activities, the evaluator can argue more convincingly that the program caused the impact. By contrast, evidence of low activity levels in the same scenario would cast doubts on the possibility that the program is responsible for outcome gains.

Most program evaluations should contain some form of process evaluation. Though less systematic than outcome designs, process evaluation techniques will often provide the more useful information for nonprofit executives.

Data Development, Report Writing, and Follow-Up

Nonprofit executives should plan to involve themselves and the program staff extensively in analysis and review of evaluation findings. This involvement is necessary first for accuracy: staff review of data and reports minimizes the risk of outside evaluators reporting inaccurate conclusions. Staff members also are more likely to utilize findings and implement recommendations that they helped to develop.

When outside evaluators are used, the best approach to this involvement may be to ask for the opportunity to review and comment on interpretations and reports while still allowing the evaluators to retain final authority on the substance of reports. Most evaluators should welcome this arrangement for self-protection; no evaluator wants to go public with conclusions that are subsequently shown to be erroneous. Staff might also be involved in basic data interpretation as, for example, by meeting with evaluators to review data printouts. As suggested earlier, interest among staff might be heightened by asking them to predict some of the results before the findings are in (Poister and Thomas, 2007).

The chief executive must also decide what final written products to request. A comprehensive evaluation report is usually desirable, both for the historical record and as a reference in case questions arise, along with a brief executive summary of one to three pages for broader distribution and readership. Other reports may be desirable for particular types of clients.

The job of the outside evaluator customarily concludes at this point, but the agency's chief executive and program staff should consider if and how the program should be changed in light of the evaluation. A program evaluation can provide both a direction and an impetus for change, but often with a limited window of opportunity to achieve any change. The agency's chief executive should take advantage of that window by discussing the evaluation with staff and, where appropriate, developing plans for what changes to make and how. Since the evaluation data presumably came from the agency's outcome assessment system, this is also a good time to consider any need to change that system. Only through such efforts can a nonprofit agency gain the full value of a program evaluation.

Summary

Nonprofit agencies today confront increasingly strong demands to demonstrate that their programs work. To meet these demands, contemporary nonprofit agencies must engage in systematic outcome assessment, measuring and monitoring the performance of their programs.

Outcome assessment data can speak to important questions of whether progress is being made on key agency objectives. As a result, every nonprofit agency, if it has not already done so, should consider if and how it can develop, collect, and analyze these data on a continuing basis.

Outcome assessment data alone can not speak to issues of causality, that is, to whether any observed changes resulted from a specific agency program or programs. Agency executives who wish to investigate those kinds of causal connections should consider taking a step beyond outcome assessment to conduct a program evaluation, too. Program evaluations build from a foundation of strong outcome assessment, adding the techniques of comparison and control necessary to speak to the role of specific programs in producing desired outcomes.

Success in these efforts may not come easily. For one thing, nonprofit agencies often need to find additional funds to support new initiatives in either outcome assessment or program evaluation. Yet as Carman (2007, p. 71) has observed, “although funders and other stakeholders may be asking [nonprofit agencies] to report on evaluation and performance information, most are not receiving separate funds or additional grants to collect this information.” In the long term, the solution may lie in these agencies “investing in their own evaluation capacity,” as Carman (2007, p. 73) recommends, but that strategy offers no help in the near term.

Even if the necessary funding can be found, success in either outcome assessment or program evaluation also requires a delicate balance of analytic and scientific expertise with group process skills. On the analytic side, nonprofit executives and staff should acquire at least a basic expertise, which can be supplemented as necessary with the talents of skilled consultants. On the group process side, nonprofit executives must ensure that any outcome assessment planning or program evaluation includes extensive participation of the agency's stakeholders, including at least the program staff and funders. Achieving that balance can give the executives and staff of nonprofit organizations the knowledge necessary to provide better programs and services.

References

Carman, J. G. Evaluation Practice Among Community-Based Organizations: Research into the Reality. American Journal of Evaluation, 2007, 28(1), pp. 60–75.
Edwards, D. J., and Thomas, J. C. Developing a Municipal Performance Measurement System: Reflections on the Atlanta Dashboard. Public Administration Review, 2005, 65(3), 369–376.
Fernandez, S., and Rainey, H. G. Managing Successful Organizational Change in the Public Sector. Public Administration Review, 2006, 66(2), 168–176.
Hatry, H. P. Using Agency Records. In K. E. Newcomer, H. P. Hatry, and J.S. Wholey (eds.), Handbook of Practical Program Evaluation (4th ed.) San Francisco: Jossey-Bass, 2015 325–343.
Hatry, H., and Lampkin, L. Key Steps in Outcome Management. Washington, D.C.: The Urban Institute, 2003.
Herman, R. D., and Renz, D. O. Advancing Nonprofit Organizational Effectiveness Research and Theory: Nine Theses. Nonprofit Management & Leadership, 2008, 18(4), 399–415.
W. K. Kellogg Foundation. Logic Model Development Guide. Battle Creek, Michigan: Author, 2004.
Kaplan, R. S., and Norton, D. P. The Balanced Scorecard: Translating Strategy into Action. Boston: Harvard Business School Press, 1996.
Kress, G., Springer, J. F., and Koehler, G. Policy Drift: An Evaluation of the California Business Enterprise Program. Policy Studies Journal, 1980, 8, 1101–1108.
McLaughlin, J. A., and Jordan, G. B. Using Logic Models. In K. E. Newcomer, H. P. Hatry, and J.S. Wholey (eds.), Handbook of Practical Program Evaluation (4th ed.) San Francisco: Jossey-Bass, 2015. 62–87.
Meier, K. J., and Morgan, D. P. Speed Kills: A Longitudinal Analysis of Traffic Fatalities and the 55 MPH Speed Limit. Policy Studies Review, 1981, 1, 157–167.
Minich, L., Howe, S., Langmeyer, D., and Corcoran, K. Can Community Change Be Measured for an Outcomes-Based Initiative? A Comparative Case Study of the Success by 6 Initiative. American Journal of Community Psychology, 2006, 38, 183–190.
Morley, E., Hatry, H., and Cowan, J. Making Use of Outcome Information for Improving Services: Recommendations for Nonprofit Organizations. Washington, D.C.: The Urban Institute, 2002.
Newcomer, K. E., and Triplett, T. Using Surveys. In K. E. Newcomer, H. P. Hatry, and J. S. Wholey (eds.), Handbook of Practical Program Evaluation (4th ed.) San Francisco: Jossey-Bass, 2015, 344–382.
Patton, M. Q. Utilization-Focused Evaluation, 4th ed. Thousand Oaks, Calif.: Sage, 2008.
Poister, T. H., and Thomas, J. C. The ‘Wisdom of Crowds’: Learning from Administrators Predictions of Citizen Perceptions. Public Administration Review, 2007, 67(3), 279–289.
Rea, L. M., and Parker, R. A. Designing and Conducting Survey Research: A Comprehensive Guide, (4th ed.) San Francisco: Jossey-Bass, 2014.
Rossi, P. H., Lipsey, M. W., and Freeman, H. E. Evaluation: A Systematic Approach (7th ed.) Newbury Park, CA.: Sage, 2004.
Savaya, R., and Waysman, M. The Logic Model: A Tool for Incorporating Theory in Development and Evaluation of Programs. Administration in Social Work, 2005, 29(2), 85–103.
Stone, M., Bigelow, B., and Crittenden, W. Research on Strategic Management in Nonprofit Organizations: Synthesis, Analysis, and Future Directions. Administration & Society, 1999, 3, 378–423.
Thomas, J. C. ‘Patching Up’ Evaluation Designs: The Case for Process Evaluation. Policy Studies Journal, 1980, 8, 1145–1151.
Thomas, J. C., Poister, T. H., and Ertas, N. Customer, Partner, Principal: Local Government Perspectives on State Agency Performance in Georgia. Journal of Public Administration Research and Theory, 20(4), 2010, 779-799.
United Way of America. Outcome Measurement Resource Network. http://www.liveunited.org/outcomes/. Visited December 17, 2009.
Weiss, C. H. Evaluation Research: Methods of Assessing Program Effectiveness. Englewood Cliffs, N. J.: Prentice-Hall, 1972.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 16: Outcome Assessment and Program Evaluation

Create new playlist

Sign In

Sign Up