13.1 Introduction
13.2 DECIDE: A framework to guide evaluation
Designing useful and attractive products requires skill and creativity. As products evolve from initial ideas through conceptual design and prototypes, iterative cycles of design and evaluation help to ensure that they meet users' needs. Deciding when and how to evaluate a product requires careful consideration and may be different for different kinds of products.
The case studies in the previous chapter illustrate some of the approaches used.
The design process starts with the designers working to develop a product that meets users' requirements, but, as you have seen, understanding requirements tends to happen by a process of negotiation between designers and users. As designers understand users' needs better, their designs reflect this understanding. Similarly, as users see and experience design ideas, they are able to give better feedback that enables the designers to improve their designs further. The process is cyclical, with evaluation facilitating understanding between designers and users.
Evaluation is driven by questions about how well the design or particular aspects of it satisfy users' needs and offer appropriate user experiences. Some of these questions provide high-level goals to guide the evaluation. For example, does this product excite users so that they will buy and use it? Others are much more specific. Can users find a particular menu item? Do they interpret a particular graphic as the designers intended and do they find it attractive? Practical constraints play a big role in shaping how evaluation is done: tight schedules, low budgets, or little access to users constrain what evaluators can do. There are ethical considerations too: medical records must be private and so are certain areas of people's homes.
Experienced designers get to know what works and what doesn't. As you have seen in Chapter 12, there is a broad repertoire of evaluation methods that can be tailored for specific circumstances. Knowing and having the confidence to adapt methods is essential. The wide variety of mobile and ubiquitous systems coming onto the market challenges conventional evaluation practices, which must be adapted to provide useful feedback. Therefore, when planning evaluations, evaluators must consider the nature of each product, the kinds of users that will use it, and the contexts of use, as well as logistical issues, such as the budget, the schedule, the skills and equipment required for the evaluation. Planning evaluation studies involves asking questions about the process and anticipating potential problems. Within interaction design there are many books and websites that list different techniques and guidelines for conducting an evaluation, but there is very little overarching guidance for how to plan an evaluation. To help you, we propose the DECIDE framework, which provides a structure for planning evaluation studies.
The main aims of this chapter are to:
Well-planned evaluations are driven by goals which aim to seek answers to clear questions, which may be stated explicitly, upfront, as in usability testing, or may emerge as the evaluation progresses, as in ethnographic evaluation. The way questions are stated also varies depending on the stage in the design when the evaluation occurs. Questions help to determine the kind of evaluation approach that is adopted and the methods used. Practical issues, such as the amount of time available to carry out the evaluation, the availability of participants, and suitable equipment, also impact these decisions. Ethical issues must also be considered, particularly when working with users, and evaluators must have enough time and expertise to evaluate, analyze, interpret, and present the data that they collect. The DECIDE framework provides the following checklist to help you plan your evaluation studies and to remind you about the issues you need to think about. It has the following six items:
A list has the tendency to suggest an order in which things should be done. However, when working with the DECIDE framework, it is common to think about and deal with items iteratively, moving backwards and forwards between them. Each item in the framework is related to the other items, so it would be unusual to work in any other way.
What are the high-level goals of the evaluation? Who wants it and why? An evaluation (i) to help clarify that user needs have been met in an early design sketch has different goals from an evaluation (ii) to select the best representation of a metaphor for a conceptual design, or an evaluation to finetune an interface, or (iii) to examine how mobile technology changes working practices, or (iv) to inform how the next version of a product should be changed, or (v) to explore the impact of ambient technology in a social space; or (vi) to investigate what makes collaborative computer games engaging.
Goals guide the evaluation by helping to determine its scope, so identifying what these goals are is the first step in planning an evaluation. For example, we can restate the first general goal statement mentioned above more clearly as:
Check that the sketch indicates that designers have understood the users' needs.
Rewrite each of the general goal descriptions above (i–vi) as a goal statement and suggest which case study or part of a case study from Chapter 12 fits the goal statement.
Comment
In turn, these goals influence the approach chosen to guide the study and the selection of evaluation methods.
In order to make goals operational, we must clearly articulate the questions to be answered by the evaluation study. For example, the goal of finding out why some customers prefer to purchase paper airline tickets over the counter rather than e-tickets can be broken down into a number of relevant questions for investigation. What are customers' attitudes to these e-tickets? Perhaps they don't trust the system and are not sure that they will actually get on the flight without a ticket in their hand. Do customers have adequate access to computers to make bookings? Are they concerned about security? Does the electronic system have a bad reputation? Is the user interface to the ticketing system so poor that they can't use it? Maybe some people can't managed to complete the transaction. Maybe some people like the social interaction with a ticketing agent?
Questions can be broken down into very specific subquestions to make the evaluation even more finegrained. For example, what does it mean to ask, “Is the user interface poor?” Is the system difficult to navigate? Is the terminology confusing because it is inconsistent? Is the response time too slow? Is the feedback confusing or maybe insufficient? Subquestions can, in turn, be further decomposed if even more specific issues need to be addressed.
Activity 13.2
Imagine you have been asked to evaluate the impact of the HelloWall on users' behavior. Based on what you know about the HelloWall from Chapter 12, write two or three questions that you could investigate.
Comment
You could ask a variety of questions. Here are some that we thought of:
Having identified the goals and some questions that you want to investigate, the next step is to choose the evaluation approach and methods that you will use. As mentioned in Chapter 12, the evaluation approach influences the kinds of methods that are used. For example, when performing an analytical evaluation, methods that directly involve users will not be used. During usability testing, ethnography will not be used. Often different approaches are used in this way, depending on the questions to be answered and the resources available for performing the study. Practical and ethical issues (discussed next) have to be considered and trade-offs made. For example, the methods that seem most appropriate may be too expensive, or may take too long, or may require equipment or expertise that is not available, so compromises are needed.
As you saw in several of the case studies discussed in Chapter 12, combinations of approaches and methods are often used to obtain different perspectives. For example, the methods used in field studies tend to involve observation, interviews, or informal discussions with participants. Questionnaires may also be used and so might diary studies in which participants record aspects of their technology usage. Usability testing also involves multiple methods and, as we have already said, is often supplemented with field studies. Each type of data tells the story from a different point of view. Together these perspectives give a broad picture of how well the design meets the usability and user experience goals that were identified during requirements gathering.
Activity 13.3
Which approaches and methods could be used in an evaluation to answer the questions that we provided for Activity 13.2? These were:
Comment
A field study is most appropriate because we want to investigate how people react to this new technology being placed in their natural environment. The methods that could be used are as follows. Observation could be used for answering both questions and subquestions. This could be done by video or by a person making notes. Questionnaires and interviews could also be designed to collect data to answer these questions.
There are many practical issues to consider when doing any kind of evaluation, and it is important to identify as many of them as possible before starting the study. However, even experienced evaluators encounter surprise events, which is why it is useful to do a pilot study (discussed in Chapter 7). Some issues that should be considered include access to appropriate users, facilities and equipment, whether schedules and budgets are realistic, and whether the evaluators have the appropriate expertise to conduct the study. Depending on the availability of resources, there may have to be compromises that involve adapting or substituting methods. For example, evaluators may wish to perform usability tests using 20 users and then to run a three-week-long field study, but the budget available for the study may only cover the cost of five testers and a shorter field study. Another example is provided by the Nokia cell phone, which involved evaluating cell phones in a country where the evaluators do not speak the language fluently and are only slightly aware of cultural norms. In this situation the evaluators had to work out how to collect the data that they needed to answer their evaluation questions. Furthermore, cell phone users are highly mobile so the evaluators knew that there would be places where the cell phones would be used that they could not go, e.g. in bathrooms, bedrooms. During the study the evaluators may also have experienced surprise events that required them to take decisions there and then. For example, it may not have been possible to ride in the taxi or car with the user because there was not enough room. Of course, no evaluation is going to be perfect, and a good field study can be done without the evaluator seeing how the product is used 100% of the time, but it is helpful to be aware of the kind of compromises that may be necessary. Thinking about the kind of users who will be involved and logistical issues, such as availability of equipment, the schedule and the budget, and the kind of expertise needed to perform the study, when planning a study will help to ensure its success.
It goes without saying that a key aspect of an evaluation is involving appropriate users or, in the case of analytical evaluation, focusing on the characteristics of the anticipated user population. When doing usability testing, for example, users must be found who represent the user population for which the product is targeted. This generally involves identifying users with a particular level of experience, e.g. novices or experts, or users with a range of expertise. The number of males and females within a particular age range, cultural diversity, educational experience, and personality differences may also need to be taken into account, depending on the kind of product being evaluated. Questionnaire surveys require large numbers of participants, so ways of identifying and reaching a representative sample of participants are needed. For field studies to be successful, the evaluator needs access to users who will interact with the technology in their natural setting.
Another issue to consider is how the users will be involved. The tasks used in a usability laboratory study should be representative of those for which the product is designed. However, there are no written rules about the length of time that a user should be expected to spend on an evaluation task. Ten minutes is too short for most tasks and two hours is a long time, so what is reasonable? Task times will vary according to the type of evaluation, but when tasks go on for more than 20 minutes, consider offering breaks. It is accepted that people using desktop computers should stop, move around, and change their position regularly after every 20 minutes spent at the keyboard to avoid repetitive strain injury. Evaluators also need to put users at ease so they are not anxious and will perform normally; it is important to treat them courteously. Participants should not be made to feel uncomfortable when they make mistakes. Greeting users, explaining that it is the product that is being tested and not them, and planning an activity to familiarize them with it before starting the task all help to put users at ease in test situations.
In field studies the onus is on the evaluators to fit in with the users and to cause as little disturbance to participants and their activities as possible. This requires practice, and even anthropologists who are trained in ethnographic methods may cause unforeseen changes (see Dilemma box below).
Dilemma: Is it Possible to Study People's Behavior without Influencing it?
A newspaper article describes how an anthropology student traveling through northern Kenya happens by chance to come upon an unknown tribe. He studies their rituals and reports the study in his PhD dissertation and several published articles in acclaimed journals. The study draws considerable attention because finding an unknown tribe is unusual in this day and age. It is the dream of many anthropologists because it allows them to study the tribe's customs before they are changed by outside influences. Of course, having published his work, the inevitable happens; more anthropologists make their way to the village and soon members of the tribe are drinking coke and wearing tee-shirts from prestigious universities and well-known tourist destinations. The Western habits of these outsiders gradually changes the tribe's behavior.
Ethnographers face a dilemma: is it possible to study people's behavior without changing it in the process?
There are many practical issues concerned with using equipment in an evaluation. For example, when using video you need to think about how you will do the recording: how many cameras and where do you put them? Some people are disturbed by having a camera pointed at them and will not perform normally, so how can you avoid making them feel uncomfortable? How will you record data about use of a mobile device when the users move rapidly from one environment to another? Several of the case studies in Chapter 12 addressed these issues. Think back, or reread the Nokia cell phone study, the Indian auxiliary midwife data collection study, and HutchWorld.
Activity 13.4
The evaluators of the Nokia cell phones described some of the logistics that they needed to consider; what were they?
Comment
The evaluators did not speak Japanese, the language of the users, and they knew that people using cell phones can be fast-moving as they go about their busy lives. Some of the things that the evaluators suggest may be necessary when conducting such a study include: spare batteries for recording devices; change and extra money for taxies or unforeseen expenses; additional clothes in case the weather suddenly changes, e.g. a rain jacket; medications; and snacks in case they don't have an opportunity to buy meals.
Time and budget constraints are important considerations to keep in mind. It might seem ideal to have 20 users test your interface, but if you need to pay them, then it could get costly. Planning evaluations that can be completed on schedule is also important, particularly in commercial settings. However, as you have seen in the interview with Sara Bly, there is rarely enough time to do the kind of studies that you would ideally like, so you have to compromise and plan to do the best job possible with the resources and time available.
Different evaluation methods require different expertise. For example, running user tests requires knowledge of experimental design and video recording. Does the evaluation team have the expertise needed to do the evaluation? If you need to analyze your results using statistical measures and you are unsure of how to use them, then consult a statistician before starting the evaluation and then again during data collection and analysis, if needed.
Comment
The Association for Computing Machinery (ACM) and many other professional organizations provide ethical codes (Box 13.1) that they expect their members to uphold, particularly if their activities involve other human beings (ACM, 1992). All data gathering requires you to consider ethical issues (see Chapter 7), but this is particularly important for evaluation because the participants are often put into unfamiliar situations. People's privacy should be protected, which means that their name should not be associated with data collected about them or disclosed in written reports (unless they give explicit permission). Personal records containing details about health, employment, education, financial status, and where participants live should be confidential. Similarly, it should not be possible to identify individuals from comments written in reports. For example, if a focus group involves nine men and one woman, the pronoun ‘she’ should not be used in the report because it will be obvious to whom it refers.
The ACM code outlines many ethical issues that professionals are likely to face (ACM, 1992). Section 1 outlines fundamental ethical considerations, while section 2 addresses additional, more specific considerations of professional conduct. Statements in section 3 pertain more specifically to individuals who have a leadership role. Principles involving compliance with the code are given in section 4. Three principles of particular relevance to this discussion are:
Most professional societies, universities, government, and other research offices require researchers to provide information about activities in which human participants will be involved. This documentation is reviewed by a panel and the researchers are notified whether their plan of work, particularly the details about how human participants and data collected about them will be treated, is acceptable. Drawing up such an agreement is mandatory in most universities and major organizations. Indeed, special review boards generally prescribe the format required and many provide a detailed form which must be completed. Once the details are accepted the review board checks periodically in order to oversee compliance. In American universities they are known as Institutional Review Boards (IRB). Other countries use different names for similar processes. Over the years IRB forms have become increasingly detailed, particularly now that much research involves the Internet and people's interaction via communication technologies across the Internet. Several law suits at prominent universities have heightened attention to IRB compliance to the extent that it sometimes takes several months and multiple amendments to get IRB acceptance. IRB reviewers are not only interested in the more obvious issues of how participants will be treated and what they will be asked to do, they also want to know how the data will be analyzed and stored. For example, data about subjects must be coded and stored to prevent linking participants' names with that data. This means that names must be replaced by a code and that the code and the data must be stored separately, usually under lock and key. Figure 13.1 contains part of a completed IRB form to evaluate a Virtual Business Information Center (VBIC) at the University of Maryland.
Activity 13.6
Imagine you plan to conduct online interviews with 20 participants in a new chat environment (perhaps using AIM, AOL, or Yahoo! chat). What privacy issues would you need to consider when completing your IRB form?
Comment
You would need to discuss how you will perform the interview so that it is private; how you will collect the data; how the data will be stored, analyzed, and reported. For each, you will need to specify privacy and security considerations. For example, each participant will have a code. The codes and the names to which they relate will be stored separately from the data. At no time will real names be used, nor will there be reference to any markers that could enable identity of the participant, e.g. where the participant lives, works, gender, or ethnicity if these are distinguishing features among the pool of participants.
People give their time and their trust when they agree to participate in an evaluation study and both should be respected. But what does it mean to be respectful to participants? What should participants be told about the evaluation? What are participants' rights? Many institutions and project managers require participants to read and sign an informed consent form similar to the one in Box 13.2. This form explains the aim of the study and promises participants that their personal details and performance will not be made public and will be used only for the purpose stated. It is an agreement between the evaluator and the participants that helps to confirm the professional relationship that exists between them.
Box 13.2: Informed Consent Form
I state that I am over 18 years of age and wish to participate in the evaluation study being conducted by Dr. Hoo and his colleagues at the College of Extraordinary Research, University of Highland, College Estate.
The purpose of the study is to assess the usability of HighFly, a website developed at the National Library to provide information to the general public.
The procedures involve the monitored use of HighFly. I will be asked to perform specific tasks using HighFly. I will also be asked open-ended questions about HighFly and my experience using it. In addition, the evaluators will observe my use of HighFly in my workplace and home using a handheld device and laptop or desktop computer.
All information collected in the study is confidential, and my name will not be identified at any time.
I understand that I am free to ask questions or to withdraw from participation at any time without penalty.
__________________ ____
Signature of Participant Date
(Adapted from Cogdill, 1999.)
The following summary provides guidelines that will help ensure evaluations are done ethically and that adequate steps to protect users' rights have been taken.
Activity 13.7
Think back to the Hutch World and Indian auxiliary nurse midwives case studies. What ethical issues did the developers have to consider?
Comment
The developers of HutchWorld considered all the issues just listed above. In addition, because the study involved patients, they had to be particularly careful that medical and other personal information was kept confidential. They were also sensitive to the fact that cancer patients may become too tired or sick to participate, so they reassured them that they could stop at any time if the task became onerous.
The team working with the Indian auxiliary nurse midwives were particularly careful to make sure that the nurses knew their rights and that they felt treated with respect. This was essential in order to build trust. Furthermore, it is likely that the participants may not have known about the usual evaluation ethics so the team was particularly careful to ensure that they were informed. Since this study also involved a medical system the team needed to ensure that personal medical information was treated confidentially. Privacy and security were major considerations.
The explosion in Internet and web usage has resulted in more research on how people use these technologies and their effects on everyday life (Jones, 1999). Consequently, there are many projects in which developers and researchers are logging users' interactions, analyzing blogs, recording web traffic, or examining conversations in chatrooms, bulletin boards, or on email. These studies can be done without users knowing that they are being studied. This raises ethical concerns, chief among which are issues of privacy, confidentiality, informed consent, and appropriation of others' personal stories (Sharf, 1999). People often say things online that they would not say face to face. Furthermore, many people are unaware that the personal information they share online can be read by someone with technical know-how years later, even after they have deleted it from their personal mailbox (Erickson et al., 1999).
Studies of user behavior on the Internet may involve logging users' interactions and keeping a copy of their conversations with others. Should users be told that this is happening?
Comment
Yes, it is better to tell users in advance that they are being logged. Knowledge of being logged often ceases to be an issue as users become involved in what they are doing.
Dilemma: What Would You Do?
There is a famous and controversial story about a 1961–62 experiment by Yale social psychologist Stanley Milgram to investigate how people respond to orders given by people in authority. Much has been written about this experiment and details have been changed and embellished over the years, but the basic ethical issues it raises are still worth considering, even if the details of the actual study have been distorted.
The participants were ordinary residents of New Haven who were asked to administer increasingly high levels of electric shocks to victims when they made errors in the tasks they were given. As the electric shocks got more and more severe, so did the apparent pain of the victims receiving them, to the extent that some appeared to be on the verge of dying. Not surprisingly, those administering the shocks became increasingly disturbed by what they were being asked to do, but several continued, believing that they should do as their superiors told them. What they did not realize was that the so-called victims were, in fact, very convincing actors who were not being injured at all. Instead, the shock administrators were themselves the real subjects of the experiment. It was their responses to authority that were being studied in this deceptive experiment.
This story raises several important ethical issues. First, this experiment reveals how power relationships can be used to control others. Second and equally important, this experiment relied on deception. The experimenters were, in fact, the subjects and the fake subjects colluded with the real scientists to deceive them. Without this deception the experiment would not have worked.
Is it acceptable to deceive subjects to this extent for the sake of scientific discovery? What do you think?
Decisions are must be made about what data is needed to answer the study questions, how the data will be analyzed, and how the findings will be presented (see Chapter 8). To a great extent the method used determines the type of data collected, but there are still some choices. For example, should the data be treated statistically? Some general questions also need to be asked. Is the method reliable? Will the method measure what is intended, i.e. what is its validity? Are biases creeping in that will distort the results? Will the results be generalizable, i.e. what is their scope? Will the evaluation study be ecologically valid or is the fundamental nature of the process being changed by studying it?
The reliability or consistency of a method is how well it produces the same results on separate occasions under the same circumstances. Another evaluator or researcher who follows exactly the same procedure should get similar results. Different evaluation methods have different degrees of reliability. For example, a carefully controlled experiment will have high reliability, whereas observing users in their natural setting will be variable. An unstructured interview will have low reliability: it would be difficult if not impossible to repeat exactly the same discussion.
Validity is concerned with whether the evaluation method measures what it is intended to measure. This encompasses both the method itself and the way it is performed. If, for example, the goal of an evaluation study is to find out how users use a new product in their homes, then it is not appropriate to plan a laboratory experiment. An ethnographic study in users' homes would be more appropriate. If the goal is to find average performance times for completing a task, then a method that only recorded the number of user errors would be invalid.
Bias occurs when the results are distorted. For example, expert evaluators performing a heuristic evaluation may be more sensitive to certain kinds of design flaws than others, and this will be reflected in the results. Evaluators collecting observational data may consistently fail to notice certain types of behavior because they do not deem them important. Put another way, they may selectively gather data that they think is important. Interviewers may unconsciously influence responses from interviewees by their tone of voice, their facial expressions, or the way questions are phrased, so it is important to be sensitive to the possibility of biases.
The scope of an evaluation study refers to how much its findings can be generalized. For example, some modeling methods, like the keystroke model, have a narrow, precise scope. The model predicts expert, error-free behavior so, for example, the results cannot be used to describe novices learning to use the system. The problems of overstating the results were discussed in more detail in Chapter 8.
Ecological validity concerns how the environment in which an evaluation is conducted influences or even distorts the results. For example, laboratory experiments are controlled and are quite different from workplace, home, or leisure environments. Laboratory experiments therefore have low ecological validity because the results are unlikely to represent what happens in the real world. In contrast, ethnographic studies do not impact the environment as much, so they have high ecological validity.
Ecological validity is also affected when participants are aware of being studied. This is sometimes called the Hawthorne effect after a series of experiments at the Western Electric Company's Hawthorne factory in the USA in the 1920s and 1930s. The studies investigated changes in length of working day, heating, lighting, etc., but eventually it was discovered that the workers were reacting positively to being given special treatment rather than just to the experimental conditions. Similar findings sometimes occur in medical trials. Patients given the placebo dose (a false dose in which no drug is administered) show improvement that is due to receiving extra attention that makes them feel good.
Find a journal or conference publication that describes an interesting evaluation study or select one from www.hcibib.org or from a digital library such as the ACM Digital Library. Then use the DECIDE framework and your knowledge from Chapters 7 and 8 to analyze it. Some questions that you should seek to answer include:
In this chapter we introduced the DECIDE framework, which will help you to plan an evaluation. There are six steps in the framework:
Key Points
DENZIN, N.K. and LINCOLN, Y.S. (2005) The SageHandbook of Qualitative Research, 3rd Edition. Sage Publications. This book is a collection of chapters by experts in qualitative research. It is an excellent resource.
HOLTZBLATT, K. (ed.) (2005) Designing for the mobile device: experiences, challenges and methods. Communications of the ACM 48(7): 32–66. This collection of papers points out the challenges that evaluators face when studying mobile devices, particularly when the most appropriate study is a field study that may involve working in a different culture and changing physical environments regularly.
JONES, S. (ed.) (1999) Doing Internet Research: Critical Issues and Methods for Examining the Net. Sage Publications. As the title states, this book is concerned with research. However, several of the chapters provide information that will be useful for those evaluating software used on the Internet.
SHNEIDERMAN, B. and PLAISANT, C. (2005) Designing the User Interface: Strategies for Effective Human-Computer Interaction, 4th edn. Chapter 4: Evaluating interface designs, pp. 139–171. Addison-Wesley. This chapter provides a useful overview of evaluation and provides valuable references.
MALONEY-KRICHMAR, D. and PREECE, J. (2005) A Multilevel Analysis of Sociability, Usability and Community Dynamics in an Online Health Community, ACM Transactions on Computer-Human Interaction, 12(2):1–32. This paper describes how activity in an online community was evaluated using a combination of theoretical frameworks and evaluation methods.