Chapter 5

Issue-Based Metrics

Contents

Most user experience professionals probably consider identifying usability issues and providing design recommendations the most important parts of their job. A usability issue might involve confusion around a particular term or piece of content, method of navigation, or just not noticing something that should be noticed. These types of issues, and many others, are typically identified as part of an iterative process in which designs are being evaluated and improved throughout the design and development process. This process provides tremendous value to product design and is the cornerstone of the UX profession.

Usability issues are generally thought of as purely qualitative. They typically include the identification and description of a problem one or more participants experienced and, in many cases, an assessment of the underlying cause of the problem. Most UX professionals also include specific recommendations for remedying the problem and many report positive findings as well (i.e., something that worked particularly well).

Most UX professionals don’t strongly associate metrics with usability issues. This may be because of the gray areas in identifying issues or because identifying issues is part of an iterative design process, and metrics are perceived as adding little value. However, not only is it possible to measure usability issues, but doing so also adds value in product design while not slowing down the iterative design process.

This chapter reviews some simple metrics around usability issues. It also discusses different ways of identifying usability issues, prioritizing the importance of different types of issues, and factors you need to think about when measuring usability issues.

5.1 What is a Usability Issue?

What do we mean by usability issues? Usability issues are based on behavior in using a product. As a UX professional you interpret the cause of these issues, such as confusing terminology or hidden navigation. Examples of the more common types of usability issues include:

• Behaviors that prevent task completion

• Behaviors that takes someone “off course”

• An expression of frustration by the participant

• Not seeing something that should be noticed

• A participant says a task is complete when it is not

• Performing an action that leads away from task success

• Misinterpreting some piece of content

• Choosing the wrong link to navigate through web pages

A key point to consider in defining usability issues is how they will be addressed. The most common use is in an iterative design process focused on improving the product. In that context, the most useful issues are those that point to possible improvements in the product. In other words, it helps if issues are reasonably actionable. If they don’t point directly to a part of the interface that was causing a problem, they should at least give you some hint of where to begin looking. For example, we once saw an issue in a usability test report that said, “The mental model of the application does not match the user’s mental model.” Note that no behavior was mentioned. And that was it. Although this may be an interesting interpretation of some behavior in a theoretical sense, it does very little to guide designers and developers in addressing the issue.

However, consider an issue like this: “Many participants were confused by the top-level navigation menu (which is the interpretation of the behavior), often jumping around from one section to another trying to find what they were looking for (the behavior).” Particularly if this issue is followed by a variety of detailed examples describing what happened, it could be very helpful. It tells you where to start looking (the top-level navigation), and the more detailed examples of additional behaviors may help focus on some possible solutions. Molich, Jeffries, and Dumas (2007) conducted an interesting study of usability recommendations and ways to make them more useful and usable. They suggest that all usability recommendations improve the overall user experience of the application, take into account business and technical constraints, and are specific and clear.

Of course, not all usability issues are things to be avoided. Some usability issues are positive. These are sometimes called usability “findings,” as the term issues often has negative connotations. Here are some examples of positive usability issues:

• All participants were able to log into the application

• There were no errors in completing the search task.

• Participants were faster at creating a report

The main reason for reporting positive findings, in addition to providing some positive reinforcement for the project team, is to make sure that these aspects of the interface don’t get “broken” in future design iterations.

5.1.1 Real Issues versus False Issues

One of the most difficult parts of any usability professional’s job is determining which usability issues are real and which are merely an aberration. Obvious issues are those that most, if not all, participants encounter. For example, it may be obvious when participants select the wrong option from a poorly worded menu, get taken down the wrong path, and then spend a significant amount of time looking for their target in the wrong part of the application. These are behaviors the cause of which are usually a “no brainer” for almost anyone to identify.

Some usability issues are much less obvious, or it’s not completely clear whether something is a real issue. For example, what if only 1 out of 10 participants expresses some confusion around a specific piece of content or terminology on a website? Or if only 1 out of 12 participants doesn’t notice something she should have? At some point the UX professional must decide whether what he observed is likely to be repeatable with a larger population. In these situations, ask yourself whether the participant’s behavior, thought process, perception, or decisions during the task were logical. In other words, is there a consistent story or reasoning behind her actions or thoughts? If so, then it may be an issue even if only one participant encountered it. However, no apparent rhyme or reason behind the behavior may be evident. If the participant can’t explain why he did what he did, and it only happened once, then it’s likely to be idiosyncratic and should probably be ignored.

5.2 How to Identify an Issue

The most common way to identify usability issues is during a study in which you are interacting with a participant directly. This might be in person or over the phone using remote testing technology. A less common way to identify usability issues is through some automated techniques such as an online study or by observing a video from a participant, similar to what is generated from a site like usertesting.com. This is where you don’t have an opportunity to observe participants directly but only have access to their behavioral and self-reported data. Identifying issues through this type of data is more challenging but still quite possible.

Possible usability issues might be predicted beforehand and tracked during test sessions. But be careful that you’re really observing the issues and not just finding them because you expected to. Your job is certainly easier when you know what to look for, but you might also miss other issues that you never considered. In our testing, we typically have an idea of what to look for, but we also try to keep an open mind to spot the surprise issues. There’s no “right” approach; it all depends on the goals of the evaluation. When evaluating products that are in an early conceptual stage, it’s more likely that you won’t have preset ideas about what the usability issues are. As the product is further refined, you may have a clearer idea of what specific issues you’re looking for.

The Issues You Expect May Not Be the Ones You Find

One of the earliest sets of guidelines for designing software interfaces was published by Apple (1982). It was called the Apple IIe Design Guidelines, and it contained a fascinating story of an early series of usability tests Apple conducted. They were working on the design of a program called Apple Presents Apple, which was a demonstration program for customers to use in computer stores. One part of the interface to which the designers paid little attention was asking users whether their monitor was monochrome or color. The initial design of the question was “Are you using a black-and-white monitor? (They had predicted that users might have trouble with the word monochrome.) In the first usability test, they found that a majority of the participants who used a monochrome monitor answered this question incorrectly because their monitor actually displayed text in green, not white!

What followed was a series of hilarious iterations involving questions such as “Does your monitor display multiple colors?” or “Do you see more than one color on the screen?”—all of which kept failing for some participants. In desperation, they were considering including a developer with every computer just to answer this question, but then they finally hit on a question that worked: ‘‘Do the words above appear in several different colors?’’ In short, the issues you expect may not be the issues you find.

5.2.1 In-Person Studies

The best way to facilitate identifying usability issues during an in-person study is using a think-aloud protocol. This involves having participants verbalize their thoughts as they are working through the tasks. Typically, the participants are reporting what they are doing, what they are trying to accomplish, how confident they are about their decisions, their expectations, and why they performed certain actions. Essentially, it’s a stream of consciousness focusing on their interaction with the product. During a think-aloud protocol, you might observe the following:

• Verbal expressions of confusion, frustration, dissatisfaction, pleasure, or surprise

• Verbal expressions of confidence or indecision about a particular action that may be right or wrong

• Participants not saying or doing something that they should have done or said

• Nonverbal behaviors such as facial expressions and/or eye movements

In addition to listening to participants, it is very important to observe their behavior. Watching what they are doing, where they struggle, and how they succeed provides a great way to identify usability issues.

5.2.2 Automated Studies

Identifying usability issues through automated studies requires careful data collection. The key is to allow participants to enter verbatim comments at a page or task level. In most automated studies, several data points are collected for each task: success, time, ease-of-use rating, and verbatim comments. Verbatim comments are the best way to understand any possible issues.

One way to collect verbatim comments is to require the participant to provide a comment at the conclusion of each task. This might yield some interesting results, but it doesn’t always yield the best results. An alternative that seems to work better is to make the verbatim comment conditional. If the participant provides a low ease-of-use score (e.g., not one of the two highest ratings), then she is asked to provide feedback about why she rated the task that way. Having a more pointed question usually yields more specific, actionable comments. For example, participants might say that they were confused about a particular term or that they couldn’t find the link they wanted on a certain page. This type of task-level feedback is usually more valuable than one question after they complete all the tasks (post-study). The only downside of this approach is if the participant adjusts his ratings, after several questions, in order to avoid the open-ended question.

5.3 Severity Ratings

Not all usability issues are the same: Some are more serious than others. Some usability issues mildly annoy or frustrate users, whereas others cause them to make the wrong decisions or lose data. Obviously, these two different types of usability issues have a very different impact on the user experience, and severity ratings are a useful way to deal with them.

Severity ratings help focus attention on the issues that really matter. There’s nothing more frustrating for a developer or business analyst than being handed a list of 82 usability issues that all need to be fixed immediately. By prioritizing usability issues, you’re much more likely to have a positive impact on the design, not to mention lessening the likelihood of making enemies with the rest of the design and development team.

The severity of usability issues can be classified in many ways, but most severity rating systems can be boiled down to two different types. In one type of rating system, severity is based purely on the impact on the user experience: The worse the user experience, the higher the severity rating. A second type of severity rating system tries to bring in multiple dimensions or factors, such as business goals and technical implementation costs.

5.3.1 Severity Ratings Based on the User Experience

Many severity ratings are based solely on the impact on the user experience. These rating systems are easy to implement and provide very useful information. They usually have three levels —often something like low, medium, and high severity. Occasionally there is a “catastrophe” level, which is essentially a showstopper (delaying product launch or release—Nielsen, 1993).

When choosing a severity rating system, it’s important to look at your organization and the product you are evaluating. Often, a three-level system works well in many situations:

Low: Any issue that annoys or frustrates participants but does not play a role in task failure. These are the types of issues that may lead someone off course, but he still recovers and completes the task. This issue may only reduce efficiency and/ or satisfaction a small amount, if any.

Medium: Any issue that contributes to significant task difficulty but does not cause task failure. Participants often develop workarounds to get to what they need. These issues have an impact on effectiveness and most likely efficiency and satisfaction.

High: Any issue that leads directly to task failure. Basically, there is no way to encounter this issue and still complete the task. This type of issue has a significant impact on effectiveness, efficiency, and satisfaction.

Note that this scheme is a rating of task failure, one of the measures of user experience. In a test in which there are no task failures, there can be no high severity issues.

Another limitation of a three-level scheme from low to high is that user experience professionals often are reluctant to use the “low” category, fearing that those issues may be ignored. That limits the scale to two levels.

An Example of the Ultimate Issue Severity

Tullis (2011) described an example of what we consider the ultimate in issue severity. In the early 1980s he conducted a usability test of a prototype of a handheld device for detecting high voltage on a metallic surface. The device had two indicator lights: one simply indicated that the device is working and the other indicated that there is high voltage present, which could be fatal. Unfortunately, both indicator lights were green. And they were right next to each other. And neither was labeled. After pleading with the designers to change the design, he finally decided to do a quick usability test. He had 10 participants perform 10 simulated tasks with the device. The prototype was rigged to signal the hazardous voltage condition 20% of the time. Out of 100 participant tasks, the indicator lights were interpreted correctly 99 times. But that one error was when it was signaling hazardous voltage. This usability issue could have resulted in serious injury or death to the user. The designers were convinced and the design was changed significantly.

image

5.3.2 Severity Ratings Based on a Combination of Factors

Severity rating systems that use a combination of factors usually are based on the impact on the user experience coupled with frequency of use and/or impact on the business goals. Nielsen (1993) provides an easy way to combine the impact on the user experience and frequency of use on severity ratings (Figure 5.1). This severity rating system is intuitive and easy to explain.

image

Figure 5.1 Severity rating scale taking into account problem frequency and impact on the user experience. Adapted from Nielsen (1993).

Alternatively, it’s possible to consider three or even four dimensions, such as impact to the user experience, predicted frequency of occurrence, impact on the business goals, and technical/implementation costs. For example, you might combine four different three-point scales:

• Impact on the user experience (0 = low, 1 = medium, 2 = high)

• Predicted frequency of occurrence (0 = low, 1 = medium, 2 = high)

• Impact on the business goals (0 = low, 1 = medium, 2 = high)

• Technical/implementation costs (0 = high, 1 = medium, 2 = low)

By adding up the four scores, you now have an overall severity rating ranging from 0 to 8. Of course, a certain amount of guesswork is involved in coming up with the levels, but at least all four factors are being taken into consideration. Or, if you really want to get fancy, you can weight each dimension based on some sort of organizational priority.

5.3.3 Using a Severity Rating System

Once you have settled on a severity rating system, you still need to consider a few more things. First, be consistent: Decide on one severity rating system and use it for all your studies. By using the same severity rating system, you will be able to make meaningful comparisons across studies, as well as help train your audience on the differences between the severity levels. The more your audience internalizes the system, the more persuasive you will be in promoting design solutions.

Second, communicate clearly what each level means. Provide examples of each level as much as possible. This is particularly important for other usability specialists on your team who might also be assigning ratings. It’s important that developers, designers, and business analysts understand each severity level. The more the “nonusability” audience understands each level, the easier it will be to influence design solutions for the highest priority issues.

Third, try to have more than one usability specialist assign severity ratings to each issue. One approach that works well is to have the usability specialists independently assign severity ratings to each of the issues and then discuss any of the issues where they gave different ratings and try to agree on the appropriate level.

Finally, there’s some debate about whether usability issues should be tracked as part of a larger bug-tracking system (Wilson & Coyne, 2001). Wilson argues that it is essential to track usability issues as part of a bug-tracking system because it makes the usability issues more visible, lends more credibility to the usability team, and makes it more likely that the issues will be remedied. Coyne suggests that usability issues, and the methods to fix them, are much more complex than typical bugs. Therefore, it makes more sense to track usability issues in a separate database. Either way, it’s important to track the usability issues and make sure they are addressed, not simply forgotten.

5.3.4 Some Caveats about Rating Systems

Not everyone believes in severity ratings. Kuniavsky (2003) suggests letting your audience provide their own severity ratings. He argues that only those who are deeply familiar with the business model will be able to determine the relative priority of each usability issue.

Bailey (2005) strongly argues against severity rating systems altogether. He cites several studies that show there is very little agreement between usability specialists on the severity rating for any given usability issue (Catani & Biers, 1998; Cockton & Woolrych, 2001; Jacobsen, Hertzum, & John, 1998; Molich & Dumas, 2008). All of these studies generally show that there is very little overlap in what different usability specialists identify as a high-severity issue. Obviously, this is troubling given that many important decisions may be based on severity ratings.

Hertzum et al. (2002) highlight a potentially different problem in assigning severity ratings. In their research they found that when multiple usability specialists are working as part of the same team, each usability specialist rates the issues she personally identifies as more severe than issues identified by the other usability specialists on their own team. This is one aspect known as an evaluator effect, and it poses a significant problem in relying on severity ratings by a single UX professional. As a profession, we don’t yet know why severity ratings are not consistent between specialists.

So where does this leave us? We believe that severity ratings are far from perfect, but they still serve a useful purpose. They help direct attention to at least some of the most pressing needs. Without severity ratings, the designers or developers will simply make their own priority list, perhaps based on what’s easiest or least expensive to implement. Even though there is subjectivity involved in assigning severity ratings, they’re better than nothing. We believe that most key stakeholders understand that there is more art than science involved, and they interpret the severity ratings within this broader context.

5.4 Analyzing and Reporting Metrics for Usability Issues

Once you’ve identified and prioritized the usability issues, it’s helpful to do some analyses of the issues themselves. This lets you derive some metrics related to the usability issues. Exactly how you do this will largely depend on the type of usability questions you have in mind. Three general questions can be answered by looking at metrics related to usability issues:

• How is the overall usability of the product? This is helpful if you simply want to get an overall sense of how the product did.

• Is the usability improving with each design iteration? Focus on this question when you need to know how the usability is changing with each new design iteration.

• Where should you focus your efforts to improve the design? The answer to this question is useful when you need to decide where to focus your resources.

All of the analyses we examine can be done with or without severity ratings. Severity ratings simply add a way to filter the issues. Sometimes it’s helpful to focus on the high-severity issues. Other times it might make more sense to treat all the usability issues equally.

5.4.1 Frequency of Unique Issues

The simplest way to measure usability issues is to simply count the unique issues. Analyzing the frequency of unique issues is most useful in an iterative design process when you want some high-level data about how the usability is changing with each new design iteration. For example, you might observe that the number of unique issues decreased from 24 to 12 to 4 through the first three design iterations. These data are obviously trending in the right direction, but they’re not necessarily iron-clad evidence that the design is significantly better. Perhaps the four remaining issues are so much bigger than all the rest that without addressing them, everything else is unimportant. Therefore, we suggest a thorough analysis and explanation of the issues when presenting this type of data.

Keep in mind that this frequency represents the number of unique issues, not the total number of issues encountered by all participants. For example, assume Participant A encountered 10 issues, whereas Participant B encountered 14 issues, but 6 of those issues were the same as those from Participant A. If A and B were the only participants, the total number of unique issues would be 18. Figure 5.2 shows an example of how to present the frequency of usability issues when comparing more than one design.

image

Figure 5.2 Example data showing the number of unique usability issues by design iteration.

The same type of analysis can be performed using usability issues that have been assigned a severity rating. For example, if you have classified your usability issues into three levels (low, medium, and high severity), you can easily look at the number of issues by each type of severity rating. Certainly the most telling data item would be the change in the number of high-priority issues with each design iteration. Looking at the frequency of usability issues by severity rating, as illustrated in Figure 5.3, can be very informative since it is an indicator of whether the design effort between each iteration is addressing the most important usability issues.

image

Figure 5.3 Example data showing the number of unique usability issues by design iteration, categorized by severity rating. The change in the number of high-severity issues is probably of key interest.

5.4.2 Frequency of Issues Per Participant

It can also be informative to look at the number of (nonunique) issues each participant encountered. Over a series of design iterations, you would expect to see this number decreasing along with the total number of unique issues. For example, Figure 5.4 shows the average number of issues encountered by each participant for three design iterations. Of course, this analysis could also include the average number of issues per participant broken down by severity level. If the average number of issues per participant is steady over a series of iterations, but the total number of unique issues is declining, then you know there is more consistency in the issues that the participants are encountering. This would indicate that the issues encountered by fewer participants are being fixed, whereas those encountered by more participants are not.

image

Figure 5.4 Example data showing the average number of usability issues encountered by participants in each of three usability tests.

5.4.3 Frequency of Participants

Another useful way to analyze usability issues is to observe the frequency or percentage of participants who encountered a specific issue. For example, you might be interested in whether participants correctly used some new type of navigation element on your website. You report that half of the participants encountered a specific issue in the first design iteration and only 1 out of 10 encountered the same issue in the second design iteration. This is a useful metric when you need to focus on whether you are improving the usability of specific design elements as opposed to making overall usability improvements.

With this type of analysis, it’s important that your criteria for identifying specific issues are consistent between participants and designs. If a description of a specific issue is a bit fuzzy, your data won’t mean very much. It’s a good idea to explicitly document the issue’s exact nature, thereby reducing any interpretation errors across participants or designs. Figure 5.5 shows an example of this type of analysis.

image

Figure 5.5 Example data showing the frequency of participants who experienced specific usability issues.

The use of severity ratings with this type of analysis is useful in a couple of ways. First, you could use the severity ratings to focus your analysis only on the high-priority issues. For example, you could report that there are five outstanding high-priority usability issues. Furthermore, the percentage of participants who are experiencing these issues is decreasing with each design iteration. Another form of analysis is to aggregate all the high-priority issues to report the percentage of participants who experienced any high-priority issue. This helps you see how overall usability is changing with each design iteration, but it is less helpful in determining whether to address a specific usability problem.

5.4.4 Issues by Category

Sometimes it’s helpful to know where to focus design improvements from a tactical perspective. Perhaps you feel that only certain areas of the product are causing the most usability issues, such as navigation, content, terminology, and so forth. In this situation, it can be useful to aggregate usability issues into categories. Simply examine each issue and then categorize it into a type of issue. Then look at the frequencies of issues that fall into each category. Issues can be categorized in many different ways. Just make sure the categorization makes sense to you and your audience, and use a limited number of categories, typically three to eight. If there are too many categories, it won’t provide much direction. Figure 5.6 provides an example of usability issues analyzed by category.

image

Figure 5.6 Example data showing the frequency of usability issues categorized by type. Note that both navigation and terminology issues were improved from the first to the second design iteration.

5.4.5 Issues by Task

Issues can also be analyzed at a task level. You might be interested in which tasks lead to the most issues, and you can report the number of unique issues that occur for each. This will identify the tasks you should focus on for the next design iteration. Alternatively, you could report the frequency of participants who encounter any issue for each task. This will tell you the pervasiveness of a particular issue. The greater the number of issues for each task, the greater the concern should be.

If you have assigned a severity rating to each issue, it might be useful to analyze the frequency of high-priority issues by task. This is particularly effective if you want to focus on a few of the biggest problems and your design efforts are oriented toward specific tasks. This is also helpful if you are comparing different design iterations using the same tasks.

5.5 Consistency in Identifying Usability Issues

Much has been written about consistency and bias in identifying and prioritizing usability issues. Unfortunately, the news is not so good. Much of the research shows that there is very little agreement on what a usability issue is or how severe it is.

Perhaps the most exhaustive set of studies, called CUE (Comparative Usability Evaluation) has been coordinated by Rolf Molich. To date, nine separate CUE studies have been conducted, dating back to 1998. Each study was set up in a similar manner. Different teams of usability experts all evaluated the same design. Each team reported their findings, including the identification of the usability issues, along with their design recommendations. The first study, CUE-1 (Molich et al., 1998), showed very little overlap in the issues identified. In fact, only 1 out of the 141 issues was identified by all four teams participating in the study, and 128 out of the 141 issues were identified by single teams. Several years later, in CUE-2, the results were no more encouraging: 75% of all the issues were reported by only 1 of 9 usability teams (Molichet et al., 2004). CUE-4 (Molich & Dumas, 2008) showed similar results: 60% of all the issues were identified by only 1 of the 17 different teams participating in the study. More recently, CUE-8 focused on the consistency of how UX metrics are used and the conclusions that are drawn.

CUE-8—How Practitioners Measure Website Usability

by Rolf Molich,    Dialog Design

Fifteen experienced professional usability teams simultaneously and independently measured a baseline for the usability of the car rental website Budget.com. This comparative study documented a wide difference in measurement approaches. The 8–10 teams that used similar and well-established approaches reached surprisingly similar results.

Fifteen Teams Measured the Same Website

In May 2009, 15 U.S. and European teams independently and simultaneously carried out usability measurements of the Budget.com car rental website. The goals were to investigate reproducibility of professional usability measurements and how experienced professionals actually carry out usability measurements.

The measurements were based on a common scenario and instructions. The scenario deliberately did not specify in detail which measures the teams were supposed to collect and report, although participants were asked to collect time-on-task, task success, and satisfaction data, as well as any qualitative data they normally would collect. The anonymous reports from the 15 participating teams are available publicly online (http://www.dialogdesign.dk/CUE-8.htm).

All teams were asked to measure the same five tasks in their study, for example, “Rent an intermediate size car at Logan Airport in Boston, Massachusetts, from Thursday 11 June 2009 at 09.00 AM to Monday 15 June at 3.00 PM. If asked for a name, use John Smith, email address [email protected]. Do not submit the reservation.”

Teams used from 9 to 313 test participants and from 21 to 128 hours to complete the study. Interestingly, the team that tested the most participants also spent the fewest hours on the study. This team used 21 person hours to conduct 313 sessions, which were all unmoderated.

Eight of the 15 teams used the SUS questionnaire for measuring subjective satisfaction. Despite its known shortcomings, SUS seems to be the current industry standard. No other questionnaire was used by more than one team.

Nine teams included qualitative results in addition to the required quantitative results. The general feeling seemed to be that the qualitative results were a highly useful by-product of the measurements.

The study is named CUE-8. It was the eighth in a series of Comparative Usability Evaluation studies (http://www.dialogdesign.dk/CUE.html).

Unmoderated Test Sessions

Six teams used unmoderated, automated measurements. Two of these six teams supplemented unmoderated measurements with moderated measurement sessions. These teams obtained valuable results but some also found that their data from the unattended test sessions were contaminated or invalid. Some participants reported impossible task times, perhaps because they wanted the reward with as little effort as possible.

Examples of contaminated data are 33 seconds to rent a car, which is impossible on the Budget.com website. The presence of obviously contaminated data in the data set raises serious doubts about the validity of all data in the data set. It’s easy to spot unrealistic data, but how about a reported time of, for example, 146 seconds to rent a car in a data set that also contains unrealistic data? The 146 seconds look realistic, but how do you know that the unmoderated test participant did not use an unacceptable approach to arrive at the reported time?

Unmoderated measurements are attractive from a resource point of view; however, data contamination is a serious problem and it is not always clear what you are actually measuring. While both moderated and unmoderated testing have opportunities for things to go wrong, it is more difficult to detect and correct these with unmoderated testing. Further studies of how data contamination can be prevented and how contaminated data can be cleaned efficiently are required.

For unmoderated measurements, the ease of use and intrusiveness of the remote tool influence measurements. Some teams complained about clunky interfaces. We recommend that practitioners demand usable products for usability measurements.

Practitioner’s Takeaway from CUE-8

CUE-8 confirmed a number of rules for good measurement practice. Perhaps the most interesting result from CUE-8 is that these rules were not always observed by the participating professional teams.

• Adhere strictly to precisely defined measurement procedures for quantitative tests.

• Report time on task, success/failure rate, and satisfaction for quantitative tests.

• Exclude failed times from average task completion times.

• Understand the inherent variability from samples. Use strict participant screening criteria. Provide confidence intervals around your results if this is possible. Keep in mind that time on task is not distributed normally and therefore confidence intervals as commonly computed on raw scores may be misleading.

• Combine qualitative and quantitative findings in your report. Present what happened (quantitative data) and support it with why it happened (qualitative data). Qualitative data provide considerable insight regarding the serious obstacles that users faced and it is counterproductive not to report this insight.

• Justify the composition and size of your participant samples. This is the only way you have to allow your client to judge how much confidence they should place in your results.

• When using unmoderated methodologies for quantitative tests ensure that you can distinguish between extreme and incorrect results. Although unmoderated testing can exhibit a remarkable productivity in terms of user tasks measured with a limited effort, quantity of data is no substitute for clean data.

Further Information

The 17-page refereed paper “Rent a Car in Just 0, 60, 240 or 1,217 Seconds? Comparative Usability Measurement, CUE-8” describes the results of CUE-8 in detail. The paper is freely available in the November 2010 issue of the Journal of Usability Studies.

5.6 Bias in Identifying Usability Issues

Many different factors can influence how usability issues are identified. Carolyn Snyder (2006) provides a review of many of the ways usability findings might be biased. She concludes that bias cannot be eliminated, but it must be understood. In other words, even though our methods have flaws, they are still useful. We’ve distilled the different sources of bias in a usability study into seven general categories:

Participants: Your participants are critical. Every participant brings a certain level of technical expertise, domain knowledge, and motivation. Some participants may be well targeted and others may not. Some participants are comfortable in a lab setting, whereas others are not. All of these factors make a big difference in what usability issues you end up discovering.

Tasks: The tasks you choose have a tremendous impact on what issues are identified. Some tasks might be well defined with a clear end state, others might be open ended, and yet others might be self-generated by each participant. The tasks basically determine what areas of the product are exercised and the ways in which they are exercised. Particularly with a complex product, this can have a major impact on what issues are uncovered.

Method: The method of evaluation is critical. Methods might include traditional lab testing or some type of expert review. Other decisions you make are also important, such as how long each session lasts, whether the participant thinks aloud, or how and when you probe.

Artifact: The nature of the prototype or product you are evaluating has a huge impact on your findings. The type of interaction will vary tremendously whether it is a paper prototype, functional or semifunctional prototype, or production system.

Environment: The physical environment also plays a role. The environment might involve direct interaction with the participant, indirect interaction via a conference call or behind a one-way mirror, or even at someone’s home. Other characteristics of the physical environment, such as lighting, seating, observers behind a one-way mirror, and videotaping, can all have an impact on the findings.

Moderators: Different moderators will also influence the issues that are observed. A UX professional’s experience, domain knowledge, and motivation all play a key role.

Expectations:Norgaard and Hornbaek (2006) found that many usability professionals come into testing with expectations on what are the most problematic areas of the interface. These expectations have a significant impact on what they report, often times missing many other important issues.

An interesting study that sheds some light on these sources of bias was conducted by Lindgaard and Chattratichart (2007). They analyzed the reports from the nine teams in CUE-4 who conducted actual usability tests with real users. They looked at the number of participants in each test, the number of tasks used, and the number of usability issues reported. They found no significant correlation between the number of participants in the test and the percentage of usability problems found. However, they did find a significant correlation between the number of tasks used and the percentage of usability problems found (r = 0.82, p < 0.01). When looking at the percentage of new problems uncovered, the correlation with the number of tasks was even higher (r = 0.89, p < 0.005). As Lindgaard and Chattratichart (2007) concluded, these results suggest “that with careful participant recruitment, investing in wide task coverage is more fruitful than increasing the number of users.”

One technique that works well to increase task coverage in a usability test is to define a set of tasks that all participants must complete and another set that is derived for each participant. These additional tasks might be selected based on characteristics of the participant (e.g., an existing customer or a prospect) or might be selected at random. Care must be exercised when making comparisons across participants, as not all participants had the same tasks. In this situation, you may want to limit certain analyses to the core tasks.

The Special Case of Moderator Bias in an Eye-Tracking Study

One of the more difficult aspects of moderating a usability study is controlling where you look during the session. Moderators usually are looking either at the participant or their interaction on a screen, or some other interface. This works well, except in the case of an eye-tracking study. Most eye-tracking studies measure where participants look, and whether participants are noting key elements on the interface. As a moderator, it can be difficult not to look at the target when the participant is scanning the interface. Participants can pick up on this easily, begin to notice where you are looking, and use that information as a guide to target. It happens very quickly and subtly. While this behavior has not been reported in the user experience literature, we have observed it during our own eye-tracking studies. The best thing to do is to be aware of it, and if you find your eyes starting to wander to the target, simply refocus on the participant, what they are doing, or, if you have to, some other element on the page. Or don’t sit in the room with the participant, if that’s an option. When you’re sitting with the participant in an eye-tracking study, there’s also a greater chance that the participant will naturally look at you and away from the screen.

5.7 Number of Participants

There has been much debate about how many participants are needed in a usability test to reliably identify usability issues. [For a summary of the debate, see Barnum et al. (2003).] Nearly every UX professional seems to have an opinion. Not only are many different opinions floating around out there, but quite a few compelling studies have been conducted on this very topic. From this research, two different camps have emerged: those who believe that five participants are enough to identify most of the usability issues and those who believe that five is nowhere near enough.

5.7.1 Five Participants is Enough

One camp believes that a majority, or about 80%, of usability issues will be observed with the first five participants (Lewis, 1994; Nielsen & Landauer, 1993; Virzi, 1992). This is known as the “magic number 5.” One of the most important ways to figure out how many participants are needed in a usability test is to measure p, or the probability of a usability issue being detected by a single test participant. It’s important to note that this p is different from the p value used in tests of significance. The probabilities vary from study to study, but they tend to average around 0.3, or 30%. [For a review of different studies, see Turner, Nielsen, and Lewis (2002).] In a seminal paper, Nielsen and Landauer (1993) found an average probability of 31% based on 11 different studies. This basically means that with each participant, about 31% of the usability problems are being observed.

Figure 5.7 shows how many issues are observed as a function of the number of participants when the probability of detection is 30%. (Note that this assumes all issues have an equal probability of detection, which may be a big assumption.) As you can see, after the first participant, 30% of the problems are detected; after the third participant, about 66% of the problems are observed; and after the fifth participant, about 83% of the problems have been identified. This claim is backed up not only by this mathematical formula, but by a ecdotal evidence as well. Many UX professionals only test with five or six participants during an iterative design process. In this situation, it is relatively uncommon to test with more than a dozen, with a few exceptions. If the scope of the product is particularly large or if there are distinctly different audiences, then a strong case can be made for testing with more than five participants.

Calculating p, or Probability of Detection

Calculating the probability of detection is fairly straightforward. Simply line up all the usability issues discovered during the test. Then, for each participant, mark how many of the issues were observed with that participant. Add the total number of issues identified with each participant and then divide by the total number of issues. Each test participant will have encountered anywhere from 0 to 100% of the issues. Then, take the average for all the test participants. This is the overall probability rate for the test. Consider the example shown in this table.

Image

Once the average proportion has been determined (0.49 in this case), the next step is to calculate how many users are needed to identify a certain percentage of issues. Use the following formula:

image

where n is the number of users.

So if you want to know the proportion of issues that would be identified by a sample of three users:

• 1−(1−0.49)3

• 1−(0.51)3

• 1−0.133

• 0.867, or about 87%, of the issues would be identified with a sample of three users from this study

image

Figure 5.7 Example showing how many users are required to observe the total number of issues in a usability study, given a probability of detection.

5.7.2 Five Participants is Not Enough

Other researchers have challenged this idea of the magic number 5 (Molich et al., 1998; Spool & Schroeder, 2001; Woolrych & Cockton, 2001). Spool and Schroeder (2001) asked participants to purchase various types of products, such as CDs and DVDs, at three different electronics websites. They discovered only 35% of the usability issues after the first five participants—far lower than the 80% predicted by Nielsen (2000). However, in this study the scope of the websites being evaluated was very large, even though the task of buying something was very well defined. Woolrych and Cockton (2001) discount the assertion that five participants are enough, primarily because it does not take into account individual differences.

The analyses by Lindgaard and Chattratichart (2007) of the nine usability tests from CUE-4 also raise doubts about the magic number 5. They compared the results of two teams, A and H, that both did very well, uncovering 42 and 43%, respectively, of the full set of usability problems. Team A used only 6 participants, whereas Team H used 12. At first glance, this might be seen as evidence for the magic number 5, as a team that tested only 6 participants uncovered as many problems as a team that tested 12. But a more detailed analysis reveals a different conclusion. In looking specifically at the overlap of usability issues between just these two reports, they found only 28% in common. More than 70% of the problems were uncovered by only one of the two teams, ruling out the possibility of the five-participant rule applying in this case.

The Evaluator Effect

The Evaluator Effect (Hornbaek & Frokjaer, 2008; Jacobson, Hertzum, & John, 1998; Vermeern, van Kesteren, & Bekker, 2003) in usability testing suggests that UX professionals identify a different set of usability issues. In other words, there is little agreement or overlap in the usability issues identified by different UX professionals. The evaluator effect has been observed consistently in the CUE studies led by Rolf Molich (http://www.dialogdesign.dk/CUE.html). Most recently, CUE-9 (Molich, 2011) focused on the Evaluator Effect. Most of the 34 test team leaders in CUE-9 were confident that they found the most significant usability issues. However, there was little overlap in the issues. Furthermore, the test teams felt that running more participants would not help them identify more usability issues.

How do we reconcile this finding in the context of the recommended number of participants? It is easy for a UX professional to say that he found most of the usability issues after testing with 5–10 participants. In fact, they are usually very confident. But how do they really know unless they compare their findings to another UX professional? The fact is, they don’t. It is quite possible that additional usability issues, often significant, may be uncovered with an independent assessment by another UX professional.

5.7.3 Our Recommendation

We recommend maintaining flexibility regarding sample sizes in usability tests. We feel it may be acceptable to test with 5–10 participants, using one UX team when the following conditions are met:

• It is OK to miss some of the major usability issues. You are more interested in capturing some of the big issues, iterating the design, and retesting. Any improvement is welcome.

• There is only one distinct user group that you believe will think about the design and tasks in a very similar way.

• The scope of the design is limited. There are a small number of screens and/or tasks.

We recommend increasing the number of participants and/or the number of UX teams when the following conditions apply:

• You must capture as many UX issues as possible. In other words, there will be significant negative repercussions if you miss any of the major usability issues.

• There is more than one distinct user group.

• The scope of the design is large. In this case we would recommend a broader set of tasks.

We fully realize that not everyone has access to multiple UX researchers. In this case, try to solicit feedback from any other observers. No one can see everything. Also, you might want to acknowledge that some of the major usability issues might not have been identified.

5.8 Summary

Many usability professionals make their living by identifying usability issues and by providing actionable recommendations for improvement. Providing metrics around usability issues is not commonly done, but it can be incorporated easily into anyone’s routine. Measuring usability issues helps you answer some fundamental questions about how good (or bad) the design is, how it is changing with each design iteration, and where to focus resources to remedy the outstanding problems. You should keep the following points in mind when identifying, measuring, and presenting usability issues.

1. The easiest way to identify usability issues is during an in-person lab study, but it can also be done using verbatim comments in an automated study. The more you understand the domain, the easier it will be to spot the issues. Having multiple observers is very helpful in identifying issues.

2. When trying to figure out whether an issue is real, ask yourself whether there is a consistent story behind the user’s thought process and behavior. If the story is reasonable, then the issue is likely to be real.

3. The severity of an issue can be determined in several ways. Severity always should take into account the impact on the user experience. Additional factors, such as frequency of use, impact on the business, and persistence, may also be considered. Some severity ratings are based on a simple high/medium/low rating system. Other systems are number based.

4. Some common ways to measure usability issues are measuring the frequency of unique issues, the percentage of participants who experience a specific issue, and the frequency of issues for different tasks or categories of issue. Additional analysis can be performed on high-severity issues or on how issues change from one design iteration to another.

5. When identifying usability issues, questions about consistency and bias may arise. Bias can come from many sources, and there can be a general lack of agreement on what constitutes an issue. Therefore, it’s important to work collaboratively as a team, focusing on high-priority issues, and to understand how different sources of bias impact conclusions. Maximizing task coverage and including an additional UX team may be key.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset