Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8

Combined and Comparative Metrics

Contents

Usability data are building blocks. Each piece of usability data can be used to create new metrics. Raw usability data can include task completion rates, time on task, or self-reported ease of use. All of these usability data can be used to derive new metrics that were not available previously, such as an overall usability metric or some type of “usability scorecard.” Why might you want to do this? We think the most compelling reason is to have an easy-to-understand score or summary of all the metrics you’ve collected in a study. This can be very handy when presenting to senior managers, for tracking changes across iterations or releases, and for comparing different designs.

Two of the common ways to derive new usability metrics from existing data are by (1) combining more than one metric into a single usability measure and (2) comparing existing usability data to expert or ideal results. Both methods are reviewed in this chapter.

8.1 Single Usability Scores

In many usability tests, you collect more than one metric, such as task completion rate, task time, and perhaps a self-reported metric such as a system usability scale (SUS) score. In most cases, you don’t care so much about the results for each these metrics individually as you do about the total picture of the user experience as reflected by all of these metrics. This section covers the various ways you can combine or represent different metrics to get an overall view of the usability of a product, or different aspects of a product, perhaps as revealed by different tasks.

The most common question asked after a usability test is “How did it do?” People who ask this question (often the product manager, developer, or other members of the project team) usually don’t want to hear about task completion rates, task times, or questionnaire scores. They want an overall score of some type: Did it pass or fail? How did it do in comparison to the last round of usability testing? Making these kinds of judgments in a meaningful way involves combining the metrics from a usability test into some type of overall score. The challenge is figuring out how to combine scores from different scales in a meaningful way (e.g., task completion rates in percentages and task times in minutes or seconds).

8.1.1 Combining Metrics Based on Target Goals

Perhaps the easiest way to combine different metrics is to compare each data point to a target goal and represent one single metric based on the percentage of users who achieved a combined set of goals. For example, assume that the goal is for users to complete at least 80% of their tasks successfully in no more than 70 seconds each on the average. Given that goal, consider the data in Table 8.1, which shows task completion rate and average time per task for each of eight participants in a usability test.

Table 8.1

Sample task completion and task time data from eight participants^a.

^aAlso shown are averages for task completion and time and an indication of whether each participant met the objective of completing at least 80% of the tasks in no more than 70 seconds.

Table 8.1 shows some interesting results. The average values for task completion (82%) and task time (67 seconds) would seem to indicate that the goals for this test were met. Even if you look at the number of users who met the task completion goal (six participants, or 75%) or the task time goal (five participants, or 62%), you would still find the results reasonably encouraging. However, the most appropriate way to look at the results is to see if each individual participant met the stated goal (i.e., the combination of completing at least 80% of the tasks in no more than 70 seconds each). It turns out, as shown in the last column of Table 8.1, that only three, or 38%, of the participants actually met the goal. This demonstrates the importance of looking at individual participant data rather than just looking at averages. This can be particularly true when dealing with relatively small numbers of participants.

This method of combining metrics based on target goals can be used with any set of metrics. The only real decision is what target goals to use. Target goals can be based on business goals and/or comparison to ideal performance. The math is easy (each person just gets a 1 or 0), and the interpretation is easy to explain (the percentage of users who had an experience that met the stated goal during the test).

8.1.2 Combining Metrics Based on Percentages

Although we’re well aware that we should have measurable target goals for our usability tests, in practice many of us don’t have them. So what can you do to combine different metrics when you don’t have target goals? One simple technique for combining scores on different scales is to convert each score to a percentage and then average them. For example, consider the data in Table 8.2, which shows results of a usability test with 10 participants.

Table 8.2

Sample data from a usability test with 10 participants^a.

^aTime per task is the average time to complete each task, in seconds. Tasks completed are number of tasks (out of 15) that the user completed successfully. Rating is the average of a five-point task ease rating for each task, where higher is better.

One way to get an overall sense of the results from this study is to first convert each of these metrics to a percentage. In the case of the number of tasks completed and the subjective rating, it’s easy because we know the maximum (“best”) possible value for each of those scores: there were 15 tasks, and the maximum possible subjective rating on the scale was 4. So we just divide the score obtained for each participant by the corresponding maximum to get the percentage.

In the case of time data, it’s a little trickier, as there’s not a predefined “best” or “worst” time—the ends of the scale are not known beforehand. One way of handling this would be to have several experts do the task and treat the average of their times as the “best” time. Another way is to treat the fastest time obtained from the participants in the study as the “best” (25 seconds, in this example), the slowest time as the “worst” (70 seconds, in this example), and then express other times in relation to those. Specifically, you divide the difference between the longest time and the observed time by the difference between longest and shortest times. This way, the shortest time becomes 100% and the longest becomes 0%. Using that method of transforming the data, you get the percentages shown in Table 8.3.

We’re grateful to David Juhlin, of the Bentley University Design and Usability Center, for suggesting this transformation of time data. In the first edition of this book we used a different method, which resulted in a nonlinear transformation. This new approach is linear and more appropriate.

Transforming Time Data in Excel

Here are the steps for transforming time data to percentages using these rules in Excel:

1. Enter the raw times into a single column in Excel. For this example, we will assume they are in column “A” and that you started on row “1”. Make sure there are no other values in this column, such as an average at the bottom.

2. In the cell to the right of the first time, enter the formula:

3. Copy this formula down as many rows as there are times to be transformed.

Table 8.3

Data from Table 8.2 transformed to percentages^a.

^aFor task completion data, the score was divided by 15. For rating data, the score was divided by 4. For time data, the difference between the longest time (70) and the observed time was divided by the difference between longest (70) and shortest (25) times.

Table 8.3 also shows the average of these percentages for each of the participants. If any one participant had completed all the tasks successfully in the shortest average time and had given the product a perfect score on the subjective rating scales, that person’s average would have been 100%. However, if any one participant had failed to complete any of the tasks, had taken the longest time per task, and had given the product the lowest possible score on the subjective rating scales, that person’s average would have been 0%. Of course, rarely do you see either of those extremes. Like the sample data in Table 8.3, most participants fall between those two extremes. In this case, averages range from a low of 28% (Participant 4) to a high of 85% (Participant 9), with an overall average of 58%.

Calculating Percentages Across Iterations or Designs

One of the valuable uses of this kind of overall score is in making comparisons across iterations or releases of a product or across different designs. But it’s important to do the transformation across all of the data at once, not separately for each iteration or design. This is particularly important for time data, where the times that you’ve collected are determining the best and worst times. That selection of the best and worst times should be done by looking across all of the conditions, iterations, or designs that you want to compare.

So if you had to give an “overall score” to the product whose test results are shown in Tables 8.2 and 8.3, you could say it got 58% overall. Most people wouldn’t be too happy with 58%. Many years of grades from school have probably conditioned most of us to think of a percentage that low as a “failing grade.” But you should also consider how accurate that percentage is. Because it’s an average based on individual scores from 10 different participants, you can construct a confidence interval for that average, as explained in Chapter 2. The 90% confidence interval in this case is ±11%, meaning that the confidence interval extends from 47 to 69%. Running more participants would probably give you a more accurate estimate of this value, whereas running fewer would probably have made it less accurate.

One thing to be aware of is that when we averaged the three percentages together (from task completion data, task time data, and subjective ratings), we gave equal weight to each of those measures. In many cases, that’s a perfectly reasonable thing to do, but sometimes the business goals of the product may indicate a different weighting. In this example, we’re combining two performance measures (task completion and task time) with one self-reported measure (rating). By giving equal weight to each, we’re actually giving twice as much weight to performance as to the self-reported measure. That can be adjusted by using weights in calculating the averages, as shown in Table 8.4.

Table 8.4

Calculation of weighted averages^a.

^aEach individual percentage is multiplied by its associated weight, these products are summed, and that sum is divided by the sum of the weights (4, in this example).

In Table 8.4, the subjective rating is given a weight of 2, and each of the two performance measures is given a weight of 1. The net effect is that the subjective rating gets as much weight in the calculation of the average as the two performance measures together. The result is that these weighted averages for each participant tend to be closer to the subjective ratings than the equal-weight averages in Table 8.3. The exact weights you use for any given product should be determined by the business goals for the product. For example, if you’re testing a website for use by the general public, and the users have many other competitors’ websites to choose from, you might want to give more weight to self-reported measures because you probably care more about the users’ perception of the product than anything else.

However, if you’re dealing with an application where speed and accuracy are more important, such as a stock-trading application, you would probably want to give more weight to performance measures. You can use any weights that are appropriate for your situation, but remember to divide by the sum of those weights in calculating the weighted average.

These basic principles apply to transforming any set of metrics from a usability test. For example, consider the data in Table 8.5, which includes number of tasks completed successfully (out of 10), number of web page visits, an overall satisfaction rating, and an overall usefulness rating.

Table 8.5

Sample data from a usability test with nine participants^a.

^aTasks completed are the number of tasks (out of 10) that the user completed successfully. Number of page visits is the total number of web pages that the user visited in attempting the tasks.(Typically, each revisit to the same page is counted as another visit.) The two ratings are average subjective ratings of satisfaction and usefulness, each on a seven-point scale (0–6).

Calculating percentages from these scores is very similar to the previous example. The number of tasks completed is divided by 10, and the two subjective ratings are each divided by 6 (the maximum rating). The other metric, number of web page visits, is somewhat analogous to the time metric in the previous example. But in the case of web page visits, it is usually possible to calculate the minimum number of page visits that would be required to accomplish the tasks. In this example, it was 20. You can then transform the number of page visits by dividing 20 (the fewest possible) by the actual number of page visits. The closer the number of page visits is to 20, the closer the percentage will be to 100%. Table 8.5 shows original values, percentages, and then equal-weight averages. In this case, note that equal weighting (normal average) results in the same weight being given to performance data (task completion and page visits) and self-reported data (the two ratings).

Converting Ratings to Percentages

What if the subjective ratings you used were on a scale that started at 1 instead of 0? Would that make a difference in how you transform the ratings to a percentage? Most definitely. Let’s assume the ratings were on a scale of 1–7 instead of 0–6, with higher numbers being better. Both are seven-point scales. In both cases, you want the lowest possible rating to become 0% and the highest possible rating to become 100%. When the ratings are on a 0–6 scale, simply dividing each rating by 6 (the highest possible rating) gives the desired range (0 to 100%). But when the ratings are on a 1–7 scale, there’s a problem. If you divide each rating by 7 (the highest possible rating), you get a maximum score of 100%, which is okay, but the minimum score is 1/7, or 14%, not the 0% that you want. The solution is to first subtract 1 from each rating (rescaling it to 0–6) and then divide by the new maximum score (6, in this case). So, the lowest score becomes (1−1)/6, or 0%, and the highest becomes (7−1)/6, or 100%.

To look at transforming another set of metrics, consider the data in Table 8.6. In this case, the number of errors is listed, which would include specific errors the users made, such as data-entry errors. Obviously, it is possible (and desirable) for a user to make no errors, so the minimum possible is 0. But there’s usually no predefined maximum number of errors that a user could make. In a case like this, the best way to transform the data is to divide the number of errors obtained by the maximum number of errors and then subtract from 1. In this example, the maximum is 5, the number of errors made by participant 4. This is how the error percentages in Table 8.6 were obtained. If any user had no errors (optimum), their percentage would be 100%. The percentage for the user(s) with the highest number of errors would be 0%. Note that in calculating any of these percentages, we always want higher percentages to be better—to reflect better usability. So, in the case of errors, it makes more sense to think of the resulting percentage as an “accuracy” measure.

Watch out for Outliers

When transforming any data where you’re letting observed values determine the minimum or maximum (e.g., times or errors), you need to be particularly cautious about outliers. For example, in the data shown in Table 8.6, what if Participant #4 had made 20 errors instead of 5? The net effect would have been that his transformed percentage would still have been 0% but all of the others would have been pushed much higher. One of the standard ways of detecting outliers is by calculating the mean and standard deviation of all your data and then considering any values more than twice or three times the standard deviation away from the mean as outliers. (Most people use twice the standard deviation, but if you want to be really conservative, use three times.) For the purpose of transforming data, those outliers should be excluded. In this modified example, the mean plus twice the standard deviation of the number of errors is 14.2, while the mean plus three times the standard deviation is 19.5. By either criterion, you should treat 20 errors as an outlier and exclude it.

Table 8.6

Sample data from a usability test with 12 participants^a.

^aTasks completed are the number of tasks (out of 10) that the user completed successfully. Number of errors is the number of specific errors that the user made, such as data-entry errors. Satisfaction rating is on a scale of 0 to 6.

When transforming any usability metric to a percentage, the general rule is to first determine the minimum and maximum values that the metric can possibly have. In many cases this is easy; they are predefined by the conditions of the usability test. Here are the various cases you might encounter.

• If the minimum possible score is 0 and the maximum possible score is 100 (e.g., a SUS score), then you’ve basically already got a percentage. Just divide by 100 to make it a true percentage.

• In many cases, the minimum is 0 and the maximum is known, such as the total number of tasks or the highest possible rating on a rating scale. In that case, simply divide the score by the maximum to get the percentage. (This is why it’s generally easier to code rating scales starting with 0 as the worst value.)

• In some cases, the minimum is 0 but the maximum is not known, such as the example of errors. In that situation, the maximum would need to be defined by the data—the highest number of errors any participant made. Specifically, the number of errors would be transformed by dividing the number of errors obtained by the maximum number of errors any participant made and subtracting that from 1.

• Finally, in some cases, neither minimum nor maximum possible scores are predefined, as with time data. In this case, you can use your data to determine the minimum and maximum values. Assuming higher values are worse, as is the case with time data, you would divide the difference between the highest value and the observed value by the difference between the highest and the lowest values.

What If Higher Numbers Are Worse?

Although higher numbers are better in cases such as task success rates, in other cases they’re worse, such as time or errors. Higher numbers could also be worse in a rating scale if it was defined that way (e.g., 0–6, where 0 = Very Easy and 6 = Very Difficult). In any of these cases, you must reverse the scale before averaging these percentages with other percentages where higher numbers are better. For example, with the rating scale just shown, you would subtract each value from 6 (the maximum) to reverse the scale. So 0 becomes 6 and 6 becomes 0.

8.1.3 Combining Metrics Based on Z Scores

Another technique for transforming scores on different scales so that they can be combined is using z scores. (See, for example, Martin & Bateson, 1993, p. 124.) These are based on the normal distribution and indicate how many units above or below the mean of the distribution any given value is. When you transform a set of scores to their corresponding z scores, the resulting distribution by definition has a mean of 0 and standard deviation of 1. This is the formula for transforming any raw score to its corresponding z score:

where x is the score to be transformed, μ is the mean of the distribution of those scores, and σ is the standard deviation of the distribution of those scores.

This transformation can also be done using the “=STANDARDIZE” function in Excel. Data in Table 8.2 could also be transformed using z scores, as shown in Table 8.7.

Excel Tip

Step-by-Step Guide to Calculating z Scores

Here are the steps for transforming any set of raw scores (times, percentages, clicks, whatever) into z scores:

1. Enter raw scores into a single column in Excel. For this example, we will assume they are in column “A” and that you started on row “1”. Make sure there are no other values in this column, such as an average at the bottom.

2. In the cell to the right of the first raw score, enter the formula:

3. Copy this “standardize” formula down as many rows as there are raw scores.

4. As a double check, calculate the mean and standard deviation for this z-score column. The average should be 0, and the standard deviation should be 1 (both within rounding error).

Table 8.7

Sample data from Table 8.2 transformed using z scores^a.

^aFor each original score, the z score was determined by subtracting the mean of the score’s distribution from it and then dividing by the standard deviation. This z score tells you how many standard deviations above or below the mean that score is. Since you need all the scales to have higher numbers better, the scale of the z scores of times is reversed by multiplying by (–1).

The bottom two rows of Table 8.7 show the mean and standard deviation for each set of z scores, which should always be 0 and 1, respectively. Note that in using z scores, we didn’t have to make any assumptions about the maximum or minimum values that any of the scores could have. In essence, we let each set of scores define its own distribution and rescale them so those distributions would each have a mean of 0 and standard deviation of 1. In this way, when they are averaged together, each of the z scores makes an equal contribution to the average z score. Note that when averaging the z scores together, each of the scales must be going the same direction—in other words, higher values should always be better. In the case of time data, the opposite is almost always true. Since z scores have a mean of 0, this is easy to correct simply by multiplying the z score by (–1) to reverse its scale.

If you compare the z-score averages in Table 8.7 to the percentage averages in Table 8.3, you will find that the ordering of the participants based on those averages is nearly the same: Both techniques yield the same top three participants (9, 5, and 3) and the same bottom three participants (4, 8, and 1).

One disadvantage of using z scores is that you can’t think of the overall average of the z scores as some type of overall usability score, as by definition that overall average will be 0. So when would you want to use z scores? They mainly are useful when you want to compare one set of data to another, such as data from iterative usability tests of different versions of a product, data from different groups of users in the same usability test, or data from different conditions or designs within the same usability test. You should also have a reasonable sample size (e.g., at least 10 participants per condition) to use the z-score method.

For example, consider the data shown in Figure 8.1 from Chadwick-Dias, McNulty, and Tullis (2003), which shows z scores of performance for two iterations of a prototype. This research studied the effects of age on performance in using a website. Study 1 was a baseline study. Based on their observations of the participants in Study 1, especially the problems encountered by the older participants, they made changes to the prototype and then conducted Study 2 with a new group of participants. The z scores were equal-weighted combinations of task time and task completion rate.

Figure 8.1 Data showing performance z scores from two studies of a prototype with participants over a wide range of ages. The performance z score was an equal-weighted combination of task time and task completion rate. Changes were made to the prototype between Study 1 and Study 2. The performance z scores were significantly better in Study 2, regardless of the participant’s age. Adapted from Chadwick-Dias et al. (2003); used with permission.

It’s important to understand that the z-score transformations were done using the full set of data from Study 1 and Study 2 combined. They were then plotted appropriately to indicate from which study each z score was derived. The key finding was that the performance z scores for Study 2 were significantly higher than the performance z scores for Study 1; the effect was the same regardless of age (as reflected by the fact that the two lines are parallel to each other). If the z-score transformations had been done separately for Study 1 and Study 2, the results would have been meaningless because the means for Study 1 and Study 2 would both have been forced to 0 by the transformations.

8.1.4 Using Single Usability Metric

Jeff Sauro and Erika Kindlund (2005) developed a quantitative model for combining usability metrics into a single usability score. Their focus is on task completion, task time, error counts per task, and post-task satisfaction rating (similar to ASQ described in Chapter 6). Note that all of their analyses are at the task level, whereas the previous sections have described analyses at the “usability test” level. At the task level, task completion is typically a binary variable for each participant: that person either completed the task successfully or did not. At the usability-test level, task completion, as shown in previous sections, indicates how many tasks each person completed, and it can be expressed as a percentage for each participant.

Sauro and Kindlund used techniques derived from Six Sigma methodology (e.g., Breyfogle, 1999) to standardize their four usability metrics (task completion, time, errors, and task rating) into a SUM. Conceptually, their techniques are not that different from the z score and percentage transformations described in the previous sections. In addition, they used Principal Components Analysis, a statistical technique that looks at correlations between variables, to determine if all four of their metrics were contributing significantly to the overall calculation of the single metric. They found that all four were significant and, in fact, that each contributed about equally. Consequently, they decided that each of the four metrics (once standardized) should contribute equally to the calculation of the SUM score.

An online tool for entering data from a usability test and calculating the SUM score is available from Jeff Sauro’s “Usability Scorecard” website at http://www.usabilityscorecard.com/. For each task and each participant in the usability test, you must enter the following:

• Whether the participant completed the task successfully (0 or 1).

• Number of errors committed on that task by that participant. (You also specify the number of error opportunities for each task.)

• Task time in seconds for that participant.

• Post-task satisfaction rating, which is an average of three post-task ratings on five-point scales of task ease, satisfaction, and perceived time—similar to ASQ.

After entering these data for all the tasks, the tool standardizes the scores and calculates the SUM score for each task. Standardized data shown for each task are illustrated in Table 8.8. Note that a SUM score is calculated for each task, which allows for overall comparisons of tasks. In these sample data, participants did best on the “Cancel reservation” task and worst on the “Check restaurant hours” task. An overall SUM score, 68% in this example, is also calculated, as is a 90% confidence interval (53 to 88%), which is the average of the confidence intervals of the SUM score for each task.

Table 8.8

Sample standardized data from a usability test^a.

^aAfter entering data for each participant and each task, these are the standardized scores calculated by SUM, including an overall SUM score and a confidence interval for it.

The online tool also provides the option to graph task data from a usability study, including the SUM scores. Figure 8.2 shows a sample graph from the tool.

Figure 8.2 Sample graph of SUM scores from http://www.usabilityscorecard.com/. The tasks of this usability test are listed down the left. For each task, the orange circle shows the mean SUM score and bars show the 90% confidence interval for each. In this example, it’s apparent that the “Reconcile Accounts” and “Manage Cash-Flow” tasks are the most problematic.

8.2 Usability Scorecards

An alternative to combining different metrics to derive an overall usability score is to present the results of the metrics graphically in a summary chart. This type of chart is often called a Usability Scorecard. The goal is to present data from the usability test in such a way that overall trends and important aspects of the data can be detected easily, such as tasks that were particularly problematic for the users. If you only have two metrics that you’re trying to represent, a simple combination graph from Excel may be appropriate. For example, Figure 8.3 shows the task completion rate and task ease rating for each of 10 tasks in a usability test.

Figure 8.3 A sample combination column and line chart for 10 tasks. Task rating is shown via the columns and labeled on the right axis. Task success is shown via the lines and is labeled on the left axis.

The combination chart in Figure 8.3 has some interesting features. It clarifies which tasks were the most problematic for the participants (Tasks 4 and 8) because they have the lowest values on both scales. It’s also obvious where there were significant disparities between task success data and task ease ratings, such as Tasks 9 and 10, which had only moderate task completion rates but the highest task ratings. (This is an especially troubling finding because it might indicate that some of the users did not complete the task successfully but thought they did.) Finally, it’s easy to distinguish the tasks that had reasonably high values for both metrics, such as Tasks 3, 5, and 6.

How to Create a Combination Chart in Excel

Older versions of Excel made it easy to create this type of combination chart with two axes, but it’s a bit more challenging in the newer versions (2007 and higher). Here’s what you do:

1. Enter your data into two columns in the spreadsheet (e.g., one column for task success and the other for task rating). Create a column chart like you normally would for both variables. This will look strange because the two variables will be plotted on the same axis, with one scale overshadowing the other greatly.

2. Right-click on one of the columns in the chart and choose “Format Data Series.” In the resulting dialog box, choose “Series Options.” In the “Plot Series On” area, choose “Secondary Axis.”

3. Close that dialog box. The chart will still look odd because now the two columns are on top of each other.

4. Right click on a column being charted on the primary (left) axis and select “Change Series Chart Type.”

5. Change that variable to a line graph. Close that dialog box.

(Yes, we know this type of combination chart breaks the rule about only using line graphs for continuous data, like times. But you have to break the rule to make it work in Excel. And rules are made to be broken anyway!)

This type of combination chart works well if you have only two metrics to represent, but what if you have more? One way of representing summary data for three or more metrics is using radar charts (which were also illustrated in Chapter 6). Figure 8.4 shows an example of a radar chart for summarizing the results of a usability test with five factors: task completion, page visits, accuracy (lack of errors), satisfaction rating, and usefulness rating. In this example, although task completion, accuracy, and usefulness rating were relatively high (good), the page visits and satisfaction rating were relatively low (poor).

Figure 8.4 A sample radar chart summarizing task completion, page visits, accuracy (lack of errors), satisfaction rating, and usefulness rating from a usability test. Each has been transformed to a percentage using the techniques outlined earlier in this chapter.

Although radar charts can be useful for a high-level view, it’s not really possible to represent task-level information in them. The example in Figure 8.4 averaged data across the tasks. What if you want to represent summary data for three or more metrics but also maintain task-level information? One technique for doing that is using what are called Harvey Balls. A variation on this technique has been popularized by Consumer Reports. For example, consider the data shown earlier in Table 8.7, which presents the results for six tasks in a usability test, including task completion, time, satisfaction, and errors. These data could be summarized in a comparison chart as shown in Figure 8.5. This type of comparison chart allows you to see at a glance how the participants did for each of the tasks (by focusing on the rows) or how the participants did for each of the metrics (by focusing on the columns).

What Are Harvey Balls?

Harvey Balls are small, round pictograms used typically in a comparison table to represent values for different items:

They’re named for Harvey Poppel, a Booz Allen Hamilton consultant who created them in the 1970s as a way of summarizing long tables of numeric data. There are five levels, progressing from an open circle to a completely filled circle. Typically, the open circle represents the worst values, and the completely filled circle represents the best values. Links to images of Harvey Balls of different sizes can be found on our website, www.MeasuringUX.com. Harvey Balls shouldn’t be confused with Harvey Ball, who was the creator of the smiley face !

Figure 8.5 A sample comparison chart using data from Table 8.7. Tasks have been ordered by their SUM score, starting with the highest. For each of the four standardized scores (task completion, satisfaction, task time, and errors), the value has been represented by coded circles (known as Harvey Balls), as shown in the key.

8.3 Comparison to Goals and Expert Performance

Although the previous section focused on ways to summarize usability data without reference to an external standard, in some cases you may have an external standard that can be used for comparison. The two main flavors of an external standard are predefined goals and expert, or optimum, performance.

8.3.1 Comparison to Goals

Perhaps the best way to assess the results of a usability test is to compare those results to goals that were established before the test. These goals may be set at the task level or an overall level. Goals can be set for any of the metrics we’ve discussed, including task completion, task time, errors, and self-reported measures. Here are some examples of task-specific goals:

• At least 90% of representative users will be able to reserve a suitable hotel room successfully.

• Opening a new account online should take no more 8 minutes on average.

• At least 95% of new users will be able to purchase their chosen product online within 5 minutes of selecting it.

Similarly, examples of overall goals could include the following:

• Users will be able to complete at least 90% of their tasks successfully.

• Users will be able to complete their tasks in less than 3 minutes each, on average.

• Users will give the application an average SUS rating of at least 80%.

Typically, usability goals address task completion, time, accuracy, and/or satisfaction. The key is that the goals must be measurable. You must be able to determine whether the data in a given situation supports the attainment of the goal. For example, consider the data in Table 8.9.

Table 8.9

Sample data from eight tasks showing target number of page visits and mean of actual number of page visits.

	Target # of Page Visits	Actual # of Page Visits
Task 1	5	7.9
Task 2	8	9.3
Task 3	3	7.3
Task 4	10	11.5
Task 5	4	7
Task 6	6	6.9
Task 7	9	9.8
Task 8	7	10.2

Table 8.9 shows data for eight tasks in a usability study of a website. For each task, a target number of page visits has been predetermined (ranging from 4 to 10). Figure 8.6 depicts the target and actual page views for each task graphically. This chart is useful because it allows you to visually compare the actual number of page visits for each task, and its associated confidence interval, to the target number of page views. In fact, all the tasks had significantly more page views than the targets. What’s perhaps not so obvious is how the various tasks performed relative to each other—in other words, which ones came out better and which ones worse. To make that kind of comparison easier, Figure 8.7 shows the ratio of the target to actual page views for each task. This can be thought of as a “page view efficiency” metric: the closer it is to 100%, the more efficient the participants were being. This makes it easy to spot tasks where the participants had trouble (e.g., Task 3) versus tasks where they did well (e.g., Task 7). This technique could be used to represent the percentage of participants who met any particular objective (e.g., time, errors, SUS rating) either at the task level or at the overall level.

Figure 8.6 Target and actual number of page visits for each of eight tasks. Error bars represent the 90% confidence interval for the actual number of page visits.

Figure 8.7 Ratio of target to actual page views for each of the eight tasks.

8.3.2 Comparison to Expert Performance

An alternative to comparing the results of a usability test to predefined goals is to compare the results to the performance of experts. The best way to determine the expert performance level is to have one or more presumed experts actually perform the tasks and to measure the same things that you’re measuring in the usability test. Obviously your experts really need to be experts—people with subject-matter expertise, in-depth familiarity with the tasks, and in-depth familiarity with the product, application, or website being tested. And your data will be better if you can average the performance results from more than one expert. Comparing results of a usability test to results for experts allows you to compensate for the fact that certain tasks may be inherently more difficult or take longer, even for an expert. The goal, of course, is to see how close the performance of the participants in the test actually comes to the performance of the experts.

Although you could theoretically do a comparison to expert performance for any performance metric, it’s used most commonly for time data. With task success data, the usual assumption is that a true expert would be able to perform all the tasks successfully. Similarly, with error data, the assumption is that an expert would not make any errors. But even an expert would require some amount of time to perform the tasks. For example, consider the task time data shown in Table 8.10.

Table 8.10

Sample time data from 10 tasks in a usability test showing average actual time per task (in seconds), expert time per task, and ratio of expert to actual time.

Graphing the ratio of expert to actual times, as shown in Figure 8.8, makes it easy to spot tasks where the test participants did well in comparisons to the experts (Tasks 3 and 9) and tasks where they did not do so well (Tasks 2 and 4).

Figure 8.8 Graph of the ratio of expert to actual times from Table 8.10.

8.4 Summary

Some of the key takeaways from this chapter are as follow.

1. An easy way to combine different usability metrics is to determine the percentage of users who achieve a combination of goals. This tells you the overall percentage of users who had a good experience with your product (based on the target goals). This method can be used with any set of metrics and is understood easily by management.

2. One way of combining different metrics into an overall “usability score” is to convert each of the metrics to a percentage and then average them together. This requires being able to specify, for each metric, an appropriate minimum and maximum value.

3. Another way to combine different metrics is to convert each metric to a z score and then average them together. Using z scores, each metric gets equal weight when they are combined. But the overall average of the z scores will always be 0. This metric is useful in comparing different subsets of the data to each other, such as data from different iterations, different groups, or different conditions.

4. The SUM technique is another method for combining different metrics, specifically task completion, task time, errors, and task-level satisfaction rating. The method requires entry of individual task and participant data for the four metrics. Calculations yield a SUM score, as a percentage, for each task and across all tasks, including confidence intervals.

5. Various types of graphs and charts can be useful for summarizing the results of a usability test in a “usability scorecard.” A combination line and column chart is useful for summarizing the results of two metrics for tasks in a test. Radar charts are useful for summarizing the results of three or more metrics overall. A comparison chart using Harvey Balls to represent different levels of the metrics can summarize effectively the results for three or more metrics at the task level.

6. Perhaps the best way to determine the success of a usability test is to compare the results to a set of predefined usability goals. Typically these goals address task completion, time, accuracy, and satisfaction. The percentage of users whose data met the stated goals can be a very effective summary.

7. A reasonable alternative to comparing to predefined goals, especially for time data, is to compare actual performance data to data for experts. The closer the actual performance is to expert performance, the better.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8. Combined and Comparative Metrics

Create new playlist

Sign In

Sign Up