Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4

Performance Metrics

Contents

Anyone who uses technology has to interact with some type of interface to accomplish their goals. For example, a user of a website clicks on different links, a user of a word-processing application enters information via a keyboard, or a user of a video game system pushes buttons on a remote control or waves a controller in the air. No matter the technology, users are behaving or interacting with a product in some way. These behaviors form the cornerstone of performance metrics.

Every type of user behavior is measurable in some way. Behaviors that achieve a goal for a user are especially important to the user experience. For example, you can measure whether users clicking through a website (behavior) found what they were looking for (goal). You can measure how long it took users to enter and format a page of text properly in a word-processing application or how many buttons users pressed in trying to cook a frozen dinner in a microwave. All performance metrics are calculated based on specific user behaviors.

Performance metrics not only rely on user behaviors, but also on the use of scenarios or tasks. For example, if you want to measure success, the user needs to have specific tasks or goals in mind. The task may be to find the price of a sweater or submit an expense report. Without tasks, performance metrics aren’t possible. You can’t measure success if the user is only browsing a website aimlessly or playing with a piece of software. How do you know if he or she was successful? But this doesn’t mean that the tasks must be something arbitrary given to the users. They could be whatever the users came to a live website to do or something that the participants in a usability study generate themselves. Often we focus studies on key or basic tasks.

Performance metrics are among the most valuable tools for any usability professional. They’re the best way to evaluate the effectiveness and efficiency of many different products. If users are making many errors, you know there are opportunities for improvement. If users are taking four times longer to complete a task than what was expected, efficiency can be improved greatly. Performance metrics are the best way of knowing how well users are actually using a product.

Performance metrics are also useful in estimating the magnitude of a specific usability issue. Many times it’s not enough to know that a particular issue exists. You probably want to know how many people are likely to encounter the same issue after the product is released. For example, by calculating a success rate that includes a confidence interval, you can derive a reasonable estimate of how big a usability issue really is. By measuring task completion times, you can determine what percentage of your target audience will be able to complete a task within a specified amount of time. If only 20% of the target users are successful at a particular task, it should be fairly obvious that the task has a usability problem.

Senior managers and other key stakeholders on a project usually sit up and pay attention to performance metrics, especially when they are presented effectively. Managers will want to know how many users are able to complete a core set of tasks successfully using a product. They see these performance metrics as a strong indicator of overall usability and a potential predictor of cost savings or increases in revenue.

Performance metrics are not the magical elixir for every situation. Similar to other metrics, an adequate sample size is required. Although the statistics will work whether you have 2 or 100 users, your confidence level will change dramatically depending on the sample size. If you’re only concerned about identifying the lowest of the low-hanging fruit, such as identifying only the most severe problems with a product, performance metrics are probably not a good use of time or money. But if you need a more fine-grained evaluation and have the time to collect data from 10 or more users, you should be able to derive meaningful performance metrics with reasonable confidence levels.

Avoid overrelying on performance metrics when your goal is simply to uncover basic usability problems. When reporting task success or completion time, it can be easy to lose sight of the underlying issues behind the data. Performance metrics tell the what very effectively but not the why. Performance data can point to tasks or parts of an interface that were particularly problematic for users, but they don’t identify the causes of the problems. You will usually want to supplement it with other data, such as observational or self-reported data, to better understand why they were problems and how they might be fixed.

Five basic types of performance metrics are covered in this chapter.

1. Task success is perhaps the most widely used performance metric. It measures how effectively users are able to complete a given set of tasks. Two different types of task success are reviewed: binary success and levels of success. Of course you can also measure task failure.

2. Time on task is a common performance metric that measures how much time is required to complete a task.

3. Errors reflect the mistakes made during a task. Errors can be useful in pointing out particularly confusing or misleading parts of an interface.

4. Efficiency can be assessed by examining the amount of effort a user expends to complete a task, such as the number of clicks in a website or the number of button presses on a mobile phone.

5. Learnability is a way to measure how performance improves or fails to improve over time.

4.1 Task Success

The most common usability metric is task success, which can be calculated for practically any usability study that includes tasks. It’s almost a universal metric because it can be calculated for such a wide variety of things being tested—from websites to kitchen appliances. As long as the user has a reasonably well-defined task, you can measure success.

Task success is something that almost anyone can relate to. It doesn’t require elaborate explanations of measurement techniques or statistics to get the point across. If your users can’t complete their tasks, then you know something is wrong. Seeing users fail to complete a simple task can be pretty compelling evidence that something needs to be fixed.

To measure task success, each task that users are asked to perform must have a clear end state or goal, such as purchasing a product, finding the answer to a specific question, or completing an online application form. To measure success, you need to know what constitutes success, so you should define success criteria for each task prior to data collection. If you don’t predefine criteria, you run the risk of constructing a poorly worded task and not collecting clean success data. Here are examples of two tasks with clear and not-so-clear end states:

• Find the 5-year gain or loss for IBM stock (clear end state)

• Research ways to save for your retirement (not a clear end state)

Although the second task may be perfectly appropriate in certain types of usability studies, it’s not appropriate for measuring task success.

The most common way of measuring success in a lab-based usability test is to have the user articulate the answer verbally after completing the task. This is natural for the user, but sometimes it results in answers that are difficult to interpret. Users might give extra or arbitrary information that makes it difficult to interpret the answer. In these situations, you may need to probe the users to make sure they actually completed the task successfully.

Another way to collect success data is by having users provide their answers in a more structured way, such as using an online tool or paper form. Each task might have a set of multiple-choice responses. Users might choose the correct answer from a list of four to five distracters. It’s important to make the distracters as realistic as possible. Try to avoid write-in answers if possible. It’s much more time-consuming to analyze each write-in answer, and it may involve judgment calls, thereby adding more noise to the data.

In some cases the correct solution to a task may not be verifiable because it depends on the user’s specific situation, and testing is not being performed in person. For example, if you ask users to find the balance in their savings account, there’s no way to know what that amount really is unless we’re sitting next to them while they do it. So in this case, you might use a proxy measure of success. For example, you could ask users to identify the title of the page that shows their balance. This works well as long as the title of the page is unique and obvious and you’re confident that they are able to actually see the balance if they reached this page.

4.1.1 Binary Success

Binary success is the simplest and most common way of measuring task success. Users either completed a task successfully or they didn’t. It’s kind of like a “pass/fail” course in college. Binary success is appropriate to use when the success of the product depends on users completing a task or set of tasks. Getting close doesn’t count. The only thing that matters is that they accomplish a goal with their tasks. For example, when evaluating the usability of a defibrillator device (to resuscitate people during a heart attack), the only thing that matters is being able to use it correctly without making any mistakes within a certain amount of time. Anything less would be a major problem, especially for the recipient! A less dramatic example might be a task that involves a goal of purchasing a book on a website. Although it may be helpful to know where in the process someone failed, if your company’s revenue depends on selling those books, that’s what really matters.

Each time users perform a task, they should be given a “success” or “failure” score. Typically, these scores are in the form of 1’s (for success) and 0’s (for failure). (Analysis is easier if you assign a numeric score rather than a text value of “success” or “failure.”) By using a numeric score, you can easily calculate the percent correct as well as other statistics you might need. Simply calculate the average of the 1’s and 0’s to determine the percent correct. Assuming you have more than one participant and more than one task, there are always two ways you can calculate task success:

• By looking at the average success rate for each task across the participants

• By looking at the average success rate for each participant across the tasks

As an example, consider the data in Table 4.1. Averages across the bottom represent the task success rates for each task. Averages along the right represent the success rates for each participant. As long as there are no missing data, the averages of those two sets of averages will always be the same.

Does Task Success Always Mean Factual Success?

The usual definition of task success is achieving some factually correct or clearly defined state. For example, if you’re using the NASA site to find who the Commander of Apollo 12 was, there’s a single factually correct answer (Charles “Pete” Conrad, Jr.). Or if you’re using an e-commerce site to purchase a copy of “Pride and Prejudice,” then purchasing that book would indicate success. But in some cases, perhaps what’s important is not so much reaching a factual answer or achieving a specific goal, but rather users being satisfied they have achieved a certain state. For example, just before the 2008 U.S. presidential election, we conducted an online study comparing the websites of the two primary candidates, Barack Obama and John McCain. The tasks included things such as finding the candidate’s position on Social Security. Task success was measured by self-report only (Yes I Found It, No I Didn’t Find, or I’m Not Sure), as for this kind of site the important thing is whether users believe they found the information they were looking for. Sometimes it can be very interesting to look at the correlation between perceived success and factual success.

Table 4.1

Task success data for 10 participants and 10 tasks.

The most common way to analyze and present binary success rates is by task. This involves simply presenting the percentage of participants who completed each task successfully. Figure 4.1 shows the task success rates for the data in Table 4.1. This approach is most useful when you want to compare success rates for each task. You can then do a more detailed analysis of each task by looking at the specific problems to determine what changes may be needed to address them. For example, Figure 4.1 shows that the “Find Category” and “Checkout” tasks appear to be problematic.

Types of Task Failure

There are many different ways in which a participant might fail a task, but they tend to fall into a few categories:

• Giving up—Participants indicate that they would not continue with the task if they were doing this on their own.

• Moderator “calls” it—The study moderator stops the task because it’s clear that the participant is not making any progress or is becoming especially frustrated.

• Too long—The participant completed the task but not within a predefined time period. (Certain tasks are only considered successful if they can be accomplished within a given time period.)

• Wrong—Participants thought that they completed the task successfully, but they actually did not (e.g., concluding that Neil Armstrong was the Commander of Apollo 12 instead of Pete Conrad). In many cases, these are the most serious kinds of task failures because the participants don’t realize they are failures. In the real world, the consequences of these failures may not become clear until much later (e.g., you intended to order a copy of “Pride and Prejudice” but are rather surprised when “Pride and Prejudice and Zombies” shows up in the mail several days later!)

Figure 4.1 Task success rates for the data in Table 4.1, including a 90% confidence interval for each task.

Another common way of looking at binary success is by user or type of user. As always in reporting usability data, you should be careful to maintain the anonymity of users in the study using numbers or other nonidentifiable descriptors. The main value of looking at success data from a user perspective is that you can identify different groups of users who perform differently or encounter different sets of problems. Here are some of the common ways to segment different users:

• Frequency of use (infrequent users versus frequent users)

• Previous experience using the product

• Domain expertise (low-domain knowledge versus high-domain knowledge)

• Age group

Task success for different groups of participants is also used when each group is given a different design to work with. For example, participants in a usability study might be assigned randomly to use either Version A or Version B of a prototype website. A key comparison will be the average task success rate for participants using Version A vs those using Version B.

If you have a relatively large number of users in a usability study, it may be helpful to present binary success data as a frequency distribution (Figure 4.2). This is a convenient way to visually represent the variability in binary task success data. For example, in Figure 4.2, six users in the evaluation of the original website completed 61 to 70% of the tasks successfully, one completed fewer than 50%, and only two completed as many as 81 to 90%. In a revised design, six users had a success rate of 91% or greater, and no user had a success rate below 61%. Illustrating that the two distributions of task success barely overlap is a much more dramatic way of showing the improvement across the iterations than simply reporting the two means.

Figure 4.2 Frequency distributions of binary success rates from usability tests of the original version of a website and the redesigned version (data from LeDoux, Connor, & Tullis, 2005).

Calculating Confidence Intervals for Binary Success

One of the most important aspects of analyzing and presenting binary success is including confidence intervals. Confidence intervals are essential because they reflect your trust or confidence in the data. In most usability studies, binary success data are based on relatively small samples (e.g., 5 to 20 users). Consequently, the binary success metric may not be as reliable as we would like it to be. For example, if 4 out of 5 users completed a task successfully, how confident can we be that 80% of the larger population of users will be able to complete that task successfully? Obviously, we would be more confident if 16 out of 20 users completed the task successfully and even more confident if 80 out of 100 did.

Fortunately, there is a way to take this into account. Binary success rates are essentially proportions: the proportion of users who completed a given task successfully. For example, if 5 of the 10 participants completed a task, the success rate is 5/10 = 0.5. The appropriate way to calculate a confidence interval for a proportion like this is to use a binomial confidence interval. Several methods are available for calculating binomial confidence intervals, such as the Wald Method and the Exact Method. But as Sauro and Lewis (2005) have shown, many of those methods are too conservative or too liberal in their calculation of the confidence interval when dealing with the small sample sizes we commonly have in usability tests. They found that a modified version of the Wald Method, called the Adjusted Wald, yielded the best results when calculating a confidence interval for task success data.

Confidence Interval Calculator

Jeff Sauro has provided a very useful calculator for determining confidence intervals for binary success on his website http://www.measuringusability.com/wald. By entering the total number of people who attempted a given task and how many of them completed it successfully, this tool will perform the Wald, Adjusted Wald, Exact, and Score calculations of the confidence interval for the mean task completion rate automatically. You can choose to calculate a 99, 95, or 90% confidence interval. If you really want to calculate confidence intervals for binary success data yourself, the details are included on our website.

If 4 out of 5 users completed a given task successfully, the Adjusted Wald Method yields a 95% confidence interval for that task completion rate ranging from 36 to 98%—a rather large range! However, if 16 out of 20 users completed the task successfully (the same proportion), the Adjusted Wald Method yields a 95% confidence interval of 58 to 93%. If you really got carried away and ran a usability test with 100 participants, of whom 80 completed the task successfully, the 95% confidence interval would be 71 to 87%. As is almost always the case with confidence intervals, larger sample sizes yield smaller (or more accurate) intervals.

4.1.2 Levels of Success

Identifying levels of success is useful when there are reasonable shades of gray associated with task success. The user receives some value from completing a task partially. Think of it as partial credit on a homework assignment if you showed your work, even though you got the wrong answer. For example, assume that a user’s task is to find the least expensive digital camera with at least 10 megapixel resolution, at least 12× optical zoom, and weighing no more than 3 pounds. What if the user found a camera that met most of those criteria but had a 10× optical zoom instead of 12×? According to a strict binary success approach, that would be a failure. But you’re losing some important information by doing that. The user actually came very close to completing the task successfully. In some cases, this might be acceptable to a user. For some types of products, coming close to fully completing a task may provide value to the user. Also, it may be helpful for you to know why some users failed a task or with which particular tasks users needed help.

Should You Include Tasks That Can’t Be Done?

An interesting question is whether a usability study should include tasks that can’t be done using the product being testing. For example, assume you’re testing an online bookstore that only carries mystery novels. Would it be appropriate to include a task that involves trying to find a book that the store doesn’t carry, such as a science-fiction novel? If one of the goals of the study is to determine how well users can determine what the store does not carry, we think it could make sense. In the real world, when you come to a new website, you don’t automatically know everything that can or can’t be done using the site. A well-designed site not only makes it clear what is available on the site, but also what’s not available. However, when tasks are presented in a usability study, there’s probably an implicit understanding that they can be done. So we think if you do include tasks that can’t be done, you should make it clear up front that some of the tasks may not be possible.

How to Collect and Measure Levels of Success

Collecting and measuring levels of success data is very similar to binary success data except that you must define the various levels. There are a couple of approaches to levels of success:

• Based on the user’s experience in completing a task. Some users might struggle or require assistance, while others complete their tasks without any difficulty.

• Based on the users accomplishing the task in different ways. Some users might accomplish the task in an optimal way, while others might accomplish it in ways that are less than optimal.

Levels of success based on the degree to which users complete a task typically have between three and six levels. A common approach is to use three levels: complete success, partial success, and complete failure.

Levels of success data are almost as easy to collect and measure as binary success data. It just means defining what you mean by “complete success” and by “complete failure.” Anything in between is considered a partial success. A more granular approach is to break out each level according to whether assistance was given or not. Below is an example of six different levels of completion:

• Complete success

With assistance

Without assistance

• Partial success

With assistance

Without assistance

• Failure

User thought it was complete, but it wasn’t

User gave up

If you do decide to use levels of success, it’s important to clearly define the levels beforehand. Also, consider having multiple observers independently assess the levels for each task and then reach a consensus.

A common issue when measuring levels of success is deciding what constitutes “giving assistance” to the participant. Here are some examples of situations we define as giving assistance:

• Moderator takes the participant back to a home page or resets to an initial (pretask) state. This form of assistance may reorient the participant and help avoid certain behaviors that initially resulted in confusion.

• Moderator asks the participant probing questions or restates the task. This may cause the user to think about her behavior or choices in a different way.

• Moderator answers a question or provides information that helps the participant complete the task.

• Participant seeks help from an outside source. For example, the participant calls a phone representative, uses some other website, consults a user manual, or accesses an online help system.

Level of success can also be examined in terms of the user experience. We commonly find that some tasks are completed without any difficulty, whereas others are completed with minor or major problems along the way. It’s important to distinguish between these different experiences. A four-point scoring method can be used for each task:

1 = No problem. The user completed the task successfully without any difficulty or inefficiency.

2 = Minor problem. The user completed the task successfully but took a slight detour. He made one or two small mistakes but recovered quickly and was successful.

3 = Major problem. The user completed the task successfully but had major problems. She struggled and took a major detour in her eventual successful completion of the task.

4 = Failure/gave up. The user provided the wrong answer or gave up before completing the task or the moderator moved on to the next task before successful completion.

When using this scoring system, it’s important to remember that these data are ordinal (see Chapter 2). Therefore, you should not report an average score. Rather, present the data as frequencies for each level of completion. This scoring system is relatively easy to use, and we usually see agreement on the various levels by different usability specialists observing the same interactions. Also, you can aggregate the data into a binary success rate if you need to. Finally, this scoring system is usually easy to explain to your audience. It’s also helpful to focus on the 3’s and 4’s as part of design improvements; there’s usually no need to worry about the 1’s and 2’s.

How to Analyze and Present Levels of Success

In analyzing levels of success, the first thing you should do is create a stacked bar chart. This will show the percentage of users who fall into each category or level, including failures. Make sure that the bars add up to 100%. Figure 4.3 is an example of a common way to present levels of success.

Figure 4.3 Stacked bar chart showing different levels of success based on task completion.

4.1.3 Issues in Measuring Success

Obviously, an important issue in measuring task success is simply how you define whether a task was successful. The key is to clearly define beforehand what criteria are for completing each task successfully. Try to think through the various situations that might arise for each task and decide whether or not they constitute success. For example, is a task successful if the user finds the right answer but reports it in the wrong format? Also, what happens if he reports the right answer but then restates his answer incorrectly? When unexpected situations arise during the test, make note of them and try to reach a consensus among the observers afterward about those cases.

One issue that commonly arises during a usability evaluation is how or when to end a task if the user is not successful. In essence, this is the “stopping rule” for unsuccessful tasks. Here are some of the common approaches to ending an unsuccessful task:

1. Tell the users at the beginning of the session that they should continue to work on each task until they either complete it or reach the point at which, in the real world, they would give up or seek assistance (from technical support, a colleague, etc.).

2. Apply a “three strikes and you’re out” rule. This means that the users get three attempts (or whatever number you decide) to complete a task before you stop them. The main difficulty with this approach is defining what is meant by an “attempt.” It could be three different strategies, three wrong answers, or three different “detours” in finding specific information. However you define it, there will be a considerable amount of discretion on behalf of the moderator or scorer.

3. “Call” the task after a predefined amount of time has passed. Set a time limit, such as 5 minutes. After the time has expired, move on to the next task. In most cases, it is better not to tell the user that you are timing them. By doing so, you create a more stressful, “test-like” environment.

Of course, you always have to be sensitive to the user’s state in any usability test and potentially end a task (or even the session) if you see that the user is becoming particularly frustrated or agitated.

4.2 Time on Task

Time on task (sometimes referred to as task completion time or simply task time) is a good way to measure the efficiency of a product. In most situations, the faster a user can complete a task, the better the experience. In fact, it would be pretty unusual for a user to complain that a task took less time than expected. But there are some exceptions to the assumption that faster is better. One could be a game, where the user doesn’t want to finish it too quickly. The main purpose of most games is the experience itself rather than quick completion of a task. Another exception may be e-learning. For example, if you’re putting together an online training course, slower may be better. Users may retain more if they spend more time completing the tasks rather than rushing through the course.

Time on Task vs Web Session Duration

Our assertion that faster task times are generally better seems at odds with the view from web analytics that you want longer page view or session durations. From a web-analytics perspective, longer page-view durations (the amount of time each user is viewing each page) and longer session durations (the amount of time each user is spending on the site) are generally considered good things. The argument is that they represent greater “engagement” with the site, or the site is considered “stickier”. Part of the reason that our assertion seems at odds with that perspective is that we don’t agree with it. Session and page-view duration are examples of metrics that are from the perspective of the site owner rather than the user. We would still argue that users generally want to be spending less time on the site, not more. But there is a way in which the two viewpoints might be reconciled. Perhaps a goal of a site might be to get users to perform more in-depth or complex tasks rather than just superficial ones (e.g., rebalancing their financial portfolio instead of just checking their balances). More complex tasks will generally yield longer times on the site and longer task times than superficial tasks.

4.2.1 Importance of Measuring Time on Task

Time on task is particularly important for products where tasks are performed repeatedly by the user. For example, if you’re designing an application for use by customer service representatives of an airline, the time it takes to complete a phone reservation would be an important measure of efficiency. The faster the airline agent can complete a reservation, presumably the more calls that can be handled and, ultimately, the more money can be saved. The more often a task is performed by the same user, the more important efficiency becomes. One of the side benefits of measuring time on task is that it can be relatively straightforward to calculate cost savings due to an increase in efficiency and then derive an actual return on investment (ROI). Calculating ROI is discussed in more detail in Chapter 9.

4.2.2 How to Collect and Measure Time on Task

Time on task is simply the time elapsed between the start of a task and the end of a task, usually expressed in minutes and seconds. Logistically, time on task can be measured in many different ways. The moderator or note taker can use a stopwatch or any other time-keeping device that can measure at the minute and second levels. Using a digital watch or application on a smartphone, you could simply record the start and end times. When video recording a usability session, we find it’s helpful to use the time-stamp feature of most recorders to display the time and then to mark those times as the task start and stop times. If you choose to record time on task manually, it’s important to be very diligent about when to start and stop the clock and/or record the start and stop times. It may also be helpful to have two people record the times and to be unobtrusive in recording the times.

Automated Tools for Measuring Time on Task

A much easier and less error-prone way of recording task times is using an automated tool. Some tools that can assist in logging of task times include the following:

• Usability Activity Log from Bit Debris Solutions (http://www.bitdebris.com/category/Usability-Activity-Log.aspx)

• The Observer XT from Noldus Information Technology (http://www.noldus.com/human-behavior-research/products/the-observer-xt)

• Ovo Logger from Ovo Studios (http://www.ovostudios.com/ovologger.asp)

• Morae from TechSmith (http://www.techsmith.com/morae.html)

• Usability Testing Environment (UTE) from Mind Design Systems (http://utetool.com/)

• Usability Test Data Logger from UserFocus (http://www.userfocus.co.uk/resources/datalogger.html)

Our website, MeasuringUX.com, also includes a simple macro for use in Microsoft Word for logging start and finish times. An automated method of logging has several advantages. Not only is it less error-prone but it’s also much less obtrusive. The last thing you want is a participant in a usability test to feel nervous from watching you press the start and stop button on your stopwatch or smartphone.

Turning on and off the Clock

Not only do you need a way to measure time, but you also need some rules about how to measure time. Perhaps the most important rule is when to turn the clock on and off. Turning on the clock is fairly straightforward: If you have the participants read the task aloud, you start the clock as soon as they finish reading the task.

Turning off the clock is a more complicated issue. Automated time-keeping tools typically have an “answer” button. Users are required to hit the “answer” button, at which point the timing ends, and they are asked to provide an answer and perhaps a few additional questions. If you are not using an automated method, you can have users report the answer verbally or perhaps even write it down. However, there are many situations in which you may not be sure if they have found the answer. In this situation, it’s important for participants to indicate their answer as quickly as possible. In any case, you want to stop timing when the participant states the answer or otherwise indicates that she has completed the task.

Tabulating Time Data

The first thing you need to do is arrange the data in a table, as shown in Table 4.2. Typically, you will want a list of all the participants in the first column, followed by the time data for each task in the remaining columns (expressed in seconds, or minutes if the tasks are long). Table 4.2 also shows summary data, including the average, median, geometric mean, and confidence intervals for each task.

Working with Time Data in Excel

If you use Excel to log data during a usability test, it’s often convenient to use times that are formatted as hours, minutes, and (sometimes) seconds (hh:mm:ss). Excel provides a variety of formats for time data. This makes it easy to enter times, but it complicates matters slightly when you need to calculate an elapsed time. For example, assume that a task started at 12:46 PM and ended at 1:04 PM. Although you can look at those times and determine that the elapsed time was 18 minutes, how to get Excel to calculate that isn’t so obvious. Internally, Excel stores all times as a number reflecting the number of seconds elapsed since midnight. So to convert an Excel time to minutes, multiply it by 60 (the number of minutes in an hour) and then by 24 (the number of hours in a day). To convert to seconds, multiply by another 60 (the number of seconds in a minute). Here’s what it looks like in Excel, including the formula:

Table 4.2

Time-on-task data for 20 participants and five tasks (all data are expressed in seconds).

4.2.3 Analyzing and Presenting Time-on-Task Data

You can analyze and present time-on-task data in many different ways. Perhaps the most common way is to look at the average amount of time spent on any particular task or set of tasks by averaging all the times for each user by task (Figure 4.4). This is a straightforward and intuitive way to report time-on-task data. One downside is the potential variability across users. For example, if you have several users who took an exceedingly long time to complete a task, it may increase the average considerably. Therefore, you should always report a confidence interval to show the variability in the time data. This will not only show the variability within the same task but also help visualize the difference across tasks to determine whether there is a statistically significant difference between tasks.

What’s the Right Precision for Time Data?

How accurate do you need to be with your time data? Of course, it depends on what you’re measuring, but the majority of the times we deal with it in the user experience world are either in seconds or minutes. It’s very rare that we need to record subsecond times. Similarly, if you’re dealing with times that are more than an hour, it’s probably not necessary to be more accurate than whole minutes.

Figure 4.4 Mean time on task, in seconds, for 19 tasks. Error bars represent a 90% confidence interval. These data are from an online study of a prototype website.

Sometimes it’s more appropriate to summarize time-on-task data using the median rather than the mean. The median is the middle point in an ordered list of all the times: Half of the times are below the median and half are above the median. Similarly, the geometric mean is potentially less biased than the mean. Time data are typically skewed, in which case the median or geometric mean may be more appropriate. In practice, we find that using these other methods of summarizing time data may change the overall level of the times, but the kinds of patterns you’re interested in (e.g., comparisons across tasks) usually stay the same; the same tasks still took the longest or shortest times overall.

Excel Tip

The median can be calculated in Excel using the = MEDIAN function. The geometric mean can be calculated using the = GEOMEAN function.

What’s a Geometric Mean?

While the mean (or arithmetic average) is based on the sum of a set of numbers, the geometric mean is based on their product. For example, the mean of 2 and 8 is (2 + 8)/2, or 10/2, which is 5. The geometric mean of 2 and 8 is sqrt(2*8), or sqrt(16), which is 4. The geometric mean will usually be smaller than the mean.

Ranges

A variation on calculating average completion time by task is to create ranges, or discrete time intervals, and report the frequency of users who fall into each time interval. This is a useful way to visualize the spread of completion times by all users. In addition, this might be a helpful approach to look for any patterns in the type of users who fall within certain segments. For example, you may want to focus on those users who had particularly long completion times to see if they share any common characteristics.

Thresholds

Another useful way to analyze task time data is by using a threshold. In many situations, the only thing that matters is whether users can complete certain tasks within an acceptable amount of time. In many ways, the average is unimportant. The main goal is to minimize the number of users who need an excessive amount of time to complete a task. The main issue is determining what the threshold should be for any given task. One way is to perform the task yourself, keeping track of the time, and then double or triple that number. Alternatively, you could work with the product team to come up with a threshold for each task based on competitive data or even a best guess. Once you have set your threshold, simply calculate the percentage of users above or below the threshold and plot as illustrated in Figure 4.5.

Figure 4.5 An example showing the percentage of users who completed each task in less than 1 minute.

Distributions and Outliers

Whenever analyzing time data, it’s critical to look at the distribution. This is particularly true for time-on-task data collected via automated tools (when the moderator is not present). Participants might take a phone call or even go out to lunch in the middle of a task. The last thing you want is to include a task time of 2 hours among other times of only 15 to 20 seconds when calculating an average! It’s perfectly acceptable to exclude outliers from your analysis, and many statistical techniques for identifying them are available. Sometimes we exclude any times that are more than two or three standard deviations above the mean. Alternatively, we sometimes set up thresholds, knowing that it should never take a user more than x seconds to complete a task. You should have some rationale for using an arbitrary threshold for excluding outliers.

The opposite problem—participants apparently completing a task in unusually short amounts of time—is also common in online studies. Some participants may be in such a hurry or only care about the compensation that they simply fly through the study as fast as they can. In most cases, it’s very easy to identify these individuals through their time data. For each task, determine the fastest possible time. This would be the time it would take someone with perfect knowledge and optimal efficiency to complete the task. For example, if there is no way you, as an expert user of the product, can finish the task in less than 8 seconds, then it is highly unlikely that a typical user could complete the task any faster. Once you have established this minimum acceptable time, you should identify the tasks that have times less than that minimum. These are candidates for removal—not just of the time but of the entire task (including any other data for the task such as success or subjective rating). Unless you can find evidence suggesting otherwise, the time indicates that the participant did not make a reasonable attempt at the task. If a participant did this for multiple tasks, you should consider dropping that participant. You can expect anywhere from 5 to 10% of the participants in an online study to be in it only for the compensation.

4.2.4 Issues to Consider When Using Time Data

Some of the issues to think about when analyzing time data is whether to look at all tasks or just successful tasks, what the impact of using a think-aloud protocol might be, and whether to tell test participants that time is being measured.

Only Successful Tasks or All Tasks?

Perhaps the first issue to consider is whether you should include times for only successful tasks or all tasks in the analysis. The main advantage of only including successful tasks is that it is a cleaner measure of efficiency. For example, time data for unsuccessful tasks are often very difficult to estimate. Some users will keep on trying until you practically unplug the computer. Any task that ends with the participant giving up or the moderator “pulling the plug” is going to result in highly variable time data.

The main advantage of analyzing time data for all tasks, successful or not, is that it is a more accurate reflection of the overall user experience. For example, if only a small percentage of users were successful, but that particular group was very efficient, the overall time on task is going to be low. Therefore, it is easy to misinterpret time-on-task data when only analyzing successful tasks. Another advantage of analyzing time data for all tasks is that it is an independent measure in relation to task success data. If you only analyze time data for successful tasks, you’re introducing a dependency between the two sets of data.

A good rule is that if the participant always determined when to give up on unsuccessful tasks, you should include all times in the analyses. If the moderator sometimes decided when to end an unsuccessful task, then use only the times for the successful tasks.

Using a Concurrent Think-Aloud Protocol

Another important issue to consider is whether to use a concurrent think-aloud protocol when collecting time data (i.e., asking participants to think aloud while they are going through the tasks). Most usability specialists rely heavily on a concurrent think-aloud protocol to gain important insight into the user experience. But sometimes a think-aloud protocol leads to a tangential topic or a lengthy interaction with the moderator. The last thing you want to do is measure time on task while a participant is giving a 10-minute diatribe on the importance of fast-loading web pages. When you want to capture time on task but also use a concurrent think-aloud protocol, a good solution is to ask participants to “hold” any longer comments for the time between tasks. Then you can have a dialog with the participant about the just-completed task after the “clock is stopped.”

Retrospective Think Aloud (RTA)

A technique that’s gaining in popularity among many usability professionals is retrospective think aloud (e.g., Birns, Joffre, Leclerc, & Paulsen, 2002; Guan, Lee, Cuddihy, & Ramey, 2006; Petrie & Precious, 2010). With this technique, participants typically remain silent while they are interacting with the product being tested. Then, after all the tasks, they are shown some kind of “reminder” of what they did during the session and are asked to describe what they were thinking or doing at various points in the interaction. The reminder can take several different forms, including a video replay of screen activity, perhaps with a camera view of the user, or an eye-tracking replay showing what the user was looking at. This technique probably yields the most accurate task time data. There’s also some evidence that the additional cognitive load of concurrent think aloud causes participants to be less successful with their tasks. For example, van den Haak, de Jong, and Schellens (2004) found that participants in a usability study of a library website were successful with only 37% of their tasks when using concurrent think aloud, but they were successful with 47% when using RTA. But keep in mind that it will take almost twice as long to run sessions using RTA.

Should You Tell Participants about the Time Measurement?

An important question to consider is whether to tell participants you are recording their time. It’s possible that if you don’t, participants won’t behave in an efficient manner. It’s not uncommon for participants to explore different parts of a website when they are in the middle of a task. On the flip side, if you tell them they are being timed, they may become nervous and feel they are the ones being tested and not the product. A good compromise is asking the participants to perform the tasks as quickly and accurately as possible, without volunteering that they are being explicitly timed. If the participant happens to ask (which they rarely do), then simply state that you are noting the start and finish time for each task.

4.3 Errors

Some user experience professionals believe errors and usability issues are essentially the same thing. Although they are certainly related, they are actually quite different. A usability issue is the underlying cause of a problem, whereas one or more errors are a possible outcome of an issue. For example, if users are experiencing a problem in completing a purchase on an e-commerce website, the issue (or cause) may be confusing labeling of the products. The error, or the result of the issue, may be the act of choosing the wrong options for the product they want to buy. Essentially, errors are incorrect actions that may lead to task failure.

4.3.1 When to Measure Errors

In some situations it’s helpful to identify and classify errors rather than just document usability issues. Measuring errors is useful when you want to understand the specific action or set of actions that may result in task failure. For example, a user may make the wrong selection on a web page and sell a stock instead of buying more. A user may push the wrong button on a medical device and deliver the wrong medication to a patient. In both cases, it’s important to know what errors were made and how different design elements may increase or decrease the frequency of errors.

Errors are a useful way of evaluating user performance. While being able to complete a task successfully within a reasonable amount of time is important, the number of errors made during the interaction is also very revealing. Errors can tell you how many mistakes were made, where they were made while interacting with the product, how various designs produce different frequencies and types of errors, and generally how usable something really is.

Measuring errors is not right for every situation. We’ve found that there are three general situations where measuring errors might be useful:

1. When an error will result in a significant loss in efficiency—for example, when an error results in a loss of data, requires the user to reenter information, or slows the user significantly in completing a task.

2. When an error will result in significant costs to your organization or the end user—for example, if an error will result in increased call volumes to customer support or in increased product returns.

3. When an error will result in task failure—for example, if an error will cause a patient to receive the wrong medication, a voter to vote for the wrong candidate accidentally, or a web user to buy the wrong product.

4.3.2 What Constitutes an Error?

Surprisingly, there is no widely accepted definition of what constitutes an error. Obviously, it’s some type of incorrect action on the part of the user. Generally an error is an action that causes the user to stray from the path to successful completion. Sometimes failing to take an action can be an error. Errors can be based on many different types of actions by the user, such as the following:

• Entering incorrect data into a form field (such as typing the wrong password during a login attempt)

• Making the wrong choice in a menu or drop-down list (such as selecting “Delete” instead of “Modify”)

• Taking an incorrect sequence of actions (such as reformatting their home media server when all they were trying to do was play a recorded TV show)

• Failing to take a key action (such as clicking on a key link on a web page)

Obviously, the range of possible actions will depend on the product you are studying (website, cell phone, DVD player, etc.). When you’re trying to determine what constitutes an error, first make a list of all the possible actions a user can take on your product. Some of those actions are errors. Once you have a universe of possible actions, you can then start to define many of the different types of errors that can be made using the product.

4.3.3 Collecting and Measuring Errors

Measuring errors is not always easy. Similar to other performance metrics, you need to know what the correct action should be or, in some cases, the correct set of actions. For example, if you’re studying a password reset form, you need to know what is considered the correct set of actions to reset the password successfully and what is not. The better you can define the universe of correct and incorrect actions, the easier it will be to measure errors.

An important consideration is whether a given task presents only a single error opportunity or multiple error opportunities. An error opportunity is basically a chance to make a mistake. For example, if you’re measuring the usability of a typical login screen, at least two error opportunities are possible: making an error when entering the user name and making an error when entering the password. If you’re measuring the usability of an online form, there could be as many error opportunities as there are fields on the form.

In some cases there might be multiple error opportunities for a task but you only care about one of them. For example, you might be interested only in whether users click on a specific link that you know will be critical to completing their task. Even though errors could be made on other places on the page, you’re narrowing your scope of interest to that single link. If users don’t click on the link, it is considered an error.

The most common way of organizing error data is by task. Simply record the number of errors for each task and each user. If there is only a single opportunity for error, the numbers will be 1’s and 0’s:

0 = No error

1 = One error

If multiple error opportunities are possible, numbers will vary between 0 and the maximum number of error opportunities. The more error opportunities, the harder and more time-consuming it will be to tabulate the data. You can count errors while observing users during a lab study, by reviewing videos after the sessions are over, or by collecting the data using an automated or online tool.

If you can clearly define all the possible error opportunities, another approach could be to identify the presence (1) or absence (0) of each error opportunity for each user and task. The average of these for a task would then reflect the incidence of those errors.

4.3.4 Analyzing and Presenting Errors

The analysis and presentation of error data differ slightly depending on whether a task has only one error opportunity or multiple error opportunities. If each task has only one error opportunity, then the data are binary for each task (the user made an error or didn’t), which means that the analyses are basically all the same as they are for binary task success. You could, for example, look at average error rates per task or per participant. Figure 4.6 is an example of presenting errors based on a single opportunity per task. In this example, they were interested in the percentage of participants who experienced an error when using different types of on-screen keyboards (Tullis, Mangan, & Rosenbaum, 2007). The control condition was the current QWERTY keyboard layout.

Figure 4.6 An example showing how to present data for single error opportunities. In this study, only one error opportunity per task (entering a password incorrectly) was possible, and the graph shows the percentage of participants who made an error for each condition.

In many situations, there are multiple opportunities for errors per task (e.g., multiple input fields in a “new account” application). Here are some of the common ways to analyze data from tasks with multiple error opportunities:

• A good place to start is to look at the frequency of errors for each task. You will be able to see which tasks are resulting in the most errors. But this may be misleading if each task has a different number of error opportunities. In that case, it might be better to divide the total number of errors for the task by the total number of error opportunities. This creates an error rate that takes into account the number of opportunities.

• You could calculate the average number of errors made by each participant for each task. This will also tell you which tasks are producing the most errors. However, it may be more meaningful because it suggests that a typical user might experience x number of errors on a particular task when using the product. Another advantage is that it takes into account extremes. If you are simply looking at the frequency of errors for each task, some users may be the source of most of the errors, whereas many others are performing the task error-free. By taking an average number of errors for each user, this bias is reduced.

• In some situations it might be interesting to know which tasks fall above or below a threshold. For example, for some tasks, an error rate above 20% is unacceptable, whereas for others, an error rate above 5% is unacceptable. The most straightforward analysis is to first establish an acceptable threshold for each task or each participant. Next, calculate whether that specific task’s error rate or user error count was above or below the threshold.

• Sometimes you want to take into account that not all errors are created equal. Some errors are much more serious than others. You could assign a severity level to each error, such as high, medium, or low, and then calculate the frequency of each error type. This could help the project team focus on the issues that seem to be associated with the most serious errors.

4.3.5 Issues to Consider When Using Error Metrics

Several important issues must be considered when looking at errors. First, make sure you are not double counting errors. Double counting happens when you assign more than one error to the same event. For example, assume you are counting errors in a password field. If a user typed an extra character in the password, you could count that as an “extra character” error, but you shouldn’t also count it as an “incorrect character” error.

Sometimes you need to know more than just an error rate; you need to know why different errors are occurring. The best way to do this is by looking at each type of error. Basically, you want to try to code each error by type of error. Coding should be based on the various types of errors that occurred. For example, continuing with the password example, the types of errors might include “missing character,” “transposed characters,” “extra character,” and so on. At a higher level, you might have “navigation error,” “selection error,” “interpretation error,” and so on. Once you have coded each error, you can run frequencies on the error type for each task to better understand exactly where the problems lie. This will also help improve the efficiency with which you collect error data.

In some cases, an error is the same as failing to complete a task—for example, with a login page that allows only one chance at logging in. If no errors occur while logging in, it is the same as task success. If an error occurs, it is the same as task failure. In this case, it might be easier to report errors as task failure. It’s not so much a data issue as it is a presentation issue. It’s important to make sure your audience understands your metrics clearly.

Another enlightening metric can be the incidence of repeated errors—namely the case where a participant makes essentially the same mistake more than once, such as repeatedly clicking on the same link that looks like it might be the right one but isn’t.

4.4 Efficiency

Time on task is often used as a measure of efficiency, but another way to measure efficiency is to look at the amount of effort required to complete a task. This is typically done by measuring the number of actions or steps that users took in performing each task. An action can take many forms, such as clicking a link on a web page, pressing a button on a microwave oven or a mobile phone, or flipping a switch on an aircraft. Each action a user performs represents a certain amount of effort. The more actions taken by a user, the more effort involved. In most products, the goal is to increase productivity by minimizing the number of discrete actions required to complete a task, thereby minimizing the amount of effort.

What do we mean by effort? There are at least two types of effort: cognitive and physical. Cognitive effort involves finding the right place to perform an action (e.g., finding a link on a web page), deciding what action is necessary (should I click this link?), and interpreting the results of the action. Physical effort involves the physical activity required to take action, such as moving a mouse, inputting text on a keyboard, turning on a switch, and many others.

An Interesting Way of Measuring Cognitive Effort

One way of measuring cognitive effort is using performance on a task that’s peripheral, or secondary, to the participants’ primary task. The more cognitive effort the primary task requires, the worse the performance on the secondary task will be. An interesting variation of this was used by Ira Hyman and associates at Western Washington University to measure cell phone distraction (Hyman et al., 2010). They had one of their students ride a unicycle around a popular square on campus while wearing a clown suit (a rather memorable sight!). Then they observed 347 pedestrians walking across the square, some of whom were talking on their cell phones. After crossing the square, they asked the pedestrians if they had seen a unicycling clown. The clown was remembered by 71% of those walking with a friend, 61% of those listening to music, and 51% of those walking alone. But only 25% of those talking on a cell phone remembered the unicycling clown!

Efficiency metrics work well if you are concerned not only with the time it takes to complete a task but also the amount of cognitive and physical effort involved. For example, if you’re designing an automobile navigation system, you need to make sure that it does not take much effort to interpret its navigation directions, as the driver’s attention must be focused on the road. It would be important to minimize both physical and cognitive effort to use the navigation system.

4.4.1 Collecting and Measuring Efficiency

There are some important points to keep in mind when collecting and measuring efficiency.

• Identify the action(s) to be measured: For websites, mouse clicks or page views are common actions. For software, it might be mouse clicks or keystrokes. For appliances or consumer electronics, it could be button presses. Regardless of the product being evaluated, you should have a clear idea of all the possible actions.

• Define the start and end of an action: You need to know when an action begins and ends. Sometimes the action is very quick, such as a press of a button, but other actions can take much longer. An action may be more passive in nature, such as looking at a web page. Some actions have a very clear start and end, whereas other actions are less defined.

• Count the actions: You must be able to count the actions. Actions must happen at a pace that can be identified visually or, if they are too fast, by an automated system. Try to avoid having to review hours of video to collect efficiency metrics.

• Actions must be meaningful: Each action should represent an incremental increase in cognitive and/or physical effort. The more actions, the more effort. For example, each click of a mouse is almost always an incremental increase in effort.

Once you have identified the actions you want to capture, counting those actions is relatively simple. You can do it manually, such as counting page views or presses of a button. This will work for fairly simple products, but in most cases, it is not practical. Many times a participant is performing these actions at amazing speeds. There may be more than one action every second, so using automated data collection tools is far preferable.

Keystroke-Level Modeling

This discussion of low-level actions such as keystrokes and mouse clicks should sound familiar if you’ve ever studied theories of human–computer interaction. A framework called GOMS (Goals, Operators, Methods, and Selection rules) dates back to a classic book, “The Psychology of Human–Computer Interaction” (Card, Moran, & Newell, 1983). In it, a user’s interaction with a computer is decomposed into its fundamental units, which could be physical, cognitive, or perceptual. Identifying these fundamental units and assigning times to each of them allow you to predict how long a particular interaction will take. A simplified version of GOMS is called the keystroke-level model, which, as its name implies, focuses on keystrokes and mouse clicks (e.g., Sauro, 2009).

4.4.2 Analyzing and Presenting Efficiency Data

The most common way to analyze and present efficiency metrics is by looking at the number of actions each participant takes to complete a task. Simply calculate an average for each task (by participant) to see how many actions are taken. This analysis is helpful in identifying which tasks required the most amount of effort; it works well when each task requires about the same number of actions. However, if some tasks are more complicated than others, it may be misleading. It’s also important to represent the confidence intervals (based on a continuous distribution) for this type of chart.

Shaikh, Baker, and Russell (2004) used an efficiency metric based on the number of clicks to accomplish the same task on three different weight-loss sites: Atkins, Jenny Craig, and Weight Watchers. They found that users were significantly more efficient (needed fewer clicks) with the Atkins site than with the Jenny Craig or Weight Watchers sites.

Lostness

Another measure of efficiency sometimes used in studying behavior on the web is called “lostness” (Smith, 1996). Lostness is calculated using three values:

N: The number of different web pages visited while performing the task

S: The total number of pages visited while performing the task, counting revisits to the same page

R: The minimum (optimum) number of pages that must be visited to accomplish the task

Lostness, L, is then calculated using the following formula:

Consider the example shown in Figure 4.7. In this case, the user’s task is to find something on Product Page C1. Starting on the home page, the minimum number of page visits (R) to accomplish this task is three. However, Figure 4.8 illustrates the path a particular user took in getting to that target item. This user started down some incorrect paths before finally getting to the right place, visiting a total of six different pages (N), or a total of nine page visits (S). So for this example:

N = 6

S = 9

R = 3

Figure 4.7 Optimum number of steps (three) to accomplish a task that involves finding a target item on Product Page C1 starting from the home page.

Figure 4.8 Actual number of steps a user took in getting to the target item on Product Page C1. Note that each revisit to the same page is counted, giving a total of nine steps.

A perfect lostness score would be 0. Smith (1996) found that participants with a lostness score less than 0.4 did not exhibit any observable characteristics of being lost. However, she reported that participants with a lostness score greater than 0.5 definitely did appear to be lost. Note that additional measures of lostness have been proposed by Otter & Johnson (2000) and Gwizdka & Spence (2007).

Once you calculate a lostness value, you can easily calculate the average lostness value for each task. The number or percentage of participants who exceed the ideal number of actions can also be indicative of the efficiency of the design. For example, you could show that 25% of the participants exceeded the ideal or minimum number of steps, and you could break it down even further by saying that 50% of the participants completed a task with the minimum number of actions.

Backtracking Metric

Treejack (http://www.optimalworkshop.com/treejack.htm) is a tool from Optimal Workshop for testing information architectures (IAs). Participants in a Treejack study navigate an information hierarchy to indicate where in the hierarchy they would expect to find a given piece of information or perform some action. Participants can move down the hierarchy or, if they need to, they can move back up it. Several useful metrics come out of a Treejack study, including traditional ones, such as where participants indicated they would expect to find each function. But a particularly interesting metric is a “backtracking” metric that indicates cases where a participant went back up the hierarchy. You can then look at the percentage of participants who “backtracked” while performing each task. In our IA studies, we’ve found this was often the most revealing metric.

4.4.3 Efficiency as a Combination of Task Success and Time

Another view of efficiency is that it’s a combination of two of the metrics discussed in this chapter: task success and time on task. The Common Industry Format for Usability Test Reports (ISO/IEC 25062:2006) specifies that the “core measure of efficiency” is the ratio of the task completion rate to the mean time per task. Basically, it expresses task success per unit time. Most commonly, time per task is expressed in minutes, but seconds could be appropriate if the tasks are very short or even hours if they are unusually long. The unit of time used determines the scale of the results. Your goal is to choose a unit that yields a “reasonable” scale (i.e., one where most of the values fall between 1 and 100%). Table 4.3 shows an example of calculating an efficiency metric based on task completion and task time. Figure 4.9 shows how this efficiency metric looks in a chart.

Table 4.3

The efficiency measure is simply the ratio of task completion to task time in minutes^a.

^aOf course, higher values of efficiency are better. In this example, users appear to have been more efficient in performing tasks 5 and 6 than the other tasks.

Figure 4.9 An example showing efficiency as a function of completion rate/time.

A slight variation on this approach to calculating efficiency is to count the number of tasks completed successfully by each participant and divide that by the total time spent by the participant on all tasks (successful and unsuccessful). This gives a very straightforward efficiency score for each participant: number of tasks completed successfully per minute (or whatever unit of time you used). If a participant completed 10 tasks successfully in a total time of 10 minutes, then that participant was successfully completing 1 task per minute overall. This works best when all participants attempted the same number of tasks and the tasks are relatively comparable in terms of their level of difficulty.

Figure 4.10 shows data from an online study comparing four different navigation prototypes for a website. This was a between-subjects study, in which each participant used only one of the prototypes, but all participants were asked to perform the same 20 tasks. Over 200 participants used each prototype. We were able to count the number of tasks completed successfully by each participant and divide that by the total time that participant spent. The averages of these (and the 95% confidence intervals) are shown in Figure 4.10.

Figure 4.10 Average number of tasks completed successfully per minute in an online study of four different prototypes of navigation for a website. Over 200 participants attempted 20 tasks with each prototype. Participants using Prototype 2 were significantly more efficient (i.e., completed more tasks per minute) than those using Prototype 3.

4.5 Learnability

Most products, especially new ones, require some amount of learning. Usually, learning does not happen in an instant but occurs over time as experience increases. Experience is based on the amount of time spent using a product and the variety of tasks performed. Learning is sometimes quick and painless, but at other times it is quite arduous and time-consuming. Learnability is the extent to which something can be learned efficiently. It can be measured by looking at how much time and effort are required to become proficient, and ultimately expert in using something. We believe that learnability is an important user experience metric that doesn’t receive as much attention as it should. It’s an essential metric if you need to know how someone develops proficiency with a product over time.

Consider the following example. Assume you’re a user experience professional who has been asked to evaluate a time-keeping application for employees within their organization. You could go into the lab and test with 10 participants, giving each participant a set of core tasks. You might measure task success, time on task, errors, and even overall satisfaction. Using these metrics will allow you to get some sense of the usability of the application. Although these metrics are useful, they can also be misleading. Since the use of a time-keeping application is not a one-time event, but happens with some degree of frequency, learnability is very important. A key factor is how much time and effort are required to become proficient using the time-keeping application. Yes, there may be some initial obstacles when first using the application, but what really matters is “getting up to speed.” It’s quite common in usability studies to only look at a participant’s initial exposure to something, but sometimes it’s more important to look at the amount of effort needed to become proficient.

Learning can happen over a short period of time or over longer periods of time. When learning happens over a short period of time, the user tries out different strategies to complete tasks. A short period of time might be several minutes, hours, or days. For example, if users have to submit their timesheets every day using a time-keeping application, they try to quickly develop some type of mental model of how the application works. Memory is not a big factor in learnability; it is more about adapting strategies to maximize efficiency. Within a few hours or days, it is hoped that maximum efficiency is achieved.

Learnability and “Self-Service”

Learnability is much more important today than it was in the early days of computers. The web has facilitated a move toward many more “self-service” applications than we previously had. At the same time, it has fostered an expectation that you should be able to use just about anything on the web without extensive training or practice. In the 1980s, if you wanted to book an airline flight yourself, you called and spoke to a representative who had extensive training in the use of a mainframe-based airline reservation system. Today you go to one of many different websites that let you book an airline flight. How long do you think one of those sites would stay in business if it started by saying, “OK, let’s start with a 3-hour training course on the use of our site?” Learnability is a key differentiator in today’s self-service economy.

Learning can also happen over a longer time period, such as weeks, months, or years. This is the case where significant gaps exist in time between each use. For example, if you only fill out an expense report every few months, learnability can be a significant challenge because you may have to relearn the application each time you use it. In this situation, memory is very important. The more time there is between experiences with the product, the greater the reliance on memory.

4.5.1 Collecting and Measuring Learnability Data

The process of collecting and measuring learnability data is basically the same as it is for the other performance metrics, but you’re collecting the data at multiple times. Each instance of collecting the data is considered a trial. A trial might be every 5 minutes, every day, or once a month. The time between trials, or when you collect the data, is based on expected frequency of use.

The first decision is which type of metrics you want to use. Learnability can be measured using almost any performance metric over time, but the most common ones are those that focus on efficiency, such as time on task, errors, number of steps, or task success per minute. As learning occurs, you expect to see efficiency improve.

After you decide which metrics to use, you need to decide how much time to allow between trials. What do you do when learning occurs over a very long time? What if users interact with a product once every week, month, or even year? The ideal situation would be to bring the same participants into the lab every week, month, or even year. In many cases, this is not very practical. The developers and the business sponsors might not be very pleased if you told them the study will take 3 years to complete. A more realistic approach is to bring in the same participants over a much shorter time span and acknowledge the limitation in the data. Here are a few alternatives:

• Trials within the same session. The participant performs the task, or set of tasks, one right after the other, with no breaks in between. This is very easy to administer, but it does not take into account significant memory loss.

• Trials within the same session but with breaks in between each task. The break might be a distracter task or anything that might promote forgetting. This is fairly easy to administer, but it tends to make each session relatively long.

• Trials between sessions: The participant performs the same tasks over multiple sessions, with at least 1 day in between. This may be the least practical, but most realistic, if the product is used sporadically over an extended period of time.

4.5.2 Analyzing and Presenting Learnability Data

The most common way to analyze and present learnability data is by examining a specific performance metric (such as time on task, number of steps, or number of errors) by trial for each task or aggregated across all tasks. This will show you how that performance metric changes as a function of experience, as illustrated in Figure 4.11. You could aggregate all the tasks together and represent them as a single line of data or you could look at each task as separate lines of data. This can help determine how the learnability of different tasks compare, but it also can also make the chart harder to interpret.

Figure 4.11 An example of how to present learnability data based on time on task.

The first aspect of the chart you should notice is the slope of the line(s). Ideally, the slope (sometimes called the learning curve) is fairly flat and low on the y axis (in the case of errors, time on task, number of steps, or any other metric where a smaller number is better). If you want to determine whether a statistically significant difference between the learning curves (or slopes) exists, you need to perform an analysis of variance and see if there is a main effect of trial. (See Chapter 2 for a discussion of analysis of variance.)

You should also notice the point of asymptote, or essentially where the line starts to flatten out. This is the point at which users have learned as much as they can, and there is very little room for improvement. Project team members are always interested in how long it will take someone to reach maximum performance.

Finally, you should look at the difference between the highest and the lowest values on the y axis. This will tell you how much learning must occur to reach maximum performance. If the gap is small, users will be able to learn the product quickly. If the gap is large, users may take quite some time to become proficient with the product. One easy way to analyze the gap between highest and lowest scores is by looking at the ratio of the two. Here is an example:

• If the average time on the first trial is 80 seconds and on the last trial is 60 seconds, the ratio shows that users are initially taking 1.3 times longer.

• If the average number of errors on the first trial is 2.1 and on the last trial is 0.3, the ratio shows a 7 times improvement from the first trial to the last trial.

It may be helpful to look at how many trials are needed to reach maximum performance. This is a good way to characterize the amount of learning required to become proficient in using the product.

In some cases you might want to compare learnability across different conditions, as shown in Figure 4.12. In this study (Tullis, Mangan, & Rosenbaum, 2007), they were interested in how speed (efficiency) of entering a password changed over time using different types of on-screen keyboards. As you can see from the data, there is an improvement from the first trial to the second trial, but then the times flatten out pretty quickly. Also, all the on-screen keyboards were significantly slower than the control condition, which was a real keyboard.

Figure 4.12 Looking at the learnability of different types of on-screen keyboards.

4.5.3 Issues to Consider When Measuring Learnability

Two of the key issues to address when measuring learnability are (1) what should be considered a trial and (2) how many trials to include.

What is a Trial?

In some situations learning is continuous. This means that the user is interacting with the product fairly continuously without any significant breaks in time. Memory is much less a factor in this situation. Learning is more about developing and modifying different strategies to complete a set of tasks. The whole concept of trials does not make much sense for continuous learning. What do you do in this situation? One approach is to take your measurements at specified time intervals. For example, you may need to take measurements every 5 minutes, 15 minutes, or every hour. In one usability study we conducted, we wanted to evaluate the learnability of a new suite of applications that would be used many times every day. We started by bringing the participants into the lab for their first exposure to the applications and their initial tasks. They then went back to their regular jobs and began using the applications to do their normal work. We brought them back into the lab 1 month later and had them perform basically the same tasks again (with minor changes in details) while we took the same performance measures. Finally, we brought them back one more time after another month and repeated the procedure. In this way, we were able to look at learnability over a 2-month period.

Number of Trials

How many trials should you plan for? Obviously there must be at least two, but in most cases there should be at least three or four. Sometimes it’s difficult to predict where in the sequence of trials the most learning will take place or even if it will take place. In this situation, you should err on the side of more trials than you think you might need to reach stable performance.

4.6 Summary

Performance metrics are powerful tools used to evaluate the usability of any product. They are the cornerstone of usability and can inform key decisions, such as whether a new product is ready to launch. Performance metrics are always based on user behavior rather than what they say. There are five general types of performance metrics.

1. Task success metrics are used when you are interested in whether users are able to complete tasks using the product. Sometimes you might only be interested in whether a user is successful or not based on a strict set of criteria (binary success). Other times you might be interested in defining different levels of success based on the degree of completion, the user’s experience in finding an answer, or the quality of the answer given.

2. Time on task is helpful when you are concerned about how quickly users can perform tasks with the product. You might look at the time it takes to complete a task for all users, a subset of users, or the proportion of users who can complete a task within a desired time limit.

3. Errors are a useful measure based on the number of mistakes users make while attempting to complete a task. A task might have a single error opportunity or multiple error opportunities, and some types of errors may be more important than others.

4. Efficiency is a way of evaluating the amount of effort (cognitive and physical) required to complete a task. Efficiency is often measured by the number of steps or actions required to complete a task or by the ratio of the task success rate to the average time per task.

5. Learnability involves looking at how any efficiency metric changes over time. Learnability is useful if you want to examine how and when users reach proficiency in using a product.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4. Performance Metrics

Create new playlist

Sign In

Sign Up