Chapter 9

Special Topics

Contents

This chapter introduces a number of topics related to the measurement or analysis of user experience data not traditionally thought of as part of “mainstream” UX data. These include information you can glean from live data on a production website, data from card-sorting studies, data related to the accessibility of a website, and UX return on investment (ROI). These topics didn’t fit neatly into the other chapters, but we believe they are an important part of a complete UX metrics toolkit.

9.1 Live Website Data

If you’re dealing with a live website, there’s a potential treasure trove of data about what the visitors to your site are actually doing—what pages they’re visiting, what links they’re clicking on, and what paths they’re following through the site. The challenge usually isn’t getting raw data but making sense of it. Unlike lab studies with perhaps a dozen participants or online studies with perhaps 100 participants, live sites have the potential to yield data from thousands or even hundreds of thousands of users.

Entire books have been written on only the subject of web metrics and web analytics (e.g., Burby & Atchison, 2007; Clifton, 2012; Kaushik, 2009). There’s even a “For Dummies” book on the topic (Sostre & LeClaire, 2007). So obviously we won’t be able to do justice to the topic in just one section of one chapter in this book. What we’ll try to do is introduce you to some of the things you can learn from live website data and specifically some of the implications they might have for the usability of your site.

9.1.1 Basic Web Analytics

Some websites get huge numbers of visitors every day. But regardless of how many visitors your site gets (assuming it gets some), you can learn from what they’re doing on the site.

Some Web Analytics Terms

Here are the meanings of some of the terms used commonly in web analytics.

• Visitors. The people who have visited your website. Usually a visitor is counted only once during the time period of a report. Some analytics packages use the term “unique visitor” to indicate that they’re not counting the same person more than once. Some also report “new visitors” to distinguish them from ones who have been to your site before.

• Visits. The individual times that your website was accessed; sometimes also called “sessions.” An individual visitor can have multiple visits to your site during the time period of the report.

• Page views. The number of times individual pages on your site are viewed. If a visitor reloads a page, that typically counts as a new page view; likewise, if visitors navigate to another page in your site and then return to a page, that will count as a new page view. Page views let you see which pages on your site are the most popular.

• Landing page or entrance page. The first page that a visitor visits on your site. This is often the home page, but might be a lower level page if they found it through a search engine or had bookmarked it.

• Exit page. The last page that a visitor visits on your site.

• Bounce rate. The percentage of visits in which the visitor views only one page on your site and then leaves the site. This could indicate a lack of engagement with your site, but it could also mean that they found what they were looking for from that one page.

• Exit rate (for a page). The percentage of visitors who leave your site from a given page. Exit rate, which is a metric at an individual page level, is often confused with bounce rate, which is an overall metric for a site.

• Conversion rate. The percentage of visitors to a site who convert from being simply a casual visitor to taking some action, such as making a purchase, signing up for a newsletter, or opening an account.

A number of tools are available for capturing web analytics. Most web-hosting services provide basic analytics as part of the hosting service, and other web analytics services are available for free. Perhaps the most popular free analytics service is Google Analytics (http://www.google.com/analytics/). Figure 9.1 shows a screenshot from Google Analytics.

image

Figure 9.1 Sample Google Analytics screen for the MeasuringUX.com site.

As can be seen in Figure 9.1, you can look at many of the metrics for your site over time, such as line graphs for visits, average visit duration, and page views. These graphs of visits and page views show a pattern that’s typical for some websites, which is a difference in the number of visitors, visits, and page views for the weekend vs the weekdays. You also can capture some basic information about the visitors to your site, such as the operating system they’re running, their screen resolution, and the browsers they’re using, as illustrated in the pie charts.

Simply looking at the number of page views for various pages in your site can be enlightening, especially over time or across iterations of the site. For example, assume that a page about Product A on your site was averaging 100 page views per day for a given month. Then you modified the homepage for your site, including the description of the link to Product A’s page. Over the next month, the Product A page then averaged 150 page views per day. It would certainly appear that the changes to the homepage significantly increased the number of visitors accessing the Product A page. But you need to be careful that other factors didn’t cause the increase. For example, in the financial-services world, certain pages have seasonal differences in their number of page views. A page about contributions to an Individual Retirement Account (IRA), for example, tends to get more visits in the days leading up to April 15 because of the deadline in the United States for contributing to the prior year’s IRA.

It’s also possible that something caused your site as a whole to start getting more visitors, which certainly could be a good thing. But it could also be due to factors not related to the design or usability of your site, such as news events related to the subject matter of your site. This also brings up the issue of the impact that search “bots” can have on your site’s statistics. Search bots, or spiders, are automated programs used by most of the major search engines to “crawl” the web by following links and indexing the pages they access. One of the challenges, once your site becomes popular and is being “found” by most of the major search engines, is filtering out the page views due to these search bots. Most bots (e.g., Google, Yahoo!) usually identify themselves when making page requests and thus can be filtered out of the data.

What analyses can be used to determine if one set of page views is significantly different from another set? Consider the data shown in Table 9.1, which shows the number of page views per day for a given page over two different weeks. Week 1 was before a new homepage with a different link to the page in question was launched, and Week 2 was after.

Table 9.1

Number of page views for a given web page over 2 different weeksa.

Week 1 Week 2
Sun 237 282
Mon 576 623
Tue 490 598
Wed 523 612
Thu 562 630
Fri 502 580
Sat 290 311
Averages 454 519

aWeek 1 was before a new homepage was launched, and Week 2 was after. The new homepage contained different wording for the link to this page.

These data can be analyzed using a paired t test to see if the average for Week 2 (519) is significantly different from the average for Week 1 (454). It’s important to use a paired t test because of the variability due to the days of the week; comparing each day to itself from the previous week takes out the variability due to days. A paired t test shows that this difference is statistically significant (p < 0.01). If you had not used a paired t test, and just used a t test for two independent samples, the result (p = 0.41) would not have been anywhere near significant. (See Chapter 2 for details on how to run a paired t test in Excel.)

9.1.2 Click-Through Rates

Click-through rates can be used to measure the effectiveness of different ways of presenting a link or button. They indicate the percentage of visitors who are shown a particular link or button who then actually click on it. If a link is shown 100 times and it is clicked on 1 of those times, its click-through rate is 1%. Most commonly the term is used to measure the effectiveness of web ads, but the concept applies to any link, button, or clickable image. For example, Holland (2012a) describes a study of two different buttons on a product page for an ecommerce site, as shown in Figure 9.2. The only difference between the two pages was the wording of the green button: “Personalize Now” vs “Customize It”. The click-through rate for the button labeled “Personalize Now” was 24% higher. They continued tracking through to actual sales and found that clicks on that version of the button also resulted in a 48% higher revenue per visitor. Why did the “Personalize Now” text yield more clicks and more sales? We could speculate, but we don’t really know. That’s one of the limitations of live-site data.

image

Figure 9.2 Example of two button designs tested on a product page.

What analyses can be used to determine if the click-through rate for one link is significantly different from that for another link? One appropriate analysis is the χ2 test. A χ2 test lets you determine whether an observed set of frequencies is significantly different from an expected set of frequencies. (See Chapter 2 for more details.) For example, consider the data shown in Table 9.2 that represent click rates for two different links. The click-through rate for Link #1 is 1.4% [145/(145 + 10,289)]. The click-through rate for Link #2 is 1.7% [198/(198 + 11,170)]. But are these two significantly different from each other? Link #2 got more clicks, but it was also presented more times. To do a χ2 test, you must first construct a table of expected frequencies as if there were no difference in the click-through rates of Link #1 and Link #2. This is done using the sums of the rows and columns of the original table, as shown in Table 9.3.

Table 9.2

Click rates for two different links: The number of times each link was clicked and the number of times each was presented but not clicked.

Click No Click
Link #1 145 10289
Link #2 198 11170

Table 9.3

Same data as Table 9.2 but with sums of rows and columns addeda.

Image

aThese are used to calculate expected frequencies if there were no differences in the click-through rates.

By taking the product of each pair of row and column sums and dividing that by the grand total you get the expected values as shown in Table 9.4. For example, the expected frequency for a “Click” on “Link #1” (164.2) is the product of the respective row and column sums divided by the grand total: (343×10,434)/21,802. The “CHITEST” function in Excel can then be used to compare the actual frequencies in Table 9.2 to the expected frequencies in Table 9.4. The resulting value is p = 0.04, indicating that a significant difference exists between the click-through rates for Link #1 and Link #2.

Table 9.4

Expected frequencies if there were no differences in click-through rates for Link #1 and Link #2, derived from sums shown in Table 9.3.

Expected Click No Click
Link #1 164.2 10269.8
Link #2 178.8 11189.2

You should keep two important points about the χ2 test in mind. First, the χ2 test must be done using raw frequencies or counts, not percentages. You commonly think of click-through rates in terms of percentages, but that’s not how you test for significant differences between them. Also, the categories used must be mutually exclusive and exhaustive, which is why the preceding example used “Click” and “No Click” as the two categories of observations for each link. Those two categories are mutually exclusive and account for all possible actions that could be taken on the link—either the user clicked on it or didn’t.

9.1.3 Drop-Off Rates

Drop-off rates can be a particularly useful way of detecting where there might be some usability problems on your site. The most common use of drop-off rates is to identify where in a sequence of pages users are dropping out of or abandoning a process, such as opening an account or completing a purchase. For example, assume that the user must fill out the information on a sequence of five pages to open some type of account. Table 9.5 reflects the percentage of users who started the process that actually completed each of the five pages.

Table 9.5

Percentage of users who started a multipage process that actually completed each of the steps.

Page #1 89%
Page #2 80%
Page #3 73%
Page #4 52%
Page #5 49%

In this example, all of the percentages are relative to the number of users who started the entire process—that is, who got to Page #1. So 89% of the users who got to Page #1 completed it successfully, 80% of that original number completed Page #2, and so on. Given the data in Table 9.5, which of the five pages do the users seem to be having the most trouble with? The key is to look at how many users dropped off from each page—in other words, the difference between how many got to the page and how many completed it. Those “drop-off percentages” for each of the pages are shown in Table 9.6.

Table 9.6

Drop-off percentages for each page shown in Table 9.5: The difference between percentage who got to the page and percentage who completed it successfully.

Page #1 11%
Page #2 9%
Page #3 7%
Page #4 21%
Page #5 3%

This makes it clear that the largest drop-off rate, 21%, is associated with Page #4. If you’re going to redesign this multipage process, you would be well advised to learn what’s causing the drop-off at Page #4 and then try to address that in the redesign.

9.1.4 A/B Tests

A/B tests are a special type of live-site study in which you manipulate elements of the pages that are presented to the users. The traditional approach to A/B testing on a website involves posting two alternative designs for a given page or elements of a page. Some visitors to the site see the “A” version whereas others see the “B” version. In many cases, this assignment is random, so about the same number of visitors sees each version. In some cases, the majority of visitors see the existing page, and a smaller percentage see an experimental version that’s being tested. Although these studies are typically called A/B tests, the same concept applies to any number of alternative designs for a page.

What Makes a Good A/B Test?

A good A/B test requires careful planning. Here are some tips to keep in mind:

• Make sure the method you’re using to “split” visitors between “A” and “B” versions really is random. If someone tells you it’s good enough to just send all visitors in the morning to version “A” and all visitors in the afternoon to version “B”, don’t believe it. There could be something different about morning visitors vs afternoon visitors.

• Test small changes, especially at first. It might be tempting to design two completely different versions of a page, but you’ll learn much more by testing small differences. If the two versions are completely different from each other, and one performs significantly better than the other, you still don’t know why that one was better. If the only difference is, for example, wording of the call-to-action button, then you know the difference is due to that wording.

• Test for significance. It might look like one version is beating the other one, but do a statistical test (e.g., χ2) to make sure.

• Be agile. When you’re confident that one version is outperforming the other, then “promote” the winning version (i.e., send all visitors to it) and move on to another A/B test.

• Believe the data, not the HIPPO (Highest Paid Person’s Opinion). Sometimes the results of A/B tests are surprising and counterintuitive. One of the advantages that UX researchers bring to the mix is that you can follow up on these surprising findings using other techniques (e.g., surveys, lab, or online studies) to try to understand them better.

Technically, visitors to a page can be directed to one of the alternative pages in a variety of ways, including based on random number generation, the exact time (e.g., an even or odd number of seconds since midnight), or several other techniques. Typically, a cookie is set to indicate which version the visitor was shown so that if he or she returns to the site within a specified time period, the same version will be shown again. Keep in mind that it’s important to test the alternative versions at the same time because of the external factors mentioned before that could affect the results if you tested at different times.

Holland (2012b) described an A/B test of the page layout for an online newspaper. As shown in Figure 9.3, the two versions differed in the relationship of the photos to the articles they accompanied. In Version A, photos alternated between the left and the right of the articles. In Version B, photos were always to the right. They measured the click rates on the articles (to read the full version of the article). Version B, with article photos always to the right, increased clicks by 20% and total site pages viewed by 11%.

Which TestWon.com

Anne Holland runs a website called “Which Test Won” that’s a treasure trove of examples of A/B tests. As of the writing of the second edition of this book, she has about 300 examples of different A/B tests on her site, ranging from tests that manipulated entire page designs to tests where the only difference was the color of a single button. She posts a new test every week, encouraging readers to guess whether the A or the B version of the test won. She also has a free e-mail newsletter alerting readers to new tests on the site.

image

Figure 9.3 Two versions of page layout for an online newspaper. In Version A (left), photos for each article alternated between the left and the right of the article. In Version B (right), they were always to the right.

Carefully designed A/B tests can give you significant insight into what works and what doesn’t work on your website. Many companies, including Amazon, eBay, Google, Microsoft, Facebook, and others, are constantly doing A/B tests on their live sites, although most users don’t notice it (Kohavi, Crook, & Longbotham, 2009; Kohavi, Deng, Frasca, Longbotham, Walker, & Xu, 2012; Tang, Agarwal, O’Brien, & Meyer, 2010). In fact, as Kohavi and Round (2004) explained, A/B testing is constant at Amazon, and experimentation through A/B testing is the main way they make changes to their site.

9.2 Card-Sorting Data

Card sorting as a technique for organizing the elements of an information system in a way that makes sense to the users has been around at least since the early 1980s. For example, Tullis (1985) used the technique to organize the menus of a mainframe operating system. More recently, the technique has become popular as a way of informing decisions about the information architecture of a website (e.g., Maurer & Warfel, 2004; Spencer, 2009). Over the years the technique has evolved from a true card-sorting exercise using index cards to an online exercise using virtual cards. Although many UX professionals seem to be familiar with the basic card-sorting techniques, fewer seem to be aware that various metrics can be used in the analyses of card-sorting data.

The two major types of card-sorting exercises are (1) open card sorts, where you give users the cards that are to be sorted but let them define their own groups that the cards will be sorted into, and (2) closed card sorts, where you give users the cards to be sorted as well as the names of the groups to sort them into. Although some metrics apply to both, others are unique to each.

Card-Sorting Tools

A number of tools are available for conducting card-sorting exercises. Some are desktop applications, whereas others are web based. Most of these include basic analysis capabilities (e.g., hierarchical cluster analysis). Here are some of the ones we’re familiar with:

• CardZort (http://www.cardzort.com/cardzort/)(a Windows application)

• OptimalSort (http://www.optimalworkshop.com/optimalsort.htm)(a web-based service)

• UsabiliTest Card sorting (http://www.usabilitest.com/CardSorting)(a web-based service)

• UserZoom Card sorting (http://www.userzoom.com/products/card-sorting)(a web-based service)

• UzCardSort (http://uzilla.mozdev.org/cardsort.html)(a Mozilla extension)

• Websort (http://www.websort.net/)(a web-based service)

• XSort (http://www.xsortapp.com/)(a Mac OS X application)

Although not a card-sorting tool, you could also use PowerPoint or similar programs to do card-sorting exercises when the number of cards is relatively small. Create a slide that has the cards to be sorted along with empty boxes and then e-mail that to participants, asking them to put the cards into the boxes and to name the boxes. Then they simply email the file back. Of course, you’re on your own for the analysis in this case.

9.2.1 Analyses of Open Card-Sort Data

One way to analyze data from an open card sort is to create a matrix of the “perceived distances” (also called a dissimilarity matrix) among all pairs of cards in the study. For example, assume you conducted a card-sorting study using 10 fruits: apples, oranges, strawberries, bananas, peaches, plums, tomatoes, pears, grapes, and cherries. Assume one participant in the study created the following names and groupings:

• “Large, round fruits”: apples, oranges, peaches, tomatoes

• “Small fruits”: strawberries, grapes, cherries, plums

• “Funny-shaped fruits”: bananas, pears

You can then create a matrix of “perceived distances” among all pairs of fruits for each participant using the following rules:

• If this person put a pair of cards in the same group, it gets a distance of 0.

• If this person put a pair of cards into different groups, it gets a distance of 1.

Using these rules, the distance matrix for the preceding participant would look like what’s shown in Table 9.7.

Card-Sort Analysis Spreadsheets

Donna Maurer has developed an Excel spreadsheet for the analysis of card-sorting data. She uses some very different techniques for exploring the results of a card-sorting exercise than the more statistical techniques we’re describing here, including support for the person doing the analysis to standardize the categories by grouping the ones that are similar. The spreadsheet and instructions can be downloaded from http://www.rosenfeldmedia.com/books/cardsorting/blog/card_sort_analysis_spreadsheet/.

In addition, Mike Rice has developed a spreadsheet for creating a co-occurrence matrix from card-sorting data. This type of analysis allows you to see how often any two cards were sorted into the same group. His analysis spreadsheet works with the same spreadsheets that Donna Maurer uses for her analyses. Mike’s analysis spreadsheet, and the instructions for using it, can be found at http://www.informoire.com/co-occurrence-matrix/.

Table 9.7

Distance matrix for one participant in the fruit card-sorting example.

Image

We’re only showing the top half of the matrix for simplicity, but the bottom half would be exactly the same. The diagonal entries are not meaningful because the distance of a card from itself is undefined. (Or it can be assumed to be zero if needed in the analyses.) So for any one participant in the study, the entries in this matrix will only be 0’s or 1’s. The key is to then combine these matrices for all the participants in the study. Let’s assume you had 20 participants do the card-sorting exercise with the fruits. You can then sum the matrices for the 20 participants. This will create an overall distance matrix whose values can, in theory, range from 0 (if all participants put that pair into the same group) to 20 (if all participants put that pair into different groups). The higher the number, the greater the distance. Table 9.8 shows an example of what that might look like. In this example, only 2 of the participants put the oranges and peaches in different groups, whereas all 20 of the participants put the bananas and tomatoes into different groups.

Table 9.8

Overall distance matrix for 20 participants in the fruit card-sorting study.

Image

This overall matrix can then be analyzed using any of several standard statistical methods for studying distance (or similarity) matrices. Two that we find useful are hierarchical cluster analysis (e.g., Aldenderfer & Blashfield, 1984) and multidimensional scaling (MDS)(e.g., Kruskal & Wish, 2006). Both are available in a variety of commercial statistical analysis packages, including SAS (http://www.sas.com), IBM SPSS (http://www.spss.com), and NCSS (http://www.ncss.com/), as well as some add-on packages for Excel (e.g., Unistat, http://www.unistat.com; XLStat, http://www.xlstat.com).

Hierarchical Cluster Analysis

The goal of hierarchical cluster analysis is to build a tree diagram where the cards that were viewed as most similar by the participants in the study are placed on branches that are close together. For example, Figure 9.4 shows the result of a hierarchical cluster analysis of the data in Table 9.8. The key to interpreting a hierarchical cluster analysis is to look at the point at which any given pair of cards “join together” in the tree diagram. Cards that join together sooner are more similar to each other than those that join together later. For example, the pair of fruits with the lowest (shortest) distance in Table 9.8 (peaches and oranges; distance = 2) join together first in the tree diagram.

image

Figure 9.4 Result of a hierarchical cluster analysis of data shown in Table 9.8.

Several different algorithms can be used in hierarchical cluster analysis to determine how the “linkages” are created. Most of the commercial packages that support hierarchical cluster analysis let you choose which method to use. The linkage method we think works best is one called the Group Average method. But you might want to experiment with some of the other linkage methods to see what the results look like; there’s no absolute rule saying one is better than another.

One thing that makes hierarchical cluster analysis so appealing for use in the analysis of card-sorting data is that you can use it to directly inform how you might organize the cards (pages) in a website. One way to do this is to take a vertical “slice” through the tree diagram and see what groupings that creates. For example, Figure 9.4 shows a four-cluster “slice”: The vertical line intersects four horizontal lines, forming the four groups whose members are color coded. How do you decide how many clusters to create when taking a “slice” like this? Again, there’s no fixed rule, but one method we like is to calculate the average number of groups of cards created by the participants in the card-sorting study and then try to approximate that.

After taking a “slice” through the tree diagram and identifying the groups created by that, the next thing you might want to do is determine how those groups compare to the original card-sorting data—in essence, to come up with a “goodness-of-fit” metric for your derived groups. One way of doing that is to compare the pairings of cards in your derived groups with the pairings created by each participant in the card-sorting study and to identify what percentage of the pairs match. For example, for the data in Table 9.7, only 7 of the 45 pairs do not match those identified in Figure 9.4. The 7 nonmatching pairings are apples–tomatoes, apples–pears, oranges–tomatoes, oranges–pears, bananas–pears, peaches–tomatoes, and peaches–pears. That means 38 pairings do match, or 84% (38/45). Averaging these matching percentages across all the participants will give you a measure of the goodness of fit for your derived groups relative to the original data.

Multidimensional Scaling

Another way of analyzing and visualizing data from a card-sorting exercise is using multidimensional scaling (MDS). Perhaps the best way to understand MDS is through an analogy. Imagine that you had a table of the mileages between all pairs of major U.S. cities but not a map of where those cities are located. An MDS analysis could take that table of mileages and derive an approximation of the map showing where those cities are relative to each other. In essence, MDS tries to create a map in which the distances between all pairs of items match the distances in the original distance matrix as closely as possible.

The input to an MDS analysis is the same as the input to hierarchical cluster analysis—a distance matrix, like the example shown in Table 9.8. The result of an MDS analysis of the data in Table 9.8 is shown in Figure 9.5. The first thing that’s apparent from this MDS analysis is how the tomatoes and bananas are isolated from all the other fruit. That’s consistent with the hierarchical cluster analysis, where those two fruits were the last two to join all the others. In fact, our four-cluster “slice” of the hierarchical cluster analysis (Figure 9.4) had these two fruits as groups unto themselves. Another thing apparent from the MDS analysis is how the strawberries, grapes, cherries, and plums cluster together on the left, and the apples, peaches, pears, and oranges cluster together on the right. That’s also consistent with the hierarchical cluster analysis.

How Many Participants Are Enough for a Card-Sorting Study?

Tullis and Wood (2004) conducted a card-sorting study in which they addressed the question of how many people are needed for a card-sorting study if you want reliable results from your analyses. They did an open sort with 46 cards and 168 participants. They then analyzed the results for the full data set (168 participants), as well as many random subsamples of the data from 2 to 70 participants. Correlations of the results for those subsamples to the full data set looked like the chart here.

image

The “elbow” of that curve appears to be somewhere between 10 and 20, with a sample size of 15 yielding a correlation of 0.90 with the full data set. Although it’s hard to know how well these results would generalize to other card-sorting studies with different subject matter or different numbers of cards, they at least suggest that about 15 may be a good target number of participants.

image

Figure 9.5 Multidimensional scaling analysis of the distance matrix in Table 9.8.

Note that it’s also possible to use more than two dimensions in an MDS analysis, but we’ve rarely seen a case where adding even just one more dimension yields particularly useful insights into card-sorting data. Another point to keep in mind is that the orientation of the axes in an MDS plot is arbitrary. You could rotate or flip the map any way you want and the results would still be the same. The only thing that’s actually important is the relative distances between all pairs of the items.

The most common metric that’s used to represent how well an MDS plot reflects the original data is a measure of “stress” that’s sometimes referred to as Phi. Most of the commercial packages that do MDS analysis can also report the stress value associated with a solution. Basically, it’s calculated by looking at all pairs of items, finding the difference between each pair’s distance in the MDS map and its distance in the original matrix, squaring that difference, and summing those squares. That measure of stress for the MDS map shown in Figure 9.5 is 0.04. The smaller the value, the better. But how small does it really need to be? A good rule of thumb is that stress values under 0.10 are excellent, whereas stress values above 0.20 are poor.

We find that it’s useful to do both a hierarchical cluster analysis and an MDS analysis. Sometimes you see interesting things in one that aren’t apparent in the other. Because they are different statistical analysis techniques, you shouldn’t expect them to give exactly the same answers. For example, one thing that’s sometimes easier to see in an MDS map is which cards are “outliers”—those that don’t obviously belong with a single group. There are at least two reasons why a card could be an outlier: (1) It could truly be an outlier—a function that really is different from all the others, or (2) it could have been “pulled” toward two or more groups. When designing a website, you would probably want to make these functions available from each of those clusters.

9.2.2 Analyses of Closed Card-Sort Data

Closed card sorts, where you not only give participants the cards but also the names of the groups in which to sort them, are probably done less often than open card sorts. Typically, you would start with an open sort to get an idea of the kinds of groups that users would naturally create and the names they might use for them. Sometimes it’s helpful to follow up an open sort with one or more closed sorts, mainly as a way of testing your ideas about organizing the functions and naming the groupings. With a closed card sort you have an idea about how you want to organize the functions, and you want to see how close users come to matching the organization you have in mind.

We used closed card sorting to compare different ways of organizing the functions for a website (Tullis, 2007). We first conducted an open sort with 54 functions. We then used those results to generate six different ways of organizing the functions that we then tested in six simultaneous closed card-sorting exercises. Each closed card sort used the same 54 functions but presented different groups to sort the functions into. The number of groups in each “framework” (set of group names) ranged from three to nine. Each participant only saw and used one of the six frameworks.

In looking at the data from a closed card sort, the main thing you’re interested in is how well the groups “pulled” the cards to them that you intend to belong to those groups. For example, consider the data in Table 9.9 which shows the percentage of participants in a closed card-sorting exercise who put each card into each of the groups.

Table 9.9

Percentage of participants in a closed card sort who put each of 10 cards into each of the three groups provided.

Image

The other percentage, shown on the right in Table 9.9, is the highest percentage for each card. This is an indicator of how well the “winning” group pulled the appropriate cards to it. What you hope to see are cases like Card #10 in this table, which was pulled very strongly to Group C, with 92% of the participants putting it in that group. Ones that are more troubling are cases such as Card #7, where 46% of the participants put it in Group A, but 37% put it in Group C—participants were very “split” in terms of deciding where that card belonged in this set of groups.

One metric you could use for characterizing how well a particular set of group names fared in a closed card sort is the average of these maximum values for all the cards. For the data in Table 9.9, that would be 73%. But what if you want to compare results from closed card sorts that had the same cards but different sets of groups? That average maximum percentage will work well for comparisons as long as each set contained the same number of groups. But if one set had only three groups and another had nine groups, as in the Tullis (2007) study, it’s not a fair metric for comparison. If participants were simply acting randomly in doing the sorting with only three groups, by chance they would get a maximum percentage of 33%. But if they were acting randomly in doing a sort with nine groups, they would get a maximum percentage of only 11%. So using this metric, a framework with more groups is at a disadvantage in comparison to one with fewer groups.

We experimented with a variety of methods to correct for the number of groups in a closed card sort. The one that seems to work best is illustrated in Table 9.10. These are the same data as shown earlier in Table 9.9 but with two additional columns. The “2nd place” column gives the percentage associated with the group that had the next-highest percentage. The “Difference” column is simply the difference between the maximum percentage and the 2nd-place percentage. A card that was pulled strongly to one group, such as Card #10, gets a relatively small penalty in this scheme. But a card that was split more evenly, such as Card #7, takes quite a hit.

Table 9.10

Same data as shown in Table 9.9 but with an additional two columnsa.

Image

a“2nd place” refers to the next-highest percentage after the maximum percentage, and “Difference” indicates the difference between the maximum percentage and the 2nd-place percentage.

The average of these differences can then be used to make comparisons between frameworks that have different numbers of groups. For example, Figure 9.6 shows the data from Tullis (2007) plotted using this method. We call this a measure of the percent agreement among the participants about which group each card belongs to. Obviously, higher values are better.

image

Figure 9.6 Comparison of six frameworks in six parallel closed card sorts. Because the frameworks had different numbers of groups, a correction was used in which the percentage associated with the 2nd-place group was subtracted from the winning group. Adapted from Tullis (2007); used with permission.

Data from a closed card sort can also be analyzed using hierarchical cluster analysis and MDS analysis, just like data from an open card sort. These give you visual representations of how well the framework you presented to the participants in the closed card sort actually worked for them.

9.2.3 Tree Testing

A technique that’s closely related to closed card sorting is tree testing. This is a technique where you provide an interactive representation of the proposed information organization for a site, typically in the form of menus that let the user traverse the information hierarchy. For example, Figure 9.7 shows a sample study in Treejack (http://www.optimalworkshop.com/treejack.htm) from the participant’s perspective.

image

Figure 9.7 Sample study in Treejack. The task is shown at the top. Initially the participant sees only the menu on the left. After selecting “Cell Phones & Plans” from that menu, a submenu is shown. This continues until the participant chooses the “I’d find it here” button. The participant can go back up the tree at any time.

Although the interface is very different, conceptually this is similar to a closed card-sorting exercise. In a tree test, each task is similar to a “card” in that the participants are telling you where they would expect to find that element in the tree structure.

Figure 9.8 shows an example of data for one task provided by Treejack, including the following:

• Task success data. You tell Treejack which nodes in the tree you consider to be successful for each task.

• Directness. This is the percentage of participants who didn’t backtrack up the tree at any point during the task. This can be a useful indication of how confident the participants are in making their selections.

• Time taken. Average time taken by participants to complete the task.

image

Figure 9.8 Sample data for one task in Treejack, including task success, directness, and time taken.

And all three of these metrics are shown with 95% confidence intervals!

Treejack also provides an interesting visualization of data for each task called a “PieTree,” shown in Figure 9.9. In this visualization, the size of each node reflects the number of participants who visited that node for this task. Colors within each node reflect the percentage of participants who continued down a correct path, an incorrect path, or nominated a “leaf” node as the correct answer. In the online version of the PieTree, hover information for each node gives you more details about what the participants did at that node.

Some Tree-Testing Tools

The following are some of the tree-testing tools that we’re aware of:

• C-Inspector (http://www.c-inspector.com)

• Optimal Workshop’s Treejack (http://www.optimalworkshop.com/treejack.htm)

• PlainFrame (http://uxpunk.com/plainframe/)

• UserZoom Tree Testing (http://www.userzoom.com/products/tree-testing)

image

Figure 9.9 A “PieTree” from Treejack showing the paths that participants took in performing a single task, in this case indicating where they would expect to find information about lowest-cost home Internet plans. Green highlights the correct path (starting from the center).

9.3 Accessibility Data

Accessibility usually refers to how effectively someone with disabilities can use a particular system, application, or website (e.g., Cunningham, 2012; Henry, 2007; Kirkpatrick et al., 2006). We believe that accessibility is really just usability for a particular set of users. When viewed that way, it becomes obvious that most of the other metrics discussed in this book (e.g., task completion rates and times, self-reported metrics) can be applied to measure the usability of any system for users with different types of disabilities. For example, Nielsen (2001) reported four usability metrics from a study of 19 websites with three groups of users: blind users, who accessed the sites using screen-reading software; low-vision users, who accessed the sites using screen-magnifying software; and a control group who did not use assistive technology. Table 9.11 shows results for the four metrics.

Table 9.11

Data from usability tests of 19 websites with blind users, low-vision users, and users with normal visiona.

Image

aAdapted from Nielsen (2001b); used with permission.

These results point out that the usability of these sites is far worse for the screen-reader and screen-magnifier users than it is for the control users. But the other important message is that the best way to measure the usability of a system or website for users with disabilities is to actually test with representative users. Although that’s a very desirable objective, most designers and developers don’t have the resources to test with representative users from all the disability groups that might want to use their product. That’s where accessibility guidelines can be helpful.

Perhaps the most widely recognized web accessibility guidelines are the Web Content Accessibility Guidelines (WCAG), Version 2.0, from the World-Wide Web Consortium (W3C)(http://www.w3.org/TR/WCAG20/). These guidelines are divided into four major categories:

1. Perceivable

a. Provide text alternatives for nontext content.

b. Provide captions and other alternatives for multimedia.

c. Create content that can be presented in different ways, including assistive technologies, without losing meaning.

d. Make it easier for users to see and hear content.

2. Operable

a. Make all functionality available from a keyboard.

b. Give users enough time to read and use content.

c. Do not use content that causes seizures.

d. Help users navigate and find content.

3. Understandable

a. Make text readable and understandable.

b. Make content appear and operate in predictable ways.

c. Help users avoid and correct mistakes.

4. Robust

a. Maximize compatibility with current and future user tools.

One way of quantifying how well a website meets these criteria is to assess how many of the pages in the site fail one or more of each of these guidelines.

Some automated tools can check for certain obvious violations of these guidelines (e.g., missing “Alt” text on image). Although the errors they detect are generally true errors, they also commonly miss many errors. Many of the items that the automated tools flag as warnings may in fact be true errors, but it takes a human to find out. For example, if an image on a web page has null Alt text defined (ALT=“”), that may be an error if the image is informational or it may be correct if the image is purely decorative. The bottom line is that the only really accurate way to determine whether accessibility guidelines have been met is by manual inspection of the code or by evaluation using a screen reader or other appropriate assistive technology. Often both techniques are needed.

Automated Accessibility-Checking Tools

Some of the tools available for checking web pages for accessibility errors include the following:

• Cynthia Says (http://www.contentquality.com/)

• Accessibility Valet Demonstrator (http://valet.webthing.com/access/url.html)

• WebAIM’s WAVE tool (http://wave.webaim.org/)

• University of Toronto Web Accessibility Checker (http://achecker.ca/checker/)

• TAW Web Accessibility Test (http://www.webdevstuff.com/103/taw-web-accessibility-test.html)

Once you’ve analyzed the pages against the accessibility criteria, one way of summarizing the results is to count the number of pages with errors. For example, Figure 9.10 shows results of a hypothetical analysis of a website against the WCAG guidelines. This shows that only 10% of the pages have no errors, whereas 25% have more than 10 errors. The majority (53%) have 3–10 errors.

image

Figure 9.10 Results of analysis of a website against the WCAG guidelines.

In the United States, another important set of accessibility guidelines is the so-called Section 508 guidelines or, technically, the 1998 Amendment to Section 508 of the 1973 Rehabilitation Act (Section 508, 1998; also see Mueller, 2003). This law requires federal agencies to make their electronic and information technology accessible to people with disabilities, including their websites. The law applies to all federal agencies when they develop, procure, maintain, or use electronic and information technology. Section 508 specifies 16 standards that websites must meet. The Section 508 requirements are essentially a subset of the full WCAG 2.0 guidelines. We believe the most useful metric for illustrating Section 508 compliance is a page-level metric, indicating whether the page passes all 16 standards or not. You can then chart the percentage of pages that pass versus those that fail.

Update to Section 508

At the time of the writing of this second edition, an update of Section 508 is expected shortly. A 2011 draft has been released, commented on, and public hearings have been held. The new version is much more complete and closely reflects WCAG 2.0. The latest information can be found at http://www.access-board.gov/508.htm.

9.4 Return-On-Investment Data

A book about usability metrics wouldn’t be complete without at least some discussion of return on investment, as the usability metrics discussed in this book often play a key role in calculating ROI. But because entire books have been written on this topic (Bias & Mayhew, 2005; Mayhew & Bias, 1994), our purpose is to just introduce some of the concepts.

The basic idea behind usability ROI, of course, is to calculate the financial benefit attributable to usability enhancements for a product, system, or website. These financial benefits are usually derived from such measures as increased sales, increased productivity, or decreased support costs that can be attributed to the usability improvements. The key is to identify the cost associated with the usability improvements and then compare those to the financial benefits.

As Bias and Mayhew (2005) summarize, there are two major categories of ROI, with different types of returns for each:

• Internal ROI:

ent Increased user productivity

ent Decreased user errors

ent Decreased training costs

ent Savings gained from making changes earlier in the design life cycle

ent Decreased user support

• External ROI:

ent Increased sales

ent Decreased customer support costs

ent Savings gained from making changes earlier in the design life cycle

ent Reduced cost of providing training (if training is offered by the company)

To illustrate some of the issues and techniques in calculating usability ROI, we’ll look at an example from Diamond Bullet Design (Withrow, Brinck, & Speredelozzi, 2000). This case study involved the redesign of a state government web portal. They conducted usability tests of the original website and a new version that had been created using a user-centered design process. The same 10 tasks were used to test both versions. A few of them were as follows:

• You are interested in renewing a {state} driver’s license online.

• How do nurses get licensed in {the state}?

• To assist in traveling, you want to find a map of {state} highways.

• What 4-year colleges are located in {the state}?

• What is the state bird of {the state}?

Twenty residents of the state participated in the study, which was a between-subjects design (with half using the original site and half using the new). Data collected included task times, task completion rates, and various self-reported metrics. They found that the task times were significantly shorter for the redesigned site, and the task completion rates were significantly higher. Figure 9.11 shows task times for the original and redesigned sites. Table 9.12 shows a summary of the task completion rates and task times for both versions of the site, as well as an overall measure of efficiency for both (task completion rate per unit time).

image

Figure 9.11 Task times for the original and the redesigned sites (an asterisk indicates significant difference). Adapted from Withrow et al. (2000); used with permission.

Table 9.12

Summary of task performance dataa.

Original Site Redesigned Site
Average Task Completion Rate 72% 95%
Average Task Time (mins) 2.2 0.84
Average Efficiency 33% 113%

aAverage efficiency is the task completion rate per unit of time (Task Completion Rate/Task Time). Adapted from Withrow et al. (2000); used with permission.

So far, everything is very straightforward and simply illustrates some of the usability metrics discussed in this book. But here’s where it gets interesting. To begin calculating ROI from the changes made to the site, Withrow et al. (2000) made the following assumptions and calculations related to time savings:

• Of the 2.7 million residents of the state, we might “conservatively estimate” a quarter of them use the website at least once per month.

• If each of them saved 79 seconds (as was the average task savings in this study), then about 53 million seconds (14,800 hours) are saved per year.

• Converting this to labor costs, we find 370 person-weeks (at 40 hours per week) or 7 person-years are saved per month; 84 person-years are saved each year.

• On average, a citizen in the target state had an annual salary of $14,700.

• This leads to a yearly benefit of $1.2 million based only on the time savings.

Note that this chain of reasoning had to start with a pretty big assumption: that a quarter of the residents of the state use the site at least once per month. So that assumption, which all the rest of the calculations hinge upon, is certainly up for debate. A better way of generating an appropriate value with which to start these calculations would have been from actual usage data for the current site.

They went on to calculate an increase in revenue due to the increased task completion rate for the new site:

1. The task failure rate of the old portal was found to be 28%, whereas the new site was 5%.

2. We might assume that 100,000 users would pay a service fee on the order of $2 per transaction at least once a month.

3. Then the 23% of them who are succeeding on the new site, whereas formerly they were failing, are generating an additional $552,000 in revenue per year.

Again, a critical assumption had to be made early in the chain of reasoning: that 100,000 users would pay a service fee to the state on the order of $2 per transaction at least once a month. A better way of doing this calculation would have been to use data from the live site specifically about the frequency of fee-generating transactions (and amounts of the fees). These could then have been adjusted to reflect the higher task completion rate for the redesigned site. If you agree with their assumptions, these two sets of calculations yield a total of about $1.75 million annually, either in time savings to the residents or increased fees to the state. Although Withrow and colleagues (2000) don’t specify how much was spent on the redesign of this portal, we can safely assume it was dramatically less than $1.75 million!

This example points out some of the challenges associated with calculating usability ROI. In general, there are two major classes of situations where you might try to calculate a usability ROI: when users of the product are employees of your company and when users of the product are your customers. It tends to be much more straightforward to calculate ROI when the users are employees of your company. You generally know how much employees are paid, so time savings in completing certain tasks (especially highly repetitive ones) can be translated directly to monetary savings. In addition, you may know the costs involved in correcting certain types of errors, so reductions in the rates of those errors could also be translated to monetary savings.

Calculating usability ROI tends to be much more challenging when the users are your customers (or really anyone not an employee of your company). Your benefits are much more indirect. For example, it might not make any real difference to your bottom line that your customers can complete a key income-generating transaction in 30% less time than before. It probably does not mean that they will then be performing significantly more of those transactions. But what it might mean is that over time those customers will remain your customers and others will become your customers who might not have otherwise (assuming the transaction times are significantly shorter than they are for your competitors), thus increasing revenue. A similar argument can be made for increased task completion rates.

Some ROI Case Studies

A variety of other case studies of usability ROI are available. Here’s just a sampling.

• The Nielsen Norman Group did a detailed analysis of 72 usability ROI case studies and found increases in key performance indicators of 0% to over 6000%. The case studies covered a wide variety of websites, including Macy’s, Bell Canada, New York Life, Open Table, a government agency, and a community college (Nielsen, Berger, Gilutz, & Whitenton, 2008).

• A redesign of the BreastCancer.org discussion forums resulted in a 117% increase in site visitors, a 41% increase in new memberships, a 53% reduction in time taken to register, and a 69% reduction in monthly help desk costs (Foraker, 2010).

• After a redesign of Move.com’s home search and contact an agent features, users’ ability to find a home increased from 62 to 98%, sales lead generation to real estate agents increased over 150%, and their ability to sell advertising space on the site improved significantly (Vividence, 2001).

• A user-centered redesign of Staples.com resulted in 67% more repeat customers and a 10% improvement in ratings of ease of placing orders, overall purchasing experience, and likelihood of purchasing again. Online revenues went from $94 million in 1999 to $512 million after implementation of the new site (Human Factors International, 2002).

• A major computer company spent $20,700 on usability work to improve the sign-on procedure in a system used by several thousand employees. The resulting productivity improvement saved the company $41,700 the first day the system was used (Bias & Mayhew, 1994).

• After a redesign of the navigational structure of Dell.com, revenue from online purchases went from $1 million per day in September 1998 to $34 million per day in March 2000 (Human Factors International, 2002).

• A user-centered redesign of a software product increased revenue by more than 80% over the initial release of the product (built without usability work). The revenues of the new system were 60% higher than projected, and many customers cited usability as a key factor in deciding to buy the new system (Wixon & Jones, 1992).

9.5 Summary

Here are some of the key takeaways from this chapter.

1. If you’re dealing with a live website, you should be studying what your users are doing on the site as much as you can. Don’t just look at page views. Look at click-through rates and drop-off rates. Whenever possible, conduct live A/B tests to compare alternative designs (typically with small differences). Use appropriate statistics (e.g., χ2) to make sure any differences you’re seeing are statistically significant.

2. Card sorting can be immensely helpful in learning how to organize some information or an entire website. Consider starting with an open sort and then following up with one or more closed sorts. Hierarchical cluster analysis and multidimensional scaling are useful techniques for summarizing and presenting the results. Closed card sorts can be used to compare how well different information architectures work for the users. Tree-testing tools can also be a useful way to test a candidate organization.

3. Accessibility is just usability for a particular group of users. Whenever possible, try to include older users and users with various kinds of disabilities in your usability tests. In addition, you should evaluate your product against published accessibility guidelines or standards, such as WCAG or Section 508.

4. Calculating ROI data for usability work is sometimes challenging, but it usually can be done. If the users are employees of your company, it’s generally easy to convert usability metrics such as reductions in task times into dollar savings. If the users are external customers, you generally have to extrapolate usability metrics such as improved task completion rates or improved overall satisfaction to decreases in support calls, increases in sales, or increases in customer loyalty.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset