Contents
2.1 Independent and Dependent Variables
2.3.1 Measures of Central Tendency
2.3.4 Displaying Confidence Intervals as Error Bars
2.4.3 Comparing More Than Two Samples
2.5 Relationships Between Variables
2.7 Presenting your Data Graphically
This chapter covers background information about data, statistics, and graphs that apply to just about any user experience metrics. Specifically, we address the following:
• The basic types of variables and data in any user experience study, including independent and dependent variables, and nominal, ordinal, interval, and ratio data.
• Basic descriptive statistics such as the mean and median, standard deviation, and the concept of confidence intervals, which reflect how accurate your estimates of measures such as task times, task success rates, and subjective ratings actually are.
• Simple statistical tests for comparing means and analyzing relationships between variables.
• Tips for presenting your data visually in the most effective way.
We use Microsoft Excel 2010 for all of the examples in this chapter (and really in most of this book) because it is so popular and widely available. Most of the analyses can also be done with other readily available spreadsheet tools such as Google Docs or OpenOffice.org.
At the broadest level, there are two types of variables in any usability study: independent and dependent. Independent variables are the things you manipulate or control for, such as designs you’re testing or the ages of your participants.Dependent variables are the things you measure, such as success rates, number of errors, user satisfaction, completion times, and many more. Most of the metrics discussed in this book are dependent variables.
When designing a user experience study, you should have a clear idea of what you plan to manipulate (independent variables) and what you plan to measure (dependent variables). The most interesting outcomes of a study are at the intersection of the independent and dependent variables, such as whether one design resulted in a higher task success rate than the other.
Both independent and dependent variables can be measured using one of four general types of data: nominal, ordinal, interval, and ratio. Each type of data has its own unique characteristics and, most importantly, supports specific types of analyses and statistics. When collecting and analyzing user experience data, you should know what type of data you’re dealing with and what you can and can’t do with each type.
Nominal (also called categorical) data are simply unordered groups or categories. Without order between the categories, you can say only that they are different, not that one is any better than the other. For example, consider apples, oranges, and bananas. They are just different; no one fruit is inherently better than any other.
In user experience, nominal data might be characteristics of different types of users, such as Windows versus Mac users, users in different geographic locations, or males vs females. These are typically independent variables that allow you to segment data by these different groups. Nominal data also include some commonly used dependent variables, such as task success, the number of users who clicked on link A instead of link B, or users who chose to use a remote control instead of the controls on a DVD player itself.
Among the statistics you can use with nominal data are simple descriptive statistics such as counts and frequencies. For example, you could say that 45% of the users are female, there are 200 users with blue eyes, or 95% were successful on a particular task.
Ordinal data are ordered groups or categories. As the name implies, data are organized in a certain way. However, the intervals between measurements are not meaningful. Some people think of ordinal data as ranked data. For example, the list of the top 100 movies, as rated by the American Film Institute (AFI), shows that their 10th best movie of all time, Singing in the Rain, is better than their 20th best movie of all time, One Flew Over the Cuckoo’s Nest. But these ratings don’t say that Singing in the Rain is twice as good as One Flew Over the Cuckoo’s Nest. One film is just better than the other, at least according to the AFI. Because the distance between the ranks is not meaningful, you cannot say one is twice as good as the other. Ordinal data might be ordered as better or worse, more satisfied or less satisfied, or more severe or less severe. The relative ranking (the order of the rankings) is the only thing that matters.
In user experience, the most common examples of ordinal data come from self-reported data. For example, a user might rate a website as excellent, good, fair, or poor. These are relative rankings: The distance between excellent and good is not necessarily the same distance between good and fair. Or if you were to ask the participants in a usability study to rank order four different designs for a web page according to which they prefer, that would also be ordinal data. There’s no reason to assume that the distance between the page ranked first by a participant and the page ranked second is the same as the distance between the page ranked second and the one ranked third. It could be that the participant really loved one page and hated all three of the others.
The most common way to analyze ordinal data is by looking at frequencies. For example, you might report that 40% of the users rated the site as excellent, 30% as good, 20% percent as fair, and 10% as poor. Calculating an average ranking may be tempting but it’s statistically meaningless.
Interval data are continuous data where differences between the values are meaningful, but there is no natural zero point. An example of interval data familiar to most of us is temperature. Defining 0° Celsius or 32° Fahrenheit based on when water freezes is completely arbitrary. The freezing point of water does not mean the absence of heat; it only identifies a meaningful point on the scale of temperatures. But the differences between the values are meaningful: the distance from 10° to 20° is the same as the distance from 20° to 30° (using either scale). Dates are another common example of interval data.
In usability, the System Usability Scale (SUS) is one example of interval data. SUS (described in detail in Chapter 6) is based on self-reported data from a series of questions about the overall usability of any system. Scores range from 0 to 100, with a higher SUS score indicating better usability. The distance between each point along the scale is meaningful in the sense that it represents an incremental increase or decrease in perceived usability.
Interval data allow you to calculate a wide range of descriptive statistics (including averages and standard deviation). There are also many inferential statistics that can be used to generalize about a larger population. Interval data provide many more possibilities for analysis than either nominal or ordinal data. Much of this chapter will review statistics that can be used with interval data.
One of the debates you can get into with people who collect and analyze subjective ratings is whether you must treat the data as purely ordinal or if you can treat it as being interval. Consider these two rating scales:
At first glance, you might say those two scales are the same, but the difference in presentation makes them different. Putting explicit labels on items in the first scale makes the data ordinal. Leaving the intervening labels off in the second scale and only labeling the end points make the data more “interval-like,” which is why most subjective rating scales only label the ends, or “anchors,” and not every data point. Consider a slightly different version of the second scale:
Presenting it that way, with 9 points along the scale, makes it even more obvious that the data can be treated as if it were interval data. The reasonable interpretation of this scale by a user is that the distances between all the data points along the scale are equal. A question to ask yourself when deciding whether you can treat some data like this as interval or not is whether a point halfway between any two of the defined data points makes sense. If it does, then it makes sense to analyze the data as interval data.
Ratio data are the same as interval data but with the addition of an absolute zero. This means that the zero value is not arbitrary, as with interval data, but has some inherent meaning. With ratio data, differences between the measurements are interpreted as a ratio. Examples of ratio data are age, height, and weight. In each example, zero indicates the absence of age, height, or weight.
In user experience, the most obvious example of ratio data is time. Zero seconds left to complete a task would mean no time or duration remaining. Ratio data let you say something is twice as fast or half as slow as something else. For example, you could say that one user is twice as fast as another user in completing a task.
There aren’t many additional analyses you can do with ratio data compared to interval data in usability. One exception is calculating a geometric mean, which might be useful in measuring differences in time. Aside from that calculation, there really aren’t many differences between interval and ratio data in terms of the available statistics.
Descriptive statistics are essential for any interval or ratio-level data. Descriptive statistics, as the name implies, describe the data, without saying anything about the larger population. Inferential statistics let you draw some conclusions or infer something about a larger population above and beyond your sample.
The most common types of descriptive statistics are measures of central tendency (such as the mean), measures of variability (such as the standard deviation), and confidence intervals, which pull the other two together. The following sections use the sample data shown in Table 2.1 to illustrate these statistics.These data represent the time, in seconds, that it took each of 12 participants in a usability study to complete the same task.
Table 2.1
Time to complete a task, in seconds, for each of 12 participants in a usability study.
Participant | Task Time (seconds) |
P1 | 34 |
P2 | 33 |
P3 | 28 |
P4 | 44 |
P5 | 46 |
P6 | 21 |
P7 | 22 |
P8 | 53 |
P9 | 22 |
P10 | 29 |
P11 | 39 |
P12 | 50 |
Measures of central tendency are simply a way of choosing a single number that is in some way representative of a set of numbers. The three most common measures of central tendency are the mean, median, and mode.
The mean is what most people think of as the average: the sum of all values divided by how many values there are. The mean of most user experience metrics is extremely useful and is probably the most common statistic cited in a usability report. For the data in Table 2.1, the mean is 35.1 seconds.
The median is the middle number if you put them in order from smallest to largest: half the values are below the median and half are above the median. If there is no middle number, the median is halfway between the two values on either side of the middle. For the data in Table 2.1, the median is equal to 33.5 seconds (halfway between the middle two numbers, 33 and 34). Half of the users were faster than 33.5 seconds and half were slower. In some cases, the median can be more revealing than the mean. For example, let’s assume the task time for P12 had been 150 seconds rather than 50.That would change the mean to 43.4 seconds, but the median would be unchanged at 33.5 seconds. It’s up to you to decide which is a more representative number, but this illustrates the reason that the median is sometimes used, especially when larger values (or so-called “outliers”) may skew the distribution.
The mode is the most commonly occurring value in the set of numbers. For the data in Table 2.1, the mode is 22 seconds, because two participants completed the task in 22 seconds. It’s not common to report the mode in usability test results. When data are continuous over a broad range, such as the task times shown in Table 2.1, the mode is generally less useful. When data have a more limited set of values (such as subjective rating scales), the mode is more useful.
Measures of variability reflect how much the data are spread or dispersed across the range of values. For example, these measures help answer the question, “Do most users have similar task completion times or is there a wide range of times?” In most usability studies, variability is caused by individual differences among your participants. There are three common measures of variability: range, variance, and standard deviation.
The range is the distance between the minimum and maximum values. For the data in Table 2.1, the range is 32, with a minimum time of 21 seconds and a maximum time of 53 seconds. The range can vary wildly depending on the metric. For example, in many kinds of rating scales, the range is usually limited to five or seven, depending on the number of values used in the scales. When you study completion times, the range is very useful because it will help identify “outliers” (data points that are at the extreme top and bottom of the range). Looking at the range is also a good check to make sure that the data are coded properly. If the range is supposed to be from one to five, and the data include a seven, you know there is a problem.
Variance tells you how spread out the data are relative to the average or mean. The formula for calculating variance measures the difference between each individual data point and the mean, squares that value, sums all of those squares, and then divides the result by the sample size minus 1. For the data in Table 2.1, the variance is 126.4.
Once you know the variance, you can calculate the standard deviation easily, which is the most commonly used measure of variability. The standard deviation is simply the square root of the variance. The standard deviation of the data shown in Table 2.1 is 11.2 seconds. Interpreting the standard deviation is a little easier than interpreting the variance, as the unit of the standard deviation is the same as the original data (seconds, in this example).
A confidence interval is an estimate of a range of values that includes the true population value for a statistic, such as a mean. For example, assume that you need to estimate the true population mean for a task time whose sample times are shown in Table 2.1. You could construct a confidence interval around that mean to show the range of values that you are reasonably certain will include the true population mean. The phrase “reasonably certain” indicates that you will need to choose how certain you want to be or, put another way, how willing you are to be wrong in your assessment. This is what’s called the confidence level that you choose or, conversely, the alpha level for the error that you’re willing to accept. For example, a confidence level of 95%, or an alpha level of 5%, means that you want to be 95% certain, or that you’re willing to be wrong 5% of the time.
There are three variables that determine the confidence interval for a mean:
• The sample size, or the number of values in the sample. For the data in Table 2.1, the sample size is 12, as we have data from 12 participants.
• The standard deviation of the sample data. For our example, that is 11.2 seconds.
• The alpha level we want to adopt. The most common alpha levels (primarily by convention) are 5 and 10%. Let’s choose an alpha of 5% for this example, which is a 95% confidence interval.
The 95% confidence interval is then calculated using the following formula:
The value “1.96” is a factor that reflects the 95% confidence level. Other confidence levels have other factors. This formula shows that the confidence interval will get smaller as the standard deviation (the variability of data) decreases or as the sample size (number of participants) increases.
Confidence intervals are incredibly useful. We think you should calculate and display them routinely for just about any means that you report from a usability study. When displayed as error bars on a graph of means, they make it visually obvious how accurate the measures actually are.
Let’s now consider the data in Figure 2.2, which shows the checkout times for two different designs of a prototype website. In this study, 10 participants performed the checkout task using Design A and another 10 participants performed the checkout task using Design B. Participants were assigned randomly to one group or the other. The means and 90% confidence interval for both groups have been calculated using the AVERAGE and CONFIDENCE functions. The means have been plotted as a bar graph, and the confidence intervals are shown as error bars on the graph. Even just a quick glance at this bar graph shows that the error bars for these two means don’t overlap with each other. When that is the case, you can safely assume that the two means are significantly different from each other.
One of the most useful things you can do with interval or ratio data is to compare different means. If you want to know whether one design has higher satisfaction ratings than another or if the number of errors is higher for one group of users compared to another, your best approach is through statistics.
There are several ways to compare means, but before jumping into the statistics, you should know the answers to a couple of questions:
1. Is the comparison within the same set of users or across different users? For example, if you are comparing some data for men vs women, it is highly likely that these are different users. When comparing different samples like this, it’s called independent samples. But if you’re comparing the same group of users on different products or designs, you will use something called paired samples.
2. How many samples are you comparing? If you are comparing two samples, use a t test. If you are comparing three or more samples, use an analysis of variance (also called ANOVA).
Perhaps the simplest way to compare means from independent samples is using confidence intervals, as shown in the previous section. In comparing the confidence intervals for two means, you can draw the following conclusions:
• If the confidence intervals don’t overlap, you can safely assume the two means are significantly different from each other (at the confidence level you chose).
• If the confidence intervals overlap slightly, the two means might still be significantly different. Run a t test to determine if they are different.
• If the confidence intervals overlap widely, the two means are not significantly different.
Let’s consider the data in Figure 2.3 to illustrate running a t test for independent samples. This shows the ratings of ease of use on a 1 to 5 scale for two different designs as rated by two different groups of participants (who were assigned randomly to one group or the other). We’ve calculated the means and confidence intervals and graphed those. But note that the two confidence intervals overlap slightly: Design 1’s interval goes up to 3.8, whereas Design 2’s goes down to 3.5. This is a case where you should run a t test to determine if the two means are significantly different.
A paired samples t test is used when comparing means within the same set of users. For example, you may be interested in knowing whether there is a difference between two prototype designs. If you have the same set of users perform tasks using prototype A and then prototype B, and you are measuring variables such as self-reported ease of use and time, you will use a paired samples t test.
With paired samples like these, the key is that you’re comparing each person to themselves. Technically, you’re looking at the difference in each person’s data for the two conditions you’re comparing. Let’s consider the data shown in Figure 2.4, which shows “Ease of Use” ratings for an application after their initial use and then again at the end of the session. So there were 10 participants who gave two ratings each. The means and 90% confidence intervals are shown and have been graphed. Note that the confidence intervals overlap pretty widely. If these were independent samples, you could conclude that the ratings are not significantly different from each other. However, because these are paired samples, we’ve done a t test on paired samples (with the “Type” as “1”). That result, 0.0002, shows that the difference is highly significant.
Figure 2.4 Data showing paired samples in which each of 10 participants gave an ease of use rating (on a 1–5 scale) to an application after an initial task and at the end of the study.
Let’s look at the data in Figure 2.4 in a slightly different way, as shown in Figure 2.5. This time we’ve simply added a third column to the data in which the initial rating was subtracted from the final rating for each participant. Note that for 8 of the 10 participants, the rating increased by one point, whereas for 2 participants it stayed the same. The bar graph shows the mean of those differences (0.8) as well as the confidence interval for that mean difference. In a paired-samples test like this, you’re basically testing to see if the confidence interval for the mean difference includes 0 or not. If not, the difference is significant.
Figure 2.5 Same data as in Figure 2.4, but also showing the difference between initial and final ratings, the mean of those differences, and the 90% confidence interval.
Note that in a paired samples test, you should have an equal number of values in each of the two sets of numbers that you’re comparing (although it is possible to have missing data). In the case of independent samples, the number of values does not need to be equal. You might happen to have more participants in one group than the other.
We don’t always compare only two samples. Sometimes we want to compare three, four, or even six different samples. Fortunately, there is a way to do this without a lot of pain. An ANOVA lets you determine whether there is a significant difference across more than two groups.
Excel lets you perform three types of ANOVAs. We will give an example for just one type of ANOVA, called a single-factor ANOVA. A single-factor ANOVA is used when you just have one variable you want to examine. For example, you might be interested in comparing task completion times across three different prototypes.
Let’s consider the data shown in Figure 2.6, which shows task completion times for three different designs. There were a total of 30 participants in this study, with each using only one of the three designs.
Figure 2.6 Task completion times for three different designs (used by different participants) and results of a single-factor ANOVA.
Results are shown in two parts (the right-hand portion of Figure 2.6). The top part is a summary of the data. As you can see, the average time for Design 2 is quite a bit slower, and Designs 1 and 3 completion times are faster. Also, the variance is greater for Design 2 and less for Designs 1 and 3. The second part of the output lets us know whether this difference is significant. The p value of 0.000003 reflects the statistical significance of this result. Understanding exactly what this means is important: It means that there is a significant effect of the “designs” variable. It does not necessarily mean that each of the design means is significantly different from each of the others—only that there is an effect overall. To see if any two means are significantly different from each other, you could do a two sample t test on just those two sets of values.
Sometimes it’s important to know about the relationship between different variables. We’ve seen many cases where someone observing a usability test for the first time remarks that what users say and what they do don’t always correspond with each other. Many users will struggle to complete just a few tasks with a prototype, but when asked to rate how easy or difficult it was, they often give it good ratings. This section provides examples of how to perform analyses that investigate these kinds of relationships (or lack thereof).
When you first begin examining the relationship between two variables, it’s important to visualize what the data look like. That’s easy to do in Excel using a scatterplot. Figure 2.7 is an example of a scatterplot of actual data from an online usability study. The horizontal axis shows mean task time in minutes, and the vertical axis shows mean task rating (1–5, with higher numbers being better). Note that as the mean task time increases, the average task rating drops. This is called a negative relationship because as one variable increases (task time), the other variable decreases (task rating). The line that runs through the data is called a trend line and is added easily to the chart in Excel by right-clicking on any one of the data points and selecting “Add Trend Line.” The trend line helps you better visualize the relationship between the two variables. You can also have Excel display the R2 value (a measure of the strength of the relationship) by right-clicking on the trend line, choosing “Format Trend Line,” and checking the box next to “Display R-squared value on chart.”
Nonparametric tests are used for analyzing nominal and ordinal data. For example, you might want to know if a significant difference exists between men and women for success and failure on a particular task. Or perhaps you’re interested in determining whether there is a difference among experts, intermediates, and novices on how they ranked different websites. To answer questions that involve nominal and ordinal data, you will need to use some type of nonparametric test.
Nonparametric statistics make different assumptions about the data than the statistics we’ve reviewed for comparing means and describing relationships between variables. For instance, when we run t tests and correlation analysis, we assume that data are distributed normally and the variances are approximately equal. The distribution is not normal for nominal or ordinal data. Therefore, we don’t make the same assumptions about the data in nonparametric tests. For example, in the case of (binary) success, when there are only two possibilities, the data are based on the binomial distribution. Some people like to refer to nonparametric tests as “distribution-free” tests. There are a few different types of nonparametric tests, but we will just cover the χ2 test because it is probably the most commonly used.
The χ2 (pronounced “chi square”) test is used when you want to compare nominal (or categorical) data. Let’s consider an example. Assume you’re interested in knowing whether there is a significant difference in task success among three different groups: novice, intermediates, and experts. You run a total of 60 people in your study, 20 in each group. You measure task success or failure on a single task. You count the number of people who were successful in each group. For novices, only 6 out of 20 were successful, 12 out of 20 intermediates were successful, and 18 out of 20 experts were successful. You want to know if there is a statistically significant difference among the groups.
Figure 2.8 shows what the data look like and output from the CHITEST function. In this example, the likelihood that this distribution is due to chance is about 2.9% (0.029). Because this number is less than 0.05 (95% confidence), we can reasonably say that there is a difference in success rates among the three groups.
In this example we were just examining the distribution of success rates across a single variable (experience group). There are some situations in which you might want to examine more than one variable, such as experience group and design prototype. Performing this type of evaluation works the same way. Figure 2.9 shows data based on two different variables: group and design. For a more detailed example of using χ2 to test for differences in live website data for two alternative pages (so-called A/B tests), see Chapter 9.
You might have collected and analyzed the best set of usability data ever, but it’s of little value if you can’t communicate it effectively to others. Data tables are certainly useful in some situations, but in most cases you’ll want to present your data graphically. A number of excellent books on the design of effective data graphs are available, including those written by Edward Tufte (1990, 1997, 2001, 2006)[Edward Tufte (1990), Edward Tufte (1997), Edward Tufte (2001), Edward Tufte (2006)], Stephen Few (2006, 2009, 2012)[Stephen Few (2006), Stephen Few (2009), Stephen Few (2012)], and Dona Wong (2010). Our intent in this section is simply to introduce some of the most important principles in the design of data graphs, particularly as they relate to user experience data.
We’ve organized this section around tips and techniques for five basic types of data graphs:
We begin each of the following sections with one good example and one bad example of that particular type of data graph.
Column graphs and bar graphs (Figure 2.10) are the same thing; the only difference is their orientation. Technically, column graphs are vertical and bar graphs are horizontal. In practice, most people refer to both types simply as bar graphs, which is what we will do.
Figure 2.10 Good (top) and bad (bottom) examples of bar graphs for the same data. Mistakes in the bad version include failing to label data, not starting the vertical axis at 0, not showing confidence intervals when you can, and showing too much precision in the vertical axis labels.
Bar graphs are probably the most common way of displaying usability data. Almost every presentation of data from a usability test that we’ve seen has included at least one bar graph, whether it was for task completion rates, task times, self-reported data, or something else. The following are some of the principles used for bar graphs.
• Bar graphs are appropriate when you want to present the values of continuous data (e.g., times, percentages) for discrete items or categories (e.g., tasks, participants, designs). If both variables are continuous, a line graph is appropriate.
• The axis for the continuous variable (the vertical axis in Figure 2.10) should normally start at 0. The whole idea behind bar graphs is that the lengths of the bars represent the values being plotted. By not starting the axis at 0, you’re manipulating their lengths artificially. The bad example in Figure 2.10 gives the impression that there’s a larger difference between the tasks than there really is. A possible exception is when you include error bars, making it clear which differences are real and which are not.
• Don’t let the axis for the continuous variable go any higher than the maximum value that’s theoretically possible. For example, if you’re plotting percentages of users who completed each task successfully, the theoretical maximum is 100%. If some values are close to that maximum, Excel and other packages will tend to automatically increase the scale beyond the maximum, especially if error bars are shown.
Line graphs (Figure 2.11) are used most commonly to show trends in continuous variables, often over time. Although not as common as bar graphs in presenting usability data, they certainly have their place. The following are some of the key principles for using line graphs.
• Line graphs are appropriate when you want to present the values of one continuous variable (e.g., percent correct, number of errors) as a function of another continuous variable (e.g., age, trial). If one of the variables is discrete (e.g., gender, participant, task), then a bar graph is more appropriate.
• Show your data points. Your actual data points are the things that really matter, not the lines. The lines are just there to connect the data points and make the trends more obvious. You may need to increase the default size of the data points in Excel.
• Use lines that have sufficient weight to be clear. Very thin lines are not only hard to see, but it’s harder to detect their color and they may imply a greater precision in data than is appropriate. You may need to increase the default weight of lines in Excel.
• Include a legend if you have more than one line. In some cases, it may be clearer to move the labels manually from the legend into the body of the graph and put each label beside its appropriate line. It may be necessary to do this in PowerPoint or some other drawing program.
• As with bar graphs, the vertical axis normally starts at 0, but it’s not as important with a line graph to always do that. There are no bars whose length is important, so sometimes it may be appropriate to start the vertical axis at a higher value. In that case, you should mark the vertical axis appropriately. The traditional way of doing this is with a “discontinuity” marker () on that axis. Again, it may be necessary to do that in a drawing program.
Figure 2.11 Good (top) and bad (bottom) examples of line graphs for the same data. Mistakes in the bad version include failing to label the vertical axis, not showing data points, not including a legend, and not showing confidence intervals.
Scatterplots (Figure 2.13), or X/Y plots, show pairs of values. Although they’re not very common in usability reports, they can be very useful in certain situations, especially to illustrate relationships between two variables. Here are some of the key principles for using scatterplots.
• You must have paired values that you want to plot. A classic example is heights and weights of a group of people. Each person would appear as a data point, and the two axes would be height and weight.
• Normally, both of the variables would be continuous. In Figure 2.13, the vertical axis shows mean values for a visual appeal rating of 42 web pages (from Tullis & Tullis, 2007). Although that scale originally had only four values, the means come close to being continuous. The horizontal axis shows the size, in k pixels, of the largest nontext image on the page, which truly is continuous.
• You should use appropriate scales. In Figure 2.13, the values on the vertical axis can’t be any lower than 1.0, so it’s appropriate to start the scale at that point rather than 0.
• Your purpose in showing a scatterplot is usually to illustrate a relationship between the two variables. Consequently, it’s often helpful to add a trend line to the scatterplot, as in the good example in Figure 2.13. You may want to include the R2 value to indicate the goodness of fit.
Pie or donut charts (Figure 2.14) illustrate the parts or percentages of a whole. They can be useful any time you want to illustrate the relative proportions of the parts of a whole to each other (e.g., how many participants in a usability test succeeded, failed, or gave up on a task). Here are some key principles for their use.
• Pie or donut charts are appropriate only when the parts add up to 100%. You have to account for all the cases. In some situations, this might mean creating an “other” category.
• Minimize the number of segments in the chart. Even though the bad example in Figure 2.14 is technically correct, it’s almost impossible to make any sense out of it because it has so many segments. Try to use no more than six segments. Logically combine segments, as in the good example, to make the results clearer.
• In almost all cases, you should include the percentage and label for each segment. Normally these should be next to each segment, connected by leader lines if necessary. Sometimes you have to move the labels manually to prevent them from overlapping.
Figure 2.14 Good (top) and bad (bottom) examples of pie or donut charts for the same data. Mistakes in the bad version include too many segments, poor placement of the legend, not showing percentages for each segment, and using 3D, for which the creator of this pie chart should be pummeled with a wet noodle.
Stacked bar graphs (Figure 2.15) are basically multiple pie charts shown in bar or column form. They’re appropriate whenever you have a series of data sets, each of which represents parts of the whole. Their most common use in user experience data is to show different task completion states for each task. Here are some key principles for their use.
• Like pie charts, stacked bar graphs are only appropriate when the parts for each item in the series add up to 100%.
• The items in the series are normally categorical (e.g., tasks, participants).
• Minimize the number of segments in each bar. More than three segments per bar can make it difficult to interpret. Combine segments as appropriate.
• When possible, make use of color-coding conventions that your audience is likely to be familiar with. For many U.S. audiences, green is good, yellow is marginal, and red is bad. Playing off of these conventions can be helpful, as in the good example in Figure 2.15, but don’t rely solely on them.
In a nutshell, this chapter is about knowing your data. The better you know your data, the more likely you are to answer your research questions clearly. The following are some of the key takeaways from this chapter.
1. When analyzing your results, it’s critical to know your data. The specific type of data you have will dictate what statistics you can (and can’t) perform.
2. Nominal data are categorical, such as binary task success or males and females. Nominal data are usually expressed as frequencies or percentages. χ2 tests can be used when you want to learn whether the frequency distribution is random or there is some underlying significance to the distribution pattern.
3. Ordinal data are rank orders, such as a severity ranking of usability issues. Ordinal data are also analyzed using frequencies, and the distribution patterns can be analyzed with a χ2 test.
4. Interval data are continuous data where the intervals between each point are meaningful but without a natural zero. The SUS score is one example. Interval data can be described by means, standard deviations, and confidence intervals. Means can be compared to each other for the same set of users (paired samples t test) or across different users (independent samples t test). ANOVA can be used to compare more than two sets of data. Relationships between variables can be examined through correlations.
5. Ratio data are the same as interval but with a natural zero. One example is completion times. Essentially, the same statistics that apply to interval data also apply to ratio data.
6. Any time you can calculate a mean, you can also calculate a confidence interval for that mean. Displaying confidence intervals on graphs of means helps the viewer understand the accuracy of the data and to see quickly any differences between means.
7. When presenting your data graphically, use the appropriate types of graphs. Use bar graphs for categorical data and line graphs for continuous data. Use pie charts or stacked bar graphs when data sum to 100%.