Chapter 6. The Analysis Phase (Getting Answers From Your Experiments)

LAUNCHING YOUR A/B TEST out “in the wild” is exciting—it’s the first time you’ll get to have a conversation with your users at scale about how your designs will serve them and meet their needs! This process of launching and learning from your test will be our focus in this chapter. One of the principles behind A/B testing is that it’s a good idea to always test first with a very small portion of your user base. The data you get back from your experiments will give you a sense of how well your design is performing with respect to your goals, before you invest the time to launch a design to all of your users. Did the design(s) lead to the behaviors you were expecting to see? Did you see that people acted in ways which map more to your business metrics? Were you able to successfully improve the metrics that you were targeting? For example, if your goal is that your users watch more movies they like, did you see an uptick in movies selected and watched all the way through? This would indicate that your recommendation algorithms and/or your presentation of selections are more effective than in your previous design.

Once you get back data from your initial tests, you then need to decide what you’ll do next. Do you take that experience and then roll it out to a larger part of your user base? Did the results suggest you are on the wrong track, that you should abandon the idea entirely? Or do you keep working on your hypothesis to improve it before testing it again? We will also tackle these types of questions in this chapter. Referring back to the framework we introduced in Chapter 3, this chapter focuses on the activities shown in Figure 6-1 (outlined in the dotted line).

Figure 6-1. “The Analysis Phase” of our experimentation framework is where you launch your test to your users, collect, and then analyze the resulting data.

Throughout this book we have been championing a data-aware strategy; part of that strategy is not reacting instantly to partial results but instead being considered and reflective about what your results mean within the context of your overall company and product goals.

In Chapter 4 and Chapter 5, we’ve highlighted the importance of thinking about how you’ll launch your A/B test, and how that impacts the types of results you’ll get and the things you can learn, right from the beginning. In this chapter, we’ll have a more in-depth conversation about launching your A/B test, analyzing your results, and deciding what your next steps should be. We hope that you’re excited to see everything come together as we introduce how to make actionable and data-aware decisions based on A/B testing.

Vetting Your Designs Ahead of Launch

Although we are focusing on A/B testing in this book, remember that throughout your design process you might use other forms of data to inform how you craft your experience. You can also use these methods and tools to evaluate your designs well ahead of launching them in an A/B test; you might find that you are able to eliminate a test cell or two before launch. We’ll give you a quick survey (no pun intended!) of a few of the tools you might consider using. Many of the tools that we’ll list here are equally useful for answering broader questions that you might be asking when first working to define your hypothesis, as well as later on in the process when you might be looking to fine-tune different elements of your experience (e.g., specific variables).

Lab Studies: Interviews and Usability Testing

Throughout the process leading up to launching your A/B test, there are many places where it will make sense to use lab-based user research methodologies like interviews and usability testing to refine your thinking. What we’d like to emphasize here is that speaking directly with your users throughout both the analytical and execution phases is a great way to ensure that you’re ultimately designing and building something that has the potential to impact your user behavior (and therefore your metrics). And, of equal importance, doing so will ensure that the designs you launch and test at scale are a fair representation of the hypothesis and concept you’re hoping to understand, not one that is doomed to fail due to implementation details.

Remember that speaking directly to users can give us insight into their motivations and the “why” behind the data we might be capturing. Here’s where you can actually observe if certain design decisions resulted in the behavior you were hoping it would. Did making the image larger have the intended effect of making that feature more noticeable? Do your users describe their experience in terms that reflect your hypothesis? Are you confident that the differences between your test cells are strong enough that you can measure their differences in the data you get back? Do the designs that you’ve crafted feel like they address some of the problems that users identified in your analysis stage?

Many companies leverage different methods to vet designs that later get A/B tested. In a process called “Soundcheck,” for example, the user research team at Spotify brings real users into the office every two or three weeks to collect user feedback on projects from across the company. Teams can sign up for a slot to get their projects Soundchecked, and because the cadence and dates of these sessions are known in advance, teams can plan to get feedback on their work in a way that aligns with their timeline and needs. Every time Spotify runs a Soundcheck, the product teams learn actionable things about their users to inspire copy and design changes that evolve the test cells before they are launched in an A/B test at scale, or encourage teams to go back to the drawing board on ideas that didn’t succeed in small sample research. This helps those teams put forth the best possible designs when they deploy their A/B test to experiment at scale.

Surveys

Surveys are also useful ways to vet designs ahead of a launch. There are certain kinds of experiments that lend themselves well to surveys as a tool to get data back on your designs ahead of launch. We have seen this be very effective especially when testing changes that are more focused on the visual layer and don’t necessarily require the user to step through a sequence of actions. Examples here might be logo or icon changes, or different visual and UI treatments. We’ve found surveys to be useful for measuring and understanding the emotional impact that a design might have, and identify reasons why different groups of users might have different responses to a design.

Surveys are a good complement to A/B tests because while surveys can give you a quantifiable signal about how users will perceive or react to something, they can’t tell you how the design will actually perform in the wild. People are notoriously bad at predicting their own behaviors. However, surveys help provide nuance and context to the behaviors you might see in an A/B test by explaining the “why”—the emotions, attitudes, and need states that led to or correlate with those behaviors.

Working with Your Peers in Data

Finally, we’ll return to one of the themes we’ve been emphasizing throughout the book, which is that if you happen to work with people who are focused on doing the analysis of data or gathering the research, we have often found that they have even better tuned instincts around how customers will react than product managers or designers. Analysts and user researchers get exposed to a wide range of data from many different sources and are often quite strong at being able to guess the outcome of what testing your designs will have.

This is another point where we’d like to highlight that if you are working with data analysts who will be responsible for doing the analysis on your tests after they are launched, then bringing them in early into the process to help you “design” your test cell structure can be really useful. Data analysts can often help you to understand which variables will be most useful to test, whether you have test cells that seem redundant with each other, and ultimately how many test cells you might want to test or not. As we’ve discussed before, analysts and user researchers can help you to determine how to sequence your tests based on what it is that you hope to learn and what is most important to learn first, second, third, and so on.

Launching Your Design

By this point you’ve designed several test cells that you want to put in front of the world. You might have used complementary methods to vet the design ahead of launch, but now it’s time to deploy your experiment at scale. The mechanics of launching your A/B test will depend a lot on things that might be particular to your business as well as your product and the technology it leverages. However, we will cover some of the things to consider when deploying your A/B test regardless of exactly how you do it.

As a reminder of where we are in the overall process, this is the first time that a group of your users at scale will be exposed to your design/experiment. After your test launches, there will be a period of time which you will have designated in your test design where your users are interacting with your design, you’ll need to wait to see the behavioral impact of your design. Only after you get sufficient data can you decide what to do next; please check back to our discussion on the relationship between effect size and sample size.

Also remember that you’ll only be launching this experience to a subset of your users, and therefore you’ll want to take the necessary precautions so that you can be sure that the results that you get from your A/B test will be applicable to your broader user base (this includes thinking about cohorts, segments, and representativeness as discussed in Chapter 2, and making sure that there are no confounding variables such as seasonal variations or edge case conditions that might obviate what you can learn from your test). This is also where concepts that we first introduced in earlier chapters around minimum detectable effect, power, and sample size become important. We’ll spend a little time in this section revisiting those concepts in service of helping you prepare to launch your test.

First, you should define the minimum detectable effect (MDE), or the minimum change that you want to detect in order to call your test a success. You should be partnering with your data friends in the business functions at your company to model the long-term impact of the MDE to your business. Remember that making any change to your experience isn’t free—your business will have to pay for the time it takes to design, test, and implement those changes and roll them out to your entire user base. And, of equal or superior importance, your consumers will “pay” through the friction it takes them to relearn an existing experience, where bigger changes result in greater friction. Essentially, when you define an MDE you’re saying observing x increase/decrease in your metrics will justify those business and experiential costs. Any difference below that MDE means for your test that it’s probably not worth launching that experience right now.

Balancing Trade Offs to Power Your Test

Recall from Chapter 2 that you’ll need a powerful enough and properly designed test to detect a difference at least as big as your MDE. Power in particular depends on a few things, including your sample size and how confident you want to be. And in turn, defining your sample size will force you to trade off between the percentage of users you roll out to and the time you leave your test running. As you can see from Figure 6-2, then, the different considerations for rollout that we’ve already introduced throughout the book so far constrain each other. In practice, this means that as you decide to roll out an A/B test, you’ll be weighing different decisions against each other. We’ll walk you through some of these tensions to help you think about how to make those types of decisions.

Figure 6-2. Considerations you should take into account when rolling out your A/B test.

Weighing sample size and significance level

In Chapter 2, we told you that the power of your test depends on the sample size you have and the significance level you are comfortable with. In practice, you’ll probably use a standard significance level for all experiments, usually 0.05, but there are scenarios when you may consider changing this, and we’ll explore them here.

Let’s say that you’re thinking about rolling out a really substantial change to your experience, and now you want to evaluate whether this global change is worthwhile. In this specific case, your goal in making the proposed changes is to improve your overall user experience, which you hope leads to a measurable increase in your metric of interest, for instance, by increasing retention rates. Because the change is so huge, you won’t roll the test or the change to everyone out unless you observe a measurable increase in retention in a well-crafted experiment. If the experiment indicates no or negative change, then the roll out to the entire user base should be blocked. In the scenario we just described, you’ll notice that we kept emphasizing how important it was to observe a meaningful and measurable change. In other words, this is a scenario in which you and your coworkers would likely be relatively risk averse: you’re looking for high confidence that the changes have the desired impact (since they change your product in a way it would be hard to turn back from), otherwise you won’t roll out. In cases like this you’ll want to have high confidence in your result to minimize the risk of being wrong. As you might recall from Chapter 2, this means that you’ll need a relatively large sample size in order to power a test at such a significance level, let’s say 0.01. For you, the added cost of a large sample size is worthwhile to have a higher degree of confidence in your result.

Compare the scenario we just described against a very early stage exploration of a concept—an exploratory learning test. In the latter case, you’re looking for directional feedback about a new idea you’re thinking about. Recall our discussion in Chapter 5 about designing and testing for the appropriate level of polish, and our earlier discussions of balancing speed and learning. In these exploratory tests, you should have already determined with your team that you aren’t rolling out this first iteration of the concept. Instead, you are simply looking for a signal to understand whether you’ve identified a promising area to further consider, or whether abandoning the idea and moving on to something else before investing too much time and resources will better serve you. There will be more opportunities to validate the idea and verify that your design delivers the intended results; this is just the first iteration of the idea. While replication like this is part of the plan, this is the only case you might consider reducing your significance level to something like 0.1 and therefore your sample size (which means your test can be faster and less costly to run).

So, as you can see, you’ll have to balance your degree of confidence in the data with the amount of time you have to learn.

Getting the sample that you need (rollout % versus test time)

Let’s say you’ve made the decision about how to trade off confidence and sample size. You know the MDE you are aiming for, and have determined the sample size you need and how many treatments you can test. The last key question you’ll need to think about is how to get that sample: Will you deploy the experience to a greater percentage of your users, or run your test for a longer time?

As we already mentioned, when you roll out an A/B test, it is not launched to your entire population at once. In practice, this means that you’ll allocate some of your users to the test cells, and keep the rest of your users out of the experiment. This is because you want to avoid constantly shifting experiences for your users—remember an experiment from your perspective will always feel like a new experience, and possibly a bewildering one, for your users.

In Chapter 4 we talked about the balance between learning and speed. Larger companies with many active users generally roll out an A/B test to 1% or less, because they can afford to keep the experimental group small while still collecting a large enough sample in a reasonable amount of time without underpowering their experiment. Imagine comparing a company with a million daily active users to a company with 10,000. The company with a million daily active users can allot 1% of their daily users to the test condition and get 10,000 users in a single day. It would take the smaller company 100 days to get that same sample size—far too long for most A/B tests. If they ramped up their rollout rate from 1% to 10%, though, they could cut that time down to 10 days, which is much more reasonable.

It’s important to allocate enough users to a test so that you can measure small changes to your metric of interest. Tests where the size of your user base is limited, either because your user base is too small or because you don’t have enough time to let your test run, can pose challenges. For example, let’s say you have significantly more users on iOS than you do on Android. You might find that running tests on the Android platform that get you results you can be confident might be much harder for you as a result. This is why we introduced the concepts of sample size and power early on—knowing about these constraints early on will help you think big enough with the hypothesis and tests you’re running.

Who are you including in your sample?

Finally, we want to remind you that it’s important to be sensitive to who you allocate into your test group, and how. Remember that when you’re rolling out an A/B test, you should select a group of users to act as the control group (no changes to their experience) and then allocate other users into each of your test cells. Both your control and your test groups should be representative of the portion of your user base that you hope to learn about.

Let’s return to our camp metaphor. Imagine if you allocated your campers to the experimental conditions for your race to the campsite based on when they arrived at camp. Let’s say you had 20 people that you needed to put into four groups. One way to do this is to take the first 5 people and put them into one group, the next 5 people in the next group, and so on. The earliest campers might be the most eager because they’ve come to camp before, which would introduce bias because you’re comparing a group of experienced campers to a group of new campers, who may never have gone hiking before. You would have inadvertently created a selection process where team 1 is stronger than team 2, which is stronger than team 3, and so on. So to counter this, you might instead have people count off in order—1, 2, 3, 4—and then ask all the 1s to form a team, all the 2s to form another team, and so on. This might be a better strategy for creating a more random sampling.

Recall from our conversation in Chapter 2 that the portion of your audience that you test with will constrain who the data represents. For instance, if your hypothesis targets a specific audience, you might want to experiment specifically on that audience first. In other cases, it might make sense to test with new or existing users. Another important reminder is that in order to meaningfully compare your test and control conditions, you’ll need to be sure that there are no hidden differences between the users in your control condition and your test conditions that might lead to confounds: differences in your groups that did not result from the change you made, but instead from other differences that you did not anticipate and control for. We encourage you to revisit Chapter 2 as you’re thinking about allocating your users to tests.

In line with this point, Eric Colson shared a story with us about how important it was for the team at Stitch Fix to be especially disciplined about how they selected users to be in the test and control cells for their A/B tests:

A rookie mistake in A/B testing is not properly randomizing your test and control groups. Suppose you have a particular treatment you wanted to test—it could be a new offer, a new UI, whatever. It’s intuitive to draw a sample of new users—say, 100,000 of them—and then give them a test treatment and then compare their behaviors to ‘everybody else.’ A naive team may draw the sample by exposing the next 100,000 users that come along to the experimental treatment and then resuming the default experience for everybody else after. But this is an arbitrary sample—not a random one! The naive team may push back. “What are you talking about? There was nothing special about those users. I just took the next 100,000 that came along.” But there was a difference: time. The users that received the test treatment were selected at a different point in time from everyone else. The difference may have been just few days—or even just a few hours or minutes. It may seem like a subtle distinction, but this can result in biasing the sample.

We found that the difference between our test & control groups and “everyone else” can be quite pronounced. We do properly sample a set of users and then randomly assign them to test and control. The sample may have been taken over a particular time period. But the test and control groups were randomly assigned from within this sample so any nuance as a result of the time period will apply equally to both groups. Often, we observe no difference in behavior between test and control groups. This can be disappointing as the treatment often represents someone’s clever idea to improve the user experience. Inevitably—perhaps in desperation—someone will compare the test group to ‘everybody else.’ And alas, there is a difference—sometimes a striking one! Metrics for the test cell can be materially higher and statistically significant when compared to ‘everybody else.’ Perhaps the idea was a winner after all? Of course, this is a specious comparison! The samples were inherently different and can not be compared. The difference observed in metrics is due to sampling—not the treatment.

We found that the day-of-week and time-of-day that a new user visits the site can really matter. For example, busy professionals may be less likely to visit the site on a weekday or during business hours. Even the inner workings of our internal systems can behave differently depending on the time-of-day or day-of-week. It’s important to control for these idiosyncrasies by drawing a sample and then dividing it randomly into test and control groups.

We may not all have a user base that shifts based on day of the week in the way that the customers at Stitch Fix do. But we thought this was a great example of how you, as a designer, need to consider some of the ways in which the environment that you launch your test in might affect the outcome and results that you get. Recall that in our discussion about balancing learning and speed, running tests for at least a whole business cycle can help remediate some of these variations based on day of the week or hour. Being aware of these different conditions, both technical and behavioral (in your users), can help you understand how to design your test and how to interpret the results.

One way to help identify whether you’ve made mistakes in how you do your sampling is to run “A/A” tests every once and a while. In other words, you allocate users to the same experience, and see whether there are any differences in the metrics you observe. Since there is no change between the test cell and control, if you get false positives (that is, significant differences between these two groups) more than 5% of the time, it’s likely that there’s an issue with how you are allocating that would cause bias in your results. You should re-evaluate your methodologies to address these concerns, and be skeptical of any results run under these imperfect conditions.

We don’t mean to overcomplicate the process of allocating your users into different test cells, but it is worth thinking about all the factors that might bias the results you get for your test, and to try to address them. However, you will also weigh these factors with some rational thinking around whether they actually would have major impact on your results. Some of the factors that may or may not influence what kinds of users you get are:

  • Are these users that are familiar with your product or are they completely new to your product or service?

  • What devices/platforms are they using? Does your product look or perform differently on different platforms? (Android, iOS, browser, etc.)

  • How much technical expertise do your users have? Will your future users have similar technical expertise?

For users that are completely new to your experience:

  • When did they sign up? (Time of day? Day of week?)

  • Where did they sign up for it? (From an ad, from search result, etc.?)

  • What country are they from? There are cultural differences as well as national standards and legal requirements that may affect the user experience and therefore users’ behaviors.

  • Will different types of users use your product in different ways, or with different frequency?

This list of questions will likely differ based on the kind of test that you are running, but also based on what kind of product you have. And for each of these kinds of questions that you might ask yourself, you should also ask “Does it really matter?”

As a very obvious example, suppose you were running an A/B test on a newly designed sign up process on mobile. You wouldn’t bother to allocate any users who are signing up on the web to your test cells, you would only be putting users into your sample that were signing up on mobile, as this is the thing you are trying to test. Let’s say that for your particular product, you know that people who shop during the day tend to be students (who have time to sign up for things during the day) but that people who sign up for your service in the evenings and weekends tend to be working professionals. Again, depending on the nature of what you are testing, you might want to make sure that you are thoughtfully getting both kinds of users. If you were only allocating users into your test cells on Saturday and Sunday, you might have inadvertently biased for getting more professionals into your test cells and it would have been worthwhile to make sure you were allocating users into your test cells over the course of a full week. Similarly, if you started your experiment on Saturday and ran it for eight days, you may have inadvertently biased for weekend behavior (because you have two Saturdays in your sample). A good rule of thumb is to run your experiment for multiples of seven days (or whatever the length of your business cycle is) to avoid these biases creeping into your data.

All of these variations can be challenging to keep straight. If you can, we encourage you to lean on the data-savvy folks within your company to help you think through these considerations. However, we believe that having a solid grasp on these concepts will let you participate in (and possibly even drive) those discussions. As a designer, your well-honed intuition about your users provides a valuable and unique perspective in identifying other possible confounds in allocation. What have you learned about your users and how they engage with your product? Have you seen systematic variation in how your users engage with your product? How might they affect sampling? As you get data back on your designs and make decisions on next steps, it’s really important for you to be aware and thoughtful about how something like the sample that is being allocated to your test cells might have an impact on the results you get. Throughout the analysis portion of the chapter, we’ll give you some practical tips to minimize the risk of bias in your results.

Practical Implementation Details

So far in this book we talked about the implementation of your experiment, such as which users to run your experiment with, which countries to run your hypothesis in, and which platform to use for testing. These are important considerations for launching your test.

In many cases, the implementation details of your hypothesis are flexible, and you can choose them wisely to give your hypothesis the best chance of providing valuable insights. For instance, companies that have large international user bases like Facebook will run a test experience on an entire country. Since Facebook is a social network, it’s important for their testing that the people in their test cells are actually connected to each other. By testing on an entire country that is somewhat isolated (e.g., by language and/or culture) from the rest of the world, you get a sample that will have good social connections, but is decoupled from the rest of the world so it allows for some testing to be done. Facebook uses New Zealand as one of its favorite testing grounds for exactly these reasons.1 Localized testing can also help reduce or eliminate the risk that a single user is allocated to conflicting tests, since a user can’t be both in China and in New Zealand at the same time. Again, because of cultural and other considerations, companies might choose countries that are either very similar to their key markets or countries that are very different from the market that they are most familiar with.

Other times, you can be flexible about choosing to test a hypothesis on just one platform, even if you want that hypothesis to eventually apply to all your platforms. You should ask yourself, which platform is going to be the best one for you to learn on? Is there a particular platform that will amplify the effects of what you are looking for? For instance, consider Netflix’s hypothesis that automatically starting the next episode of a TV show after the previous one finished would increase hours streamed, one of their main metrics of interest (because of its ability to predict likelihood of renewing one’s subscription). This hypothesis could have been tested on any platform. However, the designers at Netflix decided to first pursue the hypothesis on the TV, since it’s a lean-back experience that often occurs during longer sessions.

These two examples are cases where the details of how to launch the A/B test weren’t fixed—the teams behind those tests could choose which platforms and countries that were most ideal for testing. In other cases, however, details of your test will be determined by the hypothesis itself. For instance, say you are working on a web application that lets individuals log their workouts to track progress toward their fitness goals. You’ve noticed that although your retention is high with people whose primary activity is weight lifting, retention is relatively low in runners. As a result, you formulate the following hypothesis:

For runners, by using GPS to automatically import the details about a run, we will lower the friction for runners to use our app because they don’t have to remember to track and log their details manually, which we will measure by an increase in retention for this population.

Runners don’t run with their laptops or desktop computers: they run with their phones. Therefore, this hypothesis implicitly suggests that it must be tested on a mobile device, which may or may not be possible due to the limitations of your product (perhaps you don’t have A/B testing capabilities on mobile just yet, or you don’t have a mobile app at all).

Another case in which you might run into implementation limitations is when you’re targeting a particular audience—for instance, users in an international market such as Japan. If you are an international company, only a subset of your users will be in Japan. You’ll have to be cognizant of whether you can still get a large enough sample size in a reasonable amount of time to appropriately power your test.

Finally, we just want to remind you to make sure that you are set up to collect the data necessary to make a good and informed decision when you do get your results. Again, before you set off to launch your experience to your users, take the time to think about what data you will need to collect and if there are any other parts of your experience that you will want to make sure are tracked so that you can learn from them. Instrumentation refers to the types of data that your company is able to collect. You’ll need appropriate instrumentation for your metric of interest, and all of the secondary metrics you care to observe. Once you launch your test, it is often much harder to retroactively put things into place to measure the data and get the information that you want to get. We have seen many people have to rerun or relaunch A/B tests because they didn’t take the time to take into consideration some of these basics before they launched. Remember that if your metric of interest is how you will evaluate the success or failure of your test, you’ll need to have that measurable at a minimum in order to know what you’ve learned.

Is your experience “normal” right now?

If you recall our conversation in Chapter 2, one of the major strengths of A/B testing is that it is “in the wild.” That is, you’re not in a lab setting where users’ attention and distractions might be different than in the real world; instead, you have all the messy considerations of life accounted for.

However, sometimes the nature of contextual research can be a pitfall, too. Major events in the world, or changing components of your experience, can confound your experiment. For instance, the Google home page is known for using a Google Doodle in place of its logo. Jon Wiley shared a story about a Google search page redesign, where the experience was made cleaner by removing everything except the Google doodle and the search bar.

The problem Jon shared with us was that they originally ran this experiment on the same day as when the Google Doodle was a barcode (Figure 6-3). So he shared with us that a lot of the users who were in the experiment landed on a home page that had a barcode and a search box, but nothing else, which created an unusual experience that confounded the test results. This story is a good reminder that in many of your products there are other things that might be going on at the same time as you are running your experiment that you didn’t account for when you were crafting the design you want to test. When this happens, your best course of action to avoid being misled is to throw away the data and restart the experience under normal conditions. Though it can feel unfortunate, this is the best way to avoid making decisions with biased and misleading data, which could come out to be more costly in the end.

Figure 6-3. This unusual Google doodle confounded an A/B test result that launched on the same day, by creating an unusually strange experience.

Sanity check: Questions to ask yourself

Before you launch, there are a few questions you will want to consider. Some of these are repeated from questions we’ve shared in the previous chapters; however, given that some time has probably transpired as you’ve been going through this process, asking them again might be a refreshing way to make sure you are still on track with your original goals:

  • What am I trying to learn? Do I still believe that my design articulates what I’m trying to learn?

  • What will I do if my experiment works or doesn’t work? (Do you have a sense for what your next steps might be?)

  • Does my test have a large enough sample size to be powerful at the significance level I want?

  • Do you have an understanding of all the variables in your test? (So that when the results do come back you might have some ideas as to what things influenced which changes.)

  • Do you have good secondary metrics or additional tracking built into your tests so that you can do deeper analysis if needed?

  • Will this data be informative for developing new hypotheses or further experiments?

With these questions in mind, you’ll be well prepared to launch an informative experiment to help you learn about your users and your product. In the next section, we’ll give an in depth treatment of how to evaluate your results after your test has been completed.

Evaluating Your Results

Thus far in this chapter, we’ve focused on prelaunch considerations. Now, we’ll imagine that you’ve paired closely with your engineering friends and managed to deploy your test to users. You’ve given the test the time it needs to acquire a sufficient sample size, and now you’re ready to analyze the results.

If you’ve done everything right, then as you launch your test you should:

  • Understand and know how different factors could influence or be reflected in your results (e.g., audience, platform, etc.)

  • Have good secondary metrics or additional tracking built into your tests so that you can do deeper analysis if needed

After you launch a test, your work isn’t over. In fact, some of the most important work comes with what you do after. A/B testing gives you huge potential to learn about your users and how your design affects them, but it’s in the analysis part of your work that insights will crystalize and become concretely valuable for you, your team, and your product. During the analysis, focus on questions like the following:

  • What impact did your changes have on your metrics of interest? Was this impact surprising, or did it align with your expectations?

  • Did you observe impact to any other key, proxy, or secondary metrics? What was it?

  • Are there any results that require further investigation using other techniques?

  • What did these results show about your hypotheses?

It’s useful at this point to revisit and reiterate some of the things you originally set out to learn and look for evidence of whether you were right or wrong in what actually happened. You should also be equally open to the possibility that the results you get are inconclusive—that you can’t tell if you were either right or wrong. Getting an inconclusive result doesn’t mean that you didn’t learn anything. You might have learned that the behavior you were targeting is in fact not as impactful as you were hoping for.

Revisiting Statistical Significance

Recall that in Chapter 2, we told you that measures of statistical significance such as p-values help you quantify the probability that the difference you observed between your test cells and control group were due to random chance, rather than something true about the world. A p-value ranges between 0 and 1. When we see a very large p-value (that is, closer to 1 than to 0), this means that it is likely that the difference we observed was due to chance. In these cases, we conclude that the test did not have the intended effect. Smaller p-values observed suggest that observed difference was unlikely to be caused by random chance. You can’t statistically prove a hypothesis to be true, but in a practical setting we often take smaller p-values as evidence that there is a causal relationship between the change we made and the metrics we observed. In many social science fields like psychology, and also in most A/B testing environments, we take a p-value of p = 0.05 or less to be statistically significant. This means we have 95% confidence in our result (computed as 1 – 0.05 = .95, or 95%). However, your team may work with larger p-values if you are willing to be wrong more often.

In quantifying the likelihood that your result was due to chance, p-values are a good indicator for the amount of risk you’re taking on. Very small p-values can give you more confidence in your result, by minimizing the risk of false positives. A false positive is when you conclude that there is a difference between groups based on a test, when in fact there is no difference in the world. Recall that p < 0.05 means that if in reality there was no difference between our test and control group, we would see the difference we saw (or a bigger difference) less than 5% of the time (Figure 6-4).

Figure 6-4. P-values range from 0 to 1. The larger the p-value, the more likely the result was to be caused by chance. We often take p-values below 0.05 to be statistically significant in product design, but smaller significance levels can further reduce risk.

This means that 5% of the time, we will have a false positive. That’s 1 in every 20 statistically significant results—which, if you’re a company that runs a lot of A/B tests, can be quite a lot! If you set a lower threshold for statistical significance, say, 1%, then you reduce your risk: you’ll only be wrong 1 in 100 times. However, to be confident at that level requires a larger sample size and more statistical power, so as we discussed before there are trade offs.

Fortunately, although false positives are a reality, there are methods you can use to reduce your risk of falling into that trap. Later in the chapter, we’ll talk a little bit more about how you can use some of these methods to better understand whether you’re observing a false positive. First, though, we’ll talk a little bit about evaluating your results.

What Does the Data Say?

Remember that when you defined your hypothesis, you should have specified the metric you wanted to change and the expected direction of the change (e.g., increase retention, or decrease churn). Knowing this, you could see three types of results: an expected result, an unexpected result, or a flat result. An expected result is when you observe a statistically significant result in the direction you expected. An unexpected result is when your result is statistically significant but in the opposite direction. And finally, a flat result is when the difference you observed is not statistically significant at all.

In the next few pages, we’ll talk about each of these results in turn.

Expected (“Positive”) Results

It’s always great to see the results you hoped to see! You’ve probably found evidence that your hypothesis has merit, which is an exciting learning to share with your team. But remember that your goal in A/B testing is not to ship the tested design(s) as soon as you see the results you wanted, but to understand the learning behind that result, and what it means for your users and your product. From a practical and ethical standpoint, too, a positive result doesn’t mean you should make your new default experience the test cell that won just yet. Remember that every change you make to your experience is a change that your users will have to live with. Constantly altering experiences for them can cause friction and frustration. We encourage you to ask yourself the following questions:

  • How large of an effect will your changes have on users? Will this new experience require any new training or support? Will the new experience slow down the workflow for anyone who has become accustomed to how your current experience is?

  • How much work will it take to maintain?

  • Did you take any “shortcuts” in the process of running the test that you need to go back and address before your roll it out to a larger audience (e.g., edge cases or fine-tuning details)?

  • Are you planning on doing additional testing and if so, what is the time frame you’ve established for that? If you have other large changes that are planned for the future, then you may not want to roll your first positive tests out to users right away.

As you can see from these questions, if the change that you’re rolling out is substantially different from what your users are accustomed to and disruptive to the user, then you’ll want to think carefully about how you roll out those changes, even if your metrics show that the result will be positive. As with any new feature rollout, you will want to think about how those changes are communicated to your customers, if you need to educate other people in your company about the changes to come, and what other steps you might need to take to ensure that your work is ready for a broader audience.

Sometimes, you’ll decide that the results from a positive test aren’t worth rolling out. Dan McKinley, formerly of the peer-to-peer independent marketplace Etsy, shared an example of a test that had expected results but never got rolled out. He said:

In the seller backend we had people who are very heavily invested in specific workflows. Our best sellers were very important to us. We could design a listing process, the way to get an item for sale on Etsy, that’s objectively better for any visitor. However, it could totally be the wrong thing to release because it’s a different thing that thousands of people have to learn. So you could get a positive A/B testing result there and the decision would be “We’re not going to release it” or “We’re going to slowly evolve our way to that” to reduce the friction.

McKinley is speaking to the point that a rollout decision involves more than seeing positive results—it’s also doing the “right” thing by your user base, which might be to leave the experience the same so they don’t have to learn something new, or finding ways to slowly onboard them to the new experience so it doesn’t feel so jarring.

One final consideration to keep in mind: Don’t follow statistically significant results blindly. A statistically significant difference does not always mean a significant difference. Just because a result is statistically significant doesn’t guarantee it will be practically important (or what statisticians would call “substantive”). Your data friends can help, and we link to a relevant paper in the Resources appendix.

Unexpected and Undesirable (“Negative”) Results

Sometimes, you see results that are unexpected or undesirable. A design change that you hoped would make a positive impact for your users caused the opposite: those test cells perform worse according to your metric of interest than the existing experience. Even though it can feel disappointing to have an A/B test that went differently than you were hoping, we want to remind you that these failures can sometimes be the greatest opportunity for learning. Unexpected results challenge your and your team’s assumptions, and being challenged in this way forces you to reevaluate and hone your instinct about your users.

Seeing a negative result might mean that something was wrong with your hypothesis and your hypothesis is a reflection of your own instinct as to what would cause a positive reaction in your users. Always reflect on what led you to devise a hypothesis that didn’t resonate with your users. Here are a few questions to get you started:

  • Are they using the feature the way you think they do?

  • Do they care about different things than you think they do?

  • Are you focusing on something that only appeals to a small segment of the base but not the majority?

Spend a lot of time with the data to see if you can understand why your test had a negative result. Exploring secondary metrics here can help you gather a more holistic image of what happened that led to the result you observed. You may also consider supplementing your test with usability studies, surveys, or interviews to get a better understanding of why your test showed the results you got.

Understanding the context of why the experiment failed can also help you decide what to do next. Dan McKinley reminds us with this example that the result of an A/B test is one piece of information that should be weighed among many other considerations.

There would be cases where you’d have negative results that you’d still want to release. You do A/B tests so you have data when you’re discussing whether or not to release the experience. You’re not acting out an algorithm and taking human judgment, out of the process. This is only one of the inputs, but we still talk about it. And we decide, “OK, this affects us negatively but we’re going to release it anyway,” or “This is positive but we think it might be an evil change that we shouldn’t do.” Both of those things are possible. Human judgments are very much part of the decision to release a thing or not.

The best example of negative results that we released anyway was switching Etsy from sorting search results by recency to sorting by relevance. Very early on, search results on Etsy were ordered so the most recent item was first. That made sense when there were a thousand items on the site.

Then a few years later, there were 20 million items on the site and it made less sense. At that time, listing an item cost 20 cents, so Etsy was getting a decent amount of its revenue from people renewing items [so that they would get pushed to the top of search results]. We realized this was a horrible user experience. We thought we could improve conversion somewhat by releasing search relevance. In the test, we improved conversion very slightly. But nowhere near enough to pay us back for the loss in revenue. We ultimately released an ad product that was a better use of sellers’ marketing budget than just pressing the 20 cent button over and over again. But there was a discussion of “Here’s how much money we’re going to lose when we release this—we’re just going to do it.” So we could have made the globally optimal decision [for revenue] and not released that at the time. My instinct is that the company wouldn’t exist now if we did.

As we’ve discussed, deciding whether or not to release a specific experience isn’t always as simple as just looking to see if the results of your experiment are positive or negative relative to your metric of interest. As you can see, sometimes unexpected results can still lead to a rollout of an idea, because it’s the learning that’s more important than the outcome of the test. Making such decisions should involve thinking about the broader decisions you are making to ensure the best possible experience for your customers.

In other contexts, however, all of the evidence you receive will point to your test cell being unfavorable. This is true when both your key and secondary metrics tell the same story: the test cell had an unexpected and unwanted consequence for user behavior. When this happens, you should stop running the test, consider putting the users in that test cell back into the control experience, and decide whether or not there is any value in further exploring that hypothesis.

At the design level, you should resist the urge to revisit it and polish the “failed” design. In tests like this, incremental tweaks to language and execution rarely overcome a negative result. Thinking you can turn an unwanted result into a positive outcome can often be too costly and is unlikely to work. In these cases, it’s often better to go back to the drawing board and see if you can’t refine or revise your hypothesis in a way that will create a more favorable result. Or perhaps consider abandoning that hypothesis for the meantime in favor of better-performing hypotheses.

When the World is Flat

So far, we’ve focused on two cases where you find a statistically significant difference between your two groups, whether it was the one you intended or not. In practice, many A/B tests are inconclusive, where there doesn’t seem to be any measurable difference between the test experience and the control. Null or flat results are common and occur even when you feel that there are huge differences between your test cells. For instance, in his book Uncontrolled, Jim Manzi reports that only 10% of experiments lead to business changes at Google,2 and at Skyscanner,3 roughly 80% of tests fail to improve predicted metrics in a statistically significant way.

There are two ways you can interpret these kinds of results:

  • Your hypothesis may have merit, but you haven’t found the right expression of it yet (you haven’t gone big enough yet). More formally, it could be that your experiment was under-powered, a true effect exists but you need more power since it’s smaller than you designed the experiment for.

  • The hypothesis that you were testing won’t affect the behavior of your users in a large enough way that it changes metrics.

The “art” of A/B testing becomes apparent when you hit impasses such as this. Because you don’t know whether your hypothesis has the potential to cause a meaningful impact to the user behavior you want to alter, you’ll have to trust your instincts on whether this line of thinking is worth investing in further. If you have other evidence to support the hypothesis (for instance, a strong signal from in-lab user research), it might be worth investing additional time making some refinements to your explorations. However, if you’ve investigated this hypothesis unsuccessfully several times, or you don’t have other evidence to support it, it might be that you’ve exhausted your options and should abandon the idea. A lot of A/B testing is an iterative process where each test helps to further develop your thinking. So if you’re considering retesting, think about what your specific concerns are and why you didn’t get stronger results. If we revisit some of the factors listed earlier in this chapter, you might ask yourself the following:

  • Did you select the right sample?

  • Do you need more users to be able to measure the effect of this change?

  • Did you keep your test running for long enough? (Remember that depending on the metrics you are measuring you might need to run longer tests to see the effects—retention is a good example of this.) Remember also that you should always run your tests for X business cycles.

  • Were there other external factors that might have affected your test?

Think also about the execution of the test:

  • Were the changes between the test cells pronounced enough? Were they substantially differentiated? (Did you jump too quickly into testing details before you had established the basic concept?)

  • Is there something upstream in the user flow that might have affected your results?

  • Can you dig into your secondary metrics to see if there are any clues or differences in these metrics that might give you a hint as to what is going on?

Address these concerns, decide whether you have other evidence to support your hypothesis or not, and then consider retesting. Secondary metrics can be of huge value here to help you tease out whether there were any meaningful behavioral changes for your users that may not have ultimately caused a change in your metric of interest. Remember that you shouldn’t base rollout decisions solely on secondary metrics, but they can be a good indicator to help you gain a more complete picture of why your result was flat: were there behaviors that counteracted the metric you were interested in? Did your test have no effect on any behavior at all?

If you didn’t see an impact from your changes, there is no clear benefit to forcing your users through the friction of adopting a new workflow or experience. Remember, any change you make to the experience will have an effect on your existing users. They’ve become accustomed to using your product in a certain way. If there isn’t a clear benefit to the changes you’ve made, it might still cause some confusion and frustration.

There are, of course, times where it makes sense to roll out your changes even if you come to the conclusion that you didn’t affect any of your key metrics. For example, if you believe that your feature adds value because it’s something that users have been requesting, it is a simplification (or improvement) to the UI, or it has strategic value, then you might decide to roll it out. In these situations, the changes you are making have value beyond what you might be able to measure in metrics. (You might, of course, find other ways to capture the impact in data—for example, usability studies, surveys, and other methods could be employed to better understand this.) For instance, you might be able to make an announcement that you’ve addressed the most requested feature from users, you might be able to make a change that is more in line with your brand or your design principles, or you can set the stage for being able to do something more strategic in the future. In these situations, you can make the decision with confidence that your change won’t have a negative impact on your key metrics.

One example of this is when Spotify rolled out their new and unified dark interface. The new redesign did not affect retention. However, the design team at Spotify got to examine the holistic experience of Spotify as a brand and an app, which “cleaned up” a lot of the issues with the UI. Survey results also showed positive user sentiment toward the redesign, which gave the team additional confidence that they should roll out the new interface.

Another example comes from a conversation with Dan McKinley, where he shared some interesting insights as to how the nature of the work that you are testing influences the kind of results you are expecting to see. Specifically, Dan talked about the series of tests that they were doing to ensure that their site worked as well on mobile as it did on the web. At that time, many of their experiments were run with the goal of creating a coherent design and a responsive site on mobile rather than directly impacting the bottom line.

Flat tests are the normal experience. Most experiments were run to make sure that we’re not inadvertently impacting users in a way we’re not meaning to. We always hope for improvements but a lot of work wasn’t to get a direct monetary response. You know, it’s more like “All right, this is a site that’s been on the internet since 2006 and we’re accidentally using 20 slightly different shades of grey.” We’re gradually making a more coherent design and making a responsive site. This is a massive undertaking; there’s a certain amount of work you have to do just to keep up with the state of the internet, right? It was rare that any of those changes affected the bottom line.

But that’s still work that we want to be doing. It’s strategically important even though we don’t expect massive windfalls from single tests.

The caution with rolling out changes in any scenario is that you should be thoughtful about the inconvenience to users and cost in the form of resources at your company that rolling out those changes will incur. For example, if you’re adding a completely new feature (you’re not improving an existing one)—then how much additional work does that feature represent for the future? Will it somehow constrain future innovation?

In summary, you generally won’t roll out changes if you don’t see the impact you were hoping for. When you make changes to your experience, it has costs. This cost might be as small as causing a little friction for your users who have become accustomed to a specific experience, or it might be as large as adding a new part to your experience that you now need to maintain and change as you go forward.

Errors

Errors are inevitable in experiments. With industry standard significance p-values of 0.05%, around 1 in 20 A/B experiments will result in a false positive, and therefore a false learning! Worse yet, with every new treatment you add, your error rate will increase by another 5%—so with 4 additional treatments, your error rate could be as high as 25%. Keeping this at the top of your mind will help you keep an eye out for these errors so that you can spot them before you make decisions on “bad” data. The best way to limit errors is to run fewer treatments, or speak to your data friends about correction for multiple comparisons.

There are two kinds of errors. In a false positive, we conclude that there was a change when in fact there wasn’t. In a false negative, we declare that there was no effect when an effect was there—our experiment just didn’t find it. It’s impossible to reduce both types entirely. Instead, our goal as good experimenters should be to limit them as much as possible. More importantly, we should keep in mind that we are not dealing with certainty, only data with a specified level of confidence, and therefore the potential to be wrong.

Replication

The idea of replication, or repeating an experiment to see whether it has the same findings, is borrowed from academic science. Replicating an experiment helps increase our confidence that we aren’t observing a false positive. Remember that for a p-value of .05, we would expect to be wrong 1 out of every 20 times. But getting a p-value below that twice due to chance has a probability of much less than 1%—about 1 in every 400 times. As you can see, then, repeating an experiment can vastly increase your confidence in your result.

If an uplift or negative seems highly unusual or particularly surprising, your best course of action is to validate with a new experiment to see whether your finding replicates. This will help you have increasing confidence in your results, and therefore reduce the risk of error.

Using secondary metrics

Another way to help spot errors is by leveraging your secondary metrics. We introduced the concept of a secondary metric in Chapter 4. These can help you in two ways:

  • As a gut check on your key metrics

  • To gather other insights or learning that build your knowledge base

If you see that your secondary metrics align with your primary metrics, then you have even greater confidence in your results. This is because if multiple metrics point to the same underlying user behavior, then you’ll feel more comfortable that you’ve correctly identified the cause of the change in metrics. This helps reduce the probability of a false positive.

Recall our earlier example from Etsy. If you see that there is something unexpected or even counterintuitive, secondary metrics can help you understand what’s going on to paint a more complete picture of how your users are behaving. There might be a flag that the results you are seeing aren’t sustainable. For instance, that your primary metric of interest did change, but other metrics that should vary with that metric did not change (or changed in the wrong direction). This might indicate that your test falsely increased a key metric without building the right behavior in your user base.

You can also use the secondary metrics you are gathering to build up your existing knowledge about how your users interact with your experience. How much time do your users generally spend on the site? Where are they going? Use this information to help feed into future hypotheses or to further refine your tests.

Secondary metrics and digging deeper with other data that you have can also help dispel false narratives that arise from the human desire to “explain” things. Eric Colson from Stitch Fix talked about how they are able to use data to understand what is really happening with their customers and how having access to rich data can give you insights that you might not have had otherwise:

Our model affords us rich data that can be used to understand the variation in business metrics. This prevents us from falling victim to narrative fallacies—the plausible but unsupported explanations for the variation. Traditional retail models lack the data and therefore are often prone to latch on to easy-to-grasp narratives. I am told that it is common for store managers at department stores to blame the weather for a bad day of sales. Other times they blame the merchandise (“how can you expect me to sell this stuff?”). Other times still they blame the customers (“Customers these days are so promotion-driven. How do you expect me to sell this stuff at our prices?”). Of course, when sales are good the store manager accepts the credit (“it’s due to the service level we provide in this store”). Each narrative is plausible. And, indeed we gravitate to any explanation that seems satisfying—it’s not in human nature to let things go unexplained. We’ll even accept them in the absence of evidence. The response, “I don’t know,” just isn’t a tenable answer to the question of ‘why did sales go up or down?’

At Stitch Fix we have so much rich data that we not only have the ability to explain phenomenon, but also the obligation. We can statistically tease apart the effects owing to the client, the merchandise, even seasonality. Narratives still surface, however, they are offered up as hypotheses until validated. By studying the data with proper statistical rigor we can either accept or reject them. For example, a few Halloweens ago we had this weird drop in metrics. A tempting narrative circulated that the drop was due to the fact that Halloween fell on a Friday this year, distracting our clients from their typical shopping behaviors. It was a reasonable and relatable explanation. Yet, the narrative was stopped from spreading in the office because it had no evidence to support it. Instead it was recast as a hypothesis. We framed it up as such and looked for evidence. Customer behavior was not consistent with the hypothesis, causing us to reject the hypothesis. Eventually it was revealed that it had nothing to do with Friday Halloween. It was a bug in some recently released code!

While the amount of data that a company like Stitch Fix has is perhaps a lot more than your average company, we think this story helps to give a good overview of the approach you can use as you learn to leverage your data to understand your users’ behaviors in a more granular way. As humans, we can be quick to jump onto stories that resonate with us, but data (and especially data triangulation) can help us to either back up those stories or debunk them.

Using multiple test cells

When you’re designing to learn, your goal is to understand how your hypothesis “holds up” in the world. Although it’s important to get the design execution right, understanding how your hypothesis performs is a generalizable learning that you can apply to tweaking your design in the future. Remember that in Chapter 5 we talked about different test cells. Sometimes, each test cell will express a different hypothesis. However, in other cases you may have multiple treatments of a single hypothesis, which you can use to help you learn more.

When you have different test cells expressing the same hypothesis, you’ll want to evaluate your results both by looking at individual test cells as well as looking at your test as a group of test cells. You might find that all of the test cells in your experiment changed your metrics of interest in the way you intended compared to the control. If each of the test cells changed in the same direction relative to the control, your confidence about the underlying hypothesis should increase because each of those treatments led to the user behavior you expected. If all of the test cells caused unwanted changes to your metrics compared to the control, then you’ve probably disproven your hypothesis. Assuming that there was no issue with how you allocated users into your test conditions, it’s unlikely that every treatment would have failed if your hypothesis was true of the world.

Sometimes, you’ll see a mix within the test cells, where some are expected and desired results, and others are unexpected and undesirable. In those cases, you’ll want to look further at what might have caused these differences. You should dig deeply in these cases to understand why your test cells were performing differently. Is it possible that you might be observing a false positive in some of your test cells? Or, can you use secondary metrics to learn about what the difference was in user behavior that caused only some test cells to exhibit the intended behavior? This will help you tease out the cause of these discrepancies and help you be more confident in the results you’re observing, and what they mean for your hypothesis.

Rolling out to more users

One of the best ways to reduce the risk of false positives is to roll out an experience to users in increments, allowing you to gain additional confidence in your results before you launch to 100% of your user base. This helps you understand how the data you observed in an A/B test scales to the larger population, and increases the probability that you “catch” an unexpected impact.

Sometimes you might still get a surprising result even if all of your data turned out positive, so ramping up is a way to see whether an effect you observed in an initial A/B test scales to the larger user base. This story from Jon Wiley at Google is a great example of how even when you get positive results from an experiment, you might find that the response is different when you launch your winning experience to your full user base. The Google home page is very iconic and poses an interesting design problem because of its simplicity. One exploration that Jon did was to clean up the home page by removing all of the navigation, all of the links, anything in the footer and header, sign-in, and so on. However, many of those links are quite useful and he also wanted to make sure that they didn’t lose utility from the home page. The team came up with the idea to initially hide the links and then to fade them in if the user moved their mouse (which they believed showed intent). The team believed that this way the links would become available to the user at the moment they needed them:

After running the experiment, all of the metrics that we had tracked told us that it was a good experience for users. We just said, “Oh, OK, we’ll launch this. The numbers say it’s good.”

Then, we got a wall of user wrath. I just started getting tons and tons of email from people who were furious. We had done the opposite of what we had intended to do. Rather than making those things on the page go away unless you needed them, it turns out that everybody used their mouse even when they had no intention to click. We were actually calling attention to these things on the page. We were giving them more prominence than they deserved. This was one of those occasions where our metrics told us one thing, but the reality was that it drove everybody nuts.

Now, we had other methods of trying to elucidate this reaction. We have user research. We do qualitative research. In our qualitative research, people didn’t really notice it because it’s one of those cases where we tested it with 8 people or 16 people. People weren’t really paying attention to it. However, in aggregate, over a much larger number of people, it actually struck a nerve and really bothered people. This is a place where we were just very surprised that all of the signals that we saw weren’t true.

This story is a good illustration of some of the limitations of using data in the design process, and how sometimes the data that you observe in an A/B test might not scale in the larger population. It also points to the constant “conversation” that continually happens between your designs and your customers that we discussed in Chapter 4. That “conversation” might evolve over time. At each stage of your experiment, from running usability tests in-house, to an initial A/B testing test, to an additional roll out test, and finally to launching to your full customer base, the conditions in which your customers are evaluating your design might be different. The response you get could be different if it’s in a lab setting in your offices, on a small segment of your customer base, or launched to the full population. This was also a great example of recognizing how we need to consider both the emotional response to our designs as well as the impact on metrics. We will discuss the idea of “ramping up” later in the chapter.

Revisiting “thick” data

Finally, remember that data triangulation can be a great way to improve confidence in your results. Tricia Wang, global tech ethnographer, draws a distinction between “big data” (the type of data collected at scale, such as that which comes from an A/B test) and “thick data” (for instance, that which is collected from ethnography.) Thick data is drawn from a small sample, but its depth can give you good insights into the very human reasons why you might have seen a behavior emerge through numbers (Figure 6-5).4

Figure 6-5. An illustration of the difference between big data and “thick data’ by Tricia Wang.

At the end of an A/B test, we encourage you to be thoughtful about using other sources of data to supplement what you’ve learned, and make sure you haven’t made any mistakes in your measurements along the way. Arianna McClain shared an IDEO project to explain why collecting “thick data” is so important to catch errors in “big data.” By triangulating different research methodologies, the IDEO team learned that one of their client’s metrics—churn—wasn’t actually measuring what they thought it was measuring:

One of our clients sought to deepen their customer relationships, from new to current to churned customers. They were concerned about their churn rate, which was measured as days since last purchase. In addition to working with a team that ran quantitative analyses, IDEO conducted qualitative interviews with multiple customers who churned–some after one purchase, others after dozens of purchases.

What IDEO learned was that many of our client’s customers were continuing to engage with the product, they just weren’t purchasing. There was no way for our client to know this because customers weren’t signed in. When we asked customers why they never signed in, they simply said that there was no reason to—signing in still showed them the same content. This insight was a relief to our client. First, they realized they were not losing as many customers as they originally believed. Second, they learned they needed to measure churn differently. Finally, they saw an opportunity for how they could personalize their product by addressing churned customers: give them new content that incentivized those customers to sign in.

This example from Arianna shows how sometimes our assumptions about how to measure something (e.g., churn) can be wrong. Additionally, the measurements we get back are dependent on the design and experience that we’re offering. Because their client’s product did not offer an incentive to sign in, users weren’t signing in, and this reflected back into their churn metrics—but not in the way that they had expected.

We encourage you to think about how to apply this to your own work. Should you double-check your big data with thick data? How can you apply other methodologies to help you reduce the risk of an error in your data? Doing so can help you increase your confidence that you’ve measured the right thing—and help you avoid mistakes along the way.

Getting Trustworthy Data

“Getting numbers is easy, getting numbers you can trust is hard.”

—RONNY KOHAVI

The worst situation (which happens even with teams that have a lot of experience with A/B testing) is that you go through all this effort to plan, design, and then build and launch an A/B test, but you aren’t able to get good takeaways due to poor data quality. This is why it’s really important to feel that the data you get back is significant and that you are basing your decisions on the right representative sample.

It’s always very tempting to look at your results as soon as they start to come back and to immediately want to take some kind of action based on what you are seeing. As we’ve articulated in several chapters, it’s an important exercise in discipline to wait until your test is complete (that is, has run as long as you intended and has obtained the predetermined sample size) before checking for statistical significance and taking action as a result. There are a number of examples where a team decided to act prematurely on some data, only to find that the results changed after they had a sufficiently large sample size (and therefore a powerful enough test). If you have data analysts on your team, they’re experts in knowing and communicating how clean your data is, and what the limitations of it might be.

So, let’s assume that you feel confident in beginning to evaluate the results that you’re getting from your test. This means that you feel good about the quality of the data that you’re getting back, and that the test has run for a long enough time and with enough people in each test cell that you are responding to significant data. However, even if you’ve been hygienic about your data up until this point, we want to review a few pitfalls you’ll need to be careful to avoid so that you can make sure you have the most trustworthy data possible.

Novelty effect

In previous chapters we talked a lot about the differences between new and existing users. In practice, this difference is important not only when you’re planning your test, but also as you’re analyzing your results.

When teams see a flat or negative result, they sometimes attribute it to users needing to “get used to” the new experience. Looking at new users can be a good way to understand whether something like this is happening. New users don’t have habits or assumptions that have developed based on your control condition—they’re not familiar with your product at all. If the experiment is flat or negative compared to your metrics of interest, it shows that needing to get used to the new experience is not the cause of the undesirable results.

On the flip side, if an experiment is a win for new users only, it suggests that you could be observing a novelty effect. The novelty effect is when people “perform better” (in this context, use your product better) because of excitement about the new product rather than a true improvement in behavior or achievement. You’ll want to make sure you run your experiment for long enough to let this novelty effect wear off—after some time when the feature is no longer novel, you would expect to see a leveling off of performance.

Seasonality bias

One of the strengths of A/B testing is that it occurs in context, allowing you to measure how your products perform in the wild. This lets you take into account the practical factors that will impact how your product gets used and adopted. However, this can become a weakness for factors that are hard to control. Seasonality bias is bias introduced due to seasonal patterns in the world—for instance, due to weather or holidays. Imagine that you were running an ecommerce website. A/B tests run during the month of December might be subjected to seasonality bias because of last-minute holiday shopping.

As you are thinking about your A/B tests, be thoughtful about whether seasonal effects might confound your results. Remember that such effects are often specific to your product. If so, consider waiting until the season has passed to test your ideas at a time where you can be clear on what you are getting in terms of results.

Rolling Out Your Experience, or Not

By now, you should know what the impact of your experiment was: was it flat and inconclusive? Did it make a measurable improvement to your existing solution? Or did it make the behavior worse? Remember that your goal in A/B testing is to learn something, and take that learning to decide what to do next for your users.

We’ll spend the next part of this section talking in a little more detail about how you will go about making the decision on what to do. Part of what you will factor into your decision is the result from your test and whether it performed as you expected. You will also be influenced by how mature your testing is, how refined the work has been up until this point, how many more iterations you expect to do, and how large the changes will feel to your existing user base.

You might decide on any of the following paths:

Roll out to 100%

Take one of your test experiences and give it to your entire user base; this experience now becomes your new default and new control for all future tests. You do this when your test is successful, but also when you feel like your testing is mature enough. For example, you don’t want to roll out a change to 100% of your audience if you feel like you’ll be making another large change just a few weeks later, or if you haven’t done the due diligence to polish your experience and satisfy your 100% rollout plan (see Chapter 5).

Roll out to a larger audience

You have a good feeling about where you’re going but you don’t yet want to commit all the way. Recall that this is one way to reduce the risk of errors. Additionally, you might still have some questions about the data that you’ve gotten back from your current experience and want to gather data from a broader audience. You might have observed interesting changes while segmenting that data that you want to validate with a new experiment. You may have learned you didn’t have the power you needed to measure a change and need a bigger sample to obtain significance.

Keep your existing experience

No changes for the majority of your user base. This is usually the situation when you find that your test didn’t have the results or impact that you were expecting, or if you plan on doing further iterations or refinements on the experience you are testing (even if the results were positive).

We’ll talk a little bit about each of these decisions in the coming pages to help you decide what makes most sense based on what you’ve learned, and where you are in your product process.

What’s Next for Your Designs?

A few times throughout this book, we’ve used the framework of global versus local problems and exploration versus evaluation. This was partly to encourage focusing your design and A/B testing on solving the right type of problem. Now that you’re deciding what to do with your results, revisiting what you set out to do will help you make decisions on how to apply your data and new learning.

Were you exploring or evaluating?

Recall that exploration and evaluation exist on a spectrum that helps you and your team focus on establishing the goal(s) of your A/B test In more exploratory problems, A/B tests allow you to understand the problem space and explore possible solutions; to that end, you can seek directional feedback about what to keep pursuing. Comparatively, in evaluatory contexts, your goal is to see whether the design changes made have the causal impact expected in real users. These types of problems are probably closer to being “complete” and therefore closer to rolling out.

Our point in introducing this part of the framework was to help you and your teams align on how close you are to a “finished” product. Intuitively, this is an essential component in deciding what to do with your insights now that you’ve finished your test. Figure 6-6 helps summarize what you’ll probably be thinking about after different types of tests along this spectrum.

Figure 6-6. The spectrum of exploration to evaluation.

If you were exploring the space of possible designs and hypotheses, it’s unlikely that you’ve found the best solution on your first or second try. Doing exploratory A/B tests is in service of helping you eventually narrow down a design to evaluate. The results you got from this test can help you with directional feedback on which of your hypotheses (if any) makes the most sense to pursue further. This might lead you to explore at a smaller scale, looking at different treatments of a single hypothesis to tweak smaller details. Or you could feel confident enough to begin evaluating whether a treatment of that hypothesis is a good candidate to roll out more broadly. You might also decide that you haven’t learned enough to narrow down just yet. Perhaps your test was inconclusive or you haven’t yet collected enough data to rally behind a hypothesis. In this case, you might find yourself running additional broad exploratory tests to complement the learnings from this one. In either case, you probably won’t be rolling out any designs from an exploratory test. Instead, you’ll be taking the insights back to your team to keep working toward the right solution.

Comparatively, tests for the purposes of evaluation are close to being released: you were probably looking for major blockers or issues such as unexpected negative impact to your metrics (such as in Dan McKinley’s example from before—they anticipated “flat” results, since the goal was to create a more coherent design experience rather than to improve metrics), and to confirm whether your design caused the desired user behavior. In other words, the test cells you put forth in an evaluation test were probably candidates for rollout already, and you and your team were seeking confirmation on the best cell. If you’ve found evidence in favor of one of your test cells, you might consider rolling it out to a larger part of your audience to gain increasing confidence, until you eventually feel strongly enough to roll out to 100% and make it your new default experience.

Was your problem global or local?

The other dimension we introduced was whether your problem is global or local. This ultimately comes down to how many variables your test cell changed relative to your default experience. If you are changing just one or two variables, or pieces of your experience (e.g., color, language, placement), your problem is probably on the local end of the spectrum. However, if your change is impacting more than a few variables (closer to a “redesign” than a “tweak”) you’re probably thinking about a more global problem. Several recent examples of “global” changes are the brand redesigns of companies like Uber and Instagram, which completely altered the look and feel of their products.

Remember that our goal in running an A/B test is to causally attribute a change in the experience to a change in user behavior. One of the things that can be challenging about the difference between a global and local A/B test is that it can be hard to know what part of the change caused an impact when you’re changing many variables at once (global test). Arianna McClain articulated this well, when talking about the differences between testing new user flows and testing button colors:

Say we’re testing a three-step user flow versus a one-step flow. User flows can contain dozens of features and confounders, so it’s difficult to take the results of an A/B test at face value. If the three-step flow performs significantly better than the one-step flow, designers and analysts really need to dig in to understand why they are seeing these statistical differences. Knowing the reasons why the three-step flow worked so well may let them design even more effective products in the future. However, if a designer is simply trying to determine if a red or blue button gets more people to sign up, then they’re more able to just trust the data directly.

She reminds us that for these global tests, you might want to triangulate with other sources of data to understand which parts of the change might be leading to the difference in behavior that you observed.

Once you’ve triangulated with other sources of data for your global test, it’s time to think about rollout. Unlike the distinction between exploration and evaluation A/B tests, both global and local test cells are potential candidates for rollout. However, how you go about rolling out a global and a local test that performs well in an A/B test can be quite different. We summarize this in Figure 6-7.

Figure 6-7. The spectrum of global to local.

Local changes are only small tweaks to your existing experience. The result of this from your user’s perspective is that nothing much has changed, and therefore there’s little friction to adoption. In many cases, local changes get rolled out without anybody noticing. If you decide to move forward with a local change, you can just slowly ramp up the percentage of users with that new change until you’ve hit 100%.

Comparatively, global changes to your experience will generally draw attention. If you pursue a global change, you have to know that your users will not only notice the change, but they may have strong emotions about or face significant friction when using the new look and feel or flow of your product. In addition to rolling out the change, global changes require communications and PR to your users to help them understand the change and help them with the transition. For instance, when Instagram rebranded in May 2016, they posted a video and write-up on their blog to reintroduce the new iconography and color scheme, and motivate why the change was important.5 This helped loyal Instagram users understand the change, and feel supported rather than abandoned by the Instagram team during the change. In these cases, a team might also want to notify or staff up their Customer Support team to accommodate the potential increase in support outreach as the result of a large change.

Knowing when to stop

Finally, one additional consideration is that if you’re doing optimization tests, you’ll have to decide when to stop. Make sure you don’t get so focused on fine-tuning a test that you come to a point of diminishing returns. After all, there is only so much time that you really want to spend on tweaking button colors or layout. By keeping an eye on how much your changes are reflected in the metrics, you can determine whether you want to keep investing in optimizing your design. If you feel like you are making big changes but not a big impact on your metrics, then it’s probably not worth it for you to continue investing time in these changes.

Ramp Up

One of the choices we offered you when deciding what to do next is to “roll out to 100%.” However, it’s rare to do this straight after a small experiment. Even if you have positive results, it’s best practice in the industry to slowly increase the percentage of users who get the new experience over a period of time until you eventually reach 100%. How fast or slow you ramp up your user base onto the new experience depends somewhat on your company’s engineering resources and/or the engineering release schedule, which might be out of your control. This is one of many reasons why designers and engineers should maintain open communication and collaboration channels. However, you might also consider the speed at which you roll out the new experience to be a factor of how complicated or large the change is to the experience.

Assuming that you had a large enough user base to start A/B testing with 1% of your population, the first ramp-up stage will be within a 5% range of users aiming to identify large negative changes that indicate something is wrong with the experiment. Ramping up to a percentage that is still small limits the potential risk of changing your users’ experience frequently until you feel more confident in the design and hypothesis underlying the test cell that you ramp up. The second stage gives you even more confidence—you might look to start expanding to 10% or even 25% of your population. A third and final stage, with 50% test cell/50% control, gives you the most power to detect the impact of your test cell and may be used to learn more about segments and impact to your overall user base. See Figure 6-8 for an illustration of this. Colin McFarland talks about the importance of having an effective ramp-up strategy with clear requirements that you need to meet before moving to the next stage of testing.6 He says, “It’s crucial your ramp-up strategy doesn’t slow you down unnecessarily. An effective ramp-up strategy should have clear goals against each stage, and move forward to the next stage unless results are unsatisfactory. For example, stage 1 may look to validate there is no significant detrimental impact while ensuring engineering and platform resources are performing reliably in less than 24 hours.”

Figure 6-8. Ramp-up in phases: by first launching to a very small group of users (< 5%), then to 50%, and then finally rolling out to a majority group.

As you roll out your experience to an increasingly larger audience, you’ll want to continuously keep an eye on the impact of that new experience on your metrics. Remember that if you have statistically significant results, then you have a fair amount of confidence that as you expose this feature to a broader group, those results will continue to hold. However, there might be things that you didn’t predict would happen, so monitoring the data along the way is a bit of insurance to make sure that you are indeed having the impact that you expect to have. Another way to look for unexpected results is to use holdback groups, which we will discuss next.

Holdback Groups

Note that even when we say that you are going to roll out your results to your full user base, the common and best practice is to release a new experience to the vast majority of your user base and “hold back,” keeping just a small percentage of users (usually 1% or 5%) on the old experience (the former control). This practice can be helpful to keep measuring the long-term impact of your design change. Some tests might have a prolonged effect, where you could see interesting results a few months after you launch and you want to have a way to monitor these changes over a much longer period of time (but where that period of time might be too long to run as a prolonged A/B test). Additionally, some metrics, called “lagging metrics,” can take a long time to impact. You want a way to measure causally how your design change affected these lagging metrics, which you can only do using a holdback group.

You can think of holdback groups as a safety measure to ensure that you didn’t make any mistakes in the analysis of your experiment. You can never predict if you might have missed something that could get amplified or change the results of your test as you expose your new design to a much larger population. If you roll out your new experiment directly to a full 100% of your user population, you’ll no longer have an old experience to compare to, and therefore no way to confirm that at scale, your experiment had the intended effect. Keeping a small percentage of your population on the original experience will allow you to make that comparison. Then, when you are certain that your experience is sound and that it indeed has the positive effect you were hoping for, you can move the last percentage of users to the new experience.

One more reason to care about holdback groups is that they let you continue to evaluate whether your proxy metrics eventually have impact on your company’s key metrics. Remember that sometimes the metrics your company cares about can take a very long time to measure. Letting these experiments run in the background is one strategy to reassess the casual relationships between proxy and key metrics. There are some pitfalls associated with long-running experiments. We’ll note some additional resources at the end of the book. John Ciancutti described this well to us when he said:

You run that test, you improve your proxy metric on the first day. You’re like, “Great, we’re going to act as if this drives the key metric, and we’re going to move on,” but occasionally you have to reassess that.

You leave it running and you confirm that in fact there’s still a causal relationship between the key and proxy metrics. In other words, in this A/B test, we raised that proxy metric. And later, it did raise that hard-to-move key metric that we actually care about. It can take a very long time to get some of the data.

So you leave these things running in the background, and act as if it’s true, but then you validate it. Also, you try all of these other metrics and see what else correlates. Lots of things correlate, so you gotta find those causal effects. Lots of metrics move together, so you find the ones that are intuitively strongest and have the clearest relationship.

By continually looking for ways to evaluate these relationships between your company’s metrics, you can strengthen your data toolkit in the long term. A single experiment may be important, but it is the cumulative impact of multiple experiments that address one metric in depth or many metrics over time will allow you and your data friends to develop an impactful data practice for your entire company.

We hope that this chapter so far has helped give you a high-level view of the decisions that come after an A/B test. To close out our discussion about analyzing a test, we want to share the decision tree shown in Figure 6-9. We believe that structuring your thinking when deciding to roll out, abandon a hypothesis, or keep iterating can help you be systematic about applying A/B testing insights in the best possible way.

Figure 6-9. A simple decision tree illustrates how you might approach deciding whether or not to roll out your A/B test.

Taking Communication into Account

Katie Dill from Airbnb reminds us that we need to consider the effect that A/B testing can have on our customers. For example, if Airbnb wants to test a different listing page in Germany, the team needs to recognize that there are Americans who will travel to Germany and that they will talk to each other. Airbnb takes into consideration how people move, how they might talk to each other, and how they might see each other’s things. They need to think about the reaction that a guest or host might have if they see something different than what they are experiencing themselves. Communication and awareness around their testing is really important to them.

She cautions against designers getting too relaxed about testing and thinking “it’s just a test, it’s OK if it’s not great.” Katie emphasizes the importance of having an opinion about the experience you’re testing and being proud of it. By being confident about the experiences that you are testing, you are exerting your own judgment in the process. She notes that it’s important to remember that a test is still going to your customers and you must maintain a certain quality bar no matter how big or small the recipient group is. We want to remind you here that although we’ve emphasized that you don’t always need to have a perfectly polished experience in order to test, you should try to maintain a level of quality that you’re comfortable exposing at scale.

We know that we’ve introduced a lot of content in this chapter, and it can be challenging to understand some of the concepts we touched on in the abstract. To help address this, we’ll close out this chapter with an extended case study from Netflix. We believe that this example illustrates the major ideas that have been discussed in this chapter.

Case Study: Netflix on PlayStation 3

In Chapter 5, we used the example of Netflix launching on the PS3 to show how different hypotheses were expressed and tested at the same time. We’ll return now to that example to see how they actually performed. Before we get to the actual results, we wanted to ask you to think first about which of the different hypotheses you would have bet on winning if you were part of the team. (As a quick reminder, remember that the main success criteria for the test was viewing hours or “consumption.” Which experience would result in Netflix users watching the most TV and movies?) Examples of the four hypotheses the team tested are shown in Figure 6-10.

Figure 6-10. As a reminder, these four designs represent the four main hypotheses that the Netflix team was exploring.

Whenever we share this example, most people assume that Cell 2 or Cell 4 was most successful. About a quarter select Cell 1, and usually about 5% of people select Cell 3. We love this exercise because it demonstrates how even in a room filled with designers and product managers—all of whom should have great consumer instinct—there can be disagreement. Using data can help address these disagreements and make everyone’s consumer instinct works better the next time around.

Within Netflix, the team was pretty split about which cell would perform best. The designers tended to favor Cell 1. This was probably because in many ways it was the most robust design, and it allowed for the most scalability. They felt like it did the best job at addressing the majority of the user complaints about the original UI. It also performed well in usability testing during the prototyping phase. On the other hand, the product managers and engineers tended to favor Cell 4. They were passionate about using video as a way to evaluate content (as opposed to having to read through a long paragraph that described what you were going to see). Why read when you can watch? The engineers also did an amazing job on the technical backend of making the video load and playback extremely quickly, so there was some development pride in the performance of this cell as well. However, Cell 4 got really polarizing feedback when it was user tested. Some people loved the immersive experience, but others found it distracting.

This example illustrates how the biases of the people that are working on the experience might manifest in what they believe to be the best experience for the customers. So which group was “right”? Neither.

One interesting note here is that because the team was using a completely new platform, they weren’t able to use the original PlayStation experience as a “control.” For the sake of testing, the team had to assign one of the four new hypotheses to be the control. They chose Cell 1 because it was the experience that had been under development the longest. Consequently, it was the more thoroughly user tested and therefore best understood. In some ways, by declaring Cell 1 as the control, it was as though the team overall was betting that it would be the “winning” cell.

So we mentioned that neither the designers nor the engineers and product managers accurately predicted which experience would fare the best with real users. We illustrate the outcomes in Figure 6-11.

Figure 6-11. When compared to Cell 1, which was deemed the “control” cell, only Cell 2 led to a statistically significant increase in consumption metrics.

Compared to Cell 1 (the control group), both Cell 3 and Cell 4 had a negative impact on consumption. But compared to Cell 1, Cell 2 increased consumption. This caused the collective teams to swallow their pride a bit and examine what we thought worked well on a TV. Looking back on this initiative, without A/B testing, there probably would have been a huge disagreement between design, product, and development before we launched about which version we should release. Without data, both sides would have ultimately been wrong and in the end lost viewing hours rather than increasing them.

Based on what we discussed earlier in this chapter, you would probably expect test Cells 1, 3, and 4 to be scrapped, and all users moved to Cell 2. While that is what eventually happened, it’s worth going into a few more details about this test and how it evolved over time. Remember that this is just the first iteration of testing on the PS3 platform, essentially the first experiment ever run for the TV experience. This made it a more exploratory test than an evaluatory test, so the team used what they learned to keep iterating on the experience to make it as good as possible.

Many Treatments of the Four Hypotheses

For the sake of simplicity, we only shared one test cell for each of the four hypotheses tested when we walked through the preceding example. However, that’s not quite accurate. In this test, the team didn’t actually just test four different cells, but actually had close to 20 cells overall. Each test cell had a “sub-hypothesis” on exactly how to treat the overarching hypothesis at hand. So, for example, with Cell 3, there was a version that had box art at the top level. It had only four pieces of box art (so thematically it stayed more “true” to the original concept). And in fact, the test cell that had the box art at the top level actually performed better than the version that didn’t have any box art at all—that is to say, it had the most positive impact on the metric of interest for the team (consumption). The positive performance of Cell 2 and the improved performance of this small variation helped solidify the team’s conviction that having some visual representation of the content at the top level was extremely important. This example illustrates the importance of evaluating the results of your test both holistically and at the level of individual cells. Doing so can help you spot high-level trends across your data that could serve as potential design principles for your product or service moving forward. As a principle, the importance of visual content helped justify why Cell 3 didn’t perform well.

When analyzing and interpreting your results, remember that A/B testing shows you behaviors but not why they occurred. To do great design and really understand your users, you’ll also need to understand the psychology behind what influenced your users to behave in the way that they did. For this test, the team tried to tease apart some of the reasons why they got these results using surveys and qualitative user research. This helped the team understand why they saw the results they did, and gave them a clear directive on what pieces of each design they should move forward into future testing. For instance, this helped show the team that focusing on simplicity was more important than giving users full control.

Evolving the Design Through Iterative Tests

It’s very unlikely that you’ll land on the optimal experience from a single test. Therefore, you’ll want to continue to iterate and see if you can improve your results even more. In this stage, a team might borrow features or variables from some of the other hypotheses to “beef up” the winners from a previous test, or continue to pursue an idea that they believed had potential (or that performed well in other forms of research). Figure 6-12 and Figure 6-13 gives some examples of how the winning Cell 2 was iterated on.

In the design featured in Figure 6-12, the team borrowed a feature that performed well on the website version of Netflix and applied it to the TV UI. When you mouse over a piece of content on the website, a pop-over with information about the movie or TV show appears. This seemed like a good mechanism to use that might simplify this experience even further, and you could argue that the team was pushing even further on the theme or concept of simplicity. Internally at Netflix, this mechanism was called the “BOB” or “back of box” because it represented the kind of information you’d find on the back of a DVD box if you were looking for a movie in the video store.

Figure 6-12. Removing the informational side area, and using the “BOB.”

In the design shown in Figure 6-13, the “BOB” was reduced even further. Less information was included to see if the team could determine the bare-minimum information that users needed to make a decision about what to watch. If the team found that this cell performed well, they could consider applying what they learned to other platforms as well (website, mobile, etc.). This demonstrates the point we made earlier about how insights might generalize and apply beyond the specific context of testing, helping you and your team build a wealth of knowledge from your different tests.

Figure 6-13. A design with a very minimal “BOB.”

What If You Still Believe?

One downside of doing global A/B tests where multiple variables are changing at a time is that there is a higher chance that you might mask an interesting positive result from several of your changes with the negative effect of one poor choice. This occurs when one of the many variables that you changed in that cell was actually limiting its success. If you had changed all the variables except for that one, then you may have had a positive result.

If you believe that a failed or flat hypothesis has merit (especially when you have other evidence to support this belief), you should continue to do large-scale concept testing until you understand exactly what you’ve been missing. This is exactly what happened with Cell 4, the video cell. There was a strong belief within the company that finding what you want to watch through the act of watching is actually a very powerful proposition. So, why didn’t we see that reflected in the results from the A/B test? There were several ideas as to what might have happened:

Performance

Even though the engineering team had done a great job of making the video start up as fast as possible, it may still not have felt quick enough to users. Or they may not have had an especially fast internet connection at home, so it just felt too slow and then they lost patience.

Limited choices

They learned from Cell 2 winning that having a lot of box shots on the screen seemed to make a difference in getting people to play more. Perhaps the original video UI felt too limited in terms of the number of options.

Loss of content

Perhaps the full-bleed video was not the right execution because it overtook the information that is normally conveyed in the “BOB.” That information might still be valuable to users in a video preview context.

Outlining a few ideas like this is a great way to help refine your concept while keeping your ideas broad. For instance, were there ways to incorporate video into the winning Cell 2? By getting creative, the team found new ways to incorporate video into the previewing and movie selection process. Even after they released Cell 2 to the larger population, the team continued to explore video because they wanted to keep open the possibility of video in the interface. However, after two or three more rounds of testing different conceptual ideas around video, the team saw no uplift in metrics and decided to abandon the idea for the time being.

Once you have the data back on your first iteration, it’s tempting to roll out the winning cell to the rest of the population right away. However, this is rarely what you want to do with a global test such as this example from Netflix on PS3. Why? If your concepts are wildly different, one round of testing probably isn’t enough to settle on the best concept. Another reason you might not want to roll out after a single test is that there may be other big hypotheses that you want to test but didn’t build out yet. Running those concepts first can give you more insights to work with as you begin to refine your own ideas.

When A/B testing smaller variables, it’s easy to change those things without causing too much negative user feedback. However, in cases like this when you are about to fundamentally change the way that your service or product works, remember that your customers will have to live through (and often, struggle through) every change you make. Keeping your control running until you have a stable idea of what experience you want to settle on will limit the number of changed experiences you put your users through. This isn’t to say that once you launch a design you can never change it, but keeping this in the back of your mind will ensure that you only roll out ideas that you feel good about to 100% of your user base.

We hope that this case study helped the concepts from this chapter come to life. As you can see, data can feed back into your design and product process in many ways—by challenging your preconceptions about your user, inspiring new designs, and helping you know when you’re on the right track. A/B testing data can’t make decisions for you, and is one of many factors that weigh into the choice of what to do next. But it can uncover valuable insights so that you can ask more thoughtful and well-informed questions to better understand your users in the future, and make the most informed choices possible to do right by your users when you launch design changes at scale.

One final quick example that demonstrates the value of iterating on your design further despite negative results comes from Katie Dill at Airbnb. Katie talks about ways that the team at Airbnb has reacted to negative or confounding results:

Sometimes the data doesn’t tell the whole story. For example, we made changes to our navigation on our website and the data showed that key metrics went down. The conclusion at first was that the core idea for the information hierarchy was wrong. But when the multidisciplinary team assembled to discuss the findings and next steps, one of the designers pushed to keep the original idea but make a few changes. Through review of the design work that was done, the team found that some of the design decisions were likely at fault for the poor performance. They made a few changes including adding a carrot to indicate a drop-down menu, and saw the key metrics go right back up. It was a simple change, but a good example that when something isn’t working before throwing the whole thing out, it’s best to identify potential spot fixes. It’s just a matter of fighting for what you believe in and making sure that we’re giving it its best chance to be successful.

What’s important here is that they didn’t just take the results at face value as a direct indication of what to do next; rather, they used their experience and intuition to understand that the design didn’t perform to expectations because of bad execution on the part of design.

This chapter closes out our practical how-to guide to A/B testing as a designer. We hope that you feel empowered to get your hands dirty by being involved in A/B testing from the start. In the next chapter, we’ll give you some tips on how to build a culture of data at your company. We hope that this book so far has gotten you excited about using data in your design process. But only by building a culture to support that data can it truly make maximal impact. This topic will be the focus of Chapter 7.

Summary

We started this chapter by reviewing the things to take into consideration when launching your test and exposing your “experiment” to your users for the first time at scale. We echoed a constant theme in this book—that good planning ahead of time and thinking about how your current experiment might fit into your holistic framework around testing to learn (and grow your knowledge base) is really important.

Core to this is understanding that to gather data that is meaningful to your business and that allows you to make good decisions, you will need data that is representative of your user base. So how you decide to allocate users to the test cells in your experiment and thinking about what biases you might encounter in that process is important. Having “good data,” data that is gathered programmatically and that can be generalized from a small population to a broader population will help to ensure that what you learn from the data you collect on a smaller population can be broadly applied to your customer base.

Once your test is launched and has been running long enough to achieve the necessary sample size and statistical power, you can better evaluate your hypothesis—that is, see if the changes you made are having the effect on their behavior that you were hoping for. You’ll gather the results, analyze, and then evaluate the impact of the experiment. As you work more with experimentation and data analysis, you will understand that design explorations may not have clear “winners” and that a number of the solutions you propose may not have the intended effect on your users’ behavior that you were hoping for. Planning and articulating your work upfront can help to ensure that you will always learn something no matter what the outcome. The results you get from your test will help you get a sense of whether you did a good job of designing your test cells to maximize learning, the key concept we articulated in Chapter 5.

Finally, deciding the next steps for your experiment and the experiences you designed isn’t always as straightforward as “was our experiment successful or not?” If your results support your hypothesis, you will want to consider if you should continue to evolve and test your thinking before slowly rolling the design(s) out to a larger audience. If your results were negative or inconclusive (flat), you might want to consider if there was more you could have done to make the experiment positive or if there was a flaw in your hypothesis. So, many of the decisions you make at this stage cannot be put into a simple chart or diagram, but will require you to draw on your experience as a designer, your understanding of what has worked well for customers in your past experience, and an understanding of where you and the team you are working with are in the larger cycle of experimenting and learning. Hopefully, seeing how complex some of these decisions might be will give you an appreciation for how leveraging experimentation and data in the design process is not just a simple matter of “doing what the data says.” Instead, it’s a blend of art and science.

We believe that by being more involved in the process of experimentation, and by participating within each of the phases in the process, designers can help to shape the future of their products for the better.

Because the practice of data and design is still at such an early stage, it is of paramount importance that you, as a user-centered designer, be part of the conversation.

Questions to Ask Yourself

As you go through this last stage in the process of A/B testing and experimentation, there are a number of questions that you can ask yourself that can help you to reflect on how experimentation and experimental data analysis can help you and how well you are learning to maximize your ability to learn through the use of data. Each experiment that you run is an opportunity to hone your instincts around what will or will not work with your customers.

Good ideation processes stand up to scrutiny and collecting data on your design and vetting them with users is a great way to measure the value and impact of your work.

Here are some questions you can ask yourself:

  • Did you correctly identify the behavior you wanted to change? Was that behavior appropriately reflected in your metrics?

  • Was your team (and your stakeholders) in agreement with what you were measuring and the success metrics that you selected?

  • Did you craft a clear hypothesis that accurately reflected what you wanted to learn and what you predicted would happen?

  • Were the experiences you designed differentiated enough to provide you with good learning? Were they impactful enough that you could measure and observe a change in behavior from your users?

  • Were you too focused in your tests? Were you too broad?

  • What was the impact of your designs on your user behavior? What did you do as a result, and why did you make that decision?

  • On reviewing your data, is there something different you would do next time in designing your experiment?

Remember, there is no negative consequence of being “wrong.” In the end, asking these questions will help you will learn from “failed” tests as well as from the “successes.” Applying a “learning” mindset to the overall process of designing with data, is about engaging your customers in an ongoing conversation about your experience. Beyond the design itself, you’ll also be creating the deeper skills you need to continually respond and react to changes in your customer behavior as it evolves over time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset