Chapter 22: An Illustrated Guide to Web Experiments

Editor's Note: Experimentation is an essential conversion rate optimization (CRO) activity. However, the statistics that form the foundation of running web experiments can intimidate inbound marketers who are new to CRO. The author created this illustrated guide to conducting web experiments with this in mind. It was originally published March 26, 2012, on The Moz Blog.

Web experimentation is a great methodology to improve most things on the web. It can be used to make adjustments to existing products such as increasing user engagement and conversion rate. But it can also be used to guide entire business decisions, as suggested by Eric Ries (http://theleanstartup.com). The primary strength of controlled web experiments is the possibility to isolate variables, and thus examine causality between different metrics such as tagline and conversion rate (see www.moz.com/blog/correlation-vs-causation-mathographic).

Much of the literature on experimental design has its roots in statistics and can be quite intimidating. To make it more accessible, I created this illustrated guide to web experiments.

Author's Note: Special thanks to Andreas Høgenhaven, who kindly made the illustrations for this guide.

Before getting started on the experiment, you need to get the basics right: Test metrics that align with your long-term business goals (see www.kaushik.net/avinash/rules-choosing-web-analytics-key-performance-indicators). Test big changes, not small (see http://blog.hubspot.com/blog/tabid/6307/bid/20569/Why-Marketers-A-B-Testing-Shouldn-t-Be-Limited-to-Small-Changes.aspx). And remember that the test winner is not the optimal performance, but only the best known variation. It doesn't mean that you have found the all-time optimal performing variation. You can (almost) always do better in another test.

A/B or MVT

One of the first things to consider is the experimental design. An A/B test design is usually preferable when one or two factors are tested, while a multivariate test (MVT) design is used when two or more independent factors are tested. However, it is worth noting that more than two factors can be tested with A/B/…/n tests or with sequential A/B tests. The downside of using A/B test for several factors is that it does not capture interaction effects.

9781118551585-un2201.eps

MVT Face-off: Full Factorial versus Fractional Factorial

So you want to go multivariate, huh? Wait a second. There are different kinds of multivariate tests. If you have ever visited WhichMVT.com, you probably came across terms such as full factorial, fractional factorial, and modified Taguchi. Before getting into these wicked words, let's get our multivariate test down to earth with an example. In this example we have three different factors, and each factor has two conditions.

9781118551585-un2202.eps

This example has three factors, each with two conditions, giving a total of 23 = 8 groups. In the full factorial design, all possible combinations are tested. This means eight variations are created, and users are split among these. To get 100 users for each condition, a total of 800 users are needed. In the following table, +1 Indicates condition 1, while -1 indicates condition 2.

9781118551585-un2203.eps

This design is not too bad when you have three factors with two conditions in each. But if you want to test four factors each comprising four conditions, you will have 44 = 256 groups. This means we would need 25,600 users to get 100 users into each group! Or if you want to test 10 different factors with two conditions in each, you will end up with 210 = 1,024 groups, requiring a lot of subjects to detect any significant effect of the factors. This is not a problem if you are Google or Twitter, but it is a problem if you are selling sausages in the wider Seattle area.

Author's Note: You can calculate the test duration time with VisualWebsiteOptimizers Calculator (http://visualwebsiteoptimizer.com/split-testing-blog/ab-test-duration-calculator). The output of this calculator does, however, come with great uncertainty because the change in the conversion rate is unknown. That is kind of the point of the test.

Enter fractional factorial design. The fractional factorial design was popularized by Genichi Taguchi and is sometimes called the Taguchi design. In a fractional factorial design, only a fraction of the total number of combinations are included in the experiment (hence the name). Instead of testing all possible combinations, the fractional factorial design tests only enough combinations to calculate the conversion rate of all possible combinations.

In the previous example, comprising three factors each with two conditions, it is sufficient to run four different combinations, and use the interaction between included factors to calculate combinations of factors not included in the experiment. The four groups included are ABC; A + (BC); B + (CA); C + (BA).

Instead of testing Factor A three times, this factor is only tested once while holding B and C constant. Similarly, Factor B is tested once while holding A and C constant, and Factor C tested once while holding A and B constant. I'll not dive too deeply into the statistics here, as the experimental software does the math for us anyway.

The fractional factorial test assumes that the factors are independent of one another. If there are interactions between factors (for example, image and headline), that would affect the validity of the test. One caveat of the fractional factorial design is that one factor (e.g., A) might be confounded with two-factor interactions (e.g., BC). This means that there is a risk that you end up not knowing if the variance is caused by A or by the interaction BC. Thus, if you have enough time and visitors, full factorial design is often preferable to fractional factorial design.

Testing the Test Environment with the A/A Test

Most inbound marketers are quite familiar with A/B tests. But what is less well known is the A/A test. The A/A test is useful as a test of the experimental environment, and is worth running before starting A/B or MVT tests. The A/A test shows whether the users are split correctly, and whether there are any potential misleading biases in the test environment.

9781118551585-un2204.eps

In the A/A design, users are split up like they are in an A/B or MVT test, but all groups see the same variation. You want the test results to be non-significant, and thus find no difference between the groups. If the test is significant something is wrong with the test environment, and subsequent tests are likely to be flawed. But as discussed in the following section, an A/A test is likely to be significant sometimes, due to random error/noise.

The A/A test is also a good way to show co-workers, bosses, and clients how data fluctuate, and that they should not get too excited when seeing an increase in conversion rate with 80 percent confidence. Let's call it a sanity check—especially in the early phases of experiments.

Statistical Significance

In the ideal experiment, all variables are held constant except the independent variable (the thing you want to investigate, such as the tagline, call-to-action, or images). But in the real world, many variables are not constant. For example, when conducting an A/B test, the users are split between two groups. As people are different, the two groups will never consist of identical individuals. This is not a problem as long as the other variables are randomized. It does, however, inflict noise in the data. This is why we use statistical tests.

9781118551585-un2205.eps

We conclude that a result is statistically significant when there is only a low probability that the difference between groups is caused by random error. In other words, the purpose of statistical tests is to examine the likelihood that the two samples of scores were drawn from populations with the same mean, meaning there is no “true” difference between the groups, and all variation is caused by noise.

In most experiments and experimental software, 95 percent confidence is used as the threshold of significance, although this number is somewhat arbitrary. If the difference between two group means is significant at 98 percent probability, we accept it as significant even though there is a 2 percent probability that the difference is caused by chance. Thus, statistical tests show us how confident we can be that difference in result are not caused by chance/random error. In Google Website Optimizer, this probability is called chance to beat original.

Pro Tip: Ramp Up Traffic to Experimental Conditions Gradually

One last tip I really like is ramping up the percentage of traffic sent to experimental condition(s) slowly. If you start out sending 50 percent of the visitors to the control condition, and 50 percent to the experimental condition, you might have a problem if something in the experimental condition is broken. A better approach is to start sending only 5 percent of the users to the experimental condition(s). If everything is fine, go to 10 percent, then 25 percent, and, finally, 50 percent. This will help you discover critical errors before too many users do it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset