Improving the Reliability of Significance Testing Using Target Shuffling

By Reid Bryant posted 08-25-2014 10:26 AM

Like

What you’ll get from this post: A deep exploration of significance testing, with useful recommendations for avoiding statistical error.

Estimated reading time: 3 minutes; approximately 630 words.

When we run an A/B test, we are evaluating the performance of a page and its variations against one another. In statistical terms, we classify such an experiment as a hypothesis test—at the end of which, we can hopefully determine whether there’s a relationship between a change to the page and an increase in performance. To estimate the strength of such a relationship, we use significance testing.

Significance testing measures the effect of the independent variable (test variation) against the dependent variable (whether someone purchased a product, for example)— and determines the likelihood that the relationship observed in the trial could be attributable to random chance. The more tests you run, the easier it becomes to identify potentially false relationships.

Imagine flipping 10,000 coins 10 times each to determine if each is evenly weighted. The more coins you flip, the more likely you are to see “interesting” results—like certain coins landing all heads or all tails. Does this mean that these interesting coins are truly biased? Are the observations simply an effect caused by running multiple trials? Or are the results simply a manifestation of random variation, something that could be expected when looking at a multitude of data and retroactively developing a hypothesis?

Any challenger that you identify as having a significant lift has a probability of actually being a false positive. As demonstrated with the coin example, the false positive rate, set by alpha (α, or 1- the significance level) is dependent on how many variations (k, or the number of challengers) are included in any given test. If you are testing at 95% significance with only one challenger, you have the expected 5% chance of a false positive. If you are testing at 95% significance with more than one challenger, the simple formula 1-(1-α)^k will represent your actual false positive rate. In the case of testing at 95% significance with 3 challengers, your actual false positive rate increases to approximately 14%. As a result, you are more likely to call the lift from any one given challenger significant, when in fact it is not. When this occurs, you have committed what is called a Type I error.

To limit Type I errors in an environment with multiple challengers, you can calculate “adjusted” p-values using a variety of methods. Here at Brooks Bell, we are particularly fond of a method called target shuffling which the analytically minded folks at Elder Research developed in the 90’s. Target shuffling reveals how likely it is that statistical significance has occurred by chance when considering the number of challengers, the sample size, and the observed response proportion. It is used to verify whether the relationships are truly causal or merely a statistical anomaly.

Target shuffling randomly “shuffles” the results of your test to break any association between the test variation and the actual response. After randomizing the response data across variations, you perform a significance test for each challenger and record the p-value for the most significant result. This randomized simulation is repeated numerous times and you evaluate where—or what percentile—the p-value from the actual test falls amongst the cumulative result of all simulations. The resulting percentile becomes your “adjusted” p-value and represents the proportion of results from the random simulation that were more “interesting” than your original significance test. In effect, target shuffling measures how truly significant your original test was in the context of your specific testing environment.

Using target shuffling to calculate an “adjusted” p-value produces an improved measure of the actual relationship between each of the multiple challengers and the response. By considering increased random variation stemming from multiple challengers, the “adjusted” p-value better highlights the true relationships and reduces Type I errors. This all culminates as an intuitive validation process, improving the reliability and repeatability of recommendations made from significance testing.

Originally posted at BrooksBell.com.

What you’ll get from this post: A deep exploration of significance testing, with useful recommendations for avoiding statistical error.

Estimated reading time: 3 minutes; approximately 630 words.

When we run an A/B test, we are evaluating the performance of a page and its variations against one another. In statistical terms, we classify such an experiment as a hypothesis test—at the end of which, we can hopefully determine whether there’s a relationship between a change to the page and an increase in performance. To estimate the strength of such a relationship, we use significance testing.

Significance testing measures the effect of the independent variable (test variation) against the dependent variable (whether someone purchased a product, for example)— and determines the likelihood that the relationship observed in the trial could be attributable to random chance. The more tests you run, the easier it becomes to identify potentially false relationships.

Imagine flipping 10,000 coins 10 times each to determine if each is evenly weighted. The more coins you flip, the more likely you are to see “interesting” results—like certain coins landing all heads or all tails. Does this mean that these interesting coins are truly biased? Are the observations simply an effect caused by running multiple trials? Or are the results simply a manifestation of random variation, something that could be expected when looking at a multitude of data and retroactively developing a hypothesis?

Any challenger that you identify as having a significant lift has a probability of actually being a false positive. As demonstrated with the coin example, the false positive rate, set by alpha (α, or 1- the significance level) is dependent on how many variations (k, or the number of challengers) are included in any given test. If you are testing at 95% significance with only one challenger, you have the expected 5% chance of a false positive. If you are testing at 95% significance with more than one challenger, the simple formula 1-(1-α)^k will represent your actual false positive rate. In the case of testing at 95% significance with 3 challengers, your actual false positive rate increases to approximately 14%. As a result, you are more likely to call the lift from any one given challenger significant, when in fact it is not. When this occurs, you have committed what is called a Type I error.

To limit Type I errors in an environment with multiple challengers, you can calculate “adjusted” p-values using a variety of methods. Here at Brooks Bell, we are particularly fond of a method called target shuffling which the analytically minded folks at Elder Research developed in the 90’s. Target shuffling reveals how likely it is that statistical significance has occurred by chance when considering the number of challengers, the sample size, and the observed response proportion. It is used to verify whether the relationships are truly causal or merely a statistical anomaly.

Target shuffling randomly “shuffles” the results of your test to break any association between the test variation and the actual response. After randomizing the response data across variations, you perform a significance test for each challenger and record the p-value for the most significant result. This randomized simulation is repeated numerous times and you evaluate where—or what percentile—the p-value from the actual test falls amongst the cumulative result of all simulations. The resulting percentile becomes your “adjusted” p-value and represents the proportion of results from the random simulation that were more “interesting” than your original significance test. In effect, target shuffling measures how truly significant your original test was in the context of your specific testing environment.

Using target shuffling to calculate an “adjusted” p-value produces an improved measure of the actual relationship between each of the multiple challengers and the response. By considering increased random variation stemming from multiple challengers, the “adjusted” p-value better highlights the true relationships and reduces Type I errors. This all culminates as an intuitive validation process, improving the reliability and repeatability of recommendations made from significance testing.

- See more at: http://www.brooksbell.com/blog/improving-reliability-significance-testing-using-target-shuffling/#sthash.kwew1EeM.dpuf

Blogs

Improving the Reliability of Significance Testing Using Target Shuffling

By Reid Bryant posted 08-25-2014 10:26 AM

Permalink

Most Recent Blogs