What you’ll get from this post: A deep exploration of significance testing, with
useful recommendations for avoiding statistical error.
Estimated reading time: 3 minutes; approximately 630 words.
When we run an A/B test, we are
evaluating the performance of a page and its variations against one another. In
statistical terms, we classify such an experiment as a hypothesis test—at the
end of which, we can hopefully determine whether there’s a relationship between
a change to the page and an increase in performance. To estimate the strength
of such a relationship, we use significance testing.
Significance testing measures the
effect of the independent variable (test variation) against the dependent
variable (whether someone purchased a product, for example)— and determines the
likelihood that the relationship observed in the trial could be attributable to
random chance. The more tests you run, the easier it becomes to identify
potentially false relationships.
Read more: Does it matter how we measure the significance of test
results?
Imagine flipping 10,000 coins 10
times each to determine if each is evenly weighted. The more coins you flip,
the more likely you are to see “interesting” results—like certain coins landing
all heads or all tails. Does this mean that these interesting coins are truly
biased? Are the observations simply an effect caused by running multiple
trials? Or are the results simply a manifestation of random variation,
something that could be expected when looking at a multitude of data and
retroactively developing a hypothesis?
Any challenger that you identify as
having a significant lift has a probability of actually being a false positive.
As demonstrated with the coin example, the false
positive rate, set by alpha (α, or 1- the significance level) is dependent on
how many variations (k, or the number of challengers) are included in any given
test. If you are testing at 95% significance with only one challenger, you have
the expected 5% chance of a false positive. If you are testing at 95%
significance with more than one challenger, the simple formula 1-(1-α)k
will represent your actual false positive rate. In the case of testing at 95%
significance with 3 challengers, your actual false positive rate increases to
approximately 14%. As a result, you are more likely to call the lift from any
one given challenger significant, when in fact it is not. When this occurs, you
have committed what is called a Type I error.
To limit Type I errors in an
environment with multiple challengers, you can calculate “adjusted” p-values
using a variety of methods. Here at Brooks Bell, we are particularly fond of a
method called target shuffling which the analytically minded folks at Elder
Research developed in the 90’s. Target shuffling reveals how likely it is that
statistical significance has occurred by chance when considering the number of
challengers, the sample size, and the observed response proportion. It is used
to verify whether the relationships are truly causal or merely a statistical
anomaly.
Read more: Understanding data discrepancies across testing and
analytics tools
Target shuffling randomly “shuffles”
the results of your test to break any association between the test variation
and the actual response. After randomizing the response data across variations,
you perform a significance test for each challenger and record the p-value for
the most significant result. This randomized simulation is repeated numerous
times and you evaluate where—or what percentile—the p-value from the actual
test falls amongst the cumulative result of all simulations. The resulting
percentile becomes your “adjusted” p-value and represents the proportion of
results from the random simulation that were more “interesting” than your
original significance test. In effect, target shuffling measures how truly
significant your original test was in the context of your specific testing
environment.
Using target shuffling to calculate
an “adjusted” p-value produces an improved measure of the actual relationship
between each of the multiple challengers and the response. By considering
increased random variation stemming from multiple challengers, the “adjusted”
p-value better highlights the true relationships and reduces Type I errors.
This all culminates as an intuitive validation process, improving the
reliability and repeatability of recommendations made from significance
testing.
Originally posted at BrooksBell.com.
What you’ll get from this post: A deep exploration of significance testing, with useful recommendations for avoiding statistical error.
Estimated reading time: 3 minutes; approximately 630 words.
When
we run an A/B test, we are evaluating the performance of a page and its
variations against one another. In statistical terms, we classify such
an experiment as a hypothesis test—at the end of which, we can hopefully
determine whether there’s a relationship between a change to the page
and an increase in performance. To estimate the strength of such a
relationship, we use significance testing.
Significance testing measures the effect of the independent variable
(test variation) against the dependent variable (whether someone
purchased a product, for example)— and determines the likelihood that
the relationship observed in the trial could be attributable to random
chance. The more tests you run, the easier it becomes to identify
potentially false relationships.
Read more: Does it matter how we measure the significance of test results?
Imagine flipping 10,000 coins 10 times each to determine if each is
evenly weighted. The more coins you flip, the more likely you are to see
“interesting” results—like certain coins landing all heads or all
tails. Does this mean that these interesting coins are truly biased? Are
the observations simply an effect caused by running multiple trials? Or
are the results simply a manifestation of random variation, something
that could be expected when looking at a multitude of data and
retroactively developing a hypothesis?
Any challenger that you identify as having a significant lift has a probability of actually being a false positive. As
demonstrated with the coin example, the false positive rate, set by
alpha (α, or 1- the significance level) is dependent on how many
variations (k, or the number of challengers) are included in any given
test. If you are testing at 95% significance with only one challenger,
you have the expected 5% chance of a false positive. If you are testing
at 95% significance with more than one challenger, the simple formula
1-(1-α)k will represent your actual false positive rate. In
the case of testing at 95% significance with 3 challengers, your actual
false positive rate increases to approximately 14%. As a result, you are
more likely to call the lift from any one given challenger significant,
when in fact it is not. When this occurs, you have committed what is
called a Type I error.
To limit Type I errors in an environment with multiple challengers,
you can calculate “adjusted” p-values using a variety of methods. Here
at Brooks Bell, we are particularly fond of a method called target
shuffling which the analytically minded folks at Elder Research
developed in the 90’s. Target shuffling reveals how likely it is that
statistical significance has occurred by chance when considering the
number of challengers, the sample size, and the observed response
proportion. It is used to verify whether the relationships are truly
causal or merely a statistical anomaly.
Read more: Understanding data discrepancies across testing and analytics tools
Target shuffling randomly “shuffles” the results of your test to
break any association between the test variation and the actual
response. After randomizing the response data across variations, you
perform a significance test for each challenger and record the p-value
for the most significant result. This randomized simulation is repeated
numerous times and you evaluate where—or what percentile—the p-value
from the actual test falls amongst the cumulative result of all
simulations. The resulting percentile becomes your “adjusted” p-value
and represents the proportion of results from the random simulation that
were more “interesting” than your original significance test. In
effect, target shuffling measures how truly significant your original
test was in the context of your specific testing environment.
Using target shuffling to calculate an “adjusted” p-value produces an
improved measure of the actual relationship between each of the
multiple challengers and the response. By considering increased random
variation stemming from multiple challengers, the “adjusted” p-value
better highlights the true relationships and reduces Type I errors. This
all culminates as an intuitive validation process, improving the
reliability and repeatability of recommendations made from significance
testing.
- See more at:
http://www.brooksbell.com/blog/improving-reliability-significance-testing-using-target-shuffling/#sthash.kwew1EeM.dpuf