Stop Early Stopping
Why you shouldn't peek at significance levels to decide when to stop an experiment
What happens when there's truly no difference between A and B in an A/B test? We'll simulate 625 tests where both variants are identical. Each test has 500 visitors/day, and both variants have a conversion rate of 10%.
Each simulation can result in correctly finding no difference, or in a
false positive.
Waiting for full 30 days
Stopping when we see a "winner"
When running A/B tests, it's tempting to check results frequently and stop the test as soon as you see a "significant" winner. This practice is a form of p-hacking that dramatically increases your false positive rate, where you conclude there's a real difference when there isn't.
Why does this happen?
Because the p-value fluctuates over time:
The more you check the p-value, the more likely it is you'll see a significant result false positive.
With a significance level of 0.05, you expect a 5% false positive rate when checking just once at the end. Much better than the rate on the right sidebottom, no?
Solutions
- Don't peek. Calculate your required sample size (you can use my A/B/n Test A/B Test Duration Calculator) and wait until you have collected the full sample before looking at the p-value.
- Group Sequential Methods, such as the O'Brien-Fleming or Pocock boundaries, allow for a few pre-planned "peeks" at the data by adjusting significance thresholds at each look while keeping the desired false positive rate.
- Fully Sequential Methods: techniques like mSPRT (mixture Sequential Probability Ratio Test) or methods yielding "Always Valid p-values" allow for continuous monitoring, with statistical tests designed so that error rates are controlled regardless of when you stop.
- Bayesian A/B testing methods don't require pre-calculating the sample size and can provide more intuitive results, though they are not immune to peeking.