Óscar’s A/B Testing Toolkit

Stop Early Stopping

Why you shouldn't peek at significance levels to decide when to stop an experiment

What happens when there's truly no difference between A and B in an A/B test? We'll simulate 625 tests where both variants are identical. Each test has 500 visitors/day, and both variants have a conversion rate of 10%.
Each simulation can result in correctly finding no difference, or in a false positive.

Waiting for full 30 days

Stopping when we see a "winner"

When running A/B tests, it's tempting to check results frequently and stop the test as soon as you see a "significant" winner. This practice is a form of p-hacking that dramatically increases your false positive rate, where you conclude there's a real difference when there isn't.

Why does this happen?

Because the p-value fluctuates over time:

Day p-value 0 10 20 30 0 0.5 1

The more you check the p-value, the more likely it is you'll see a significant result false positive.

With a significance level of 0.05, you expect a 5% false positive rate when checking just once at the end. Much better than the rate on the right sidebottom, no?

Solutions

Inspired by Evan Miller’s article How Not to Run an A/B Test