Stop Early Stopping

Why you shouldn't peek at significance levels to decide when to stop an experiment

What happens when there's truly no difference between A and B in an A/B test? We'll simulate 625 tests where both variants are identical. Each test has 500 visitors/day, and both variants have a conversion rate of 10%.
Each simulation can result in correctly finding no difference, or in a false positive.

Experiment duration 30 days

How often we check results Every 1 day

Significance level (α) 0.05 (95% confidence)

Waiting for full 30 days

Stopping when we see a "winner"

When running A/B tests, it's tempting to check results frequently and stop the test as soon as you see a "significant" winner. This practice is a form of p-hacking that dramatically increases your false positive rate, where you conclude there's a real difference when there isn't.

Why does this happen?

Because the p-value fluctuates over time:

The more you check the p-value, the more likely it is you'll see a ~~significant result~~ false positive.

With a significance level of 0.05, you expect a 5% false positive rate when checking just once at the end. Much better than the rate on the right sidebottom, no?

Solutions

Don't peek. Calculate your required sample size (you can use my A/B/n Test A/B Test Duration Calculator) and wait until you have collected the full sample before looking at the p-value.
Group Sequential Methods, such as the O'Brien-Fleming or Pocock boundaries, allow for a few pre-planned "peeks" at the data by adjusting significance thresholds at each look while keeping the desired false positive rate.
Fully Sequential Methods: techniques like mSPRT (mixture Sequential Probability Ratio Test) or methods yielding "Always Valid p-values" allow for continuous monitoring, with statistical tests designed so that error rates are controlled regardless of when you stop.
Bayesian A/B testing methods don't require pre-calculating the sample size and can provide more intuitive results, though they are not immune to peeking.

How the simulation works

Generates visitors for both variants day by day:
- 250 visitors per day for each variant
- Each visitor has a 10% base conversion probability
Analyses results using different stopping criteria:
- Left side: checks for significance once, after the full duration
- Right side: checks for significance at the frequency you specify
Colours each test based on the outcome:
- tests that correctly showed no significant difference
- tests that incorrectly showed a significant difference (false positives)

Behind the scenes

The simulation uses a two-tailed two-proportion Z-test to determine statistical significance between variants (mathematically equivalent to a chi-squared test for this 2×2 case):

z = \frac{p_{1} - p_{2}}{\sqrt{\frac{p_{1} \times (1 - p_{1})}{n_{1}} + \frac{p_{2} \times (1 - p_{2})}{n_{2}}}}

where:

$p_{1}$ and $p_{2}$ are the observed conversion rates
$n_{1}$ and $n_{2}$ are the number of visitors for each variant

The p-value is calculated using the standard normal distribution:

p-value = 2 \times (1 - Φ (| z |))

where $Φ$ is the standard normal cumulative distribution function, implemented using the Abramowitz and Stegun numerical approximation (formula 26.2.17).

Inspired by Evan Miller’s article How Not to Run an A/B Test