Have you ever gotten an A/B test result that looks statistically significant, but doesn't actually boost performance IRL?
That’s a false discovery, and it’s a pretty common occurrence.
Up to 25% of test results significant at the 5% level are false discoveries, according to a new paper co-authored by two marketing professors at University of Pennsylvania’s Wharton School, Ron Berman and Christophe Van den Bulte.
Berman came away from the research — based on about 5,000 effects from 2,700 experiments run on Optimizely’s platform in 2014 — with some testing tips for growth marketers.
He broke three of them down for us below.
Run two-stage A/B tests.
The best way to avoid false discoveries, according to Berman, is a two-stage A/B test.
Let’s say you’re testing four options on 10,000 people. Instead of showing each version to a quarter of your sample, do this:
- Step 1: Show each of the four options to 2,000 people. Pick the top performer.
- Step 2: Show the winner to the final 2,000 people, and compare its performance to whatever’s live on your site. This second stage acts as “a defense against the false discovery,” Berman said.
The more (unique!) options you try in Step 1, Berman and his co-author found, the more robust your test.
“The smaller sample per variation makes the statistics worse, but picking the best out of and makes it better.”
Worry more about small wins than big flops.
“If you see very, very, very small improvements,” Berman said, “it's potentially just because you're not courageous enough in the experiments you're doing.”
The distribution of the Optimizely data suggested that the vast majority of A/B test results — 70%! — are true nulls. (In other words, A and B don’t perform differently.)
Berman suspects that’s because “people are too afraid” of big flops, so they test tiny tweaks and cut themselves off from big wins.
Companies with more A/B testing experience see a lower rate of false discoveries, Berman and his co-author found.
It’s hard to know why this is — are they testing differently? Hiring differently? — but it means that “experience helps.”
When you’re A/B testing, you can’t avoid false discoveries entirely. But you can minimize the damage they do to performance with two-part tests, big creative swings and persistence.