Why Don’t A/B Test Results Add Up?

You’re working on some new growth / marketing tactic to increase some conversion or retention number. The idea is simple: make small incremental changes by picking winners of successive A/B tests and all those changes will stack and you’ll have a better product in all the measured ways that matter: customer retention, new customer acquisitions, product utilization, etc.

This is a familiar practice in (tech) Growth, Marketing, and Product Marketing teams making changes to various landing pages (Growth), art assets and copy (Marketing), or in-product messaging (Product Marketing) as a few examples.

Invariably, there is a finding.

The commentary goes something like this: “Hey, the A/B test shows that if we move this button over there, we’ve improved top-of-the-funnel metrics by X%”.

Or “When the artwork has a person looking directly at the user that results in a better clickthrough rate than when the person is in quarter-profile.”

Or “When we place banner ads in-app, they lead to higher upgrades than dismissible pop-ups.”

All of this can sound reasonable in the moment. There’s enough data to back up the statements. There are p-values and statistical significance reports that point to outperformance against the control. All of this has the air of science around it. So it be must good right?

But then the other reality hits. You review the state of affairs several experiments ago and surely by now several 1% gains should show up in some report somewhere. But the numbers don’t add up like they should.

And now come the funambulists with explanations of “Well, the macroeconomic environment has shifted a bit”; “Well, the gains are there AcTuAllY and what you’re seeing is that if we had done nothing we would be 10% worse. All our experiments have kept things stable.”; “Well, there were a lot of product changes that didn’t go through the A/B testing regime and we broke some of the flows, so we can’t really compare the results from before, but everything is working.”

What makes things harder to disentangle is that some tactics do work and their impact is persistent. And it’s this existence of success that makes the existence of non-success (I’m not calling it failure on purpose) harder to accept. It’s one thing if there never were a success — we could more easily believe the broad statements like “macroeconomic environment has shifted” or “we made lots of product changes that didn’t go through proper A/B testing”.

So what is actually happening? Several things, sometimes individually, sometimes compounded. Let’s review in no particular order.

  • Your primary metrics of concern from one experiment to another keep changing and you end up making tradeoffs. So even if the “net” business effect of an experiment is positive on a per experiment basis, in the aggregate things have offset each other. A kind of Simpson’s paradox, if you would.
  • False positive and false negative results are not being rigorously monitored. You’re running statistical tests! A p-value less than 0.05 isn’t a sure thing. There’s still a chance of error! Sometimes you might be rejecting the null (i.e., the control group underperformed) incorrectly. Sometimes you might be retaining the null (i.e., the control group maintained a dominant performance agains the alternatives) incorrectly. These errors are not “symmetric” in that an oops in one experiment cancels out the gains of another experiment. You could be negating the gains of many experiments.
  • You’re being hoodwinked by the unseen / uncaptured variables. Your demographics are changing and you just don’t know it. None of the data you are capturing about users shows a change, but what’s lurking underneath is a pernicious correlation that you’re unaware of. You may have a “latent variable problem”. That growth tactic that changed the color of a button from pastel green to fluorescent peridot improving the click through rate by 10%? Maybe that correlates to “18-35yo males who have cats” and that demographic has either been saturated or their preferences have changed or you’re getting more of a new demographic “18-35yo females who don’t have cats” who don’t like fluorescent peridot. This doesn’t matter so much for fleeting, in-the-moment marketing campaigns because you’re just trying to capture the zeitgeist and once it’s gone, you’ll update your campaigns to reflect the new trends. But it does matter with product development. See the next bullet.
  • Your Marketing, Product Marketing, and Growth teams are uncoordinated. Marketing is now targeting “18-35yo ambitious go-getters who have money but no time for cats”, Product Marketing is still trying to upsell the “cat toy” and Growth is leaning into the “fluorescent peridot” theme. Marketing sees good health in top-of-the-funnel metrics. But the “cat toy” banner ad isn’t working like it used to. And the fluorescent peridot theme’s click through rate has come back to the pastel green theme’s click through rate. Commence finger-pointing and mental gymnastics.
  • Your attribution system is probably screwed up. If you’re relying on “touch” based attribution, you’re likely getting a lot of bad reads.
  • You did release new product features without testing. This one is hard to “fix” since experimentation is expensive and can really slow down the release of “obvious” features. But these are the gambles you have to make as a product team and as a business. There’s a disconnect in accepting a new feature by newly acquired users versus tenured users. And running multiple versions of the product is no small challenge if you haven’t (or technically couldn’t have) built a product to be A/B tested.

These are a handful of reasons beyond just “we really did break some flow” that can make experimentation practices nothing more than a Sisyphean endeavor. Solving for this is unique to each organization, but an overall common practice is to be unified on the metrics that matter to the business and to be unified on where and how much tradeoffs are acceptable.

Get in touch if you are taking a critical look at your experimentation practices. I can help organize this into a cohesive unified practice that speaks to the business.