A:B Testing
A:B Testing

5 Ways your A:B Tests Are Going Wrong

What are A:B Tests?

You test two versions of marketing content, A and B, and measure how each performs. When you've gathered enough data to know which is better, you use that version for your future marketing, so you continue to get better results. (Read More)

Other names for this type of testing include: Split Testing, Bucket Testing and Multivariate Testing. (Read More)

A:B Testing is much better than guessing, but it has problems. And unfortunately most people don't tell you about them. This blog post lists some of the biggies, so you can avoid them.

1. Is your Setup OK?

Is there any mistake in how you run your testing? Does the software not work? Are the samples of people not comparable? If you've got problems like these, then test results will be because of your setup, not the marketing that you think you're testing.

The solution is to run occasional A:A tests. This is the most basic type of A:B test and you simply create two identical versions of the same marketing and test them against each other. Read More at: 6 Common Ways Your Split Test Results Could Be Off - The Daily Egg

If you get significantly different results - one of your identical marketing versions wins significantly - then you may be doing it wrong.

Don't feel bad. Doing experiments properly is hard. For example, "There’s an ­unspoken rule in the pharmaceutical industry that half of all academic biomedical research will ultimately prove false, and in 2011 a group of researchers at Bayer decided to test it. Looking at sixty-seven recent drug discovery projects based on preclinical cancer biology research, they found that in more than 75 percent of cases the published data did not match up with their in-house attempts to replicate" - Scientific Regress by William Wilson. If that's how Big Science fares, I hate to think what's the percentage of bad experiments in marketing.

2. Are You Testing The Wrong Thing?

For example:

  • If testing the effect of tone-of-voice in email subject lines, did the lengths vary so that one didn't fully show on mobile?
  • If trying different product images, did you accidentally vary page size and load time?
  • Did your change break some script?
  • Did you push some other marketing below the fold

The point is not that your A:B test failed - the winner is better than the loser - but that you will draw incorrect conclusions for future marketing.

The solution is to treat A:B tests much more carefully if you plan to use the results to guide your marketing policy. At the very least you must run an independent test to check your conclusions.

3. Novelty Effect

The Novelty Effect is that people pay special attention to recent changes. Basically we humans evolved for millions of years to pay special attention to new changes in the world around  us, because changes may indicate new food that we can eat, or new predators that could eat us. And we still pay attention to novelty on the internet.

"Some of your experienced visitors (people who have been to your site in the past) will try out your new feature simply because it is new. So a new design change or feature will likely give a short-term advantage to the variant," - WhichTestWon.

So when you are A:B testing a new version of your marketing vs the existing one, be aware that the new version may do better purely because it's new. This doesn't mean don't use the winner! What it does mean is that the winner may not be better for long: your test results are only valid for a short while.

So be careful when drawing general conclusions. Suppose you are doing email marketing for a travel company and you find that including symbols like an aircraft and a palm tree in the subject line increases opens. This does NOT mean that it will still work next month.

There are two possible solutions. You can repeat your A:B test after a few weeks have passed and the novelty effect has worn off, to see if the winner still does better. Or keep regularly refreshing your marketing, so that the novelty effect is constantly working to your benefit.

4. Data Dredging

This happens when you analyze your data to spot effects, without previously deciding what you are looking for. If you search for a while, you are bound to find false coincidences by chance. 

For example, if you try enough action button colours, you are bound to find one that appears significantly different from what you use now, even it it's not really. Read More at: 6 Ways Marketers Fool You With Statistics - Fresh Relevance and scroll down to the XKCD cartoon.

The solution is to not trust investigative research, unless you've done additional work to check the exact result you think you found. When results are important to you, always run an independent test to check your conclusions.

5. Your Sample Sizes Are Too Big

A/B Testing Calculators that work out your sample size are great. I like this basic one from Evan Miller,

They all use a significance level - how certain do you want to be that your test result is "correct". 90%? 95%? 99%?.

There's a big incentive to choose a large percentage, because we all want to avoid mistakes, and sometimes that's the right thing. For example if you're doing tests to choose the font for your site, this is a decision that will have a permanent impact, so you want to be very certain.

But usually picking a large percentage significance is wrong. For example if you're choosing the subject for tomorrow's newsletter - that's not so important, and using a smaller percentage significance means a much smaller sample size, so less people will be forcibly sent the "bad" version during testing.

The solution is to use a significance level of 80% for little decisions. The main value of testing in such cases is for the A:B test to act as a "back stop" and prevent very bad copy going out, and a significance level of 80% will do that handily.

Another way of looking at this is that levels like 99% significance are calculated on the assumption that you've run your experiment perfectly - which probably isn't the case. An apparant high significance of e.g. 99% is very unlikely to mean 99% when anyone but a real expert is doing the tests, so using a lower significance level for minor decisions has little downside.

Update. How Jeff Bezos of Amazon approaches heavy-weight vs light-weight decisions: There are 2 types of decisions to make, and don't confuse them

Want to find out more?

Contact Us →  

Further Reading