As I delve deeper into the abyss of numbers while preparing for my interactive panel on validity at Optimization Summit, I’m coming across more fallacies about validity in marketing tests. Here’s one I recently heard…
“We were told that if we send each treatment to 4,000 people on our list we would have a valid test.”
I changed the number to protect the innocent, but this is a common misconception.
That’s why they play the game
Now I’ll give them credit. The misconception above is thinking along the correct lines. A large enough sample size is necessary to ensure you have validity.
However, while you can take an educated guess, it is impossible to know the minimum sample size before the test is actually run. Just ask a Las Vegas bookie.
Because, an important factor in sample size determination is the difference in results between the treatments. If the treatments return very different results, it’s much easier to confidently say that you really do have two (or however many) emails that will perform differently. You don’t need as many samples to do that.
However, if the treatments have very similar results, you want many more observations to see if there really is a difference.
Think of it this way. I recently went to Disney World, and while waiting in line for a ride the line split.I was curious to see if more people would go to the left or the right.
If I saw nine people go to the left, and one go to the right, I’d feel pretty confident that people tend to favor the left.
But what if the split was six to the left and four to the right? I would want way more observations to feel confident that there is a real difference, whether people really do favor one side over the other, or if what I’m seeing is just random chance. Maybe for the next ten people, six will go to the right and four to the left.
And that’s why it’s impossible to determine the exact sample size you need for every test. You would, essentially, need to know the response you would get for each treatment before you tested. And, after all, that’s why we run the tests. Because it’s nearly impossible to guess on an outcome. Again, just ask a Las Vegas bookie. The house often wins. But not always.
I asked Phillip Porter, a data analyst here at MECLABS, for a more official sounding explanation than my Mickey Mouse example above; an explanation that you can use verbatim to sound smart and win any internal debate. Here’s what he said…
“Significance is based on sample size and effect size. The larger the sample you have, the smaller effect size you can find to be significant. The larger the effect size present, the smaller sample size you need to find significance. Larger sample sizes are generally better, however any difference, no matter how small, can be found to be significant with a large enough sample size.”
“Imagine the thrill of getting your weight guessed by a professional.”
However, much like Navin R. Johnson’s carnival barker in “The Jerk,” it doesn’t hurt to take a guess as long as you know what you’re doing. And if you truly are a professional, you don’t have anything to lose by guessing (except maybe some Chiclets).
“You can guesstimate before you run the test, but the actual numbers may not match what you were expecting. If you could guesstimate correctly every time before running the test, why would you need to run the test?” Phillip said. Proving not only that great minds think alike, but that it doesn’t hurt to try to get a sense for how much traffic or email recipients you need to get a valid test.
In fact, we include a pre-test estimation tool in our Fundamentals of Online Testing course. There’s one in the MECLABS Test Protocol as well, which our researchers use for all of their experimentation.
But a true professional will still run the final numbers…
How to determine if you have a big enough sample size
I don’t want to get all Matt Damon writing on a chalkboard in “Good Will Hunting” on you, but (and read this in your best Boston accent) it comes down to the math. I had a tough time on how I could truly serve you with this blog post. The last thing I wanted to do was raise a problem and not give you a solution (or only provide a paid solution).
I think, in the end, the best thing to do is just provide you with the equation. The sample size calculation that we use internally can be found in Cochran, W. G. (1977). Sampling Techniques, 3rd ed., Wiley, New York.):
Phillip explained the formula…
“This formula provides us with the minimum sample size needed to detect significant differences when Z is determined by the acceptable likelihood of error (the abscissa of the normal curve). The value of Z is generally set to 1.96, representing a level (likelihood) of error of 5%. We want the highest accuracy possible, with the smallest sample size. This level of error, 5%, gives us the best tradeoff between these two goals.
p is the conversion rate we expect to see (estimate of the true conversion rate in the population), and d is the minimum absolute size difference we wish to detect (margin of error, half of the confidence interval).”
We are working on a dead simple validity tool (the iPod of validity tools, if you will) to pass out at Optimization Summit. But for now, you can try putting the above formula in your own Excel spreadsheet.
Even Phillip admitted, “If you are trying to calculate this by hand it can look intimidating, but if you build the formula in Excel it is pretty simple.”
Related Resources
Optimization Summit 2011 – June 1 -3
Email Marketing Tests: What to do when a radical change produces negligible results
Online Testing and Optimization: ROI your test results by considering effect size
Online Marketing Tests: How could you be so sure?
Testing Madness: What the odds of picking a perfect NCAA Tournament bracket can teach us about running valid tests