|Optimization Testing Tested:|
|Monday, 13 August 2007|
Topic: Optimization Testing Tested: Validity Threats Beyond Sample Size
It’s the nightmare scenario for any analyst or executive: Making what seems to be the right decision, only to find out it was based on false data.
Through online optimization testing, we try to discover which webpage or email message will perform best by trying each version with a random sample of target prospects.
We decide during test design how much difference we must see to warrant making a change, and what level of confidence is required to consider the data valid using statistical methods.
But is sample size the only factor that should be considered when assessing the validity of test results?
In this research brief, we will identify the four greatest threats to test validity, use recent case studies to illustrate the nature and danger of each threat, and provide you with 10 key ways to avoid making bad business decisions on the basis of a “statistically valid” sample size alone.Editor’s Note: We recently released the audio recording of our clinic on this topic. You can listen to a recording of this clinic here:
Case Study 1
We conducted a 7-day test for a large industrial parts company whose market is predominantly midsize to large businesses. The company’s largest volume paid search traffic source is Google Adwords.
Our primary goal for this test was to reduce the bounce rate for the company’s Home page.
We tested the control against a products oriented page and a directory style page.
What you need to understand: The directory style page yielded a 9.7% lower bounce rate over the control page—a 75.60% relative reduction
Using the outcome of this statistically validated test, it might have been natural to stop gathering data for the test and commence with implementation of Treatment 2 and/or follow-on research.
However, while post-test analysis and planning continued, we kept the test running for an additional week…
Case Study 1 – Week 2
After a second week, before closing out the test, we extracted a fresh set of reports, expecting further confirmation of the prior outcome.
What we found instead was quite different.
What you need to understand: In Week 2, the Control and Treatment 1 pages performed much as they had in week 1. But the Treatment 2 page bounce rate soared from 12.9% to 20.7%, exceeding both of the other pages. The relative increase in bounce rate from week 1 to week 2 was nearly 38%.
What possible causes can you think of for such a dramatic change in only one week?
Since only the “radical redesign” treatment was significantly impacted, we first speculated that many returning visitors, who were accustomed to seeing the familiar Control page, may have “bounced” by clicking back to the search results page to verify they were on the “right” page or by manually entering the URL.
To test this hypothesis, we extracted performance data separately for “New” and “Returning” visitors.
Results — New vs. Returning Visitors
What you need to understand: When the returning visitors are filtered out, and only new visitors are included, the results are similar. The bounce rate for the control page changed by only 2.4%, while that of the Treatment 2 page soared from 13.5% to 24.0%. The relative increase in bounce rate from week 1 to week 2 was nearly 79%.
Test Validity Threats
This investigation is still underway.
When conducting optimization testing, there are four primary threats to test validity
History Effects: The effect on a test variable by an extraneous variable associated with the passage of time.
Instrumentation Effects: The effect on the test variable, caused by a variable external to an experiment, which is associated with a change in the measurement instrument.
Selection Effects: The effect on a test variable, by an extraneous variable associated with different types of subjects not being evenly distributed between experimental treatments.
Sampling Distortion Effects: The effect on the test outcome caused by failing to collect a sufficient number of observations.
Now, let’s take a look at an example of History Effects.
Case Study 2
We conducted a 7-day experiment for a subscription-based site that provides search and mapping services to a nationwide database of registered sex offenders.
The objective was to determine which ad headline would have the highest click-through-rate.
During the test period, the nationally syndicated NBC television program Dateline aired a special called “To Catch a Predator.”
This program was viewed by approximately 10 million individuals, many of them concerned parents. Throughout this program sex offenders are referred to as “predators.”
This word was used in some, but not all of the headlines being tested.
We found the following:
What you need to understand: In the 48 hours following the Dateline special, there was a dramatic spike in overall click-through to the site. The click-through rates aligned in descending order of the prominence of the word “predator.” The best performing ad performed 133% better than the headline without the word “predator.”
In effect, an event external to the experiment which occurred during the test period (the Dateline special), caused a significant (and transient) change in the nature and magnitude of the arriving traffic. Thus the test was invalidated due to the History Effects validity threat.
Ask: “Did anything happen in the external environment during the test that could significantly influence results?”
Now, let’s take a look at an example of Instrumentation Effects.
Case Study 3
We conducted a multivariable landing page optimization experiment for a large subscription based site.
The goal was to increase conversion by finding the optimal combination of page elements.
The test treatments were rendered by rotating the different values of each of the test variables evenly among arriving visitors.
We discovered that in the testing software, a “fail safe” feature was enabled specifying that, if for any reason the test was not running correctly, the page would default to the Control page.
The testing system (instrumentation) would deliver to the browser both the treatment page values and the control page values, but the browser would render only the page corresponding to the test condition. It added a lot of page load time.
Control page with the 5 variables (circled) set up to rotate highlighted.
Test Validity Impact
Page Load Time
This chart shows load times for the control web page compared to one of the “treatment” pages. The extra 53Kb is significant, especially for anyone using “dialup” or low speed access. At 56K modem speed it’s an extra 9.56 seconds.
What you need to understand: The artificially long load times caused the “user experience” to be both asymmetric among the treatments, and significantly different from any of the treatments in production, thereby threatening test validity.
Ask: “Did anything happen to the technical environment or the measurement tools or instrumentation that could significantly influence results?”
Now, let’s return to Case Study 1 to see how these principles have been applied to-date in investigating the possible causes of test invalidation:
Applying Each Effect
History Effects: Did anything happen during either week to dramatically influence results?
Instrumentation Effects: Did anything happen to the technical environment or the measurement tools or instrumentation that could significantly influence results?
Selection Effects: Did anything happen to the testing environment that may have caused the nature of the incoming traffic to be different from treatment to treatment or from week to week?”
We saw that the ratio of new vs. returning visitors was not a factor. But did the nature of the incoming traffic between these two weeks differ in any other way?
We are extracting and analyzing stratified reports to identify characteristics that may have changed significantly from week 1 to week 2.
In summary, even though we haven’t yet established with certainty what caused the anomaly in Case Study 1, we continue to investigate and we’re confident we’ll establish what the key factors were.
Avoiding threats to test validity: 10 ways
While it would be a mistake to hastily assume that you know what caused a discrepancy in test results, it is an even bigger mistake to inadvertently act on bad data because you failed to thoroughly check for validity threats.
Following are 10 key ways you can recognize and avoid the most common, most damaging threats to test validity:
Using the methods and principles outlined in this brief, you can avoid being blind-sided by test-spoiling validity threats and making a bad business decision based on a “statistically valid” sample size alone.
RELATED MARKETING EXPERIMENTS REPORTS:
As part of our research, we have prepared a review of the best Internet resources on this topic.
These sites were rated for usefulness and clarity, but alas, the rating is purely subjective.
* = Decent | ** = Good | *** = Excellent | **** = Indispensable
Editor(s) — Frank Green
Writer(s) — Peg Davis
Contributor(s) — Adam Lapp
HTML Designer — Cliff Rainer