Site Optimization
Landing Page Conversion ![]() |
| Landing Page Conversion |
| Thursday, 23 August 2007 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Topic: Landing Page Conversion: Getting Significant Improvements Even When You Can’t Complete Your TestsWe have all probably designed a test Webpage or offer email that we expected to dramatically outperform the control and been stunned when performance is poor or the results come back inconclusive. What can you do to get significant improvements or learn in those situations where you actually cannot complete your test or where you have a validity issue—particularly a validity issue connected to the size of your sample? When the differences in conversion between the control and the experimental treatments are so small that the test results don’t validate, is all that time and energy a total loss? When we interpret test results, we can in fact arrive at a conclusion in two ways: induction, which is any form of reasoning in which the conclusion, though supported by the premises, does not follow from them necessarily or deduction, which is a process of reasoning in which a conclusion follows necessarily from the premises presented, so that the conclusion cannot be false if the premises are true. In this clinic, we will use induction to scrutinize recent research findings to determine when it is possible to draw valid and valuable conclusions from tests that are not statistically valid. Editor’s Note: We recently released the audio recording of our clinic on this topic. You can listen to a recording of this clinic here:Landing Page Conversion: Getting Significant Improvements Even When You Can’t Complete Your Tests
Case Study 1Test DesignWe conducted a 26-day experiment for a non-profit foundation that raises money for Alzheimer’s research. The goal was to increase conversion and consequently increase the total donation amount. TreatmentsWhich donation page will convert better?
Control (two step process)
Treatment (one step process) Results
However . . . the Treatment page had a substantially higher average donation.
So, if we eliminate “outliers” by filtering out all donations beyond two standard deviations from the mean, can we then establish validity using dollar amounts?
ValidityWith outliers removed, the Treatment page still had a much higher average donation:
QuestionCan we wisely substitute dollars for conversion rate as the success measurement? For example:
If we substitute average contribution per donor ($/donor) in place of the number of donors who contributed as a success measure, the validity test passes. If we can establish validity using dollar amounts, then we might come to these conclusions:
Unfortunately, it is NOT valid to simply substitute dollars for donors. Even so, deeper analysis of the data does appear to indicate a consistent pattern of higher contribution amounts for the Treatment-2 page than the Control page, and this insight is a central factor as we design the next round of tests. Remember, these are just informed inductive conclusions.
The Key methods for gaining insights for this case:
Case Study 2Test DesignWe conducted a 25-day experiment for an e-commerce site that sells wheelchairs and medical equipment to both businesses and consumers. Which offer page will convert best?
Results
Since the test results were inconclusive, we should:
Remember, you are trying to challenge the Control with a Treatment that results in a significant (hopefully positive) difference. Just a second . . . Before we simply discard the data and write off the test as a failure, is there ANYTHING of value we can learn from this test? Might we gain some insights from attributes beyond simply the overall test-long conversion rate? Revenue per order Both with and without outliers, the Control page had a much higher average revenue per order than the other two pages–especially the page with the highest conversion rate in the test sample.
Key point: Supplemental analysis through removal of outliers does not change the level of test validity, but rather offers additional insights about the test subjects and the testing conditions. QuestionHow could treatments with such similar conversion rates vary so widely in revenue per order? When we analyzed the test results on a product level, we noticed two subtle but very important patterns:
Observations and Insights for follow-on testing:
The key methods for gaining insights for this case:
We analyzed the amount of variation in each variable by looking at measures of variance. Greater variance translates to need for larger samples. It is also an indicator of possible hidden sources of difference during the test period
We looked at the page performance data for each of the attributes measured and looked for connections between page design attributes and performance. Case Study 3Test DesignIf you were in our last clinic on validation, you’ll remember the case study in which a test validated the first week, but by the end of the 2nd week, it did not. What if the test had simply run both weeks and at the end of the test period, it did not validate?
Stratifying the dataOne way of gaining additional insights into test data is to “disaggregate” it by different attributes. Here, if you split the data into two distinct weeks, you see that the picture looks quite different:
The Control and Treatment 1 pages performed very consistently throughout the two-week test period. But the Treatment 2 page bounce rate changed dramatically, soaring from 12.9% in the first week to 20.7% in week 2. The cause of this change is still under investigation. What can cause performance differences through time? Seasonality:
Validity threats:
Questions:Are there recognizable patterns when the relative performance of the Treatments varies significantly? Could identifying them result in validation for a specific set of conditions? Were there any “interesting” (i.e., test-threatening) events during the test period? Unless you stratify your data, you will not know. The key methods for gaining insights for this case:
Case Study 4Student Submitted Question
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Page | CTR |
|---|---|
| Red car—Square | 17.79% |
| Family in car | 17.84% |
| Men falling off log | 17.52% |
| Red car—Round | 17.01% |
What you need to understand: The sample size was big enough to say that there is a significant difference among the Click-Through Rates (CTR) performance of the four Treatments. However, if the first Treatment is the Control, then the sample is NOT big enough to say that the 2nd (“Family in car”) Treatment is significantly better than the Control. In fact, it would take more than 5 years to accumulate a sufficient sample to conclude this with a 95% level of confidence.
Test Interpretation
This means that we cannot conclusively say that any of the first three Treatments are “better” than another, using the data the student has collected to date. But, taking a closer look at Treatment 4 reveals a small but very important difference between it and the other Treatments.
![]() |
![]() |
| Start box for Treatments 1, 2, and 3 |
Start box for Treatment 4 |
Presuming that the (red car) image is the same one between Treatment 1 and Treatment 4, then the shape of the start box (rounded vs. square) caused a significant drop in quote-starts.
So, perhaps the square start box is superior.
You may gain other, similar insights for subsequent testing through this form of analysis.
Research Findings: Observations and Principles
Here are some principles and methods for judiciously gaining insights from an invalid test.
- “The Null Conclusion.” First, record and make note of the fact that changing the variables in the ways that you did had little effect on performance. This may be the single most valuable insight you can gain. In subsequent tests, you should consider testing different variables or experimenting with more “radical” changes.
- Look for patterns of performance among treatments that share similar attributes, even those that identify things that you definitely DON’T want to do, such as those in the insurance quotes case study.
- Compare secondary or non-test-design measures of success, such as dollars in the donation site case study.
- Look for connections between treatment attributes and patterns of purchase behavior, such as the big difference in featured accessories sales vs. others in the medical company case study.
- Look for patterns of “seasonal” performance difference among the treatments based upon time; such as, Does one version perform much better on weekends vs. weekdays? How about morning vs. afternoon or nighttime? These are all examples of “seasonality.”
- Look for validity threats. When test results are surprising, look for hidden sources of difference. Look to ensure that your metrics are accurate and you have not suffered from History or Instrumentation effects. (See Test Validity brief for more on this subject.)
- Keep careful historical records of your tests and test results. You may be able to achieve conclusive results in an abbreviated time by exploiting statistical methods intended to consider prior knowledge or test results. MarketingExperiments is currently conducting ongoing research on the use of Bayesian methodology and other methods for improving the effectiveness and reducing the cost of testing.
Remember, it is more important to conduct a useful test than a successful one.
To learn more about our key methods for gaining insight into invalid or incomplete tests, consider enrolling in the MarketingExperiments Professional Certification course in Online Testing.
RELATED MARKETING EXPERIMENTS REPORTS:
- Optimization Testing Tested: Validity Threats–Beyond Sample Size
- Landing Page Optimization—Big Conversion Gains from a Little Scissors and Grease?
- Landing Page Optimization Tested—How to Create “Sticky” Landing Pages
- Optimizing Landing Pages
As part of our research, we have prepared a review of the best Internet resources on this topic.
Rating System
These sites were rated for usefulness and clarity, but alas, the rating is purely subjective.
* = Decent | ** = Good | *** = Excellent | **** = Indispensable
- How To Measure the Success of Your Web App **
- Learning to Classify Incomplete Examples **
- Combining Dependent Tests with Incomplete Repeated Measurements **
- Validity (statistics) **
- Populations, Samples, and Validity **
Credits:
Editor(s) — Frank Green
Writer(s) — Bob Kemper
Peg Davis
Contributor(s) — Jeremy Brookins
Peg Davis
Jimmy Ellis
Bob Kemper
Flint McGlaughlin
HTML Designer — Cliff Rainer
Holly Hicks














