Here at Gawker, we’ve dabbled with A/B testing before, but 2015 was the first year we started taking testing seriously. A/B tests are now an integral part of our product workflow, helping us make smarter, better informed decisions.
We’ve run 30 tests over the course of the year for our product team, on features ranging from our site redesign to how we display comments to the algorithm powering our recommendation module. We started running tests in April, and have run on average about an experiment per week since then.
Here’s what we’ve learned:
To Change or Not to Change
A/B tests allow us to look at the effect a new product feature will have on key user metrics, giving us an indication of how it’s affecting user experience - does changing the position of the sidebar make users more likely to click on another story? Does it change how long they stay on our sites, or how many pages they visit?
We ran experiments where the data told us exactly what we thought it would, and ones that told us the opposite of what we expected. Here’s the breakdown of the tests we did this year and their outcomes:
- 50% of tests led to product features going live (and, conversely, 50% resulted in us deciding not to make a change)
- 37% of tests had clear, positive results in favor of the change and led to a product feature going live.
- 13% of tests did not have strong results in favor of the change (and occasionally had results specifically against doing so), but still led to a product feature going live after internal consensus.
- 20% of tests led to additional experiments (to confirm unexpected test results, test revised versions, etc)
Most of the time, if an experiment does not show clear improvement in user experience, we try to gather as much information on what went wrong and go back to working on it. There were the few cases though where, even given not-so-stellar results, we still decided to set a feature live.
For example, our new homepage design, in which we removed insets and cleaned up some visual elements, had no significant effect on pages per session or session duration, and had a negative effect on homepage article click through rate. But, the new design helped overall page performance and simplified our code, both of which were important goals for our Tech team this year, so we decided to move forward with it.
There are also times when a change may improve one metric but harm another - if you get users to click more often on links to other stories, that may also mean they finish posts less often. When that happens, we have to decide which metric to prioritize, which is usually a combination of what we think is best for the user, our business needs and overall company strategy.
Data Don’t Lie
An unexpected perk of our foray into A/B testing is how many bugs we’ve caught. We tend to pull our test results by device category or browser width, which means it’s often very clear when something is wrong and where.
We tested and released a large site redesign over the summer, in which we introduced larger images, improved our typography and made some other layout changes. Things were looking good after our first test - average session duration was up on desktop and tablet! - but session duration was mysteriously lower on mobile. We scratched our heads coming up with theories: maybe the font was too big and it was too hard to read on a phone?