We love A/B testing here at Fusion Media Group. We’ve come a long way since 2015, when we first started testing: in the past three years we’ve grown the testing culture from none at all to A/B tests being a key and necessary part of the product development process. Experiments allow us to measure and be confident in the effect new features are having on user experience, and provide product with data to prove out the success (or failure) of their projects.
We actually ended up running fewer A/B tests this year compared to last: we ran 61 tests in 2017, compared to 74 in 2016. I think this was a result of product’s focus on migrating a bunch of great sites to Kinja: these projects took a lot of time and manpower, but in general didn’t involve anything that could be A/B tested. Hopefully in 2018, with most of the migrations under our belt, product can focus more on iterating on Kinja and A/B testing will ramp up again.
Here are some key stats on 2017 experiments:
- We ran 61 tests, which averages out to more than 1 per week
- 54% of tests lead to a feature going live
- 21% of tests resulted in a feature going live despite some negative results
- 36% of tests lead to further tests
- 11% of tests failed, usually due to analytics/tracking bugs caught too late
- 34% of tests uncovered bugs in either tracking or feature functionality
- 54% of tests were monetization/revenue focused
Over a third of the tests we ran lead to additional tests, which was often by design. Most of the projects the product team works on aren’t simple enough to fit into a traditional A/B test. Product is usually working on complex projects that involve making a lot of decisions along the way, so we end up breaking projects up into smaller, testable chunks.
For example, we launched infinite scroll earlier this year, functionality that loads additional articles for the reader when they reach the bottom of the story they’re on. This was a huge project with a lot of considerations: what types of articles should we load: related stories, popular stories, stories from other blogs in our network? How many articles should we load: should it actually be ‘infinite’ or should we stop after a certain number? How should we handle the comments section: does the design of that module need to change, should we truncate the module so its clearer there’s another story below?
Trying to test all of these options in one test would be way too much and generally a bad idea from an experiment design perspective, so we ended up running four separate tests to build this product out iteratively.
Over half of all tests this year were monetization/revenue focused. This is probably in part due to a shift in our product team and company’s priorities, which I won’t opine on here (#UCIEmployee), but it’s also in large part because we have a robust, streamlined process for running monetization tests, so getting an ad related test up and analyzing the results requires little overhead from our team.
Experiments usually involve scoping out, implementing and QAing tracking for the feature we want to test, but for revenue tests, we can often skip these steps. We’ve built out the ability to pull ad data from DFP by experiment branch using key-values (which are simultaneously my favorite thing and the bane of my existence,) so setting up a monetization test usually just involves pushing some new values to DFP’s admin.
Ad tests also can generally run for shorter periods of time, as we serve ads frequently enough to reach significance in the results quickly. This means we can run a few of these types of tests in the amount of time it might take us to run one for a more specific Kinja feature.
A third of our tests this year uncovered previously missed bugs, in either feature functionality or how features were being tracked. In about a third of these cases, the bugs found were bad enough that we couldn’t accurately pull data for the test and we had to mark it as a failure, but a lot of times tests uncovered smaller bugs that didn’t nullify the test results.
We’re glad our tests caught issues before features made it to production, but we also don’t want A/B tests to become or replace standard QA. A/B Tests take too much time and resources to run and analyze to be the place in the funnel where bugs are caught. A goal of ours this year is to find a solution for more comprehensive QA to hopefully catch more bugs prior to testing, so we can use A/B tests for what they are designed for: measuring how new features affect the user and site experience.
We use Google Experiments server-side implementation to run our A/B tests, which gives us more flexibility to run tests often and on a large scale. But as A/B testing has grown over the past few years, we’ve realized some weak points in our set up, which we’re hoping to work on improving this year. Here are some of our 2018 goals:
- Switch to Google Optimize from Google Content Experiments
- Streamline the test set up process
- Build out the ability to run multiple experiments at once
Another blog post to come, if/when we accomplish these!