The Kinja team relies on A/B and multivariate testing for numerous reasons, including weighing the effect of changes to the platform, deciding which projects deserve further investment and uncovering issues as part of the QA process. We noted last year how important these experiments are to our product culture, and this past year was no different.
In 2018, we ran more experiments than ever, invested in our underlying test architecture and had some fun along the way!
A new record
We ran 90 A/B tests in 2018, a new annual record since we added the practice to our tool belt in 2015. This compares to 61 experiments in 2017, a year where we ran fewer experiments due to the effort required to migrate a handful of wonderful sites to Kinja throughout that year.
Here are some other notable stats from 2018:
- 36% of tests led to a feature going live following positive or neutral results
- 4% of tests led to a feature going live despite less than ideal results
- 37% of tests led to additional experiments
- 20% of tests led to the discovery of bugs
- Nearly 7% of tests failed, either due to a showstopping bug or tracking problem
- 16% of tests led to a feature being rethought, punted to the future or canned
There were fewer tests that led to features launching when we saw negative results, speaking to both the iterative nature of our process as well as the Kinja team’s data-driven culture. Compared to 2017, where 21% of tests resulted in a feature going live in these situations, this year saw improvements in terms of seeing a poor result, discovering the underlying cause and testing again until we were confident the change improves our platform in a meaningful way.
Additionally, sometimes a feature requires going back to the drawing board or depends on the launch of another feature. For example, this year we tested adding a new recirculation module in between infinite scroll posts on the article page. While this led to several experiments, we were not seeing the results we wanted and shelved the project.
Long-term projects lead to lots of tests
Many of our display advertising related tests were due to a change in our ad code library, migrating from Kinja’s traditional system to the ad code used by The Onion prior to their move onto our platform. Twenty-six percent of the display tests (or about 7% of all 2018 tests) were the result of this ad library migration.
Longer-term projects often involve several A/B tests, a result of the iterative nature of the team’s development process. However, sometimes these projects aren’t related to a specific feature or project, but rather an attempt to understand how certain existing features affect our key metrics. For example, we ran several multivariate experiments in Q3 and Q4 to better understand the impact currently live programmatic partners have on our ad inventory fill rate. This led to several tests, which didn’t necessarily lead to the immediate development of products, but rather gave us information to prioritize certain fixes or features.
What works for some, but not for all
Broken out by track, the areas we tested most frequently were related to monetization (display and video tracks) and the user experience (discovery):
Does this mean that we only developed features that were user-facing or focused on generating more revenue? Certainly not—instead this shows some features are easier to test than others.
Due to the size of our network and the frequency we show display ads, running monetization-related tests require little time to receive enough data to reach statistical significance. Because we’re measuring metrics like viewability, which is measured using ad impressions that occur frequently, we can run these tests for about a day, while experiments for other features that see fewer interactions need to run longer. For instance, many features developed by our interactives team are often projects like the hit PatriotHole game, Protect Your Gold From Barack Obama, or The Root 100. We don’t traditionally A/B test these because they are developed from scratch and would have no control variant to compare against.
Another track, like the performance team that formed this year, focused at first on rewriting code for lower traffic pages, preventing us from having results that reach statistical significance while running an experiment for a realistic amount of time given how frequently we A/B test. Additionally, the publishing team primarily develops CMS features, used by only our writers, which also affects our ability to reach significance.
What this shows is while A/B testing can be an important tool for evaluating a feature, it is just one of many in our set for sculpting Kinja.
Investing in the A/B testing system
Last year, we noted our goals for 2018 were to switch to Google Optimize from Google Content Experiments, streamline our process for setting up experiments and add the functionality to run concurrent tests. As anyone who has developed features before, you likely know sometimes things do not go according to plan!
We decided to stick with Google Content Experiments because Google Optimize didn’t offer support for the GA Management API, which we rely upon to automate our test setup. We also ran into challenges implementing multiple experiments, but that doesn’t mean we weren’t without wins.
This past year, we changed our underlying A/B test system—which relies on a server-side implementation of Content Experiments—to remove a dependency on our codebase. This enables us to start experiments without committing a pull request, a benefit when our codebase is frozen, and removes the time waiting for a deploy before starting a test. We also reduced the amount of time required to start a test by taking several separate manual steps and consolidating them through one nifty CLI tool.
So what’s next for us?
In 2019, we’ll continue to focus on integrating A/B testing into Kinja by developing an API that allows the product and data teams to start and monitor experiments within Kinja. We also hope to reduce the amount of time required to analyze results by automating some of the common data pulls from BiqQuery. And we also plan on revisiting our multiple experiments system to see if we can get everything working to our standards.
We’ll be certain to keep you posted as we make progress on these various projects.