Here at Gawker, we’ve dabbled with A/B testing before, but 2015 was the first year we started taking testing seriously. A/B tests are now an integral part of our product workflow, helping us make smarter, better informed decisions.

We’ve run 30 tests over the course of the year for our product team, on features ranging from our site redesign to how we display comments to the algorithm powering our recommendation module. We started running tests in April, and have run on average about an experiment per week since then.

Here’s what we’ve learned:

To Change or Not to Change

A/B tests allow us to look at the effect a new product feature will have on key user metrics, giving us an indication of how it’s affecting user experience - does changing the position of the sidebar make users more likely to click on another story? Does it change how long they stay on our sites, or how many pages they visit?

We ran experiments where the data told us exactly what we thought it would, and ones that told us the opposite of what we expected. Here’s the breakdown of the tests we did this year and their outcomes:

  • 50% of tests led to product features going live (and, conversely, 50% resulted in us deciding not to make a change)
  • 37% of tests had clear, positive results in favor of the change and led to a product feature going live.
  • 13% of tests did not have strong results in favor of the change (and occasionally had results specifically against doing so), but still led to a product feature going live after internal consensus.
  • 20% of tests led to additional experiments (to confirm unexpected test results, test revised versions, etc)

Most of the time, if an experiment does not show clear improvement in user experience, we try to gather as much information on what went wrong and go back to working on it. There were the few cases though where, even given not-so-stellar results, we still decided to set a feature live.

Advertisement

For example, our new homepage design, in which we removed insets and cleaned up some visual elements, had no significant effect on pages per session or session duration, and had a negative effect on homepage article click through rate. But, the new design helped overall page performance and simplified our code, both of which were important goals for our Tech team this year, so we decided to move forward with it.

There are also times when a change may improve one metric but harm another - if you get users to click more often on links to other stories, that may also mean they finish posts less often. When that happens, we have to decide which metric to prioritize, which is usually a combination of what we think is best for the user, our business needs and overall company strategy.

Data Don’t Lie

An unexpected perk of our foray into A/B testing is how many bugs we’ve caught. We tend to pull our test results by device category or browser width, which means it’s often very clear when something is wrong and where.

Advertisement

We tested and released a large site redesign over the summer, in which we introduced larger images, improved our typography and made some other layout changes. Things were looking good after our first test - average session duration was up on desktop and tablet! - but session duration was mysteriously lower on mobile. We scratched our heads coming up with theories: maybe the font was too big and it was too hard to read on a phone?

We presented our findings to the design team, letting them know that we couldn’t fully explain the mobile results. They looked into it further, QA-ing the new design on a few different devices, and finally found the answer: there was an issue with how the font displayed on very small browsers that made it fairly unreadable. Once the issue was fixed, we re-ran the test, and the results came back positive across all devices.

Advertisement

Experiment data has helped us expose multiple easily-missed bugs and fix them before going live.

Oops, we were tracking that wrong

This year’s focus on running experiments has helped us become experts on the inner workings of Google Analytics Premium reporting and also brought to light a few errors in our GA tagging set up.

Because our site is responsive, some features only show up on browsers above a certain width, or may appear differently on different devices, so it’s important to us to be able to pull test results by browser width.

Advertisement

We had set up a custom metric to track this, so that we could capture the specific width (i.e. 700px vs. 701px) of a user’s browser instead of grouping sessions into buckets (as we would need to do with a custom dimension.) As we started running more experiments and pulling the data by this custom metric, we were finding that a lot of our mobile traffic was coming from 600px+ browsers, which seemed oddly large for a phone.

Our design team were the ones who tipped us off that something was wrong – our site’s breakpoints are based off of points/logical pixels, not physical pixels (a concept I still don’t fully understand but you can read about here) so our users couldn’t have been viewing our sites on a phone that was more than 600 logical pixels wide.

After lots of digging and general confusion, we finally realized that browser width was setup as a hit-level custom metric, which meant that it was being set on each pageview and then summed up per the report dimensions. If a user viewed 3 pages of our site on a 320px wide phone, their session was being reported as having a browser width of 960px. We’ve since updated this to be a custom dimension that groups sessions based on our current site’s breakpoints, and our data makes a lot more sense!

The Future of A/B Testing at Gawker

Needless to say, we love A/B tests now. Testing has helped us roll out lots of great product features this year, in a well-researched, deliberate way. It’s been a huge step forward for us - gone are the days of setting things live and hoping for the best.

Advertisement

But, even with our more data-driven approach, we always make sure to put the numbers into context: any decision we make takes into consideration our staff, the company at large and our vision. Regardless of what the numbers say, we make sure not to infringe on our editorial voice or freedom, and that the changes we make help the company evolve in the right direction.