an actual picture of me after finding a long term bug in our a/b testing infrastructure
A/B TestingPosts about Fusion Media Group's A/B tests  

We’ve been working on trying to improve our A/B testing infrastructure to allow us to run multiple tests at once, and through this work discovered a long-term bug in our system that impacted our ability to split users up into random and persistent groups.

Advertisement

A/B testing relies on the assumption that the groups in the experiment are randomized and distinct so this was not good! In this post, I’ll walk through how we updated our infrastructure, what we found and how we’ve been trying to fix it.

Updated Segmentation

The big blocker to handling multiple experiments was just that we were storing the experiment ID as a single, hardcoded parameter in our frontend codebase, so our system had no way of handling more than one experiment ID.

As explained in the last post, we split our users into random groups on the Fastly level and assign these groups to experiment variants. To handle multiple experiments, we updated the Fastly configuration to map IP buckets to both a branch and an experiment ID.

Advertisement

So now, our system should be able to handle more than one distinct experiment. Or so we thought....

What went wrong

To test our new multiple experiment set up, we ran two concurrent A/A tests (a test where we split users into groups but don’t introduce any change - so we expect if all is working well for the groups to behave identically.)

Advertisement

To our surprise though, a fair number of users appeared to be being placed into both experiments, even though our mapping should have meant that their IP places them in a single branch of a single experiment.

Advertisement

The sessions with both experiment IDs looked like this, with each pageview in the session being assigned both experiment ids (and different variants)

Advertisement

Why did it go wrong?

After a lot of head scratching and poring over the data, we found the issue. Turns out, a lot of users visit our network more than once a day:

Advertisement

Why does this matter? Because we have no guarantee that the user’s IP address will be the same session to session. On mobile especially (where we were seeing the highest rate of sessions with multiple experiment Ids), users are likely connecting via 4G and different wifi networks, their IP address can easily change throughout the day.

The bad news is this means it isn’t an issue with trying to run multiple experiments - it’s been an issue that’s existed, under the radar, all along with our experiment infrastructure.

Advertisement

We built our A/B testing system on the assumption that a user’s IP address was a stable, consistent value, but it appears that this was a faulty assumption that resulted in some users being exposed to multiple variants during an experiment.

Advertisement

Users being exposed to more than one variant is bad: A/B tests are dependent on the assumption that we can break users up into distinct groups and control exactly what those groups are seeing. If a user comes to our site and sees a blue button, and then comes back and sees a green button, they may be more likely to click on the green one because they notice a difference, adding an unforeseen bias to our results.

Some further very GA specific information

You may be thinking ‘Wait that doesn’t answer why the sessions were being placed in both experiments though!’ Wow, true. We need to get deep into some Google Analytics dimension scoping to answer that part.

Advertisement

https://support.google.com/analytics/answer/2709828

If you don’t want to get into the nitty-gritty of GA, feel free to skip this part: the end of the story is users aren’t actually being placed into multiple experiment ids/variants in a single visit, that’s just an oddity of how GA reports on experiment data when a user’s experiment id/variant changes visit to visit.

Advertisement

Dimensions in GA can have a hit-level scope, a session-level scope or a user-level scope:

  • Hit-level - the dimension value sent along with the pageview will only be applied to that single ‘hit’ (i.e. pageview). Something like Article Headline is hit-level scoped, so that we can send distinct values for each page you visit.
  • Session-level - the most recent dimension value sent along with a pageview event is applied to all previous pageviews in that session. We use this for Logged in status: if on my first pageview I’m logged out, but then I log in and my second pageview sends that I’m logged in, GA will overwrite all the pageviews in my session to report ‘logged in’
  • User-level - the most recent dimension value sent along with a pageview event is applied to all previous and future pageviews and sessions for that user. We used this to track Adblock Recovery status - if a user saw the adblock recovery module, we set a user-level dimension saying so, so we can track the behavior of the users who were exposed to this module longer term.

Advertisement

So session and user-level scoped dimensions will overwrite old values for pageviews/sessions with the most updated value, whereas hit-level is confined to a single pageview (or event.)

Turns out experiment dimensions don’t follow any of these patterns, but instead have their own unique scopes:

  1. The GA experiment ID dimension is user-scoped for the duration of the experiment but it does not overwrite values
  2. The GA experiment variant dimension is also user-scoped for the duration of the experiment but can be overwritten

Advertisement

We can see what this means in practice by looking at the below sample user. In this example the user comes to our site 4 times over the course of the experiment. In their first two sessions they’re placed in the same experiment Id and variant (maybe they’re at work on their office’s wifi), but by session 3 their IP address has changed (maybe they went home and are now on their home wifi) and they’re placed into a different branch of the experiment.

Because the experiment Id dimension is user-scoped but doesn’t get overwrriten, any experiment ID the user has been exposed to over the course of the experiment will persist throughout all of their sessions while the experiment(s) is live. Session 3 below is what we were seeing in the data that confused us so much - turns out users aren’t being placed into multiple experiments in the same session, it’s just the GA ‘remembers’ all the experiment IDs that user has seen.

Advertisement

Once the experiments are over though, GA will reset - the experiment ID values don’t persist after the experiments have ended (otherwise, some of our users would have hundreds of experiment ids being reported with their sessions by now...)

Advertisement

Why we didn’t notice this before is that the experiment variant dimension functions slightly differently: it’s user scoped but can be overwritten.

Advertisement

In the same example, if we only had only one experiment live (which has been the case up until now), the first variant the user was in (variant 0) doesn’t persist through the sessions - it’s instead overwritten and replaced by the next variant (variant 2.)

Advertisement

This nuance is why this issue went undetected for so long - it was only once we tried running two experiments at once, and saw two experiment ids being set on a single session when there should have been one, that the issue became apparent.

Yikes, so have all of our experiments been a lie?

Finding a bug in the system we’ve been using for years to help product make data-backed decisions is obviously not great. Our biggest fear was that this bug would have biased our test data in a way that led us to draw false conclusions from our experiments.

Advertisement

We have access to our hit-level GA data - i.e. the ‘raw’ dataset of every user’s individual pageviews and events, rather than the aggregated versions of the data we access in the GA UI. We were able to use this data (which is stored in BigQuery, a Google data warehousing product), to identify and filter out the users who were exposed to multiple experiment ids or variants.

We repulled a lot of our past A/B test results, removing these potentially biased users, and were relieved to find the vast majority of the results and trends remained the same. We know we may also be biasing our results by removing these users - since the users exposed to multiple variants are our most engaged user group, coming to the site multiple times a day - but the % of users removed from each test is small enough (and this issue doesn’t occur on all users who visit our sites multiple times a day) that we feel okay about the results.

Advertisement

Next Steps

We’ve been brainstorming ways to improve our user segmentation, but it’s a non-trivial problem. Because we use a CDN and heavily cache our pages, we need to segment users on the Fastly level in order to serve them different versions of our sites.

Advertisement

We are testing out introducing a cookie to store a user’s bucket value based off of the IP address they use on their first visit to the site. We can then group the users based on this persistent bucket value, rather than recalculate it on every visit. As long as we randomize which groups are assigned to which variants (so certain buckets aren’t always assigned to the control group, or always assigned to the experiment group) our variants should still be random, but hopefully now distinct and unchanging for the duration of the experiment.

For now, we’re pulling all A/B test results from BigQuery, filtering out users who were exposed to multiple variants. As mentioned before, this isn’t a great solution, but it’s an alright band-aid for the time being.