We’ve always been a big fan of A/B testing, and have been doing a lot of great work in the past year to build on and improve our experimentation infrastructure. We’ll be publishing a few blog posts about that work, but to start off we wanted to post about the basics of how our system works now!
For the past few years, A/B testing has been an important part of our product development process, as it allows us to isolate and quantify the effect product changes have on our users.
We split users up into two or more random groups, which should be basically identical in composition - the same number of mobile users, logged in users, users from NYC, users visiting Deadspin, etc - making them identical in behavior as well.
We then introduce a change to one of the groups (the experiment group) and compare this group’s behavior to the group without the change (the control group.) If we observe any differences in behavior, we can be confident - with the help of statistical testing - that it was due to the change we introduced, as we’re keeping all other variables constant.
We have a server-side implementation of Google Content Experiments within Google Analytics, so we can handle the user segmentation ourselves, but still collect and use the data within GA.
The alternative to server-side implementations are client-side ones, and often involve having to reload the page on a user’s first pageview, once the browser registers what group the user is supposed to be in and redirects them to the correct version of the site. This can negatively affect user experience and mess with internal reporting, so we handle the segmentation and versioning on the server-side instead.
We use Fastly as our Content Delivery Network (CDN) and cache our pages pretty heavily. The downside to this is obviously we wouldn’t want a user in an experiment group to get served a cached version of a page and not see the version of the site they were supposed to.
So we built our experiment infrastructure into Fastly: we do the user segmentation there and then vary the cache on these segments in order to serve the correct version of our site to each group.
In Fastly, we take a hashed version of a user’s IP address and then take the last digit of this hash - which is one of 16 possible hexidecimal values, 0-9 or a-f. We then assign these 16 buckets to an experiment variant, setting the bucket-variant assignments as key-value pairs in an Edge dictionary. We could technically run a 16 arm A/B test, with one IP bucket per variant, but typically our tests are simpler than that and we’ll assign multiple buckets to a variant.
IP address: 188.8.131.52
Last digit: 5
Placed in Experiment Branch 1
What the actual dictionary looks like in Fastly:
note: we’re not storing a users IP in any way, only running it through the hash algorithm and keeping the last digit of the output.
Our development team is a continuous integration shop and generally pushes all code/new features live behind feature flags. We use an internal service we have for managing these feature flags to specify which ones should be turned on for which variants. Fastly then serves each user group the requested page with the feature flags we configured turned on.
Then of course we make sure to send the data on which variant a user is in to GA, where we do the bulk of our analysis. The variant comes from Fastly and the experiment ID was coming from a hard coded parameter in our frontend code base (more on that later.)
We also send this information to Google Ad Manager, in a custom key-value, so we can measure the impact site changes have on ad metrics, such as viewability.
That’s it! We set this all up a few years ago (shout out to Claire Neveu, Josh Laurito, Istvan Bodnar, Balasz Keki, and probably other people I’m forgetting, for designing and implementing this,) and have been happily testing ever since.
The downside of this implementation though is that we can only run one test at a time, as it only allows for one experiment ID to be set up in the system. So next up, we’ll talk about what we did to improve this system to handle multiple concurrent experiments, and also the surprise issue we found when doing so!