Here at Gawker, we run a lot of A/B tests. A focus for us lately has been testing the new recommended post module, as we move away from our basic “popular stories” recommendations to a more personalized algorithm. During the analysis of this test - and after a long conversation with the Google Analytics support team - I realized that there was a lot I didn’t understand about how GA handles report filters and segments.

For this experiment, the change we made to the recommendation module only affected article pages, not the homepage. However, our current A/B framework turns on experiments blog-by-blog, not page-by-page. Because of this, I wanted to make sure that any sessions where a user just went to the homepage and then left, or went to the homepage and clicked “view more stories” a few times but never read an article, wouldn’t be included in the analysis since they were never actually exposed to the change.

Advertisement

I created a custom report with a filter to isolate sessions that had visited at least one article page (and exclude those that had only visited the homepage) using a regular expression to match our URL structure. But, when I ran the report to filter out homepage-only traffic, Google Analytics was excluding a full 30% of our sessions! I know our homepage design is great, but that couldn’t be right. So I opened a support ticket.

What started as an open GA support ticket about how we could accurately filter out the homepage-only sessions for our A/B test analysis, turned into a very comprehensive lesson on the differences between filters and segments and how they affect the session metric in particular. Props to the people over at Google Support for giving us a lot of insight into the often black box workings of GA.

So, what was my filter doing?

Well, before we can answer that let’s go over some of the nitty gritty on how filters and the session metric in particular work:

  • Using a filter is equivalent to adding an additional dimension to your report and filtering on that column
  • When you pull “Sessions” as a metric, your report will return only the count of sessions in which the first hit of the session contained the dimensions you’ve set in your report. For example, in the case of pulling experiment results, if the report has the dimension “Variation” and metric “Sessions,” it will only show sessions where the first pageview of the session was part of the experiment. For someone who visited our site right before the experiment started, and then on their 3rd pageview became part of the experiment, their session would not show up in the report, since the “Variation” dimension was not present on the first pageview of their session.
  • Since we’re pulling sessions, adding the hit-scope dimension “Page” doesn’t really make any sense. If I added “Page” as a dimension to my report, what would sessions by page even mean? To deal with this ambiguity, if you set “Page” as a dimension or filter for a report with sessions as a metric, it will use the first page of the session.

And here’s what my custom report set up looked like:

First off, because I had Experiment ID and Variation as dimensions for my report, I was excluding any session that started before the experiment began, even if they eventually were part of the experiment. This is actually okay, and probably what we want - now we know that all sessions present in the report had the experiment from the beginning of their session.

Advertisement

As for my filter, by filtering on “Page does not match my homepage regex string”, I was getting the sessions that began on non-homepage pages (aka article pages), and filtering out those that originated from the homepage. This filter obviously did not accomplish what I had intended, but now I knew why we were getting such a high percentage of “homepage-only traffic.” It wasn’t homepage-only traffic, but the percentage of sessions that entered our sites on our homepages, which made a lot more sense!

(Side note: to find total pageviews on article pages vs. the homepage, this filter works. The issue here is only with the session reporting)

So now that we understand what we did wrong, how do we actually exclude homepage-only traffic?

Segments!

We know now that a session is only incremented if the first hit in the session contains/matches the dimensions and filters of the reports. Segments, on the other hand, allow us to match sessions based on any hit in the session, rather than just the first hit. We can use the segment logic to get the results that we want.

We’re not going to be able to exclude “sessions where all pages equal a homepage” because the session logic will include or exclude sessions if any of the hits match the criteria given. “Sessions include Page matching regex (^/$|^/[?]{1})” will give us any session that includes at least one homepage view, which isn’t what we want. If a user went to the homepage, and then went to an article page, we want to include them in the report since they were exposed to the experiment change. If they went to the homepage and then left, we don’t.

Let’s approach the logic here from another angle - if “Sessions include Page matching regex (^/$|^/[?]{1})” gives us any session that includes at least one homepage view then “Sessions include Page NOT matching regex (^/$|^/[?]{1})” will give us any session that includes at least one non-homepage, aka one article view. “Page not matching regex (^/$|^/[?]{1})” will match a non-homepage pageview, so as long as one of the pages in the session is non-homepage, the session is included in this segment and we have accomplished our goal!

So “Sessions include Page not matching regex (^/$|^/[?]{1})” gets us what we want - it filters out any homepage-only sessions from the report. On the flip side, if we wanted to look exclusively at homepage-only sessions, we need to use the “exclude” criteria option and do some double-negative thinking: “Sessions EXCLUDE Page not matching regex (^/$|^/[?]{1})” will exclude all sessions that had at least one non-homepage pageview, thus giving us only sessions with all homepage pageviews.

Conclusions

The key takeaways I learned from this process:

  • Filters will return only sessions where the filter dimension was present in the first hit of the session.
  • Segments allow you to filter sessions based on any hit in a session meeting some criteria.
  • Beware of which dimensions you use when reporting on sessions. If the dimension is not present in the first hit of the session, the session will not be counted in your report. To avoid this (which you may or may not want to do, depending on the case), use only dimensions that are present on all hits (ex: device category, source/medium) and move any other filtering criteria into segments instead.
  • When creating segments with rules based on a hit-based dimension (like page), mutual exclusivity is only possible through use of the Include and Exclude switches, since the session may have hits that satisfies the rule and hits that do not satisfy the rule. This is what we did with the “At least one article view” segment and the “Homepage-only traffic” segment.

Google Analytics is a very powerful, but also very complex tool. Make sure you think about the dimensions, filters and segments you’re adding to a report to make sure you’re getting the right slice of data. Also, email support: they’re great!

Advertisement