Last year, we started ingesting Google Ad Manager’s data transfer files in an ongoing effort to enhance our advertising reporting capabilities. Ingesting this data and getting it into a usable and accurate format ended up being a much more complex and labor intensive task than we had anticipated.
The documentation is a good starting point, but we ended up needing to supplement it with conversations with GAM support plus our own trial and error. We’re hoping to share what we’ve learned so others can get up to speed with data transfer files more quickly!
Ad Manager data transfer reports are complete logs of event-level data from Ad Manager campaigns. The reports are delivered to a Google Cloud Storage bucket as CSV files. Different CSV files correspond to different events including requests, responses (code serves), impressions, and clicks.
Google Ad Manager offers an API and a UI for running ad hoc queries on event data; however, there are a few things only accessible through the data transfer files.
This is the big one - the data transfer files log the URL on which the event occurred. In plain English: with the data transfer files, we can finally pull impressions by article!
We’ve been working recently on building out more granular revenue reporting capabilities, mainly an RPM (revenue per mille, i.e. revenue per 1,000 pageviews) report. But to get revenue by page we first need how many impressions served on said page, which you can’t do with GAM’s out of the box reporting tools alone. We could attempt to use custom key-values to send through an article ID and then pull that from the UI/API, but given that we have 13 sites producing over a thousand articles a week, we’d hit up on the key-value system limits pretty quickly.
But all the data transfer files contain a dimension called RefererURL, which is defined as “The URL of the page that requested and that displayed the ad.” So, including this field in our queries will give us how many impressions served on each individual page on our site. Then we can layer on our revenue data by joining on campaign specific details (order name, programmatic partner etc) to figure out how much money we made on each page.
We are big into GAM’s custom key-values here at GMG and have borderline abused them in order to build out a robust ad reporting system. But there are some limitations to reporting on key-values that we can’t work around. With the data transfer logs though, we have full access to any and all key-values we send through in the ad request, allowing us to slice and dice our data by these dimensions freely. These queries can be slow and expensive (hence GAM’s limits in the UI), but have been helpful when we need to pull reporting the UI/API can’t produce.
There are currently 9 data transfer files available, each housing a different metric:
The pricing is per file per month, so the more files you ask for the more it’s going to cost you. We ended up signing up for five files: Requests, Code Serves, Impressions, Clicks and ActiveViews.
Most file types have two files included: one backfill file and one non-backfill. Backfill files contain data from exchange partners: AdX, EBDA, etc., and the non-backfill files have everything else. The data in the files is distinct and needs to be summed together to get metric totals (i.e. to get total impressions, query both non-backfill and backfill files and sum the results.)
Getting the raw reports into a usable format is a complex task.
Since we are using AWS Redshift as our data warehouse, we first need to move the CSV files from Google Cloud Storage into S3. The CSV file names include timestamps, and those can be used to place the files in folders by date.
It is important to note the timestamps in the CSV file names do not correspond exactly to the time of the events in the file. Events from a particular date may not be delivered in the report files for a number of days. We have found, however, that almost all of the events for a particular date can be fetched by combining the files with timestamps between one day before and four days after the date in question. So, in order to gather the events for 12/15/2018, we combine data transfer files timestamped between 12/14/2018 and 12/19/2018 inclusive.
Once the files are in S3, we use the Redshift COPY command to copy the data into temporary tables in Redshift. The data for each file type is copied to a separate temporary table in Redshift. So we’ll have one table for requests, another for impressions, and so on.
The folder structure of the data in S3 makes it simple to select only the data delivered within a specified date range. For our regular daily ingestion, we copy the data needed to process the previous four days. Once we have copied the data into temporary event-type tables, we join the raw data to create aggregated data.
We do two joins, one for requests, and the other for delivery-specific metrics. For each join operation, we aggregate metrics over the set of dimensions that interest us. All dimensions used when aggregating requests are also used when aggregating deliveries, but some dimensions used when aggregating delivery metrics are not used when aggregating requests. This is because some dimensions (e.g. Line Item ID) are only available after an ad is served.
These two join operations are ultimately used to populate two persistent tables in Redshift. The table with delivery metrics may be joined to the table with request metrics via the set of shared dimensions.
It took us a while to work through how to query the data transfer files to calculate the metrics we wanted - requests, code serves, impressions, clicks and viewability metrics - and getting those metrics to match what we see in the UI.
At first we were seeing discrepancies up to 10% for certain metrics, which almost always ended up being a missing WHERE clause on our SQL. The documentation could be better at explaining some of these intricacies, but here’s what we’ve figured out.
All of the metrics that we want to track are derived from counts of the number of unique events. Each metric except impressions (explained below) is determined by taking the count of distinct pairs of KeyPart and TimeUsec2 values. In many cases, we also need to apply special filtering rules to omit some rows of data. Assuming that we have one table representing each type of Data Transfer CSV file, we can represent the formula for each metric in SQL form. Here they are:
SELECT COUNT(*) FROM (SELECT DISTINCT KeyPart, TimeUsec2 FROM Clicks)
SELECT COUNT(*) FROM (SELECT DISTINCT KeyPart, TimeUsec2, ImpressionId FROM Impressions WHERE Backfill OR CreativeSize != '')
As mentioned earlier, each file type contains a backfill and non-backfill version, so the ‘Backfill’ expression in the above SQL refers to the backfill subset of Data Transfer files.
There are three weird things about the impression files that our SQL is accounting for:
- Unfilled impressions are included in these files (although there are plans to remove them eventually.) They show up in the non-backfill file with null values for many fields including: OrderID, LineitemID, and CreativeSize.
- In the backfill file, there are also rows with null values for OrderId, LineItemId and CreativeSize, but these are not unfilled impressions, but in fact EBDA impressions. So we want to count the rows with null values from the backfill file, but not the non-backfill file.
- There may be multiple impressions with the same KeyPart and TimeUsec2, but they will be differentiated by the ImpressionId field, so we need include that dimension in our query or else we’ll undercount impressions.
Viewability metrics are a bit tricky: to pull Active View Viewable Impressions and Active View Measurable Impressions (the two metrics needed to calculate % viewability), you need to write three queries and pull from two different files (Active Views and Impressions.) The documentation actually does a pretty good job at walking you through all the steps.
SELECT COUNT(*) FROM (SELECT DISTINCT KeyPart, TimeUsec2 FROM ActiveViews WHERE ViewableImpression)
The Active View file only reports on viewable impressions and non-measurable impressions; so to determine measurable impressions, we have to calculate unmeasurable impressions and then subtract these from eligible impressions (which exist in the impressions file.)
SELECT COUNT(*) FROM (SELECT DISTINCT KeyPart, TimeUsec2 FROM ActiveViews WHERE NOT MeasurableImpression
SELECT COUNT(*) FROM (SELECT DISTINCT KeyPart, TimeUsec2, ImpressionId FROM Impressions WHERE ActiveViewEligibleImpression)
SELECT COUNT(*) FROM (SELECT DISTINCT KeyPart, TimeUsec2 FROM CodeServes WHERE VideoFallbackPosition = 0)
We need to include the
VideoFallbackPosition = 0 filter, because a single video ad event can also include fallback video ads. Each fallback video is represented by a row in the Data Transfer files, but we don’t want to count those as separate code serves, so we filter out all but the first video in the set (where this fallback position is zero).
We need to apply analogous filtering in the case of requests:
SELECT COUNT(*) FROM (SELECT DISTINCT KeyPart, TimeUsec2 FROM Requests) AS r LEFT JOIN (SELECT DISTINCT KeyPart, TimeUsec2 FROM CodeServes) AS c ON (r.KeyPart = c.KeyPart AND r.TimeUsec2 = c.TimeUsec2) WHERE c.KeyPart IS NULL OR c.VideoFallbackPosition = 0
We’re at a place now where the data we’re ingesting and aggregating from the data transfer files matches what we get from the GAM API/UI usually within 1%! This is likely as good as it’s going to get: support told us that both the API/UI and the DT files pull from the same underlying data, but there are differences in how that data gets ingested, aggregated and delivered that will cause some level of discrepancy between the different reporting options.
It took us months to work through ingesting, processing and aggregating this data, but we did it! We’re super excited to work on building out RPM and other highly granular ad reports that wouldn’t have been possible without this data.