Attached are the slides for the presentation I gave at the Predictive Analytics Summit in Chicago today.
Hi, my name is Josh Laurito: I run the data team at Gawker Media. Thanks for coming to my talk. Since I’m speaking right before lunch, I’ve decided to make sure my talk is extra long and full of pictures of food to torture you with.
Maciej Ceglowski of Pinboard gave an excellent talk last year and made the point that anyone who works with computers learns to fear their capacity to forget.
Like so many things with computers, memory is strictly binary. There is either perfect recall or total oblivion, with nothing in between. It doesn’t matter how important or trivial the information is. The computer can forget anything in an instant. If it remembers, it remembers for keeps.
Every [analyst] has firsthand experience of accidentally deleting something important. Our folklore as programmers is filled with stories of lost data, failed backups, inadvertently clobbering some vital piece of information, undoing months of work with a single keystroke. We learn to be afraid.
And because we live in a time when storage grows ever cheaper, we learn to save everything, log everything, and keep it forever. You never know what will come in useful. Deleting is dangerous. There are no horror stories—yet—about keeping too much data for too long.
So why do I make the case that we should throw out data? I could appeal to your innate sense of right and wrong, but I imagine most of you think the ends justify the means of data collection. And you’re probably right about a lot of it! I could also try to scare you about the coming doom of regulations and privacy backlash should we continue on our current path especially in the web industry, but the tragedy of the commons is a tragedy for a reason: the incentives right now are structured against individual actors being responsible in their data collection.
I’m going to try to make the case for deleting some of your data on it’s own right. I have three main arguments: your data is costing you more than you think, it’s worth less than what people are telling you, and it’s putting you and your user at risk.
Before I start, let me tell you about how we’ve come to these conclusions at Gawker: a medium sized company in a competitive space. Gawker is a $45 million/year (revenue) business with about 280 employees.
There are two distinct but related sides of the business. The publisher runs 8 blogs (Deadspin, Gawker, Gizmodo, io9, Jezebel, Jalopnik, Kotaku, and Lifehacker) which together generate on the order of 400-500 million pageviews each month. Our publishing platform, called Kinja, is open for public use and is home to one of the most vibrant discussion communities on the web: every month about 100,000 starter posts receive about 1.3 million comments and responses
Not too long ago, the idea of throwing out data on purpose didn’t make sense as a question to ask. The default case was that data got lost or was never stored to begin with. We were working with morsels of data, and much of the data that we had came from surveys and were subject to severe biases. We were working with the best data that we could muster.
But times have changed: and now we’re often struggling with the opposite problem: too much data. The default state is no longer one of data loss, but data being recorded. The oft-cited and probably dubious stat that 90% of all data was created in the last two years is not because people started doing more things in 2013, it’s because the tech to record this information became more broadly available. You could see it in the first day of this conference: in general, yesterday’s talks took for granted that your issue is having too much data that’s difficult to pull insights out of.
The existing narrative around data is that it’s something akin to a precious resource, like salt. It’s mined and processed, and insight is extracted like a metal from an ore, and we are in a constant land grab in order to refine more and more. In previous jobs, when I’ve started projects, I’ve asked what data we should collect and what we’re hoping to do with it. When managers start saying ‘save all the data’, or ‘save everything’, without any thought around what we could do with or infer from the data, I know they’re in this resource mindset.
This shows, I think, a massive misunderstanding of the cost structure and value of analytics.
While data collection and storage is cheap, data cleaning and analysis is expensive. It’s massively expensive, and quality cleaning and analysis is getting more expensive at least in the short term (and hopefully for those of us in this room, the long term). The data science team is probably as expensive as any part of your company outside of sales, at least per-person. The numbers differ according to the source, but a data scientist will probably cost your company at least 30% more than a developer. That’s a lot of money for reading Big Data Borat tweets.
We like to joke that data science is 80% about cleaning data. That’s something of an exaggeration, but not a terrible one.
Thanks to O’Reilly’s data science surveys we have some visibility, if flawed, into time spent on data cleaning. Over 80% of data scientists report spending substantial time on data cleaning each week. Of those, about half are spending time every day dealing with cleaning messy data, and that’s no counting contributions from non-data-science team members. I should note this excludes ETL, which is accounted for separately.
We data scientists complain that data cleaning and processing take up the vast majority of our time without meaningful discussions of what we can do to be more efficient. This isn’t an “everyone talks about the weather but nobody ever does anything about it” situation. Instead, this is a willful blindness towards optimizations. It seems likely that if cleaning the data takes up the majority of your time, that you have too much data or data that’s too dirty relative to what you’re doing with it.
Of course we can add on additional systems to manage our data, but think about what we’re doing here: we’re adding additional points of failure, additional complexity, and additional places for inaccuracies to be layered into the data.
The cost is not just limited to our labor: we have real impacts on our organization’s ability to provide our primary service. In my industry, we know there’s a clear relationship between page speed and how likely users are to click on another page or come back to our site. Loading up additional event slows us down. I’m picking on the Tribune here, though they aren’t even a particularly bad offender: we struggle to keep the number of third-party calls we make, inclusive of what our advertisers use, under 40.
When do we see places where our data costs too much to maintain? For Gawker, this has manifested itself when we look at our historical web traffic data: this is data from quantcast, who we have a strong partnership with. This is a chart of our historical numbers: we’ve had a number of discrete events that have caused meaningful dislocations in our traffic. In February of 2011, we launched a total redesign of our sites, leading to a large drop in traffic as we underestimated the impact of load time. In early 2012, Quantcast made an important change in their measurement scheme, breaking out mobile traffic against desktop and causing slight discontinuities. That’s where the dark blue starts. In late 2012, Hurricane Sandy caused outages at both of our data centers, causing all our sites to go down for an extended period of time. And in March of this year, we had a terrible search engine de-indexing bug that caused us to lose a huge percentage of our search engine juice, from which we’re still recovering.
In each of these cases, a meaningful discontinuity or change to our business caused either data loss or a change of the context under which the data was collected.
In each case, we’re faced with a decision: we can impute the missing data, we can report the data as- is and note the discrepancies and reasons it’s not an apples-to-apples comparison, or we can just toss it out.
Imputation, or replacing missing data, is fairly straightforward, but we’ll be bringing our biases into our analysis. Any trends we expect to find in the data, we’ll be codifying with our algorithms. Any predictions we try to make or models we try to build on future data will only reflect the assumptions that we’ve baked in.
Reporting data while noting context may be a reasonable solution, but it doesn’t scale. We create difficult to chart errors and local, informal knowledge structures that are difficult for new analysts and data scientists to grok. We effectively are setting traps in our data sets for our future selves and our colleagues. You can already start making out some spurious seasonality in our data: 3 of the 4 changes happened in February or March: an analyst without context would assume that’s a slow time of year.
Increasingly, I find that eliminating this data from our databases is the best solution. In our own logs and databases, we only have traffic data since February, and even that’s of limited use: we’ll toss some of it before the end of the year. While we’ll keep sources like Google Analytics and Quantcast, we’d rather have cleaner data in our own database. There’s just a limit to what we can get our of our old, dirty data. And this brings me to the question of value.
To a large extent, we believe our own hype about what we’re capable of doing with our data, and how useful it is. Those of us who work in companies that market the value of our data, either to clients or to investors, are incentivized to show the largest possible sets of data and to state the maximal potential of the data to justify the amount of money that’s spent on us.
I come from a finance background, and it’s axiomatic in finance that the value of any asset is the net cashflow you expect to receive over the life of that asset, discounted by how risky it is. The value of an apple tree is somehow related to how many apples it produces and how much people will pay for apples.
While there’s certainly much truth to the value of data, there’s lots of data that has little, if any value. Older data with limited predictive power may cost more to store than we’ll ever see in returns, even if the cost for keeping it and maintaining it is low. How much predictive power can I reasonably expect from my data on traffic to gawker.com in 2003, before Facebook or Twitter? How much can a stock analyst expect from looking at the Dow in the 1800’s, when it was calculated based on a few trades happening under a buttonwood tree?
I think all of us, to some extent, believe in the unreasonable effectiveness of data. But no less an authority than Peter Norvig, head of research at Google and author of The Unreasonable Effectiveness of Data, recognizes that there is a point where the returns of including additional data have diminished. He’s used this slide in the context of solving problems like image reconstruction and language translation. And these are relatively stable problems, where a long-term data set is likely to be an accurate reflection of future needs.
This isn’t just a call to throw out old data, either. Not all of your data is useful in calculating what will happen in the future.
For Gawker, this has come up most recently in our article recommendation algorithms. We support an article recommendation feature which recommends articles to our visitors based on a number of different algorithms. Here’s an image of what our articles look like: we have a Kotaku post about bugs in a new video game, Fallout 4, and on the left, there’s a rotating carousel that’s suggesting that you look at other stories about the game.
Previous to building a machine learning solution, we had used this module and naively populated the most popular stories across our network. This worked reasonably well, but introduced a number of problems. One was that there was no connection between user interest graphs and the topic that’s being suggested.
There’s a lot of literature around building different recommendation algorithms, so we took a few different approaches: One was doing a straightforward collaborative filter: looking at users who looked at the same article that you’re on and how likely they are to look at other articles.
A second approach we took was filtering down this data to users who have logged in to our platform. Now we are apply these predictions to all users, so this is a less good representation of the readers we’ll be making recommendations for, and we only have 10% of the data. But the the matrix of pageviews is much denser, so it’s computationally much more straightforward.
Finally, we also tried a content filter, which looks at the declared topics of each story. This has the advantage of being something we can immediately calculate for all our posts.
We ran all of these in parallel to one another, all being calculated on our Spark cluster. While we continue to test these algorithms, so far it looks like our content filtering is worse than our baseline, and our collaborative filters are better. But the collaborative filter with less data, just logged-in users, performs about the same as the one using much more data.
We were somewhat curious about this, so we looked into what worked and didn’t. It turns out that the freshness of the post is incredibly important. The item-based filtering initially included no penalty in terms of the recency of the post, and returned old posts. I suspect the noise from the non-logged-in users was primarily related to selection bias. These users typically don’t go directly to our homepages, so they only see stories after they’ve become popular on Facebook or Twitter. As a result, they’re likely to miss articles that may be relevant or interesting, but don’t get traffic on social networks. We have an implicit assumption in our model that all the stories a reader can see are laid out on a grid, but obviously this isn’t the case. Logged-in users get closest to this true article-selection process, and as such give us good results with less data.
At this point you may be saying to yourself “Alright, Josh, your 20 minutes of complaints about your job and bizarre photos of food and garbage have convinced me that the economics of working with data are not perfect. I already know that. But why does this mean I need to throw out data?
Unfortunately, your data is not only an asset, it’s a liability. You very likely hold information that can be embarrassing or incriminating to people, either now or in the future. And your ability to prevent others from accessing this data is limited.
This is one of my favorite comments threads on our platform: it was on a Gawker post about Iggy Azalea trying to book her concert tour. If you don’t know who Iggy Azalea is, my heartfelt congratulations: she’s a musician/vocalist. Blueberry Jones is a reader and books concerts in the DC area and gave her professional opinion of Iggy’s booking and pricing strategy, and a ton of context to musician booking and arena pricing. It’s great inside baseball. It also would almost certainly get Blueberry Jones fired if anyone found out her real identity
We actually do have a something of a history with hacking: in late 2010, a group called Gnosis was able to obtain access to our database. Our user information was securely hashed, but obviously this was a defining moment for the tech team at large.
Since then, we’ve had a much stronger security culture, including mandatory multi-factor authentication, a bug bounty program and ongoing monitoring and testing. We also only store identifiable user data on our own servers, not with amazon or other partners. Additionally, we’ve consciously limited the amount of data that we store, which has sometimes been difficult.
And while we’re always concerned about traditional information security, government regulation and monitoring has emerged over the last few years as a primary concern.
Facebook, to their credit, provides pretty good reporting about their responses to government data requests. These requests are only going in one direction: in 2014 requests were up 23% over the previous year: that’s twice as fast as Facebook is growing in the US.
Earlier this year, an important test case relating to speech on web platforms, Elonis v United States, went to the Supreme Court. A Pennsylvania man wrote rap lyrics that were ostensibly threatening to his ex-wife: think Eminem but maybe a notch scarier. He posted these to Facebook, and when confronted about them by law enforcement, claimed they were just artistic expression
The Supreme Court came back with a ruling on June 1 that such communications could be considered threats under the law if there was a ’truth threat’ presented, and therefore not protected by the first amendment. Unfortunately, this ruling pretty much made everyone unhappy: people looking for strict protections now need to prove the intent of the threat, which is difficult, and website owners are now potentially at risk for additional subpoenas.
In fact, the next day, June 2 Reason.com, the site for the libretarian magazine, was subpoena’ed in connection with some comments on their website in reaction to the sentencing of the founder of the Silk Road marketplace, Ross Ulbricht. Commenters were upset with the judge in the trial, and commented things from as simple as ‘I hope there’s a special place in hell reserved for that woman’ to bad as ‘Send her through a woodchipper’. The government asked for the account information for each of these commenters, down to the IP address, which Reason was compelled to respond to.
Now regardless of how you feel about active intervention in this case, this is undeniably bad for Reason Magazine. But it’s worse: the law under which this subpoena was issued doesn’t just concern threats of physical harm: it includes threats to reputation as well. Title 18 Section 875, the law in question, covers ‘threat to injure the property or reputation’ of another. Under this, the US Government could come to us asking for, and make public, information about Blueberry Jones, the commenter who talked about Iggy Azalea I mentioned earlier.
Obviously, such risks for our users is absolute poison for the open conversation that we want to foster on our platforms. This data puts our entire business at risk: based on quantcast data it looks like Reason’s traffic is down about 15% from last year, and the drop appears to coincide with the subpoena. (I reached out to Reason but didn’t hear back).
So what can do to protect our contributors? One thing is make sure that we only put our contributor’s data in our own data centers, preventing partners from being a weak spot in our security. Another is to avoid collecting identifiable information. Finally we consciously avoid storing ‘grey-area’ data like IP addresses, which may not directly lead to identifiable information but can be a key part in tracking identity online.
Unfortunately, this has real data product implications. IP data is commonly used in blacklisting and spam detection engines. It’s also used in geo-targeting for advertising or content recommendations.
As a result, we have to find smart work-arounds. For spam, we’ve learned to reduce the incentive for creating spam, as well as relying on user flagging in a more meaningful way. There are also other features that we’ve developed to track whether or not accounts are likely spam based on rules systems and simple decision-tree models. And for geo-targetting, we work with partners so that there’s no unified view of a reader or commenter, but we can still meet our ad team’s requirements.
While these restrictions might seem like serious deficiencies, I think that the real issue is in how we think about data. As long as we’re thinking of data as a raw material, the logic to collect more data will always be inescapable.
I think the better metaphor for our use of data now is food. As a species, we once were in a struggle to obtain as much food as needed to survive. We built tools and structures and taboos around that reality. But the reality today is different: food is not scarce for most of us now. It’s having too much that’s the disease.
Data is our food. It’s the fuel that feeds our algorithms and our projections and our beautiful new features. But too much of it can be bad for us and cause problems, especially when the data isn’t good enough to draw meaningful or useful conclusions from. We need to be more focused on the quality of the data than the quantity. The unreasonable effectiveness of data is, in fact, unreasonable.
We need data and we can do great things with data, but while you’re doing great things with your data, please think about the costs and the risks.
I’ll be posting most of this talk on the gawkerdata kinja page, so please some check it out, and feel free to comment about what you disagree with. You can also reach me at joshlaurito, on twitter and kinja.
Have a great lunch