This is the second installment of my Strata Conference journal. Part 1 is here.

After the Privacy talk, it was time for lunch: I sat with Miklos and Trey from DataBricks. They spend a lot of their time tuning Spark applications (DataBricks is the main author of the mllib library), and I’ve spent a lot of my time recently tinkering with a Spark app, so I got to pick their brains a little about what’s new in 1.5, etc.



After lunch, Claudia Perlich spoke about how clicks and CTR have been exposed as poor metrics, as most clicks are accidental or the result of fumbling. She put up this chart of apps with the highest ad CTR:

The punchline here is that ‘Flashlight’ apps have the best CTRs, because people are fumbling in the dark with them. That’s what your clicks are on mobile!

Anyways, I’ve heard Dr. Perlich speak before, so I tuned out for much of the talk and caught up on email. I poked my head in to hear Lauralea Banks Edwards talk about queer theory and data, which had a few interesting points about bias in data collection. I wish her slides were a little more detailed so I could re-construct what I missed.

Listening to her did inspire me to start counting the gender breakdowns at each session I was at (I know, that’s diametrically opposed to queer theory. What can I say, I missed the point). I sat through two short sessions on a new streaming session from Huawei and on customer journey modeling from Apigee. Each audience was 70-90% male, by my quick counts. The talks were fine: the speaker from Apigee (Jagdish Chand) in particular had interesting things to say about the differences in modeling behavior graphs and social graphs, and modeling sequences of customer journeys like n-grams in documents, which I found quite compelling.


But only after this did I get to the meat of the day: media presentations!

Juan Huerta talked about data science at the Wall Street Journal.


Huerta talked about a number of goals that his organization has, as well as some of the architectural components. In particular, he mentioned that most of the the data that’s used to personalize offers is accessed from a separate api.

Huerta’s three primary use cases were segmentations , churn, and LTV basically upsell. It struck me again how similar media economics are to SaaS. When I worked in the SaaS world, that was pretty much all I was concerned about.

One thing I loved about the WSJ process was the way they used their qualitative market research to inform their quantitative work. They did a survey of several thousand subscribers and non subscribers to understand the potential market, including groups they were doing a poor job converting. Then they were able to tie a survey respondent to individual user logs and train a classifier based on each segment. Here are their segments, if you are interested:

It’s a little fuzzy, but the segments are


- Print Traditionalists

- Conservative Retirees


- Mobile Movers

- Career-Driven Leaders


- Eclectic Intellects

- Unengaged Essentialists (I assume this is marketing speak for people who they have no info on).

One side benefit of this analysis was that this gave the tech team and the business/marketing arms a common language to describe their user segments. I thought that made a lot of sense: I’ve struggled to build funnels that fit into pre-existing conceptions of user activity before, so having a qualitative research-driven approach to segmentation seems like a good middle ground between the business team’s instincts and a K-means cluster trained on whatever features I can think of off the top of my head.


I was a little bit bummed out that Huerta only spoke briefly about non logged in users. I understand he’s a data guy, and the most interesting thing is to look where he has the best data. But I have to imagine conversion across their paywall is relatively low, and this is a place where there’s some low-hanging fruit.

After Huerta, Adam Kelleher of BuzzFeed gave a fairly technical talk on measuring virality, using The Dress as a case study. Adam talked primarily about how they implemented their measures of content propagation. Effectively, they give full credit for propagation to who recommended a post the first time a person clicks, and try to tie that share back to the original promotion of the post. Technically, this enforces that the propagation trees are acyclic, which gives them nice properties for running analysis. BuzzFeed has open sourced some of their work on this.


The findings on the case study were interesting: in particular, of the 8.6 million users who saw the post in the time period (8.6 million!) only about 200k came directly from Twitter, but about 1 million came from propagation trees that started on Twitter. Basically, for every 1 visitor from Twitter almost 4 additional visitors were generated, only through different channels.

Adam talked a lot about how to visualize and communicate this information, which I really appreciated. They implemented Tamara Munzner’s H3 in python, which I think is super cool: I’m a big fan of Munzner: she really deserves to be mentioned alongside Tufte, Bostock, and Brett Victor as one of the leaders in visualization today.