Text classification is an important task for publishers. It allows editorial teams to identify overperforming and underperforming topics, spot changes in readership preferences, and manage overall content production in a strategic long-term way.
At Fusion Media Group, historically, we have labeled our posts manually which is a common practice among many publishers (to be fair, it had been the only option till the very recent times). This approach has an advantage of bringing human insights into the process of text classification. But it also has a number of flaws, among the most significant ones are a lack of consistency and time spent by editorial teams.
During the last few months, thanks to joint efforts of Data Science and Data Engineering teams here at FMG, we implemented an automated way to classify our content. In this article, I want to walk you through our process and show some results. Our approach proved to be practical and not very difficult to implement, it didn’t require large Data or Tech teams or substantial financial investments. What’s important though is, as with any other innovations, to have a supportive management team.
There are many vendors who offer content classification capabilities, as we all know. We decided to go with the Google NLP. Our main reasons were:
- Google’s taxonomy is robust, it covers areas from tech to news to business, which was very important for us because across the network we write about pretty much every topic on Earth.
- Google NLP is a newer tool that became available in September 2017, it is targeted specifically at media and publishing organizations.
- Because we had been using many other Google services already, we believed that we could use a previous experience with Google products to minimize our implementation time and costs.
Estimating costs of running Google NLP API on a regular basis was also a part of our process. With our production being about 22K stories per month across the network, we projected our annual cost of running content classification to be about $1K, a manageable budget for us.
After we assigned categories to all our historical content (automatically!), it was time for us to do some analysis. Below are a few examples of what we did together with some use cases.
For the category analysis, I looked at all the stories published in 2017 by all our sites. Altogether, after filtering out posts in Spanish, I was left with 76K+ stories published by 17 different sites and subdomains in our network.
Overall, as a network, we covered 190 categories out of the Google’s total 620, i.e. 31%. Below is a graph that shows our top 25% categories by pageviews. A size of a bubble represents a relative amount of pageviews a category received from the total amount of pageviews.
One of the interesting and practical questions we wanted to answer was to identify, if possible, topics that performed better on social and topics that performed better on search. If we could tell them apart, we then would promote the first group of topics, our ‘social’ topics, on our social channels while optimize the second group, topics that readers tend to search for on the internet, for a search engine. Our analysis shows that, yes indeed, there is a difference among categories in how they perform on social and in search. The difference, of course, reflects a difference in readers’ behavior, more specifically in how they consume stories on various topics. For example, people tend to engage more on social with News and Politics stories and particularly with stories about Sensitive Subjects such as ‘sexual assault and sexual harassment’, ‘crime’, ‘murder’, ‘abortion’, and ‘domestic violence’. On the other hand, stories that talk about Movies and TV Shows reviews or Computers and Video Games tend to bring more visitors to our sites from search engines.
Here are two more examples, more specific ones, from our sites Skillet that writes about food and Two Cents that writes about personal finance. Stories on Beverages (alcoholic beverages) and Desserts (baked goods) proved to be more ‘social’ topics – our visitors came to Skillet following links shared by their friends on social media. Food and Food Recipes, on the contrary, are among topics people actually search for. Stories related to Tax Preparation and Credit Reporting & Monitoring perform better in search while topics related to longer-term financial security and well-being such as salary and raise negotiation, saving money, retirement, educational loans, etc. perform better on social.
Another question we wanted to answer was about viral posts. Do some categories tend to produce a higher percent of viral posts? Or, do viral posts tend to spread equally among the categories? The answer turned out to be yes, some categories tend to produce a higher number of outliers than others. Two such categories not surprisingly are Sensitive Subjects and News.
I was also curious to explore how, if at all, topics we write about have changed over time. I picked up three years from the past decade - 2008, 2013, and 2018, - and analyzed all posts published by one of our major sites, Gizmodo, to find the top 10 categories from each year, then I merged them to get the final list where some categories appeared in all three years while others in only one of them. As you can see, Gizmodo still writes a lot about Computers & Electronics and Arts & Entertainment. Yet, in recent years, we also started covering a political news and, most importantly, addressing social issues.
Even though the following analysis doesn’t belong to the topic I’m discussing in this post, my team asked me to write about it so here we go, with my pleasure.
Exploring textual data a society produces gives a pretty accurate insight into this society, its norms and its values. Gender images are one of the top characteristics differentiating one society from another in time and space. In the US, social images of ‘man’ and ‘woman’ have been changing fast in the recent years. I wanted to check where we, in 2017, stand on portraying genders. A very simple analysis researchers usually do to scratch surface exploring questions like ‘gender’ is to analyze words that come most frequently after ‘she’ and ‘he’. I did the same by analyzing 76K+ posts in my database, they are all posts in English we published in 2017 across the entire network. Here is the result.
The green columns in the graph show words that follow after ‘she’ and ‘he’ most often. The blue columns show words that follow ‘she’ but almost never ‘he’. As you can see, ‘wears’ and ‘wore’ are still attached to women much more often than they are to men. ‘Hit’ and ‘shot’, on the other hand, stay firmly a part of a men’s image.
A funny observation is that verbs that go after ‘he’ are mainly in the past tense while verbs that go after ‘she’ are more often in the present tense. Hmm... :)