Using Data Science to Understand Content Relevance

The importance of content relevance 

Digital Marketing and SEO Managers have been exploring Content Relevance for some time now. Recognising how relevant a website is in relation to a search term or subject is an incredibly important part of your SEO strategy. Put simply, you aren’t going to rank for a topic or term that your content doesn’t relate to.

Relevance doesn’t only impact organic, it can have a significant impact on your Paid Search strategy too. If you send users from ads to relevant landing pages, you improve your user’s experience and the landing page relevancy aspect of your Quality Score too.

As part of some research and development into a new Content Relevance algorithm we are developing for Ayima Labs, we applied some data marketing analysis to the newspapers during the time of Dominic Cumming’s not-so-little adventure.

Applying data science techniques

Using Natural Language Processing (NLP) on early articles about Boris Johnson’s Chief Advisor, and his controversial trip to the North of England, we wanted to see if we could spot any trends.

Taking all of the (non-paywalled) major papers’ original stories, we crawled their original articles on Cummings’ trip. We then identified the TF-IDF (Term Frequency – Inverse Document Frequency) for every term in these articles. This is a really good measure, one of the bases of NLP in fact, for calculating the relative importance of certain words within text and a body of work.

Quite simply, the lower a term’s TF-IDF, the less it stands out within a text. Terms like “Cummings”, “Durham” and “the” for example, would have a TF-IDF of 0, since they appear in all articles, and appear very frequently in each one. This method means we can ensure the unique terms from each article stand out.

If we then calculate the pairwise distances between the relative TF-IDFs of these articles, we can create a distance matrix, showing how similar each article is – relative to one another – in terms of both language and sentiment. Similar technologies are used to calculate sentiment on social media, for instance. 

We turned this into a distance matrix comparing the articles. This led to some interesting findings: 

In the above chart, the closer a number is to 1, the more similar the model identifies the two pieces as being in language and sentiment. The diagonal line of 1s comes across as, of course, each article is identical to itself! If it’s 0, it means that 2 pieces share absolutely 0 words, which is another unlikely proposition! 

The methodology is by no means perfect, it doesn’t take into account ambiguities of language, such as the context of words within the wider article, but some interesting insights can be garnered from what language a writer chooses to use.

Validating our approach

By including a Guardian recipe on how to make the perfect Chicken Cacciatore, we’re able to not just have a pleasant break from online vitriol with some heart-warming italian comfort food, but also to calibrate the findings and make sure the model’s working as planned. That this was a bit closer to the Guardian’s piece on Cummings than the BBC or Sky News reflects language and a writing style that carries across in the same paper, whether Cacciatore or Cummings. 

If we put the above into a heatmap, the relationships become even more stark: 

In this, the darker the colour, the less similar the two pieces are. The lighter they are, the more similar the two pieces are in language and style

We (sadly) remove cacciatore from the final analysis, leading to the below:

The lighter band in the Daily Mail’s relationships, perhaps surprisingly at the time, indicated that their original piece was the most middle-of-the-road in terms of sentiment. This similarity made the revelation on Sunday that the Daily Mail would join the other papers in condemning Cummings’ actions much less surprising. The BBC similarly stuck out here as an outlier from all, taking a generally more conciliatory tone than its peers.

How Ayima use data science

Using methods like this to garner insights about relevance gives us an insight into the types of data science that likely goes into Google’s algorithms. We are built on technology; our first-ever tool being an enterprise-level crawler which we still use today. 

Working with Ayima is about Performance, Technology & Control. We win because we use technology where we can and people where we have to.

You can learn more about our publicly available tools here, or get in touch with us to find out more about the ways in which we can utilise our next-generation technology to drive and inform your digital strategy.