Fri, May 13, 2016

Attribyte's News Algorithm

Facebook’s algorithms have been in the news lately because it appears they are being influenced by humans. This is as good a time as any to clear the air: Attribyte’s news algorithm may also be biased by its controllers! In this post, I’ll try to describe how the algorithm works and where the bias is injected.

Attribyte’s applications cover specific topics, like tech news or the upcoming election, so each starts with a selection of sites relevant to the subject. These sources are culled manually; it is likely there are some great sites that won’t be considered. Attribyte finds articles by crawling RSS feeds. If a site doesn’t have a feed that can be discovered automatically, or found by a cursory look for an RSS icon, it won’t be included. The content of the feed also matters. If the feed’s posts don’t contain markup, outbound links can’t be extracted for use by the algorithm. The selection of sites may reflect some personal bias. Technical choices made by publishers may bias the algorithm as well.

Short “quick link” posts that once appeared on personal blogs, now mostly appear on Twitter. In addition to tweeting top stories every so often, @attribyte follows a few writers (and friends). For the tech site, their links are added to the mix, mostly as a test. (It isn’t clear if using Twitter as input makes the results better.)

Now that we have our human-biased input, let’s run the algorithm:

  1. Extract all the outbound links from posts made in the past day.
  2. Canonicalize them.
  3. Count the number of times each canonicalized link appears.
  4. Filter out those that don’t seem to be links to an article or tweet. (Bias!)
  5. Filter out links that appear to be “self” links within a blog network or related properties. (Bias?)
  6. Filter out those that appear less than N times. (Bias!)
  7. Match the top links to entries already in the system. If there’s no match, attempt to crawl the page to build an entry from metadata.
  8. Remove near-duplicate entries. (Bias?)
  9. Sort the remaining entries in chronological order to build the news page.

Please notice that the output results, don’t have to be part of the input entry set. For example, the two entries shown in the image above are not from sites that Attribyte follows for “politics.” This provides a level of serendipity and counteracts, I think, some of the human bias. It is also a good way for Attribyte to discover sites that should be included in the input set.