How to group / compare similar news articles

蓝咒 提交于 2019-11-30 04:00:06

This problem breaks down into a few subproblems from a machine learning standpoint.

First, you are going to want to figure out what properties of the news stories you want to group based on. A common technique is to use 'word bags': just a list of the words that appear in the body of the story or in the title. You can do some additional processing such as removing common English "stop words" that provide no meaning, such as "the", "because". You can even do porter stemming to remove redundancies with plural words and word endings such as "-ion". This list of words is the feature vector of each document and will be used to measure similarity. You may have to do some preprocessing to remove html markup.

Second, you have to define a similarity metric: similar stories score high in similarity. Going along with the bag of words approach, two stories are similar if they have similar words in them (I'm being vague here, because there are tons of things you can try, and you'll have to see which works best).

Finally, you can use a classic clustering algorithm, such as k-means clustering, which groups the stories together, based on the similarity metric.

In summary: convert news story into a feature vector -> define a similarity metric based on this feature vector -> unsupervised clustering.

Check out Google scholar, there probably have been some papers on this specific topic in the recent literature. A lot of these things that I just discussed are implemented in natural language processing and machine learning modules for most major languages.

The problem can be broken down to:

  • How to represent articles (features, usually a bag of words with TF-IDF)
  • How to calculate similarity between two articles (cosine similarity is the most popular)
  • How to cluster articles together based on the above

There are two broad groups of clustering algorithms: batch and incremental. Batch is great if you've got all your articles ahead of time. Since you're clustering news, you've probably got your articles coming in incrementally, so you can't cluster them all at once. You'll need an incremental (aka sequential) algorithm, and these tend to be complicated.

You can also try http://www.similetrix.com, a quick Google search popped them up and they claim to offer this service via API.

One approach would be to add tags to the articles when they are listed. One tag would be XYZ. Other tags might describe the article subject.

You can do that in a database. You can have an unlimited number of tags for each article. Then, the "groups" could be identified by one or more tags.

This approach is heavily dependent upon human beings assigning appropriate tags, so that the right articles are returned from the search, but not too many articles. It isn't easy to do really well.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!