how to check uniqueness (non duplication) of a post in an rss feed

when retrieving and caching/saving (in a database) some posts from an rss feed, how to determine that:

it is the same post (example: when some typos are fixed in the feed or if the title changes, the date changes, etc...)
find feeds that talk about the same topic (example: same story from different sources)

are there any best practices for these things?

thnx a lot

Some RSS feeds have a guid element as an identifier. Posts with a shared guid are probably duplicates. Some RSS feeds just stuff the URL in there to indicate that a post's uniqueness is tied to its url. Note that if the URL matches but the Guid does not, this may indicate that the posts are not duplicates. If a feed does not maintain an archive, the url might not change. This situation is probably pretty rare.

The URL would be a good start. As for different versions when people make changes. That would depend on implementation details.

If pubDate is used in the item element of the feed, it would be useful to use that as a version perhaps.

Refer: http://cyber.law.harvard.edu/rss/rss.html#sampleFiles

Take a look at the clustering algorithms used Google news. Though your requirements are not that high, but they are vaguely related to what Google news does - They cluster stories about same event from different sources into one group. They use high level algorithms combined with NLP. But you can start with mapping the keywords in title and url.

来源：https://stackoverflow.com/questions/3656107/how-to-check-uniqueness-non-duplication-of-a-post-in-an-rss-feed

标签

sql-server

rss

feeds

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!