when retrieving and caching/saving (in a database) some posts from an rss feed, how to determine that:
- it is the same post (example: when some typos are fixed in the feed or if the title changes, the date changes, etc...)
- find feeds that talk about the same topic (example: same story from different sources)
are there any best practices for these things?
thnx a lot
Some RSS feeds have a guid element as an identifier. Posts with a shared guid are probably duplicates. Some RSS feeds just stuff the URL in there to indicate that a post's uniqueness is tied to its url. Note that if the URL matches but the Guid does not, this may indicate that the posts are not duplicates. If a feed does not maintain an archive, the url might not change. This situation is probably pretty rare.
The URL would be a good start. As for different versions when people make changes. That would depend on implementation details.
If pubDate is used in the item element of the feed, it would be useful to use that as a version perhaps.
Refer: http://cyber.law.harvard.edu/rss/rss.html#sampleFiles
Take a look at the clustering algorithms used Google news. Though your requirements are not that high, but they are vaguely related to what Google news does - They cluster stories about same event from different sources into one group. They use high level algorithms combined with NLP. But you can start with mapping the keywords in title and url.
来源:https://stackoverflow.com/questions/3656107/how-to-check-uniqueness-non-duplication-of-a-post-in-an-rss-feed