问题
I am working on a project which shows rss feeds from different sites. I keep them in the database, every 3 hours my program fetches and inserts them into sql database. I want unique records for providers not to show duplicate content.
But problem is some providers do not give GUID field, and some others gives GUID field but not pubdate.. And some others does not even give GUID or PubDate just title and link.
So to keep rss feeds uniqe in sql server what would be the best way?
Should I check for first guid, then pubbdate, then link, then title? Will it be to good practice to compare link fields in SQL to check uniqueness?
Thanks.
回答1:
I would develop a routine that takes certain key parameters like the title, source and body and then combines them to create a CRC hash. Then store the hash as an attribute with the feed and check for a matching hash before adding a new feed.
I'm not sure what your environment contraints are but here is an example for calculating CRC-32 in C#: http://damieng.com/blog/2006/08/08/calculating_crc32_in_c_and_net
来源:https://stackoverflow.com/questions/11953807/best-practice-to-keep-rss-feeds-unique-in-sql-database