How to programmatically determine whether an RSS feed is a full feed or a partial feed

后端 未结 3 1056
一生所求
一生所求 2021-02-08 10:52

I would need to programmatically determine whether an RSS feed exposes the full content of its articles or just extracts of them. How would you do it?

相关标签:
3条回答
  • 2021-02-08 11:05

    Why not follow the url from the rss-feed and check whether there is more text on this page than in the rss-feed? You would need take a html-parser and put in some general rules.

    0 讨论(0)
  • 2021-02-08 11:14

    Look for a link at the end that says "More", "Continued", "Full article", "..." or similar. Unless you want to follow every link on the page and look for the text from the feed plus extra perhaps.

    0 讨论(0)
  • 2021-02-08 11:16

    I don't think there is a very clean way of doing this, but here are two "hacky" ones:

    I'd parse the RSS's text, and look for any links coming out of it. Granted, there could be multiple links there (some to other blog posts), but if you focus on the last one, and try to come up with a few heuristic words for the title of the link (i.e. "more", "read full", etc), you should be able to get a lot of them. For more confidence, you can only look at the links that point back to the original blog.

    A more rigorous method would have you following all the links and trying to compare if the RSS fragment is a subset of the page that comes back, or if there is a substantial overlap. This may not help whenever the site uses a true summary as opposed to fragment of the full post though.

    0 讨论(0)
提交回复
热议问题