问题
I want my website to be able to pull up information about a web page when the user pastes a link into the post box, similar to Facebook.
I was wondering how sites like Google, Reddit and Facebook are able to retrieve thumbnails, titles and descriptions with just a URL.
Anyone know how they do this?
回答1:
The basic algorithm is rather simple: fetch the page, analyze content, extract text&images&title&whatever, build preview. However there are a lot of difficulties for particular use cases. Menus, banners and adds, text structure - plenty of different details that require very scrupulous processing. AFAIK there is no algorithm that can solve this task in 100% cases (yes, Google's and other algorighms aren't perfect).
About Reddit. Since it's opensourced, you can find how they do it exactly. Here is the code you're looking for: https://github.com/reddit/reddit/blob/master/r2/r2/lib/scraper.py
Yandex has API that allows to do the same. You can find more here and here.
来源:https://stackoverflow.com/questions/16750127/how-to-read-open-graph-and-meta-tags-from-a-webpage-with-a-url