How can I scrape text and images from a random web page?

北城以北 提交于 2021-02-07 12:39:22

问题


I need a way to visually represent a random web page on the internet.

Let's say for example this web page.

Currently, these are the standard assets I can use:

  • Favicon: Too small, too abstract.
  • Title: Very specific but poor visual aesthetics.
  • URL: Nobody cares to read.
  • Icon: Too abstract.
  • Thumbnail: Hard to get, too ugly (many elements crammed in a small space).

I need to visually represent a random website in a way that is very meaningful and inviting for others to click on it.

I need something like what Facebook does when you share a link:

It scraps the link for images and then creates a beautiful meaningful tile which is inviting to click on.

Any way I can scrape the images and text from websites? I'm primarily interested in a Objective-C/JavaScript combo but anything will do and will be selected as an approved answer.

Edit: Re-wrote the post and changed the title.


回答1:


Websites will often provide meta information for user friendly social media sharing, such as Open Graph protocol tags. In fact, in your own example, the reddit page has Open Graph tags which make up the information in the Link Preview (look for meta tags with og: properties).

A fallback approach would be to implement site specific parsing code for most popular websites that don't already conform to a standardized format or to try and generically guess what the most prominent content on a given website is (for example, biggest image above the fold, first few sentences of the first paragraph, text in heading elements etc).

Problem with the former approach is that you you have to maintain the parsers as those websites change and evolve and with the latter that you simply cannot reliably predict what's important on a page and you can't expect to always find what you're looking for either (images for the thumbnail, for example).

Since you will never be able to generate meaningful previews for a 100% of the websites, it boils down to a simple question. What's an acceptable rate of successful link previews? If it's close to what you can get parsing standard meta information, I'd stick with that and save myself a lot of headache. If not, alternatively to the libraries shared above, you can also have a look at paid services/APIs which will likely cover more use cases than you could on your own.




回答2:


This is what the OpenGraph standard is for. For instance, if you go to the Reddit post in the example, you can view the page information provided by HTML <meta /> tags (all the ones with names starting with 'og'):

However, it is not possible for you to get the data from inside a web browser; CORS prevents the request to the URL. In fact, what Facebook seems to do is send the URL to their servers and have them perform a request to get the required information, and sending it back.




回答3:


You can develop your own Link Preview plugin or use existing third party available plugins.

Posting example here is not possible. But i can URL of popular Link Preview plugins. Which may free or paid.

  • Link Preview

  • jQuery Preview:

You can check your url demo here , Which gives response in JSON and Raw Data You can use API also.

Hope it helps.



来源:https://stackoverflow.com/questions/49341918/how-can-i-scrape-text-and-images-from-a-random-web-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!