Scrape web pages in real time with Node.js

前端未结

关注

 8  2140

What\'s a good was to scrape website content using Node.js. I\'d like to build something very, very fast that can execute searches in the style of kayak.com, where one query

相关标签:

8条回答

孤独总比滥情好

2020-11-29 16:19
All aforementioned solutions presume running the scraper locally. This means you will be severely limited in performance (due to running them in sequence or in a limited set of threads). A better approach, imho, is to rely on an existing, albeit commercial, scraping grid.

Here is an example:
```
var bobik = new Bobik("YOUR_AUTH_TOKEN");
bobik.scrape({
  urls: ['amazon.com', 'zynga.com', 'http://finance.google.com/', 'http://shopping.yahoo.com'],
  queries:  ["//th", "//img/@src", "return document.title", "return $('script').length", "#logo", ".logo"]
}, function (scraped_data) {
  if (!scraped_data) {
    console.log("Data is unavailable");
    return;
  }
  var scraped_urls = Object.keys(scraped_data);
  for (var url in scraped_urls)
    console.log("Results from " + url + ": " + scraped_data[scraped_urls[url]]);
});
```
Here, scraping is performed remotely and a callback is issued to your code only when results are ready (there is also an option to collect results as they become available).

You can download Bobik client proxy SDK at https://github.com/emirkin/bobik_javascript_sdk
0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2020-11-29 16:21

You don't always need to jQuery. If you play with the DOM returned from jsdom for example you can easily take what you need yourself (also considering you dont have to worry about xbrowser issues.) See: https://gist.github.com/1335009 that's not taking away from node.io at all, just saying you might be able to do it yourself depending...

0 讨论(0)
发布评论:

提交评论
- 加载中...

小鲜肉

2020-11-29 16:22

check out https://github.com/rc0x03/node-promise-parser

Fast: uses libxml C bindings
Lightweight: no dependencies like jQuery, cheerio, or jsdom
Clean: promise based interface- no more nested callbacks
Flexible: supports both CSS and XPath selectors

0 讨论(0)

夕颜

2020-11-29 16:34

I see most answers the right path with cheerio and so forth, however once you get to the point where you need to parse and execute JavaScript (ala SPA's and more), then I'd check out https://github.com/joelgriffith/navalia (I'm the author). Navalia is built to support scraping in a headless-browser context, and it's pretty quick. Thanks!

0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-11-29 16:37

Node.io seems to take the cake :-)

0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-11-29 16:37

I've been doing research myself, and https://npmjs.org/package/wscraper boasts itself as a

a web scraper agent based on cheerio.js a fast, flexible, and lean implementation of core jQuery; built on top of request.js; inspired by http-agent.js

Very low usage (according to npmjs.org) but worth a look for any interested parties.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页