Scrape web pages in real time with Node.js

前端未结

关注

 8  2141

What\'s a good was to scrape website content using Node.js. I\'d like to build something very, very fast that can execute searches in the style of kayak.com, where one query

相关标签:

8条回答

梦毁少年i

2020-11-29 16:37
The new way using ES7/promises

Usually when you're scraping you want to use some method to
1. Get the resource on the webserver (html document usually)
2. Read that resource and work with it as
  1. A DOM/tree structure and make it navigable
  2. parse it as token-document with something like SAS.
Both tree, and token-parsing have advantages, but tree is usually substantially simpler. We'll do that. Check out request-promise, here is how it works:
```
const rp = require('request-promise');
const cheerio = require('cheerio'); // Basically jQuery for node.js 

const options = {
    uri: 'http://www.google.com',
    transform: function (body) {
        return cheerio.load(body);
    }
};

rp(options)
    .then(function ($) {
        // Process html like you would with jQuery... 
    })
    .catch(function (err) {
        // Crawling failed or Cheerio 
```
This is using cheerio which is essentially a lightweight server-side jQuery-esque library (that doesn't need a window object, or jsdom).

Because you're using promises, you can also write this in an asychronous function. It'll look synchronous, but it'll be asynchronous with ES7:
```
async function parseDocument() {
    let $;
    try {
      $ = await rp(options);
    } catch (err) { console.error(err); }

    console.log( $('title').text() ); // prints just the text in the <title>
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2020-11-29 16:38

It is my easy to use general purpose scrapper https://github.com/harish2704/html-scrapper written for Node.JS It can extract information based on predefined schemas. A schema defnition includes a css selector and a data extraction function. It currently using cheerio for dom parsing..

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

Scrape web pages in real time with Node.js

The new way using ES7/promises