Scrape web pages in real time with Node.js

前端 未结 8 2141
闹比i
闹比i 2020-11-29 15:43

What\'s a good was to scrape website content using Node.js. I\'d like to build something very, very fast that can execute searches in the style of kayak.com, where one query

相关标签:
8条回答
  • 2020-11-29 16:37

    The new way using ES7/promises

    Usually when you're scraping you want to use some method to

    1. Get the resource on the webserver (html document usually)
    2. Read that resource and work with it as
      1. A DOM/tree structure and make it navigable
      2. parse it as token-document with something like SAS.

    Both tree, and token-parsing have advantages, but tree is usually substantially simpler. We'll do that. Check out request-promise, here is how it works:

    const rp = require('request-promise');
    const cheerio = require('cheerio'); // Basically jQuery for node.js 
    
    const options = {
        uri: 'http://www.google.com',
        transform: function (body) {
            return cheerio.load(body);
        }
    };
    
    rp(options)
        .then(function ($) {
            // Process html like you would with jQuery... 
        })
        .catch(function (err) {
            // Crawling failed or Cheerio 
    

    This is using cheerio which is essentially a lightweight server-side jQuery-esque library (that doesn't need a window object, or jsdom).

    Because you're using promises, you can also write this in an asychronous function. It'll look synchronous, but it'll be asynchronous with ES7:

    async function parseDocument() {
        let $;
        try {
          $ = await rp(options);
        } catch (err) { console.error(err); }
    
        console.log( $('title').text() ); // prints just the text in the <title>
    }
    
    0 讨论(0)
  • 2020-11-29 16:38

    It is my easy to use general purpose scrapper https://github.com/harish2704/html-scrapper written for Node.JS It can extract information based on predefined schemas. A schema defnition includes a css selector and a data extraction function. It currently using cheerio for dom parsing..

    0 讨论(0)
提交回复
热议问题