What\'s a good was to scrape website content using Node.js. I\'d like to build something very, very fast that can execute searches in the style of kayak.com, where one query
Usually when you're scraping you want to use some method to
Both tree, and token-parsing have advantages, but tree is usually substantially simpler. We'll do that. Check out request-promise, here is how it works:
const rp = require('request-promise');
const cheerio = require('cheerio'); // Basically jQuery for node.js
const options = {
uri: 'http://www.google.com',
transform: function (body) {
return cheerio.load(body);
}
};
rp(options)
.then(function ($) {
// Process html like you would with jQuery...
})
.catch(function (err) {
// Crawling failed or Cheerio
This is using cheerio which is essentially a lightweight server-side jQuery-esque library (that doesn't need a window object, or jsdom).
Because you're using promises, you can also write this in an asychronous function. It'll look synchronous, but it'll be asynchronous with ES7:
async function parseDocument() {
let $;
try {
$ = await rp(options);
} catch (err) { console.error(err); }
console.log( $('title').text() ); // prints just the text in the <title>
}
It is my easy to use general purpose scrapper https://github.com/harish2704/html-scrapper written for Node.JS It can extract information based on predefined schemas. A schema defnition includes a css selector and a data extraction function. It currently using cheerio for dom parsing..