Is it possible to use jQuery selectors/DOM manipulation on the server-side using Node.js?
This is my formula to make a simple crawler in Node.js. It is the main reason for wanting to do DOM manipulation on the server side and probably it's the reason why you got here.
First, use request to download the page to be parsed. When the download is complete, handle it to cheerio and begin DOM manipulation just like using jQuery.
Working example:
var
request = require('request'),
cheerio = require('cheerio');
function parse(url) {
request(url, function (error, response, body) {
var
$ = cheerio.load(body);
$('.question-summary .question-hyperlink').each(function () {
console.info($(this).text());
});
})
}
parse('http://stackoverflow.com/');
This example will print to the console all top questions showing on SO home page. This is why I love Node.js and its community. It couldn't get easier than that :-)
Install dependencies:
npm install request cheerio
And run (assuming the script above is in file crawler.js
):
node crawler.js
Some pages will have non-english content in a certain encoding and you will need to decode it to UTF-8
. For instance, a page in brazilian portuguese (or any other language of latin origin) will likely be encoded in ISO-8859-1
(a.k.a. "latin1"). When decoding is needed, I tell request
not to interpret the content in any way and instead use iconv-lite to do the job.
Working example:
var
request = require('request'),
iconv = require('iconv-lite'),
cheerio = require('cheerio');
var
PAGE_ENCODING = 'utf-8'; // change to match page encoding
function parse(url) {
request({
url: url,
encoding: null // do not interpret content yet
}, function (error, response, body) {
var
$ = cheerio.load(iconv.decode(body, PAGE_ENCODING));
$('.question-summary .question-hyperlink').each(function () {
console.info($(this).text());
});
})
}
parse('http://stackoverflow.com/');
Before running, install dependencies:
npm install request iconv-lite cheerio
And then finally:
node crawler.js
The next step would be to follow links. Say you want to list all posters from each top question on SO. You have to first list all top questions (example above) and then enter each link, parsing each question's page to get the list of involved users.
When you start following links, a callback hell can begin. To avoid that, you should use some kind of promises, futures or whatever. I always keep async in my toolbelt. So, here is a full example of a crawler using async:
var
url = require('url'),
request = require('request'),
async = require('async'),
cheerio = require('cheerio');
var
baseUrl = 'http://stackoverflow.com/';
// Gets a page and returns a callback with a $ object
function getPage(url, parseFn) {
request({
url: url
}, function (error, response, body) {
parseFn(cheerio.load(body))
});
}
getPage(baseUrl, function ($) {
var
questions;
// Get list of questions
questions = $('.question-summary .question-hyperlink').map(function () {
return {
title: $(this).text(),
url: url.resolve(baseUrl, $(this).attr('href'))
};
}).get().slice(0, 5); // limit to the top 5 questions
// For each question
async.map(questions, function (question, questionDone) {
getPage(question.url, function ($$) {
// Get list of users
question.users = $$('.post-signature .user-details a').map(function () {
return $$(this).text();
}).get();
questionDone(null, question);
});
}, function (err, questionsWithPosters) {
// This function is called by async when all questions have been parsed
questionsWithPosters.forEach(function (question) {
// Prints each question along with its user list
console.info(question.title);
question.users.forEach(function (user) {
console.info('\t%s', user);
});
});
});
});
Before running:
npm install request async cheerio
Run a test:
node crawler.js
Sample output:
Is it possible to pause a Docker image build?
conradk
Thomasleveil
PHP Image Crop Issue
Elyor
Houston Molinar
Add two object in rails
user1670773
Makoto
max
Asymmetric encryption discrepancy - Android vs Java
Cookie Monster
Wand Maker
Objective-C: Adding 10 seconds to timer in SpriteKit
Christian K Rider
And that's the basic you should know to start making your own crawlers :-)
Not that I know of. The DOM is a client side thing (jQuery doesn't parse the HTML, but the DOM).
Here are some current Node.js projects:
https://github.com/ry/node/wiki (https://github.com/nodejs/node)
And SimonW's djangode is pretty damn cool...
First of all install it
npm install jquery -S
After installing it, you can use it as below
import $ from 'jquery';
window.jQuery = window.$ = $;
$(selector).hide();
You can check out a full tutorial that I wrote here: https://medium.com/fbdevclagos/how-to-use-jquery-on-node-df731bd6abc7
in 2016 things are way easier. install jquery to node.js with your console:
npm install jquery
bind it to the variable $
(for example - i am used to it) in your node.js code:
var $ = require("jquery");
do stuff:
$.ajax({
url: 'gimme_json.php',
dataType: 'json',
method: 'GET',
data: { "now" : true }
});
also works for gulp as it is based on node.js.
I believe the answer to this is now yes.
https://github.com/tmpvar/jsdom
var navigator = { userAgent: "node-js" };
var jQuery = require("./node-jquery").jQueryInit(window, navigator);
An alternative is to use Underscore.js. It should provide what you might have wanted server-side from JQuery.