Incremental and non-incremental urls in node js with cheerio and request

问题

I am trying to scrape data from a page using cheerio and request in the following way:

1) go to url 1a (http://example.com/0)
2) extract url 1b (http://example2.com/52)
3) go to url 1b
4) extract some data and save
5) go to url 1a+1 (http://example.com/1, let's call it 2a)
6) extract url 2b (http://example2.com/693)
7) go to url 2b
8) extract some data and save etc...

I am struggling work out how to do this (note, I only am familiar with node js and cheerio/request for this task even though it is likely not elegant, so am not looking for alternative libraries or languages to do this in, sorry). I think I am missing something because I can't even think how this could work.

EDIT

Let me try this in another way. here is the first part of code:

    var request = require('request'),
    cheerio = require('cheerio');

    request('http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1&s=0', function(error, response, html) {

    if (!error && response.statusCode == 200) {
        var $ = cheerio.load(html, {
          xmlMode: true
        });

        var id = ($('work').attr('id'))
        var total = ($('record').attr('total'))
    }
});

The first returned page looks like this

<response>
  <query>date:[2000 TO 2014]</query>
  <zone name="book">
    <records s="0" n="1" total="69977" next="/result?l-advformat=Thesis&sortby=dateDesc&q=+date%3A%5B2000+TO+2014%5D&l-availability=y&l-australian=y&n=1&zone=book&s=1">
      <work id="189231549" url="/work/189231549">
        <troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl>
        <title>
        Design of physiological control and magnetic levitation systems for a total artificial heart
        </title>
        <contributor>Greatrex, Nicholas Anthony</contributor>
        <issued>2014</issued>
        <type>Thesis</type>
        <holdingsCount>1</holdingsCount>
        <versionCount>1</versionCount>
        <relevance score="0.001961126">vaguely relevant</relevance>
        <identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier>
      </work>
    </records>
  </zone>
</response>

The URL above needs to increase incrementally s=0, s=1 etc. for 'total' number of times. 'id' needs to be fed into the url below in a second request:

request('http://api.trove.nla.gov.au/work/" +(id)+ "?key=6k6oagt6ott4ohno&reclevel=full', function(error, response, html) {

    if (!error && response.statusCode == 200) {
        var $ = cheerio.load(html, {
          xmlMode: true
        });

        //extract data here etc.

    }
});

For example when using id="189231549" returned by the first request the second returned page looks like this

<work id="189231549" url="/work/189231549">
  <troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl>
  <title>
    Design of physiological control and magnetic levitation systems for a total artificial heart
  </title>
  <contributor>Greatrex, Nicholas Anthony</contributor>
  <issued>2014</issued>
  <type>Thesis</type>
  <subject>Total Artificial Heart</subject>
  <subject>Magnetic Levitation</subject>
  <subject>Physiological Control</subject>
  <abstract>
    Total Artificial Hearts are mechanical pumps which can be used to replace the failing natural heart. This novel study developed a means of controlling a new design of pump to reproduce physiological flow bringing closer the realisation of a practical artificial heart. Using a mathematical model of the device, an optimisation algorithm was used to determine the best configuration for the magnetic levitation system of the pump. The prototype device was constructed and tested in a mock circulation loop. A physiological controller was designed to replicate the Frank-Starling like balancing behaviour of the natural heart. The device and controller provided sufficient support for a human patient while also demonstrating good response to various physiological conditions and events. This novel work brings the design of a practical artificial heart closer to realisation.
  </abstract>
  <language>English</language>
  <holdingsCount>1</holdingsCount>
  <versionCount>1</versionCount>
  <tagCount>0</tagCount>
  <commentCount>0</commentCount>
  <listCount>0</listCount>
  <identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier>
</work>

So my question is now how do I tie these two parts (loops) together to achieve the result (download and parse about 70000 pages)?

I have no idea how to code this in JavaScript for Node.js. I am new to JavaScript

回答1:

You can find out how to do it by studying existing famous website copiers (closed source or open source)

For example - use trial copy of http://www.tenmax.com/teleport/pro/home.htm to scrap your pages and then try the same with http://www.httrack.com and you should get the idea how they did it (and how you can do it) quite clearly.

The key programming concepts are lookup cache and task queue

Recursion is not the successful concept here if your solution should scale well up to several node.js worker processes and up to many pages

EDIT: after clarifying comments

Before you start reworking your scrapping engine into more scale-able architecture, as a new Node.js developer you can start simply with synchronized alternative to the Node.js callback hell as provided by the wait.for package created by @lucio-m-tato.

The code below worked for me with the links you provided

var request = require('request');
var cheerio = require('cheerio');
var wait = require("wait.for");

function requestWaitForWrapper(url, callback) {
  request(url, function(error, response, html) {
    if (error)
      callback(error, response);
    else if (response.statusCode == 200)
      callback(null, html);
    else
      callback(new Error("Status not 200 OK"), response);
  });
}

function readBookInfo(baseUrl, s) {
  var html = wait.for(requestWaitForWrapper, baseUrl + '&s=' + s.toString());
  var $ = cheerio.load(html, {
    xmlMode: true
  });

  return {
    s: s,
    id: $('work').attr('id'),
    total: parseInt($('records').attr('total'))
  };
}

function readWorkInfo(id) {
  var html = wait.for(requestWaitForWrapper, 'http://api.trove.nla.gov.au/work/' + id.toString() + '?key=6k6oagt6ott4ohno&reclevel=full');
  var $ = cheerio.load(html, {
    xmlMode: true
  });

  return {
    title: $('title').text(),
    contributor: $('contributor').text()
  }
}

function main() {
  var baseBookUrl = 'http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1';
  var baseInfo = readBookInfo(baseBookUrl, 0);

  for (var s = 0; s < baseInfo.total; s++) {
    var bookInfo = readBookInfo(baseBookUrl, s);
    var workInfo = readWorkInfo(bookInfo.id);
    console.log(bookInfo.id + ";" + workInfo.contributor + ";" + workInfo.title);
  }
}

wait.launchFiber(main);

回答2:

You could use the additional async module to handle multiple request and iteration through several pages. Read more about async here https://github.com/caolan/async.

来源：https://stackoverflow.com/questions/25102561/incremental-and-non-incremental-urls-in-node-js-with-cheerio-and-request

标签

node.js

url

cheerio