问题
I was happy doing my scrapping with R but found its limits. Trying to scrap the summary of cases of Argentina's Supreme Court, I found a problem for which I cannot find an answer. It is likely the outcome of learning by doing --- so please, do point out where my code works but is following a rather bad practice. Anyway, I managed to:
- Access the search page.
- Entry a relevant taxonomy term (e.g. 'DECRETO DE NECESIDAD Y URGENCIA') in
#voces
, click search and scrap the.datosSumarios
, where lies the information I need (case name, date, reporter, and so on). The code is bellow:
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');
// wait until element ready
await Promise.all([
page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),
page.waitForSelector('.ui-menu-item')
]);
await page.click('.ui-menu-item');
await Promise.all([
page.click('.glyphicon-search'),
page.waitForNavigation({ waitUntil: 'networkidle0' }),
]);
//Here we are in the place we want to be, and then capture what we need:
const result = await page.evaluate(() => {
let data = []; // Create an empty array that will store our data
let elements = document.querySelectorAll('.row'); // Select all Products
for (var element of elements){ // Loop through each proudct
let title = document.querySelector('.datosSumario').innerText;
data.push({title}); // Push an object with the data onto our array
}
return data; // Return our data array
});
//review ->
await page.click('#paginate_button2')
browser.close();
return result;
};
scrape().then((value) => {
console.log(value); // Success!
});
What I can't seem to do is to go through different pages. If you follow the page you'll see that the pagination is rather strange: there is no "next page" button but a bunch of "page number buttons", which I can press but cannot iterate the srapping section of the code above. I've tried a loop function (that did not manage to make it work). I've reviewed a few pagination tutorials but could not found one that faces this particular kind of problem.
# Update
I was able to solve the pagination thing, but currently I can't seem to make a function to actually scrap the text I need to work within the pagination (it works outside, in a single page). Sharing in case someone can point the obvious mistake I am probably making.
const puppeteer = require('puppeteer');
const fs = require('fs');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');
// wait until element ready
await Promise.all([
page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),
page.waitForSelector('.ui-menu-item')
]);
await page.click('.ui-menu-item');
await Promise.all([
page.click('.glyphicon-search'),
page.waitForNavigation({ waitUntil: 'networkidle0' }),
]);
var results = []; // variable to hold the "sumarios" I need
var lastPageNumber = 2; // I am using 2 to test, but I can choose any number and it works (in this case, the 31 pages I need to scrap)
for (let index = 0; index < lastPageNumber; index++) {
// wait 1 sec for page load
await page.waitFor(5000);
// call and wait extractedEvaluateCall and concatenate results every iteration.
// You can use results.push, but will get collection of collections at the end of iteration
results = results.concat(await MyFunction); // I call my function but the function does not work, see below
if (index != lastPageNumber - 1) {
await page.click('li.paginate_button.active + li a[onclick]'); //This does the trick
await page.waitFor(5000);
}
}
browser.close();
return results;
};
async function MyFunction() {
const data = await page.evaluate( () => // This bit works outside of the async function environment and I get the text I need in a single page
Array.from(
document.querySelectorAll('div[class="col-sm-8 col-lg-9 datosSumario"]'), element => element.textContent)
);
}
scrape().then((results) => {
console.log(results); // Success!
});
回答1:
You can try document.querySelector('li.paginate_button.active + li a[onclick]')
as a next page button equivalent. After the click on it, you can wait for a response with URL started with 'https://sjconsulta.csjn.gov.ar/sjconsulta/consultaSumarios/paginarSumarios.html?startIndex='
.
# For update
At first glance, there are some issues:
MyFunction
is not called: you needawait MyFunction()
instead ofawait MyFunction
.You need to transfer
page
intoMyFunction()
scope:
results = results.concat(await MyFunction(page));
//...
async function MyFunction(page) {
// ...
}
来源:https://stackoverflow.com/questions/62492290/pagination-when-there-is-no-next-page-button-but-bunch-of-page-numbers-pages