Puppeteer not behaving like in Developer Console

孤人 提交于 2021-01-27 21:57:19

问题


I am trying to extract using Puppeteer the title of this page: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106

I have the below code,

          (async () => {
            const browser = await puppet.launch({ headless: true });
            const page = await browser.newPage();
            await page.goto(req.params[0]); //this is the url
            title = await page.evaluate(() => {
              Array.from(document.querySelectorAll("meta")).filter(function (
                el
              ) {
                return (
                  (el.attributes.name !== null &&
                    el.attributes.name !== undefined &&
                    el.attributes.name.value.endsWith("title")) ||
                  (el.attributes.property !== null &&
                    el.attributes.property !== undefined &&
                    el.attributes.property.value.endsWith("title"))
                );
              })[0].attributes.content.value ||
                document.querySelector("title").innerText;
            });

which I have tested using the browser console and even using the { headless: false } option of Puppeteer. It works as expected in the browser, but when I actually run it with node it gives me the following error.

10:54:21 AM web.1 |  (node:10288) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'attributes' of undefined
10:54:21 AM web.1 |      at __puppeteer_evaluation_script__:14:20

So, when I run the same Array.from ...querySelectorAll("meta")... query in the browser I get the expected string:

"Zella High Waist Studio Pocket 7/8 Leggings | Nordstrom"

I'm starting to think I'm doing something wrong with the async promises, as that is the part that is different. Can anyone point me in the right direction?

EDIT: As suggested, I tested using document.title, which should be there, but it also returned null. See code and log below:

          console.log(
            "testing the return",
            (async () => {
              const browser = await puppet.launch({ headless: true });
              const page = await browser.newPage();
              await page.goto(req.params[0]); //this is the url
              try {
                title = await page.evaluate(() => {
                  const title = document.title;
                  const isTitleThere = title == null ? false : true;
                  //recently read that this checks for undefined as well as null but not an
                  //undeclared var
                  return {
                    title: title,
                    titleTitle: title.title,
                    isTitleThere: isTitleThere,
                  };
                });
              } catch (error) {
                console.log(error, "There was an error");
              }
11:54:11 AM web.1 |  testing the return Promise { <pending> }
11:54:13 AM web.1 |  { title: '', isTitleThere: true }

Does this have to do with single-page application bs? I thought puppeteer handled that because it loads everything first.

EDIT: I have added the networkidle lines and await 8000 milliseconds, as suggested. Title is still empty. Code below and log:

            await page.goto(req.params[0], { waitUntil: "networkidle2" });
            await page.waitFor(8000);
            console.log("done waiting");
            title = await page.$eval("title", (el) => el.innerText);
            console.log("title: ", title);
            console.log("done retrieving");
12:36:39 PM web.1 |  done waiting
12:36:39 PM web.1 |  title:  
12:36:39 PM web.1 |  done retreiving

EDIT: PROGRESS!! Thank you to theDavidBarton. It seems headless has to be false for it work? Does anyone know why?


回答1:


If you only need the innerText of title you could do it with page.$eval puppeteer method to achieve the same result:

const title = await page.$eval('title', el => el.innerText)
console.log(title)

Output:

Zella High Waist Studio Pocket 7/8 Leggings | Nordstrom

page.$$eval(selector, pageFunction[, ...args])

The page.$eval method runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction.


However: your main problem is that the page you are visiting is a Single-Page App (SPA) made in React.Js, and its title is filled dynamically by the JavaScript bundle. So your puppeteer finds a valid title element in the <head> when its content is simply: "" (an empty string).

Normally you should use waitUntil: 'networkidle0' in case of SPAs to make sure the DOM is populated by the actual JS framework properly and it is fully functional:

await page.goto('https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106', {
    waitUntil: 'networkidle0'
  })

Unfortunately with this specific website it throws a timeout error as the network connections don't close until the 30000 ms default timeout, something seems to be not OK on the webpage's frontend side (webworker handling?).

As a workaround you can force puppeteer sleep for 8 seconds with: await page.waitFor(8000) before you try to retrieve the title: by that time it will be properly populated. Actually when you run your script in DevTools Console it works because you are not immediately running the script: that time the page is already fully loaded, DOM is populated.

This script will return the expected title:

async function fn() {
  const browser = await puppeteer.launch({ headless: false })
  const page = await browser.newPage()

  await page.goto('https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106', {
    waitUntil: 'networkidle2'
  })
  await page.waitFor(8000)

  const title = await page.$eval('title', el => el.innerText)
  console.log(title)

  await browser.close()
}
fn()

Maybe const browser = await puppeteer.launch({ headless: false }) affects the result as well.




回答2:


when navigating to the page wait until the page is loaded

await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url

Could you try this

 try {
    title = await page.evaluate(() => {
        const title = document.title;
        const isTitleThere = title == null? false: true
        //recently read that this checks for undefined as well as null but not an 
        //undeclared var
        return {"title":title,"isTitleThere" :isTitleThere }
    })

} catch (error) {
    console.log(error, 'There was an error');

}

or this

 try {
title = await page.evaluate(() => {
    const title = document.querySelector('meta[property="og:title"]');
    const isTitleThere = title == null? false: true
    //recently read that this checks for undefined as well as null but not an 
    //undeclared var
    return {"title":title,"isTitleThere" :isTitleThere }
   })

   } catch (error) {
   console.log(error, 'There was an error');

   }


来源:https://stackoverflow.com/questions/63817148/puppeteer-not-behaving-like-in-developer-console

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!