问题

Puppeteer and PhantomJS are similar. The issue I'm having is happening for both, and the code is also similar.

I'd like to catch some informations from a website, which needs authentication for viewing those informations. I can't even access home page because it's detected like a "suspicious activity", like the SS: https://i.imgur.com/p69OIjO.png

I discovered that the problem doesn't happen when I tested on Postman using a header named Cookie and the value of it's cookie caught on browser, but this cookie expires after some time. So I guess Puppeteer/PhantomJS both are not catching cookies, because this site is denying the headless browser access.

What could I do for bypass this?

// Simple Javascript example
var page = require('webpage').create();
var url = 'https://www.expertflyer.com';

page.open(url, function (status) {
    if( status === "success") {
        page.render("home.png");
        phantom.exit();
    }
});

回答1:

The website you are trying to visit uses Distil Networks to prevent web scraping.

People have had success in the past bypassing Distil Networks by substituting the $cdc_ variable found in Chromium's call_function.js (which is used in Puppeteer).

For example:

function getPageCache(opt_doc, opt_w3c) {
  var doc = opt_doc || document;
  var w3c = opt_w3c || false;
  // var key = '$cdc_asdjflasutopfhvcZLmcfl_';    <-- This is the line that is changed.
  var key = '$something_different_';
  if (w3c) {
    if (!(key in doc))
      doc[key] = new CacheWithUUID();
    return doc[key];
  } else {
    if (!(key in doc))
      doc[key] = new Cache();
    return doc[key];
  }
}

Note: According to this comment, if you have been blacklisted before you make this change, you face another set of challenges, so you must "implement fake canvas fingerprinting, disable flash, change IP, and change request header order (swap language and Accept headers)."

回答2:

Things that can help in general :

Headers should be similar to common browsers, including :
- User-Agent : use a recent one (see https://developers.whatismybrowser.com/useragents/explore/), or better, use a random recent one if you make multiple requests (see https://github.com/skratchdot/random-useragent)
- Accept-Language : something like "en,en-US;q=0,5" (adapt for your language)
- Accept: a standard one would be like "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8"
If you make multiple request, put a random timeout between them
If you open links found in a page, set the Referer header accordingly
Images should be enabled
Javascript should be enabled
- Check that "navigator.plugins" and "navigator.language" are set in the client javascript page context
Use proxies

回答3:

If you think from the websites perspective, you are indeed doing suspicious work. So whenever you want to bypass something like this, make sure to think how they are thinking.

Set cookie properly

Puppeteer and PhantomJS etc will use real browsers and the cookies used there are better than when using via postman or such. You just need to use cookie properly.

You can use page.setCookie(...cookies) to set the cookies. Cookies are serialized, so if cookies is an array of object, you can simply do this,

const cookies = [{name: 'test', value: 'foo'}, {name: 'test2', value: 'foo'}]; // just as example, use real cookies here;
await page.setCookie(...cookies);

Try to tweak the behaviors

Turn off the headless mode and see the behavior of the website.

await puppeteer.launch({headless: false})

Try proxies

Some websites monitor based on Ip address, if multiple hits are from same IP, they blocks the request. It's best to use rotating proxies on that case.

来源：https://stackoverflow.com/questions/51731848/how-to-avoid-being-detected-as-bot-on-puppeteer-and-phantomjs

标签

node.js