I\'m using Puppeteer for Web Scraping and I have just noticed that sometimes, the website I\'m trying to scrape asks for a captcha due to the amount of visits I\'m doing fro
Proxy servers can be used so that the destination site does not detect a load of responses from a single IP address.
(Translated into Google Translate)
This is a reCAPTCHA (version 2, check out demos here), which is shown to you as the owner of the page does not want you to automatically crawl the page.
Your options are the following:
As the owner of the page does not want you to crawl that page, you could simply respect that decision and stop crawling. Maybe there is a documented API that you can use.
There is an entire industry which has people (often in developing countries) filling out captchas for other people's bots. I will not link to any particular site, but you can check out the other answer from Md. Abu Taher for more information on the topic or search for captcha solver.
For this, let me explain how reCAPTCHA works and what happens when you visit a page using it.
Each page has an ID, which you can check by looking at the source code, example:
<div class="g-recaptcha form-field" data-sitekey="ID_OF_THE_WEBSITE_LONG_RANDOM_STRING"></div>
When the reCAPTCHA code is loaded it will add a response textarea to the form with no value. It will look like this:
<textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="... display: none;"></textarea>
After you solved the challenge, reCAPTCHA will add a very long string to this text field (which can then later be checked by the server/reCAPTCHA service in the backend) when the form is submitted.
By copying the value of the textarea
field you can transfer the "solved challenge" from one browser to another (this is also what the solving services to for you). The full process looks like this:
.g-recaptcha
) in the "crawling" browserdocument.querySelector('#g-recaptcha-response').value
document.querySelector('#g-recaptcha-response').value = '...'
There is not much public information from Google how exactly reCAPTCHA works as this is a cat-and-mouse game between bot creators and Google detection algorithms, but there are some resources online with more information:
You should use combination of following:
Disclaimer: Do not use anti-captcha plugins/services to misuse resources. Resources are expensive.
Basically the idea is to use anti-captcha services like (2captcha) to deal with persisting recaptcha.
You can use this plugin called puppeteer-extra-plugin-recaptcha
by berstend.
// puppeteer-extra is a drop-in replacement for puppeteer,
// it augments the installed puppeteer with plugin functionality
const puppeteer = require('puppeteer-extra')
// add recaptcha plugin and provide it your 2captcha token
// 2captcha is the builtin solution provider but others work as well.
const RecaptchaPlugin = require('puppeteer-extra-plugin-recaptcha')
puppeteer.use(
RecaptchaPlugin({
provider: { id: '2captcha', token: 'XXXXXXX' },
visualFeedback: true // colorize reCAPTCHAs (violet = detected, green = solved)
})
)
Afterwards you can run the browser as usual. It will pick up any captcha on the page and attempt to resolve it. You have to find the submit button which varies from site to site if it exists.
// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
const page = await browser.newPage()
await page.goto('https://www.google.com/recaptcha/api2/demo')
// That's it, a single line of code to solve reCAPTCHAs