Scraping large amount of Google Scholar pages with url

问题

I'm trying to get full author list of all publications from an author on Google scholar using BeautifulSoup. Since the home page for the author only has a truncated list of authors for each paper, I have to open the link of the paper to get full list. As a result, I ran into CAPTCHA every few attempts.

Is there a way to avoid CAPTCHA (e.g. pause for 3 secs after every request)? Or make the original Google Scholar profile page to show full author list?

回答1:

Recently I faced similar issue. I at least eased my collection process with an easy workaround by implementing a random and rather longlasting sleep like this:

import time
import numpy as np

time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds

If you have enough time (let's say launch your parser at night), you can make even bigger pause (3+ times bigger) to assure you won't get captcha.

Furthermore, you can randomly change user-agents in your requests to site, that will mask you even more.

来源：https://stackoverflow.com/questions/45193277/scraping-large-amount-of-google-scholar-pages-with-url

标签

web-scraping

beautifulsoup

captcha

google-scholar

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!