Scraping large amount of Google Scholar pages with url

不羁的心 提交于 2020-07-18 02:52:08

问题


I'm trying to get full author list of all publications from an author on Google scholar using BeautifulSoup. Since the home page for the author only has a truncated list of authors for each paper, I have to open the link of the paper to get full list. As a result, I ran into CAPTCHA every few attempts.

Is there a way to avoid CAPTCHA (e.g. pause for 3 secs after every request)? Or make the original Google Scholar profile page to show full author list?


回答1:


Recently I faced similar issue. I at least eased my collection process with an easy workaround by implementing a random and rather longlasting sleep like this:

import time
import numpy as np

time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds

If you have enough time (let's say launch your parser at night), you can make even bigger pause (3+ times bigger) to assure you won't get captcha.

Furthermore, you can randomly change user-agents in your requests to site, that will mask you even more.



来源:https://stackoverflow.com/questions/45193277/scraping-large-amount-of-google-scholar-pages-with-url

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!