Avoiding Google Scholar block for crawling [closed]

元气小坏坏 提交于 2021-02-07 10:57:17

问题


I have used the following python scripts to crawl google scholar from python:

import urllib

filehandle = urllib.urlopen('http://www.techyupdates.blogspot.com')

for lines in filehandle.readlines():
   print lines

filehandle.close()

but I am doing it repeatedly so I am getting blocked by the site-google scholar saying:

This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving ....

Is there an easy way to avoid this? Any suggestions?


回答1:


[edit]

Put some kind of throttling into your script so you lightly load Google Scholar (wait for 60s or 600s or 6000s between queries, for example).

And I do mean lightly load Google Scholar. If caching the Google Scholar results is possible, that would also reduce the Google Scholar load.

You should also look at batch processing, so you can run your crawl overnight at a steady but slow speed.

The goal is that Google Scholar should not care about your additional queries, thereby fulfilling the spirit of the ToS if not the letter. But if you can fulfill both, that would be the Right Thing to Do.




回答2:


Store the file locally? You can also write a quick Python webserver to serve the file then, in case you need the HTTP connection. And yes, I agree, reading and trying to understand the error message helps, too...



来源:https://stackoverflow.com/questions/14530019/avoiding-google-scholar-block-for-crawling

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!