Avoiding Google Scholar block for crawling [closed]

问题

I have used the following python scripts to crawl google scholar from python:

import urllib

filehandle = urllib.urlopen('http://www.techyupdates.blogspot.com')

for lines in filehandle.readlines():
   print lines

filehandle.close()

but I am doing it repeatedly so I am getting blocked by the site-google scholar saying:

This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving ....

Is there an easy way to avoid this? Any suggestions?

回答1:

[edit]

Put some kind of throttling into your script so you lightly load Google Scholar (wait for 60s or 600s or 6000s between queries, for example).

And I do mean lightly load Google Scholar. If caching the Google Scholar results is possible, that would also reduce the Google Scholar load.

You should also look at batch processing, so you can run your crawl overnight at a steady but slow speed.

The goal is that Google Scholar should not care about your additional queries, thereby fulfilling the spirit of the ToS if not the letter. But if you can fulfill both, that would be the Right Thing to Do.

回答2:

Store the file locally? You can also write a quick Python webserver to serve the file then, in case you need the HTTP connection. And yes, I agree, reading and trying to understand the error message helps, too...

来源：https://stackoverflow.com/questions/14530019/avoiding-google-scholar-block-for-crawling

标签

python

web-crawler