问题
I am web scraping Google Scholar search results page by page. After a certain number of pages, a captcha pops up and interrupts my code. I read that Google limits the requests that I can make per hour. Is there any way around this limit? I read something about APIs, but I'm not sure if that is helpful.
回答1:
I feel your pain since I have done scraping from Google in the past. I have tried the following things in order to get my job done. This list is ordered from easiest to hardest techniques.
- Throttle your requests per second: Google and many other websites will identify a large number of requests per second coming from the same machine and block them automatically as a defensive action against Denial-of-Service attacks. All you need to do is to be gentle and do just 1 request every 1-5 seconds, for instance, to avoid being banned quickly.
- Randomize your sleep time: Making your code sleep for exactly 1 second is too easy to detect as being a script. Make it sleep for a random amount of time at every iteration. This StackOverflow answer shows an example on how to randomize it.
- Use a web scraper library with cookies enabled: If you write scraping code from scratch, Google will notice your requests don't return the cookies it received. Use a good library, such as Scrapy to circumvent this issue.
- Use multiple IP addresses: Throttling will definitely reduce your scraping throughput. If you really need to scrape your data fast, you will need to use several IP addresses in order to avoid being banned. There are several companies providing this kind of service on the Internet for a certain amount of money. I have used ProxyMesh and really liked both their quality, documentation and customer support.
- Use a real browser: Some websites will recognize your scraper if it doesn't process JavaScript or have a graphical interface. Using a real browser with Selenium, for instance, will solve this problem.
You can also take a look at my crawler project, written for the Web Search Engines course at the New York University. It does not scrape Google per se but contains some of the aforementioned techniques, such as throttling and randomizing the sleep time.
回答2:
From personal experience scraping Google Scholar. 45 seconds is enough to avoid CAPTCHA and bot detection. I have had a scraper running for >3 days without detection. If you do get flagged, waiting about 2 hours is enough to start again. Here is an extract from my code..
class ScholarScrape():
def __init__(self):
self.page = None
self.last_url = None
self.last_time = time.time()
self.min_time_between_scrape = int(ConfigFile.instance().config.get('scholar','bot_avoidance_time'))
self.header = {'User-Agent':ConfigFile.instance().config.get('scholar','user_agent')}
self.session = requests.Session()
pass
def search(self, query=None, year_lo=None, year_hi=None, title_only=False, publication_string=None, author_string=None, include_citations=True, include_patents=True):
url = self.get_url(query, year_lo, year_hi, title_only, publication_string, author_string, include_citations, include_patents)
while True:
wait_time = self.min_time_between_scrape - (time.time() - self.last_time)
if wait_time > 0:
logger.info("Delaying search by {} seconds to avoid bot detection.".format(wait_time))
time.sleep(wait_time)
self.last_time = time.time()
logger.info("SCHOLARSCRAPE: " + url)
self.page = BeautifulSoup(self.session.get(url, headers=self.header).text, 'html.parser')
self.last_url = url
if "Our systems have detected unusual traffic from your computer network" in str(self.page):
raise BotDetectionException("Google has blocked this computer for a short time because it has detected this scraping script.")
return
def get_url(self, query=None, year_lo=None, year_hi=None, title_only=False, publication_string=None, author_string=None, include_citations=True, include_patents=True):
base_url = "https://scholar.google.com.au/scholar?"
url = base_url + "as_q=" + urllib.parse.quote(query)
if year_lo is not None and bool(re.match(r'.*([1-3][0-9]{3})', str(year_lo))):
url += "&as_ylo=" + str(year_lo)
if year_hi is not None and bool(re.match(r'.*([1-3][0-9]{3})', str(year_hi))):
url += "&as_yhi=" + str(year_hi)
if title_only:
url += "&as_yhi=title"
else:
url += "&as_yhi=any"
if publication_string is not None:
url += "&as_publication=" + urllib.parse.quote('"' + str(publication_string) + '"')
if author_string is not None:
url += "&as_sauthors=" + urllib.parse.quote('"' + str(author_string) + '"')
if include_citations:
url += "&as_vis=0"
else:
url += "&as_vis=1"
if include_patents:
url += "&as_sdt=0"
else:
url += "&as_sdt=1"
return url
def get_results_count(self):
e = self.page.findAll("div", {"class": "gs_ab_mdw"})
try:
item = e[1].text.strip()
except IndexError as ex:
if "Our systems have detected unusual traffic from your computer network" in str(self.page):
raise BotDetectionException("Google has blocked this computer for a short time because it has detected this scraping script.")
else:
raise ex
if self.has_numbers(item):
return self.get_results_count_from_soup_string(item)
for item in e:
item = item.text.strip()
if self.has_numbers(item):
return self.get_results_count_from_soup_string(item)
return 0
@staticmethod
def get_results_count_from_soup_string(element):
if "About" in element:
num = element.split(" ")[1].strip().replace(",","")
else:
num = element.split(" ")[0].strip().replace(",","")
return num
@staticmethod
def has_numbers(input_string):
return any(char.isdigit() for char in input_string)
class BotDetectionException(Exception):
pass
if __name__ == "__main__":
s = ScholarScrape()
s.search(**{
"query":"\"policy shaping\"",
# "publication_string":"JMLR",
"author_string": "gilboa",
"year_lo": "1995",
"year_hi": "2005",
})
x = s.get_results_count()
print(x)
来源:https://stackoverflow.com/questions/60535351/web-scraping-google-search-results