Extract Google Scholar results using Python (or R)

给你一囗甜甜゛ 提交于 2019-12-03 02:41:08

I suggest you not to use specific libraries for crawling specific websites, but to use general purpose HTML libraries that are well tested and has well formed documentation such as BeautifulSoup.

For accessing websites with a browser information, you could use an url opener class with a custom user agent:

from urllib import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open

And then download the required url as follows:

openurl(url).read()

For retrieving scholar results just use http://scholar.google.se/scholar?hl=en&q=${query} url.

To extract pieces of information from a retrieved HTML file, you could use this piece of code:

from bs4 import SoupStrainer, BeautifulSoup
page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md'))

This piece of code extracts a concrete div element that contains number of results shown in a Google Scholar search results page.

0x90

Google will block you... as it will be apparent you aren't a browser. Namely, they will detect the same request signature occurring too frequently for human activity....

You can do:

y-i_guy

It looks like scraping with Python and R runs into the problem where Google Scholar sees your request as a robot query due to a lack of a user-agent in the request. There is a similar question in StackExchange about downloading all pdfs linked from a web page and the answer leads the user to wget in Unix and the BeautifulSoup package in Python.

Curl also seems to be a more promising direction.

COPython looks correct but here's a bit of an explanation by example...

Consider f:

def f(a,b,c=1):
    pass

f expects values for a and b no matter what. You can leave c blank.

f(1,2)     #executes fine
f(a=1,b=2) #executes fine
f(1,c=1)   #TypeError: f() takes at least 2 arguments (2 given)

The fact that you are being blocked by Google is probably due to your user-agent settings in your header... I am unfamiliar with R but I can give you the general algorithm for fixing this:

  1. use a normal browser (firefox or whatever) to access the url while monitoring HTTP traffic (I like wireshark)
  2. take note of all headers sent in the appropriate http request
  3. try running your script and also note the headings
  4. spot the difference
  5. set your R script to make use the headers you saw when examining browser traffic

here is the call signature of query()...

def query(searchstr, outformat, allresults=False)

thus you need to specify a searchstr AND an outformat at least, and allresults is an optional flag/argument.

You may want to use Greasemonkey for this task. The advantage is that google will fail to detect you as a bot if you keep the request frequency down in addition. You can also watch the script working in your browser window.

You can learn to code it yourself or use a script from one of these sources.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!