Downloading all pdf files from google scholar search results using wget

前端 未结 1 836
孤独总比滥情好
孤独总比滥情好 2021-02-06 05:59

I\'d like to write a simple web spider or just use wget to download pdf results from google scholar. That would actually be quite a spiffy way to get papers for res

相关标签:
1条回答
  • 2021-02-06 06:38
    wget -e robots=off -H --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" -r -l 1 -nd -A pdf http://scholar.google.com/scholar?q=filetype%3Apdf+liquid+films&btnG=&hl=en&as_sdt=0%2C23
    

    A few things to note:

    1. Use of filetyle:pdf in the search query
    2. One level of recursion
    3. -A pdf for only accepting pdfs
    4. -H to span hosts
    5. -e robots=off and use of --user-agent will ensure best results. Google Scholar rejects a blank user agent, and pdf repositories are likely to disallow robots.

    The limitation of course is that this will only hit the first page of results. You could expand the depth of recursion, but this will run wild and take forever. I would recommend using a combination of something like Beautiful Soup and wget subprocesses, so that you can parse and traverse the search results strategically.

    0 讨论(0)
提交回复
热议问题