Downloading pdf files using mechanize and urllib

吃可爱长大的小学妹 提交于 2019-12-08 02:41:32

问题


I am new to Python, and my current task is to write a web crawler that looks for PDF files in certain webpages and downloads them. Here's my current approach (just for 1 sample url):

import mechanize
import urllib
import sys

mech = mechanize.Browser()
mech.set_handle_robots(False)

url = "http://www.xyz.com"

try:
    mech.open(url, timeout = 30.0)
except HTTPError, e:
    sys.exit("%d: %s" % (e.code, e.msg))

links = mech.links()

for l in links:
    #Some are relative links
    path = str(l.base_url[:-1])+str(l.url)
    if path.find(".pdf") > 0:
       urllib.urlretrieve(path)

The program runs without any errors, but I am not seeing the pdf being saved anywhere. I am able to access the pdf and save it through my browser. Any ideas what's going on? I am using pydev (eclipse based) as my development environment, if that makes any difference.

Another question is if I want give the pdf a specific name while saving it, how can I do that? Is this approach correct? Do I have to create a file with 'filename' before I can save the PDF?

urllib.urlretrieve(path, filename) 

Thanks in advance.


回答1:


The documentation for urllib says this about the urlretrieve function:

The second argument, if present, specifies the file location to copy to (if absent, the location will be a tempfile with a generated name).

The function's return value has the location of the file:

Return a tuple (filename, headers) where filename is the local file name under which the object can be found, and headers is whatever the info() method of the object returned by urlopen() returned (for a remote object, possibly cached).

So, change this line:

urllib.urlretrieve(path)

to this:

(filename, headers) = urllib.urlretrieve(path)

and the path in filename will have the location. Optionally, pass in the filename argument to urlretrieve to specify the location yourself.




回答2:


I've never used mechanize, but from the documentation for urllib at http://docs.python.org/library/urllib.html:

urllib.urlretrieve(url[, filename[, reporthook[, data]]])

Copy a network object denoted by a URL to a local file, if necessary. If the URL points to a local file, or a valid cached copy of the object exists, the object is not copied. Return a tuple (filename, headers) where filename is the local file name under which the object can be found, and headers is whatever the info() method of the object returned by urlopen() returned (for a remote object, possibly cached). Exceptions are the same as for urlopen().

As you can see the urlretrieve function saves to a temporary file if you don't specify one. So try specifying the filename as you suggested in your second piece of code. Otherwise you could call urlretrieve like this:

    saved_filename,headers = urllib.urlretrieve(path)

and then use saved_filename later on.



来源:https://stackoverflow.com/questions/6931364/downloading-pdf-files-using-mechanize-and-urllib

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!