Downloading pdf files using mechanize and urllib

问题

I am new to Python, and my current task is to write a web crawler that looks for PDF files in certain webpages and downloads them. Here's my current approach (just for 1 sample url):

import mechanize
import urllib
import sys

mech = mechanize.Browser()
mech.set_handle_robots(False)

url = "http://www.xyz.com"

try:
    mech.open(url, timeout = 30.0)
except HTTPError, e:
    sys.exit("%d: %s" % (e.code, e.msg))

links = mech.links()

for l in links:
    #Some are relative links
    path = str(l.base_url[:-1])+str(l.url)
    if path.find(".pdf") > 0:
       urllib.urlretrieve(path)

The program runs without any errors, but I am not seeing the pdf being saved anywhere. I am able to access the pdf and save it through my browser. Any ideas what's going on? I am using pydev (eclipse based) as my development environment, if that makes any difference.

Another question is if I want give the pdf a specific name while saving it, how can I do that? Is this approach correct? Do I have to create a file with 'filename' before I can save the PDF?

urllib.urlretrieve(path, filename)

Thanks in advance.

回答1:

The documentation for urllib says this about the urlretrieve function:

The second argument, if present, specifies the file location to copy to (if absent, the location will be a tempfile with a generated name).

The function's return value has the location of the file:

Return a tuple (filename, headers) where filename is the local file name under which the object can be found, and headers is whatever the info() method of the object returned by urlopen() returned (for a remote object, possibly cached).

So, change this line:

urllib.urlretrieve(path)

to this:

(filename, headers) = urllib.urlretrieve(path)

and the path in filename will have the location. Optionally, pass in the filename argument to urlretrieve to specify the location yourself.

回答2:

I've never used mechanize, but from the documentation for urllib at http://docs.python.org/library/urllib.html:

urllib.urlretrieve(url[, filename[, reporthook[, data]]])

Copy a network object denoted by a URL to a local file, if necessary. If the URL points to a local file, or a valid cached copy of the object exists, the object is not copied. Return a tuple (filename, headers) where filename is the local file name under which the object can be found, and headers is whatever the info() method of the object returned by urlopen() returned (for a remote object, possibly cached). Exceptions are the same as for urlopen().

As you can see the urlretrieve function saves to a temporary file if you don't specify one. So try specifying the filename as you suggested in your second piece of code. Otherwise you could call urlretrieve like this:

    saved_filename,headers = urllib.urlretrieve(path)

and then use saved_filename later on.

来源：https://stackoverflow.com/questions/6931364/downloading-pdf-files-using-mechanize-and-urllib

标签

python

eclipse

web-crawler

mechanize

urllib