问题
I have a python script that fetches a webpage and mirrors it. It works fine for one specific page, but I can't get it to work for more than one. I assumed I could put multiple URLs into a list and then feed that to the function, but I get this error:
Traceback (most recent call last):
File "autowget.py", line 46, in <module>
getUrl()
File "autowget.py", line 43, in getUrl
response = urllib.request.urlopen(url)
File "/usr/lib/python3.2/urllib/request.py", line 139, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 361, in open
req.timeout = timeout
AttributeError: 'tuple' object has no attribute 'timeout'
Here's the offending code:
url = ['https://www.example.org/', 'https://www.foo.com/', 'http://bar.com']
def getUrl(*url):
response = urllib.request.urlopen(url)
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl()
I've exhausted Google trying to find how to open a list with urlopen(). I found one way that sort of works. It takes a .txt
document and goes through it line-by-line, feeding each line as a URL, but I'm writing this using Python 3 and for whatever reason twillcommandloop
won't import. Plus, that method is unwieldy and requires (supposedly) unnecessary work.
Anyway, any help would be greatly appreciated.
回答1:
In your code there are some errors:
- You define getUrls with variable arguments list (the tuple in your error);
- You manage getUrls arguments as a single variable (list instead)
You can try with this code
import urllib2
import shutil
urls = ['https://www.example.org/', 'https://www.foo.com/', 'http://bar.com']
def getUrl(urls):
for url in urls:
#Only a file_name based on url string
file_name = url.replace('https://', '').replace('.', '_').replace('/','_')
response = urllib2.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl(urls)
回答2:
It do not support tuple:
urllib.request.urlopen(url[, data][, timeout])
Open the URL url, which can be either a string or a Request object.
And your calling is incorrect. It should be:
getUrl(url[0],url[1],url[2])
And inside the function, use a loop like "for u in url" to travel all urls.
回答3:
You should just iterate over your URLs using a for loop:
import shutil
import urllib.request
urls = ['https://www.example.org/', 'https://www.foo.com/']
file_name = 'foo.txt'
def fetch_urls(urls):
for i, url in enumerate(urls):
file_name = "page-%s.html" % i
response = urllib.request.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
fetch_urls(urls)
I assume you want the content saved to separate files, so I used enumerate here to create a uniqe file name, but you can obviously use anything from hash()
, the uuid module to creating slugs.
来源:https://stackoverflow.com/questions/23278879/using-urlopen-to-open-list-of-urls