问题
I wish to have my Python script download the Master data (Download, XLSX) Excel file from this Frankfurt stock exchange webpage.
When to retrieve it with urrlib
and wget
, it turns out that the URL leads to a Blob and the file downloaded is only 289 bytes and unreadable.
http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx
I'm entirely unfamiliar with Blobs and have these questions:
Can the file "behind the Blob" be successfully retrieved using Python?
If so, is it necessary to uncover the "true" URL behind the Blob – if there is such a thing – and how? My concern here is that the link above won't be static but actually change often.
回答1:
That 289 byte long thing might be a HTML code for 403 forbidden
page. This happen because the server is smart and rejects if your code does not specify a user agent.
Python 3
# python3
import urllib.request as request
url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of Safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
r = request.Request(url, headers={'User-Agent': fake_useragent})
f = request.urlopen(r)
# print or write
print(f.read())
Python 2
# python2
import urllib2
url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
r = urllib2.Request(url, headers={'User-Agent': fake_useragent})
f = urllib2.urlopen(r)
print(f.read())
回答2:
from bs4 import BeautifulSoup
import requests
import re
url='http://www.xetra.com/xetra-en/instruments/etf-exchange-traded-funds/list-of-tradable-etfs'
html=requests.get(url)
page=BeautifulSoup(html.content)
reg=re.compile('Master data')
find=page.find('span',text=reg) #find the file url
file_url='http://www.xetra.com'+find.parent['href']
file=requests.get(file_url)
with open(r'C:\\Users\user\Downloads\file.xlsx','wb') as ff:
ff.write(file.content)
recommend requests and BeautifulSoup,both good lib
来源:https://stackoverflow.com/questions/39517522/download-file-from-blob-url-with-python