问题
I'm running a Scrapy spider in python to scrape images from a website. One of the images fails to download (even if I try to download it regularly through the site) which is an internal error for the site. This is fine, I don't care about trying to get the image, I just want to skip over the image when it fails and move onto the other images, but I keep getting a 10054 error.
> Traceback (most recent call last): File
> "c:\python27\lib\site-packages\twisted\internet\defer.py", line 588,
> in _runCallbacks
> current.result = callback(current.result, *args, **kw) File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line 137,
> in parse_photo_page
> self.retrievePhoto(base_url_photo + url[0], url_text) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 49, in wrapped_f
> return Retrying(*dargs, **dkw).call(f, *args, **kw) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 212, in call
> raise attempt.get() File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 247, in get
> six.reraise(self.value[0], self.value[1], self.value[2]) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 200, in call
> attempt = Attempt(fn(*args, **kwargs), attempt_number, False) File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line
> 216, in retrievePhoto
> code.write(f.read()) File "c:\python27\lib\socket.py", line 355, in read
> data = self._sock.recv(rbufsize) File "c:\python27\lib\httplib.py", line 612, in read
> s = self.fp.read(amt) File "c:\python27\lib\socket.py", line 384, in read
> data = self._sock.recv(left) error: [Errno 10054] An existing connection was forcibly closed by the remote
Here is my parse function that looks at the photo page and finds the important url's:
def parse_photo_page(self, response):
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('td/font/a/@href').extract()
table_fields = sel.xpath('td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
url_text = table_fields[3]
url_text = string.replace(url_text, " ","")
url_text = string.replace(url_text," ","")
self.retrievePhoto(base_url_photo + url[0], url_text)
Here is my download function with retry decorator:
from retrying import retry
@retry(stop_max_attempt_number=5, wait_fixed=2000)
def retrievePhoto(self, url, filename):
fullPath = self.saveLocation + "/" + filename
urllib.urlretrieve(url, fullPath)
It retries the download 5 times, but then throws the 10054 error and does not continue to the next image. How can I get the spider to continue after retrying? Again, I don't care about downloading the problem image, I just want to skip over it.
回答1:
It's correct that you shouldn't use urllib
inside scrapy because it blocks everything. Try to read resources related to "scrapy twisted" and "scrapy asynchronous". Anyway... I don't believe that your main problem is with "continue after retrying" but with not using "relevant xpaths" on your expressions. Here is a version that works for me (Note the ./
in './td/font/a/@href'
):
import scrapy
import string
import urllib
import os
class MyspiderSpider(scrapy.Spider):
name = "myspider"
start_urls = (
'file:index.html',
)
saveLocation = os.getcwd()
def parse(self, response):
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('./td/font/a/@href').extract()
table_fields = sel.xpath('./td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
url_text = table_fields[3]
url_text = string.replace(url_text, " ","")
url_text = string.replace(url_text," ","")
self.retrievePhoto(base_url_photo + url[0], url_text)
from retrying import retry
@retry(stop_max_attempt_number=5, wait_fixed=2000)
def retrievePhoto(self, url, filename):
fullPath = self.saveLocation + "/" + filename
urllib.urlretrieve(url, fullPath)
And here's a (much better) version that follows your patterns but uses ImagesPipeline
that @paul trmbrth mentioned.
import scrapy
import string
import os
class MyspiderSpider(scrapy.Spider):
name = "myspider2"
start_urls = (
'file:index.html',
)
saveLocation = os.getcwd()
custom_settings = {
"ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1},
"IMAGES_STORE": saveLocation
}
def parse(self, response):
image_urls = []
image_texts = []
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('./td/font/a/@href').extract()
table_fields = sel.xpath('./td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
url_text = table_fields[3]
url_text = string.replace(url_text, " ","")
url_text = string.replace(url_text," ","")
image_urls.append(base_url_photo + url[0])
image_texts.append(url_text)
return {"image_urls": image_urls, "image_texts": image_texts}
The demo file I use is this:
$ cat index.html
<table id="tblData"><tr>
<td><font>hi <a href="img/2015/cav.jpg"> foo </a> <span /> <span /> green.jpg </font></td>
</tr><tr>
<td><font>hi <a href="img/2015/caw.jpg"> foo </a> <span /> <span /> blue.jpg </font></td>
</tr></table>
来源:https://stackoverflow.com/questions/35852744/scrapy-error-10054-after-retrying-image-download