问题
So I am made a crawl spider which crawls this website (https://minerals.usgs.gov/science/mineral-deposit-database/#products, follows every link on that web page, from which it scrapes the title and it is suppesed to download the file as well. Howerver this does not happen and there is no error indication in the log!
LOG SAMPLE
2018-11-19 18:20:12 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.sciencebase.gov/catalog/item/5a1492c3e4b09fc93dcfd574>
{'date': [datetime.datetime(2018, 11, 19, 18, 20, 12, 209865)],
'file':
['https://www.sciencebase.gov/catalog/file/get/5a1492c3e4b09fc93dcfd574?
f=__disk__d7%2F26%2Fdb%2Fd726dbd9030e7554a4ef13cb56f53983f407eb7d',
'https://www.sciencebase.gov/catalog/file/get/5a1492c3e4b09fc93dcfd574?
f=__disk__d7%2F26%2Fdb%2Fd726dbd9030e7554a
4ef13cb56f53983f407eb7d&transform=1',
'https://www.sciencebase.gov/catalog/file/get/5a1492c3e4b09fc93dcfd574?
f=__disk__72%2F6b%2F7d%2F726b7dd547ce9805a97e2464dc1f4646b2a16cfb',
'https://www.sciencebase.gov/catalog/file/get/5a1492c3e4b09fc93dcfd574?
f=__disk__d4%2F87%2F6b%2Fd4876b385bc9ac2af3c9221aee4ff7a5a88f201a',
'https://www.sciencebase.gov/catalog/file/get/5a1492c3e4b09fc93dcfd574?
f=__disk__12%2Fd9%2F4f%2F12d94f844998c4a4eaf1cedd80b70f36ed960a2c',
'https://www.sciencebase.gov/catalog/file/get/5a1492c3e4b09fc93dcfd574?
f=__disk__12%2Fd9%2F4f%2F12d94f844998c4a4eaf1cedd80b70f36ed960a2c',
'https://www.sciencebase.gov/catalog/file/get/5a1492c3e4b09fc93dcfd574?
f=__disk__e3%2Ff0%2F95%2Fe3f0958d05c1240724b58709196a87492b85d8d4',
'https://www.sciencebase.gov/catalog/file/get/5a1492c3e4b09fc93dcfd574?
f=__disk__e3%2Ff0%2F95%2Fe3f0958d05c1240724b58709196a87492b85d8d4',
'https://www.sciencebase.gov/catalog/file/get/5a1492c3e4b09fc93dcfd574?
facet=USGS_TopoMineSymbols_ver2_mapservice.sd',
'https://www.sciencebase.gov/catalog/file/get/5a1492c3e4b09fc93dcfd574?
f=__disk__b0%2F64%2Fd3%2Fb064d3465149780209ef624db57830e40edb9115'],
'name': ['Prospect- and Mine-Related Features from U.S. Geological Survey '
'7.5- and 15-Minute Topographic Quadrangle Maps of the United '
'States'],
'project': ['us_deposits'],
'server': ['DESKTOP-9CUE746'],
'spider': ['deposits'],
'url':
['https://www.sciencebase.gov/catalog/item/5a1492c3e4b09fc93dcfd574']}
2018-11-19 18:20:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7312,
'downloader/request_count': 23,
'downloader/request_method_count/GET': 23,
'downloader/response_bytes': 615330,
'downloader/response_count': 23,
'downloader/response_status_count/200': 13,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 9,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 19, 17, 20, 12, 397317),
'item_scraped_count': 9,
'log_count/DEBUG': 34,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 2,
'request_depth_max': 1,
'response_received_count': 13,
'scheduler/dequeued': 19,
'scheduler/dequeued/memory': 19,
'scheduler/enqueued': 19,
'scheduler/enqueued/memory': 19,
'start_time': datetime.datetime(2018, 11, 19, 17, 20, 7, 541186)}
2018-11-19 18:20:12 [scrapy.core.engine] INFO: Spider closed (finished)
SPIDER
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import datetime
import socket
from us_deposits.items import DepositsusaItem
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
from urllib.parse import urlparse
from urllib.parse import urljoin
class DepositsSpider(CrawlSpider):
name = 'deposits'
allowed_domains = ['doi.org']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products', ]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[@id="products"][1]/p/a'),
callback='parse_x'),
)
def parse_x(self, response):
i = ItemLoader(item=DepositsusaItem(), response=response)
i.add_xpath('name', '//*[@class="container"][1]/header/h1/text()')
i.add_xpath('file', '//span[starts-with(@data-url, "/catalog/file/get/")]/@data-url',
MapCompose(lambda i: urljoin(response.url, i))
)
i.add_value('url', response.url)
i.add_value('project', self.settings.get('BOT_NAME'))
i.add_value('spider', self.name)
i.add_value('server', socket.gethostname())
i.add_value('date', datetime.datetime.now())
return i.load_item()
SETTINGS
BOT_NAME = 'us_deposits'
SPIDER_MODULES = ['us_deposits.spiders']
NEWSPIDER_MODULE = 'us_deposits.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'us_deposits.pipelines.UsDepositsPipeline': 1,
}
FILES_STORE = {
'C:/Users/User/Documents/Python WebCrawling Learning Projects'
}
Any ideas?
回答1:
Take a close look at the Files Pipeline documentation:
In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.
You need to store the URLs of the files to download in a field name file_urls
, not file
.
This minimal spider works for me:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = 'usgs.gov'
allowed_domains = ['doi.org']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products']
custom_settings = {
'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},
'FILES_STORE': '/my/valid/path/',
}
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@id="products"]/p/a'), callback='parse_x'),
)
def parse_x(self, response):
yield {
'file_urls': [response.urljoin(u) for u in response.xpath('//span[starts-with(@data-url, "/catalog/file/get/")]/@data-url').extract()],
}
来源:https://stackoverflow.com/questions/53380142/scrapy-crawl-spider-does-not-download-files