问题
Unfortunately I don't have enough population to make a comment, so I have to make this new question, referring to https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider
I have many urls in a DB. So I want to get the start_url from my db. So far not a big problem. Well I don't want the mysql things inside the spider and in the pipeline I get a problem. If I try to hand over the pipeline object to my spider like in the referred question, I only get an Attribute Error with the message
'None Type' object has no attribute getUrl
I think the actual problem is that the function spider_opened doesn't get called (also inserted a print statement which never showed its output in the console). Has somebody an idea how to get the pipeline object inside the spider?
MySpider.py
def __init__(self):
self.pipe = None
def start_requests(self):
url = self.pipe.getUrl()
scrapy.Request(url,callback=self.parse)
Pipeline.py
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
spider.pipe = self
def getUrl(self):
...
回答1:
Scrapy pipelines already have expected methods of open_spider
and close_spider
Taken from docs: https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider
open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was openedclose_spider(self, spider)
This method is called when the spider is closed. Parameters: spider (Spider object) – the spider which was closed
However your original issue doesn't make much sense, why do you want to assign pipeline reference to your spider? That seems like a very bad idea.
What you should do is open up db and read urls in your spider itself.
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []
@classmethod
def from_crawler(self, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
spider.start_urls = self.get_urls_from_db()
return spider
def get_urls_from_db(self):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls
来源:https://stackoverflow.com/questions/46339263/scrapy-get-start-urls-from-database-by-pipeline