Calling the same spider programmatically

百般思念 提交于 2019-12-13 21:52:33

问题


I have a spider which crawls links for the websites passed. I want to start the same spider again when its execution is finished with different set of data. How to restart the same crawler again? The websites are passed through database. I want the crawler to run in a unlimited loop until all the websites are crawled. Currently I have to start the crawler scrapy crawl first all the time. Is there any way to start the crawler once and it will stop when all the websites are crawled?

I searched for the same, and found a solution of handling the crawler once its closed/finished. But I don't know how to call the spider form the closed_handler method programmatically.

The following is my code:

 class MySpider(CrawlSpider):
        def __init__(self, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs)
            SignalManager(dispatcher.Any).connect(
                self.closed_handler, signal=signals.spider_closed)

        def closed_handler(self, spider):
            reactor.stop()
            settings = Settings()
            crawler = Crawler(settings)
            crawler.signals.connect(spider.spider_closing, signal=signals.spider_closed)
            crawler.configure()
            crawler.crawl(MySpider())
            crawler.start()
            reactor.run()

        # code for getting the websites from the database
        name = "first"
        def parse_url(self, response):
            ...

I am getting the error:

Error caught on signal handler: <bound method ?.closed_handler of <MySpider 'first' at 0x40f8c70>>

Traceback (most recent call last):
  File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "c:\python27\lib\site-packages\scrapy\xlib\pydispatch\robustapply.py", line 57, in robustApply
    return receiver(*arguments, **named)
  File "G:\Scrapy\web_link_crawler\web_link_crawler\spiders\first.py", line 72, in closed_handler
    crawler = Crawler(settings)
  File "c:\python27\lib\site-packages\scrapy\crawler.py", line 32, in __init__
    self.spidercls.update_settings(self.settings)
AttributeError: 'Settings' object has no attribute 'update_settings'

Is this the right way to get this done? Or is there any other way? Please help!

Thank You


回答1:


Another way to do it would be making a new script where you select the links from the database and save them to a file and then call the scrapy script this way

os.system("scrapy crawl first")

and load the links from the file onto your spider and work from there on.

If you want to constantly check the database for new links, in the first script just call the database from time to time in an infinite loop and make the scrapy call whenever there are new links!



来源:https://stackoverflow.com/questions/37002742/calling-the-same-spider-programmatically

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!