Running Multiple spiders in scrapy

前端 未结 4 1373
滥情空心
滥情空心 2021-01-05 04:27
  1. In scrapy for example if i had two URL\'s that contains different HTML. Now i want to write two individual spiders each for one and want to run both the spiders at

相关标签:
4条回答
  • 2021-01-05 05:10

    You should use scrapyd to handle multiple crawler http://doc.scrapy.org/en/latest/topics/scrapyd.html

    0 讨论(0)
  • 2021-01-05 05:23

    Here the code that allow you to run multiple spiders in scrapy. Save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3 and it works) :

    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import CrawlerProcess
    
    setting = get_project_settings()
    process = CrawlerProcess(setting)
    
    for spider_name in process.spiders.list():
        print ("Running spider %s" % (spider_name))
        process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
    
    process.start()
    

    and then you can schedule this python program to run with cronjob.

    0 讨论(0)
  • 2021-01-05 05:27

    It would probably be easiest to just run two scrapy scripts at once from the OS level. They should both be able to save to the same database. Create a shell script to call both scrapy scripts to do them at the same time:

    scrapy runspider foo &
    scrapy runspider bar
    

    Be sure to make this script executable with chmod +x script_name

    To schedule a cronjob every 6 hours, type crontab -e into your terminal, and edit the file as follows:

    * */6 * * * path/to/shell/script_name >> path/to/file.log
    

    The first * is minutes, then hours, etc., and an asterik is a wildcard. So this says run the script at any time where the hours is divisible by 6, or every six hours.

    0 讨论(0)
  • 2021-01-05 05:27

    You can try using CrawlerProcess

    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import CrawlerProcess
    
    from myproject.spiders import spider1, spider2
    
    1Spider = spider1.1Spider()
    2Spider = spider2.2Spider()
    process = CrawlerProcess(get_project_settings())
    process.crawl(1Spider)
    process.crawl(2Spider)
    process.start()
    

    If you want to see the full log of the crawl, set LOG_FILE in your settings.py.

    LOG_FILE = "logs/mylog.log"
    
    0 讨论(0)
提交回复
热议问题