How to run several versions of one single spider at one time with Scrapy?

后端 未结 2 1500
清酒与你
清酒与你 2020-12-22 00:54

My problematic is the following:

To win time, I would like to run several versions of one single spider. The process (parsing definitions) is the sa

相关标签:
2条回答
  • 2020-12-22 01:27

    So, I find a solution inspired of the scrapy crawl -a variable=value

    The spider concerned, in "spiders" folder was transformed:

    class MySpider(scrapy.Spider):
    name = "arg"
    allowed_domains = ['www.website.com']
    
        def __init__ (self, lo_lim=None, up_lim=None , type_of_race = None) : #lo_lim = 2017 , up_lim = 2019, type_of_race = pmu
            year  = range(int(lo_lim), int(up_lim)) # lower limit, upper limit, must be convert to integer type, instead this is string type
            month = range(1,13) #12 months
            day   = range(1,32) #31 days
            url   = []
            for y in year:
                for m in month:
                    for d in day:
                        url.append("https://www.website.com/details/{}-{}-{}/{}/meeting".format(y,m,d,type_of_race))
    
            self.start_urls = url #where url = ["https://www.website.com/details/2017-1-1/pmu/meeting",
                                            #"https://www.website.com/details/2017-1-2/pmu/meeting",
                                            #...
                                            #"https://www.website.com/details/2017-12-31/pmu/meeting"
                                            #"https://www.website.com/details/2018-1-1/pmu/meeting",
                                            #"https://www.website.com/details/2018-1-2/pmu/meeting",
                                            #...
                                            #"https://www.website.com/details/2018-12-31/pmu/meeting"]
    
        def parse(self, response):
            ...`
    

    Then, it answers to my problematic: to keep one single spider, and to run several versions of it by serveral commands at one time without trouble.

    Without a def __init__ it didn't work for me. I tried a lot of ways, that is this perfectible code that works for me.

    Scrapy version: 1.5.0, Python version: 2.7.9, Mongodb version: 3.6.4, Pymongo version: 3.6.1

    0 讨论(0)
  • 2020-12-22 01:53

    Scrapy supports spider arguments. Weirdly enough there's no straightforward documentation, but I'll try to fill in:

    When you run a crawl command you may provide -a NAME=VALUE arguments and these will be set as your spider class instance variables. For example:

    class MySpider(Spider):
        name = 'arg'
        # we will set below when running the crawler
        foo = None 
        bar = None
    
        def start_requests(self):
            url = f'http://example.com/{self.foo}/{self.bar}'
            yield Request(url)
    

    And if we run it:

    scrapy crawl arg -a foo=1 -a bar=2
    # will crawl example.com/1/2
    
    0 讨论(0)
提交回复
热议问题