My problematic is the following:
To win time, I would like to run several versions of one single spider. The process (parsing definitions) is the sa
So, I find a solution inspired of the scrapy crawl -a variable=value
The spider concerned, in "spiders" folder was transformed:
class MySpider(scrapy.Spider):
name = "arg"
allowed_domains = ['www.website.com']
def __init__ (self, lo_lim=None, up_lim=None , type_of_race = None) : #lo_lim = 2017 , up_lim = 2019, type_of_race = pmu
year = range(int(lo_lim), int(up_lim)) # lower limit, upper limit, must be convert to integer type, instead this is string type
month = range(1,13) #12 months
day = range(1,32) #31 days
url = []
for y in year:
for m in month:
for d in day:
url.append("https://www.website.com/details/{}-{}-{}/{}/meeting".format(y,m,d,type_of_race))
self.start_urls = url #where url = ["https://www.website.com/details/2017-1-1/pmu/meeting",
#"https://www.website.com/details/2017-1-2/pmu/meeting",
#...
#"https://www.website.com/details/2017-12-31/pmu/meeting"
#"https://www.website.com/details/2018-1-1/pmu/meeting",
#"https://www.website.com/details/2018-1-2/pmu/meeting",
#...
#"https://www.website.com/details/2018-12-31/pmu/meeting"]
def parse(self, response):
...`
Then, it answers to my problematic: to keep one single spider, and to run several versions of it by serveral commands at one time without trouble.
Without a def __init__
it didn't work for me. I tried a lot of ways, that is this perfectible code that works for me.
Scrapy version: 1.5.0, Python version: 2.7.9, Mongodb version: 3.6.4, Pymongo version: 3.6.1
Scrapy supports spider arguments. Weirdly enough there's no straightforward documentation, but I'll try to fill in:
When you run a crawl
command you may provide -a NAME=VALUE
arguments and these will be set as your spider class instance variables. For example:
class MySpider(Spider):
name = 'arg'
# we will set below when running the crawler
foo = None
bar = None
def start_requests(self):
url = f'http://example.com/{self.foo}/{self.bar}'
yield Request(url)
And if we run it:
scrapy crawl arg -a foo=1 -a bar=2
# will crawl example.com/1/2