I am very new to scrapy and also i didn\'t used regular expressions before
The following is my spider.py
code
class ExampleSpider(BaseSp
If i understand you correctly, you want a lot of start URL with a certain pattern.
If so, you can override BaseSpider.start_requests method:
class ExampleSpider(BaseSpider):
name = "test_code"
allowed_domains = ["www.example.com"]
def start_requests(self):
for i in xrange(1000):
yield self.make_requests_from_url("http://www.example.com/bookstore/new/%d?filter=bookstore" % i)
...
If you are using CrawlSpider, it's not usually a good idea to override the parse method.
Rule object can filter the urls you are interesed to the ones you do not care for.
See CrawlSpider in the docs for reference.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re
class ExampleSpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/bookstore']
rules = (
Rule(SgmlLinkExtractor(allow=('\/new\/[0-9]\?',)), callback='parse_bookstore'),
)
def parse_boostore(self, response):
hxs = HtmlXPathSelector(response)