I am very new to scrapy and also i didn\'t used regular expressions before
The following is my spider.py
code
class ExampleSpider(BaseSp
If you are using CrawlSpider, it's not usually a good idea to override the parse method.
Rule object can filter the urls you are interesed to the ones you do not care for.
See CrawlSpider in the docs for reference.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re
class ExampleSpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/bookstore']
rules = (
Rule(SgmlLinkExtractor(allow=('\/new\/[0-9]\?',)), callback='parse_bookstore'),
)
def parse_boostore(self, response):
hxs = HtmlXPathSelector(response)