The script (below) from this tutorial contains two start_urls
.
from scrapy.spider import Spider
from scrapy.selector import Selector
from dirb
start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that. http://doc.scrapy.org/en/latest/topics/spiders.html look there for example.
The class does not have a rules
property. Have a look at http://readthedocs.org/docs/scrapy/en/latest/intro/overview.html and search for "rules" to find an example.
If you use an rule to follow links (that is already implemented in scrapy), the spider will scrape them too. I hope have helped...
from scrapy.contrib.spiders import BaseSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class Spider(BaseSpider):
name = 'my_spider'
start_urls = ['http://www.domain.com/']
allowed_domains = ['domain.com']
rules = [Rule(SgmlLinkExtractor(allow=[], deny[]), follow=True)]
...
start_urls
class attribute contains start urls - nothing more. If you have extracted urls of other pages you want to scrape - yield from parse
callback corresponding requests with [another] callback:
class Spider(BaseSpider):
name = 'my_spider'
start_urls = [
'http://www.domain.com/'
]
allowed_domains = ['domain.com']
def parse(self, response):
'''Parse main page and extract categories links.'''
hxs = HtmlXPathSelector(response)
urls = hxs.select("//*[@id='tSubmenuContent']/a[position()>1]/@href").extract()
for url in urls:
url = urlparse.urljoin(response.url, url)
self.log('Found category url: %s' % url)
yield Request(url, callback = self.parseCategory)
def parseCategory(self, response):
'''Parse category page and extract links of the items.'''
hxs = HtmlXPathSelector(response)
links = hxs.select("//*[@id='_list']//td[@class='tListDesc']/a/@href").extract()
for link in links:
itemLink = urlparse.urljoin(response.url, link)
self.log('Found item link: %s' % itemLink, log.DEBUG)
yield Request(itemLink, callback = self.parseItem)
def parseItem(self, response):
...
If you still want to customize start requests creation, override method BaseSpider.start_requests()
If you use BaseSpider
, inside the callback, you have to extract out your desired urls yourself and return a Request
object.
If you use CrawlSpider
, links extraction would be taken care of by the rules and the SgmlLinkExtractor associated with the rules.
you didn't write the function to deal the urls what you want to get.so two way to reslolve.1.use the the rule (crawlspider) 2:write the function to deal the new urls.and put them in the callback function.