问题
I am very much new to scrapy. I need to follow href from the homepage of a url to multiple depths. Again inside the href links i've multiple href's. I need to follow these href until i reach my desired page to scrape. The sample html of my page is:
Initial Page
<div class="page-categories">
<a class="menu" href="/abc.html">
<a class="menu" href="/def.html">
</div>
Inside abc.html
<div class="cell category" >
<div class="cell-text category">
<p class="t">
<a id="cat-24887" href="fgh.html"/>
</p>
</div>
I need to scrape the contents from this fgh.html page. Could anyone please suggest me where to start from. I read about Linkextractors but could not find a suitable reference to begin with. Thankyou
回答1:
From what I see, I can say that:
- URLs to product categories always end with
.kat
- URLs to products contain
id_
followed by a set of digits
Let's use this information to define our spider rules:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class CodeCheckspider(CrawlSpider):
name = "code_check"
allowed_domains = ["www.codecheck.info"]
start_urls = ['http://www.codecheck.info/']
rules = [
Rule(LinkExtractor(allow=r'\.kat$'), follow=True),
Rule(LinkExtractor(allow=r'/id_\d+/'), callback='parse_product'),
]
def parse_product(self, response):
title = response.xpath('//title/text()').extract()[0]
print title
In other words, we are asking spider to follow every category link and to let us know when it crawls a link containing id_
- which would mean for us that we found a product - in this case, for the sake of an example, I'm printing the page title on the console. This should give you a good starting point.
来源:https://stackoverflow.com/questions/28390593/scrapy-crawl-and-follow-links-within-href