问题
Scrapy is used to parse an html page. My question is why sometimes scrapy returns the response I want, but sometimes does not return a response. Is it my fault? Here's my parsing function:
class AmazonSpider(BaseSpider):
name = "amazon"
allowed_domains = ["amazon.org"]
start_urls = [
"http://www.amazon.com/s?rh=n%3A283155%2Cp_n_feature_browse-bin%3A2656020011"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[contains(@class, "result")]')
items = []
titles = {'titles': sites[0].xpath('//a[@class="title"]/text()').extract()}
for title in titles['titles']:
item = AmazonScrapyItem()
item['title'] = title
items.append(item)
return items
回答1:
I believe you are just not using the most adequate XPath expression.
Amazon's HTML is kinda messy, not very uniform and therefore not very easy to parse. But after some experimenting I could extract all the 12 titles of a couple of search results with the following parse
function:
def parse(self, response):
sel = Selector(response)
p = sel.xpath('//div[@class="data"]/h3/a')
titles = p.xpath('span/text()').extract() + p.xpath('text()').extract()
items = []
for title in titles:
item = AmazonScrapyItem()
item['title'] = title
items.append(item)
return items
If you care about the actual order of the results the above code might not be appropriate but I believe that is not the case.
来源:https://stackoverflow.com/questions/20289450/python-scrapy-not-always-downloading-data-from-website