Scrapy CrawlSpider Crawls Nothing

天大地大妈咪最大 提交于 2019-12-23 04:53:25

问题


I am trying to crawl Booking.Com. The spider opens and closes without opening and crawling the url.[Output][1] [1]: https://i.stack.imgur.com/9hDt6.png I am new to python and Scrapy. Here is the code I have written so far. Please point out what I am doing wrong.

import scrapy
import urllib
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.loader import ItemLoader
from CinemaScraper.items import CinemascraperItem


class trip(CrawlSpider):
 name="tripadvisor"

def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


def parse(self, response):
        reviewsurl = response.xpath('//a[@class="show_all_reviews_btn"]/@href')
        url = response.urljoin(reviewsurl[0].extract())
        self.pageNumber = 1
        return scrapy.Request(url, callback=self.parse_reviews)


def parse_reviews(self, response):
     for rev in response.xpath('//li[starts-with(@class,"review_item")]'):
            item =CinemascraperItem()
            #sometimes the title is empty because of some reason, not sure when it happens but this works
            title = rev.xpath('.//*[@class="review_item_header_content"]/span[@itemprop="name"]/text()')
            if title:
                item['title'] = title[0].extract()
                positive_content = rev.xpath('.//p[@class="review_pos"]//span/text()')
                if positive_content:
                    item['positive_content'] = positive_content[0].extract()
                negative_content = rev.xpath('.//p[@class="review_neg"]/span/text()')
                if negative_content:
                    item['negative_content'] = negative_content[0].extract()
                item['score'] = rev.xpath('./*[@class="review_item_header_score_container"]/span')[0].extract()
                #tags are separated by ;
                item['tags'] = ";".join(rev.xpath('.//ul[@class="review_item_info_tags/text()').extract())
                yield item

     next_page = response.xpath('//a[@id="review_next_page_link"]/@href')
     if next_page:
      url = response.urljoin(next_page[0].extract())
      yield scrapy.Request(url, self.parse_reviews)

回答1:


I like to point out that in your question you speak of a website booking.com but in your spider you have the the links to the website of which are the official documents for scrapy's tutorials... Will continue to use the quotes site for the sake of explanation ....

Okay, here we go... So in your code snippet you are using a crawl spider, of which is worth mentioning that the parse function is already a part of the logic behind the Crawl spider. Like I mentioned earlier, by renaming your parse to different name such as parse_item which is the default initial function when you create the scroll spider but truthfully you can name it whatever you want. By doing so I believe I should actually crawl the site but it's all depends on your code being correct.

In a nutshell, the difference between a generic spider and they crawl spider is that when using the crawl spider you use modules such as link extractor and rules of which set certain parameters so that when the start URL follows the pattern of which is used to navigate through the page, with various helpful argument to do just that... Of which the last rule set is the one with the car back to which you polish them. iIn other words... crawl spider creates the logic for request to navigate as desired.

Notice that inthe rules set.... I enter ... "/page." .... using "." is a regular expression that says.... "From the page im in... anyt links on this page that follow the pattern ..../page" it will follow AND callback to parse_item..."

This is A SUPER simple example ... as you can enter the patter to JUST follow or JUST callback to you item parse function...

with normal spider you manuallyhave to workout the site navigation to get the desired content you wish...


CrawlSpider

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from quotes.items import QuotesItem

class QcrawlSpider(CrawlSpider):
    name = 'qCrawl'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    rules = (
        Rule(LinkExtractor(allow=r'page/.*'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = QuotesItem()
        item['quote'] =response.css('span.text::text').extract()
        item['author'] = response.css('small.author::text').extract()
        yield item

Generic Spider

import scrapy
from quotes.items import QuotesItem

class QspiSpider(scrapy.Spider):
    name = "qSpi"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css("div.quote"):
            item = QuotesItem()
            item['quote'] = quote.css('span.text::text').extract()
            item['author'] = quote.css('small.author::text').extract()
            item['tags'] = quote.css("div.tags > a.tag::text").extract()
            yield item

        for nextPage in response.css('li.next a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(nextPage))

a


EDIT: Additional info at request of OP

"...I cannot understand how to add arguments to the Rule parameters"

Okay... lets look at the official documentation just to reiterate the crawl spiders definition...

So crawl spiders create the logic behind following links by using the rule set... now lets say I want to crawl craigslist with a crawl spider for only house hold items for sale.... I want you to take notice of the to things in red....

For number one is to show that when im on craigslist house hold items page

  • https://columbia.craigslist.org/search/hsh

  • https://columbia.craigslist.org/search/hsh?s=120

SO we gather that ... anything in "search/hsh..." will be pages for house hold items list, from the first page from the lading page.

For the big red number "2"... is to show that when we are in the actual items posted... all items seem to have ".../hsh/..." so that any links inside the previs page that has this patter I want to follow and scrape from there ... SO my spider would be something like ...

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from craigListCrawl.items import CraiglistcrawlItem

class CcrawlexSpider(CrawlSpider):
    name = 'cCrawlEx'
    allowed_domains = ['columbia.craigslist.org']
    start_urls = ['https://columbia.craigslist.org/']

    rules = (
        Rule(LinkExtractor(allow=r'search/hsa.*'), follow=True),
        Rule(LinkExtractor(allow=r'hsh.*'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = CraiglistcrawlItem()
        item['title'] = response.css('title::text').extract()
        item['description'] = response.xpath("//meta[@property='og:description']/@content").extract()
        item['followLink'] = response.xpath("//meta[@property='og:url']/@content").extract()
        yield item

I want you to think of it like steps you take to get from the landingpage to where your page with the content is... So we landed on the page which is our start_url... tSo the we said that the House Hold Items has a patter so As you can see for the first rule...

Rule(LinkExtractor(allow=r'search/hsa.*'), follow=True)

Here it says allow the regular expression patter "search/hsa." be followed ... remember that "." is a regular expression that is to match anything after "search/hsa"in this case atleast.

So the logic continues and then say that any link with the pattern "hsh.*" is to be calledback to my parse_item

If you think of it as steps from on page to an other as far as "clicks" it takes it should help... though perfectly acceptable, generic spiders will give you the most control as far as resources your scrapy project will end upusing meaning that a well written spider should be more precise and far faster.




回答2:


You are overriding parse method on a CrawlSpider subclass which is not recommended as per documentation:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Though, I don't see Rules in your Spider so I'd recommend just switching to scrapy.spiders.Spider instead of scrapy.spiders.CrawlSpider. Just inherit from Spider class and run it again, it should work as you expect.



来源:https://stackoverflow.com/questions/44620722/scrapy-crawlspider-crawls-nothing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!