scrapy crawlspider output

走远了吗. 提交于 2019-12-11 22:24:55

问题


I'm having an issue running through the CrawlSpider example in the Scrapy documentation. It seems to be crawling just fine but I'm having trouble getting it to output to a CSV file (or anything really).

So, my question is can I use this:

scrapy crawl dmoz -o items.csv

or do I have to create an Item Pipeline?

UPDATED, now with code!:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from targets.item import TargetsItem

class MySpider(CrawlSpider):
    name = 'abc'
    allowed_domains = ['ididntuseexample.com']
    start_urls = ['http://www.ididntuseexample.com']

    rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(LinkExtractor(allow=('ididntuseexample.com', ))),

)

    def parse_item(self, response):
       self.log('Hi, this is an item page! %s' % response.url)
       item = TargetsItem()
       item['title'] = response.xpath('//h2/a/text()').extract() #this pulled down data in scrapy shell
       item['link'] = response.xpath('//h2/a/@href').extract()   #this pulled down data in scrapy shell
       return item

回答1:


Rules are the mechanism CrawlSpider uses for following links. Those links are defined with a LinkExtractor. This element basically indicates which links to extract from the crawled page (like the ones defined in the start_urls list) to be followed. Then you can pass a callback that will be called on each extracted link, or more precise, on the pages downloaded following those links.

Your rule must call the parse_item. So, replace:

Rule(LinkExtractor(allow=('ididntuseexample.com', ))),

with:

Rule(LinkExtractor(allow=('ididntuseexample.com',)), callback='parse_item),

This rule defines that you want to call parse_item on every link whose href is ididntuseexample.com. I suspect that what you want as link extractor is not the domain, but the links you want to follow/scrape.

Here you have a basic example that crawls Hacker News to retrieve the title and the first lines of the first comment for all the news in the main page.

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class HackerNewsItem(scrapy.Item):
    title = scrapy.Field()
    comment = scrapy.Field()

class HackerNewsSpider(CrawlSpider):
    name = 'hackernews'
    allowed_domains = ['news.ycombinator.com']
    start_urls = [
        'https://news.ycombinator.com/'
    ]
    rules = (
        # Follow any item link and call parse_item.
        Rule(LinkExtractor(allow=('item.*', )), callback='parse_item'),
    )

    def parse_item(self, response):
        item = HackerNewsItem()
        # Get the title
        item['title'] = response.xpath('//*[contains(@class, "title")]/a/text()').extract()
        # Get the first words of the first comment
        item['comment'] = response.xpath('(//*[contains(@class, "comment")])[1]/font/text()').extract()
        return item


来源:https://stackoverflow.com/questions/26528794/scrapy-crawlspider-output

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!