scrapy - seperate output file per starurl

问题

I have this scrapy spider that runs well:

`# -*- coding: utf-8 -*-
import scrapy


class AllCategoriesSpider(scrapy.Spider):
    name = 'vieles'
    allowed_domains = ['examplewiki.de']
    start_urls = ['http://www.exampleregelwiki.de/index.php/categoryA.html','http://www.exampleregelwiki.de/index.php/categoryB.html','http://www.exampleregelwiki.de/index.php/categoryC.html',]

#"Titel": :

def parse(self, response):
    urls = response.css('a.ulSubMenu::attr(href)').extract() # links to den subpages
    for url in urls:
        url = response.urljoin(url)
        yield scrapy.Request(url=url,callback=self.parse_details)

def parse_details(self,response):
    yield {
        "Titel": response.css("li.active.last::text").extract(),
        "Content": response.css('div.ce_text.first.last.block').extract(),
    }

` with

scrapy runspider spider.py -o dat.json it saves all info to dat.json

I whould like to have a outputfile per start url categoryA.json categoryB.json and so on.

A similar question has been left unanswered, I cannot reproduce this answer and I am not able to learn form the suggestions there.

How do I achive the goal of having several outputfiles, one per starturl? I whould like to only run one command/shellscript/file to achive this.

回答1:

You didn't use real urls in code so I use my page for test.
I have to changed css selectors and I used different fields.

I save it as csv because it is easier to append data.
JSON would need to read all items from file, add new item and save all items again in the same file.

I create extra field Category to use it later as filename in pipeline

items.py

import scrapy

class CategoryItem(scrapy.Item):
    Title = scrapy.Field()
    Date = scrapy.Field()
    # extra field use later as filename 
    Category = scrapy.Field()

In spider I get category from url and send to parse_details using meta in Request.
In parse_details I add category to Item.

spiders/example.py

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['blog.furas.pl']
    start_urls = ['http://blog.furas.pl/category/python.html','http://blog.furas.pl/category/html.html','http://blog.furas.pl/category/linux.html']

    def parse(self, response):

        # get category from url
        category = response.url.split('/')[-1][:-5]

        urls = response.css('article a::attr(href)').extract() # links to den subpages

        for url in urls:
            # skip some urls
            if ('/tag/' not in url) and ('/category/' not in url):
                url = response.urljoin(url)
                # add category (as meta) to send it to callback function
                yield scrapy.Request(url=url, callback=self.parse_details, meta={'category': category})

    def parse_details(self, response):

        # get category
        category = response.meta['category']

        # get only first title (or empty string '') and strip it
        title = response.css('h1.entry-title a::text').extract_first('')
        title = title.strip()

        # get only first date (or empty string '') and strip it
        date = response.css('.published::text').extract_first('')
        date = date.strip()

        yield {
            'Title': title,
            'Date': date,
            'Category': category,
        }

In pipeline I get category and use it to open file for appending and save item.

pipelines.py

import csv

class CategoryPipeline(object):

    def process_item(self, item, spider):

        # get category and use it as filename
        filename = item['Category'] + '.csv'

        # open file for appending
        with open(filename, 'a') as f:
            writer = csv.writer(f)

            # write only selected elements 
            row = [item['Title'], item['Date']]
            writer.writerow(row)

            #write all data in row
            #warning: item is dictionary so item.values() don't have to return always values in the same order
            #writer.writerow(item.values())

        return item

In settings I have to uncomment pipelines to activate it.

settings.py

ITEM_PIPELINES = {
    'category.pipelines.CategoryPipeline': 300,
}

Full code on GitHub: python-examples/scrapy/save-categories-in-separated-files

BTW: I think you could write in files directly in parse_details.

来源：https://stackoverflow.com/questions/47361396/scrapy-seperate-output-file-per-starurl

标签

python

python-3.x

web-scraping

scrapy

scrapy-spider