问题
I have this scrapy spider that runs well:
`# -*- coding: utf-8 -*-
import scrapy
class AllCategoriesSpider(scrapy.Spider):
name = 'vieles'
allowed_domains = ['examplewiki.de']
start_urls = ['http://www.exampleregelwiki.de/index.php/categoryA.html','http://www.exampleregelwiki.de/index.php/categoryB.html','http://www.exampleregelwiki.de/index.php/categoryC.html',]
#"Titel": :
def parse(self, response):
urls = response.css('a.ulSubMenu::attr(href)').extract() # links to den subpages
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url,callback=self.parse_details)
def parse_details(self,response):
yield {
"Titel": response.css("li.active.last::text").extract(),
"Content": response.css('div.ce_text.first.last.block').extract(),
}
` with
scrapy runspider spider.py -o dat.json it saves all info to dat.json
I whould like to have a outputfile per start url categoryA.json categoryB.json and so on.
A similar question has been left unanswered, I cannot reproduce this answer and I am not able to learn form the suggestions there.
How do I achive the goal of having several outputfiles, one per starturl? I whould like to only run one command/shellscript/file to achive this.
回答1:
You didn't use real urls in code so I use my page for test.
I have to changed css selectors and I used different fields.
I save it as csv
because it is easier to append data.JSON
would need to read all items from file, add new item and save all items again in the same file.
I create extra field Category
to use it later as filename in pipeline
items.py
import scrapy
class CategoryItem(scrapy.Item):
Title = scrapy.Field()
Date = scrapy.Field()
# extra field use later as filename
Category = scrapy.Field()
In spider I get category from url and send to parse_details
using meta
in Request
.
In parse_details
I add category
to Item
.
spiders/example.py
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['blog.furas.pl']
start_urls = ['http://blog.furas.pl/category/python.html','http://blog.furas.pl/category/html.html','http://blog.furas.pl/category/linux.html']
def parse(self, response):
# get category from url
category = response.url.split('/')[-1][:-5]
urls = response.css('article a::attr(href)').extract() # links to den subpages
for url in urls:
# skip some urls
if ('/tag/' not in url) and ('/category/' not in url):
url = response.urljoin(url)
# add category (as meta) to send it to callback function
yield scrapy.Request(url=url, callback=self.parse_details, meta={'category': category})
def parse_details(self, response):
# get category
category = response.meta['category']
# get only first title (or empty string '') and strip it
title = response.css('h1.entry-title a::text').extract_first('')
title = title.strip()
# get only first date (or empty string '') and strip it
date = response.css('.published::text').extract_first('')
date = date.strip()
yield {
'Title': title,
'Date': date,
'Category': category,
}
In pipeline I get category
and use it to open file for appending and save item.
pipelines.py
import csv
class CategoryPipeline(object):
def process_item(self, item, spider):
# get category and use it as filename
filename = item['Category'] + '.csv'
# open file for appending
with open(filename, 'a') as f:
writer = csv.writer(f)
# write only selected elements
row = [item['Title'], item['Date']]
writer.writerow(row)
#write all data in row
#warning: item is dictionary so item.values() don't have to return always values in the same order
#writer.writerow(item.values())
return item
In settings I have to uncomment pipelines to activate it.
settings.py
ITEM_PIPELINES = {
'category.pipelines.CategoryPipeline': 300,
}
Full code on GitHub: python-examples/scrapy/save-categories-in-separated-files
BTW: I think you could write in files directly in parse_details
.
来源:https://stackoverflow.com/questions/47361396/scrapy-seperate-output-file-per-starurl