Portia/Scrapy - how to replace or add values to output JSON

时光总嘲笑我的痴心妄想 提交于 2020-01-03 05:28:04

问题


just 2 quick doubts:

1- I want my final JSON file to replace the text extract (for example text extracted is ADD TO CART but I want to change to IN STOCK in my final JSON. Is it possible?

2- I also would like to add some custom data to my final JSON file that is not in the website, for example "Store name"... so every product that I scrape will have the store name after it. Is it possible?

I am using both Portia and Scrapy so your suggestions are welcome in both platforms.

My Scrapy spider code is below:

import scrapy
from __future__ import absolute_import
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Identity
from scrapy.spiders import Rule
from ..utils.spiders import BasePortiaSpider
from ..utils.starturls import FeedGenerator, FragmentGenerator
from ..utils.processors import Item, Field, Text, Number, Price, Date, Url, 
Image, Regex
from ..items import PortiaItem


class Advent(BasePortiaSpider):
    name = "advent"
    allowed_domains = [u'www.adventgames.com.au']
    start_urls = [u'http://www.adventgames.com.au/c/4504822/1/all-games-a---k.html',
                  {u'url': u'http://www.adventgames.com.au/Listing/Category/?categoryId=4504822&page=[1-5]',
                   u'fragments': [{u'valid': True,
                                   u'type': u'fixed',
                                   u'value': u'http://www.adventgames.com.au/Listing/Category/?categoryId=4504822&page='},
                                  {u'valid': True,
                                   u'type': u'range',
                                   u'value': u'1-5'}],
                   u'type': u'generated'}]
    rules = [
        Rule(
            LinkExtractor(
                allow=('.*'),
                deny=()
            ),
            callback='parse_item',
            follow=True
        )
    ]
    items = [
        [
            Item(
                PortiaItem,
                None,
                u'.DataViewCell > form > table',
                [
                    Field(
                        u'Title',
                        'tr:nth-child(1) > td > .DataViewItemProductTitle > a *::text',
                        []),
                    Field(
                        u'Price',
                        'tr:nth-child(1) > td > .DataViewItemOurPrice *::text',
                        []),
                    Field(
                        u'Img_src',
                        'tr:nth-child(1) > td > .DataViewItemThumbnailImage > div > a > img::attr(src)',
                        []),
                    Field(
                        u'URL',
                        'tr:nth-child(1) > td > .DataViewItemProductTitle > a::attr(href)',
                        []),
                    Field(
                        u'Stock',
                        'tr:nth-child(2) > td > .DataViewItemAddToCart > .wButton::attr(value)',
                        [])])]]

回答1:


I have never used the items class variable, it looks very unreadable and difficult to understand.

I would suggest you to have a callback method and parse it like this

def my_callback_func(self, response):

    myitem = PortiaItem()


    for item in response.css(".DataViewCell > form > table"):

        item['Title'] = item.css('tr:nth-child(1) > td > .DataViewItemProductTitle > a *::text').extract_first()

        item['Stock'] = item.css('tr:nth-child(2) > td > .DataViewItemAddToCart > .wButton::attr(value)').extract_first()

        if item['Stock'] == "ADD TO CART":

            item['is_available'] = "YES"

        ...... and so on

        yield item


来源:https://stackoverflow.com/questions/49731142/portia-scrapy-how-to-replace-or-add-values-to-output-json

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!