问题
just 2 quick doubts:
1- I want my final JSON file to replace the text extract (for example text extracted is ADD TO CART but I want to change to IN STOCK in my final JSON. Is it possible?
2- I also would like to add some custom data to my final JSON file that is not in the website, for example "Store name"... so every product that I scrape will have the store name after it. Is it possible?
I am using both Portia and Scrapy so your suggestions are welcome in both platforms.
My Scrapy spider code is below:
import scrapy
from __future__ import absolute_import
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Identity
from scrapy.spiders import Rule
from ..utils.spiders import BasePortiaSpider
from ..utils.starturls import FeedGenerator, FragmentGenerator
from ..utils.processors import Item, Field, Text, Number, Price, Date, Url,
Image, Regex
from ..items import PortiaItem
class Advent(BasePortiaSpider):
name = "advent"
allowed_domains = [u'www.adventgames.com.au']
start_urls = [u'http://www.adventgames.com.au/c/4504822/1/all-games-a---k.html',
{u'url': u'http://www.adventgames.com.au/Listing/Category/?categoryId=4504822&page=[1-5]',
u'fragments': [{u'valid': True,
u'type': u'fixed',
u'value': u'http://www.adventgames.com.au/Listing/Category/?categoryId=4504822&page='},
{u'valid': True,
u'type': u'range',
u'value': u'1-5'}],
u'type': u'generated'}]
rules = [
Rule(
LinkExtractor(
allow=('.*'),
deny=()
),
callback='parse_item',
follow=True
)
]
items = [
[
Item(
PortiaItem,
None,
u'.DataViewCell > form > table',
[
Field(
u'Title',
'tr:nth-child(1) > td > .DataViewItemProductTitle > a *::text',
[]),
Field(
u'Price',
'tr:nth-child(1) > td > .DataViewItemOurPrice *::text',
[]),
Field(
u'Img_src',
'tr:nth-child(1) > td > .DataViewItemThumbnailImage > div > a > img::attr(src)',
[]),
Field(
u'URL',
'tr:nth-child(1) > td > .DataViewItemProductTitle > a::attr(href)',
[]),
Field(
u'Stock',
'tr:nth-child(2) > td > .DataViewItemAddToCart > .wButton::attr(value)',
[])])]]
回答1:
I have never used the items
class variable, it looks very unreadable and difficult to understand.
I would suggest you to have a callback method and parse it like this
def my_callback_func(self, response):
myitem = PortiaItem()
for item in response.css(".DataViewCell > form > table"):
item['Title'] = item.css('tr:nth-child(1) > td > .DataViewItemProductTitle > a *::text').extract_first()
item['Stock'] = item.css('tr:nth-child(2) > td > .DataViewItemAddToCart > .wButton::attr(value)').extract_first()
if item['Stock'] == "ADD TO CART":
item['is_available'] = "YES"
...... and so on
yield item
来源:https://stackoverflow.com/questions/49731142/portia-scrapy-how-to-replace-or-add-values-to-output-json