Scrapy - Creating nested JSON Object

落花浮王杯 提交于 2021-02-10 18:15:16

问题


I'm learning how to work with Scrapy while refreshing my knowledge in Python?/Coding from school.

Currently, I'm playing around with imdb top 250 list but struggling with a JSON output file.

My current code is:

 # -*- coding: utf-8 -*-
import scrapy

from top250imdb.items import Top250ImdbItem


class ActorsSpider(scrapy.Spider):
    name = "actors"
    allowed_domains = ["imdb.com"]
    start_urls = ['http://www.imdb.com/chart/top']

    # Parsing each movie and preparing the url for the actors list
    def parse(self, response):
        for film in response.css('.titleColumn'):
            url = film.css('a::attr(href)').extract_first()
            actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast'
            yield scrapy.Request(actors_url, self.parse_actor)

    # Finding all actors and storing them on item
    # Refer to items.py
    def parse_actor(self, response):
        final_list = []
        item = Top250ImdbItem()
        item['poster'] = response.css('#main img::attr(src)').extract_first()
        item['title'] = response.css('h3[itemprop~=name] a::text').extract()
        item['photo'] = response.css('#fullcredits_content .loadlate::attr(loadlate)').extract()
        item['actors'] = response.css('td[itemprop~=actor] span::text').extract()

        final_list.append(item)

        updated_list = []

        for item in final_list:
            for i in range(len(item['title'])):
                sub_item = {}
                sub_item['movie'] = {}
                sub_item['movie']['poster'] = [item['poster']]
                sub_item['movie']['title'] = [item['title'][i]]
                sub_item['movie']['photo'] = [item['photo']]
                sub_item['movie']['actors'] = [item['actors']]
                updated_list.append(sub_item)
            return updated_list

and my output file is giving me this JSON composition:

[
  {
    "movie": {
      "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
      "title": ["The Shawshank Redemption"], 
      "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
      "actors": [["Tim Robbins","Morgan Freeman",...]]}
    },{
    "movie": {
      "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
      "title": ["The Godfather"], 
      "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
      "actors": [["Alexandre Rodrigues", "Leandro Firmino", "Phellipe Haagensen",...]]}
  }
]

but I'm looking to achieve this:

{
  "movies": [{
    "poster": "https://images-na.ssl-images-amazon.com/poster...",
    "title": "The Shawshank Redemption",
    "actors": [
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Tim Robbins"},
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Morgan Freeman"},...
    ]
  },{
    "poster": "https://images-na.ssl-images-amazon.com/poster...",
    "title": "The Godfather",
    "actors": [
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Marlon Brando"},
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Al Pacino"},...
    ]
  }]
}

in my items.py file I have the following:

import scrapy


class Top250ImdbItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # Items from actors.py
    poster = scrapy.Field()
    title = scrapy.Field()
    photo = scrapy.Field()
    actors = scrapy.Field()
    movie = scrapy.Field()
    pass

I'm aware of the following things:

  1. My results are not coming out in order, the 1st movie on web page list is always the first movie on my output file but the rest is not. I'm still working on that.

  2. I can do the same thing but working with Top250ImdbItem(), still browsing around how that is done in a more detailed way.

  3. This might not be the perfect layout for my JSON, suggestions are welcomed or if it is, let me know, even though I know there is no perfect way or "the only way".

  4. Some actors don't have a photo and it actually loads a different CSS selector. For now, I would like to avoid reaching for the "no picture thumbnail" so it's ok to leave those items empty.

example:

{"photo": "", "name": "Al Pacino"}

回答1:


Question: ... struggling with a JSON output file


Note: Can't use your ActorsSpider, get Error: Pseudo-elements are not supported.

# Define a `dict` **once**
top250ImdbItem = {'movies': []}

def parse_actor(self, response):
    poster = response.css(...
    title = response.css(...
    photos = response.css(...
    actors = response.css(...

    # Assuming List of Actors are in sync with List of Photos
    actors_list = []
    for i, actor in enumerate(actors):
        actors_list.append({"name": actor, "photo": photos[i]})

    one_movie = {"poster": poster,
                 "title": title,
                 "actors": actors_list
                }

    # Append One Movie to Top250 'movies' List
    top250ImdbItem['movies'].append(one_movie)


来源:https://stackoverflow.com/questions/45158117/scrapy-creating-nested-json-object

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!