I'm learning how to work with Scrapy while refreshing my knowledge in Python?/Coding from school.

Currently, I'm playing around with imdb top 250 list but struggling with a JSON output file.

My current code is:

 # -*- coding: utf-8 -*-
import scrapy

from top250imdb.items import Top250ImdbItem

class ActorsSpider(scrapy.Spider):
    name = "actors"
    allowed_domains = [""]
    start_urls = ['']

    # Parsing each movie and preparing the url for the actors list
    def parse(self, response):
        for film in response.css('.titleColumn'):
            url = film.css('a::attr(href)').extract_first()
            actors_url = '' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast'
            yield scrapy.Request(actors_url, self.parse_actor)

    # Finding all actors and storing them on item
    # Refer to
    def parse_actor(self, response):
        final_list = []
        item = Top250ImdbItem()
        item['poster'] = response.css('#main img::attr(src)').extract_first()
        item['title'] = response.css('h3[itemprop~=name] a::text').extract()
        item['photo'] = response.css('#fullcredits_content .loadlate::attr(loadlate)').extract()
        item['actors'] = response.css('td[itemprop~=actor] span::text').extract()


        updated_list = []

        for item in final_list:
            for i in range(len(item['title'])):
                sub_item = {}
                sub_item['movie'] = {}
                sub_item['movie']['poster'] = [item['poster']]
                sub_item['movie']['title'] = [item['title'][i]]
                sub_item['movie']['photo'] = [item['photo']]
                sub_item['movie']['actors'] = [item['actors']]
            return updated_list

and my output file is giving me this JSON composition:

    "movie": {
      "poster": [""], 
      "title": ["The Shawshank Redemption"], 
      "photo": [[""]], 
      "actors": [["Tim Robbins","Morgan Freeman",...]]}
    "movie": {
      "poster": [""], 
      "title": ["The Godfather"], 
      "photo": [[""]], 
      "actors": [["Alexandre Rodrigues", "Leandro Firmino", "Phellipe Haagensen",...]]}

but I'm looking to achieve this:

  "movies": [{
    "poster": "",
    "title": "The Shawshank Redemption",
    "actors": [
      {"photo": "",
      "name": "Tim Robbins"},
      {"photo": "",
      "name": "Morgan Freeman"},...
    "poster": "",
    "title": "The Godfather",
    "actors": [
      {"photo": "",
      "name": "Marlon Brando"},
      {"photo": "",
      "name": "Al Pacino"},...

in my file I have the following:

import scrapy

class Top250ImdbItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # Items from
    poster = scrapy.Field()
    title = scrapy.Field()
    photo = scrapy.Field()
    actors = scrapy.Field()
    movie = scrapy.Field()

I'm aware of the following things:

  1. My results are not coming out in order, the 1st movie on web page list is always the first movie on my output file but the rest is not. I'm still working on that.

  2. I can do the same thing but working with Top250ImdbItem(), still browsing around how that is done in a more detailed way.

  3. This might not be the perfect layout for my JSON, suggestions are welcomed or if it is, let me know, even though I know there is no perfect way or "the only way".

  4. Some actors don't have a photo and it actually loads a different CSS selector. For now, I would like to avoid reaching for the "no picture thumbnail" so it's ok to leave those items empty.


Question: ... struggling with a JSON output file

Note: Can't use your ActorsSpider, get Error: Pseudo-elements are not supported.

# Define a `dict` **once**
top250ImdbItem = {'movies': []}

def parse_actor(self, response):
    poster = response.css(...
    title = response.css(...
    photos = response.css(...
    actors = response.css(...

    # Assuming List of Actors are in sync with List of Photos
    actors_list = []
    for i, actor in enumerate(actors):
        actors_list.append({"name": actor, "photo": photos[i]})

    one_movie = {"poster": poster,
                 "title": title,
                 "actors": actors_list

    # Append One Movie to Top250 'movies' List

