Scrapy store returned items in variables to use in main script

江枫思渺然 提交于 2019-12-22 14:05:03

问题


I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes:

import scrapy
from scrapy.crawler import CrawlerProcess


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    custom_settings = {
        'LOG_ENABLED': 'False',
    }

    def parse(self, response):
        global title # This would work, but there should be a better way
        title = response.css('title::text').extract_first()

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start() # the script will block here until the crawling is finished

print(title) # Verify if it works and do some other actions later on...

This would work so far, but I am pretty sure it is not a good style, or even has some bad side effects if I define the title variable as global. If I skip that line, then I get the "undefined variable" error of course :/ Therefore I am searching for a way to return the variable and use it in my main script.

I have read about item pipeline but I was not able to make it work.

Any help/ideas are heavily appreciated :) Thanks in advance!


回答1:


using global as you know is not a good style,especially while you need to extend your demand.

My suggestion is to store the title into file or list and use it in your main process,or if you want to handle the title in other script,then just open file and read title in your script

(Note: please ignore the indentation issue)

spider.py

import scrapy
from scrapy.crawler import CrawlerProcess

namefile = 'namefile.txt'
current_title_session = []#title stored in current session
file_append = open(namefile,'a',encoding = 'utf-8')

try:
    title_in_file = open(namefile,'r').readlines()
except:
    title_in_file = open(namefile,'w')

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    custom_settings = {
        'LOG_ENABLED': 'False',
    }

    def parse(self, response):
        title = response.css('title::text').extract_first()
        if title +'\n' not in title_in_file  and title not in current_title_session:
             file_append.write(title+'\n')
             current_title_session.append(title)
if __name__=='__main__':
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(QuotesSpider)
    process.start() # the script will block here until the crawling is finished



回答2:


making a variable global should work for what you need, but as you mentioned it isn't of good style.

I would actually recommend using a different service for communication between processes, something like Redis, so you won't be having conflicts between your spider and any other process.

It is very simple to setup and use, the documentation has a very simple example.

Instantiate the redis connection inside the spider and again on the main process (think about them as separate processes). The spider sets the variables and the main process reads (or gets) the information.



来源:https://stackoverflow.com/questions/47993380/scrapy-store-returned-items-in-variables-to-use-in-main-script

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!