问题
I am quite new to Scrapy and want to try the following: Extract some values from a webpage, store it in a variable and use it in my main script. Therefore I followed their tutorial and changed code for my purposes:
import scrapy
from scrapy.crawler import CrawlerProcess
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/'
]
custom_settings = {
'LOG_ENABLED': 'False',
}
def parse(self, response):
global title # This would work, but there should be a better way
title = response.css('title::text').extract_first()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(QuotesSpider)
process.start() # the script will block here until the crawling is finished
print(title) # Verify if it works and do some other actions later on...
This would work so far, but I am pretty sure it is not a good style, or even has some bad side effects if I define the title variable as global. If I skip that line, then I get the "undefined variable" error of course :/ Therefore I am searching for a way to return the variable and use it in my main script.
I have read about item pipeline but I was not able to make it work.
Any help/ideas are heavily appreciated :) Thanks in advance!
回答1:
using global
as you know is not a good style,especially while you need to extend your demand.
My suggestion is to store the title into file or list and use it in your main process,or if you want to handle the title in other script,then just open file and read title in your script
(Note: please ignore the indentation issue)
spider.py
import scrapy
from scrapy.crawler import CrawlerProcess
namefile = 'namefile.txt'
current_title_session = []#title stored in current session
file_append = open(namefile,'a',encoding = 'utf-8')
try:
title_in_file = open(namefile,'r').readlines()
except:
title_in_file = open(namefile,'w')
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/'
]
custom_settings = {
'LOG_ENABLED': 'False',
}
def parse(self, response):
title = response.css('title::text').extract_first()
if title +'\n' not in title_in_file and title not in current_title_session:
file_append.write(title+'\n')
current_title_session.append(title)
if __name__=='__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(QuotesSpider)
process.start() # the script will block here until the crawling is finished
回答2:
making a variable global
should work for what you need, but as you mentioned it isn't of good style.
I would actually recommend using a different service for communication between processes, something like Redis, so you won't be having conflicts between your spider and any other process.
It is very simple to setup and use, the documentation has a very simple example.
Instantiate the redis connection inside the spider and again on the main process (think about them as separate processes). The spider sets the variables and the main process reads (or get
s) the information.
来源:https://stackoverflow.com/questions/47993380/scrapy-store-returned-items-in-variables-to-use-in-main-script