问题
I'm trying to scrape some urls with Scrapy and Selenium. Some of the urls are processed by Scrapy directly and the others are handled with Selenium first.
The problem is: while Selenium is handling a url, Scrapy is not processing the others in parallel. It waits for the webdriver to finish its work.
I have tried to run multiple spiders with different init parameters in separate processes (using multiprocessing pool), but I got twisted.internet.error.ReactorNotRestartable
. I also tried to spawn another process in parse
method. But seems that I don't have enought experience to make it right.
In the example below all the urls are printed only when the webdriver is closed. Please advise, is there any way to make it run "in parallel"?
import time
import scrapy
from selenium.webdriver import Firefox
def load_with_selenium(url):
with Firefox() as driver:
driver.get(url)
time.sleep(10) # Do something
page = driver.page_source
return page
class TestSpider(scrapy.Spider):
name = 'test_spider'
tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
{'start_url': 'https://www.nytimes.com/', 'selenium': True}]
def start_requests(self):
for task in self.tasks:
yield scrapy.Request(url=task['start_url'], callback=self.parse, meta=task)
def parse(self, response):
if response.meta['selenium']:
response = response.replace(body=load_with_selenium(response.meta['start_url']))
for url in response.xpath('//a/@href').getall():
print(url)
回答1:
It seems that I've found a solution.
I decided to use multiprocessing, running one spider in each process and passing a task as its init parameter. In some cases this approach may be inappropriate, but it works for me.
I tried this way before but I was getting the twisted.internet.error.ReactorNotRestartable
exception. It was caused by calling the start() method of the CrawlerProcess in each process multiple times, which is incorrect. Here I found a simple and clear example of running a spider in a loop using callbacks.
So I split my tasks
list between the processes. Then inside the crawl(tasks)
method I make a chain of callbacks to run my spider multiple times passing a different task as its init parameter every time.
import multiprocessing
import numpy as np
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
{'start_url': 'https://www.nytimes.com/', 'selenium': True}]
def crawl(tasks):
process = CrawlerProcess(get_project_settings())
def run_spider(_, index=0):
if index < len(tasks):
deferred = process.crawl('test_spider', task=tasks[index])
deferred.addCallback(run_spider, index + 1)
return deferred
run_spider(None)
process.start()
def main():
processes = 2
with multiprocessing.Pool(processes) as pool:
pool.map(crawl, np.array_split(tasks, processes))
if __name__ == '__main__':
main()
The code of TestSpider
in my question post must be modified accordingly to accept a task as an init parameter.
def __init__(self, task):
scrapy.Spider.__init__(self)
self.task = task
def start_requests(self):
yield scrapy.Request(url=self.task['start_url'], callback=self.parse, meta=self.task)
来源:https://stackoverflow.com/questions/61194207/how-can-i-make-selenium-run-in-parallel-with-scrapy