Using phantomjs for dynamic content with scrapy and selenium possible race condition

笑着哭i 提交于 2019-12-31 22:57:14

问题


First off, this is a follow up question from here: Change number of running spiders scrapyd

I'm used phantomjs and selenium to create a downloader middleware for my scrapy project. It works well and hasn't really slowed things down when I run my spiders one at a time locally.

But just recently I put a scrapyd server up on AWS. I noticed a possible race condition that seems to be causing errors and performance issues when more than one spider is running at once. I feel like the problem stems from two separate issues.

1) Spiders trying to use phantomjs executable at the same time.

2) Spiders trying to log to phantomjs's ghostdriver log file at the same time.

Guessing here, the performance issue may be the spider trying to wait until the resources are available (this could be due to the fact that I also had a race condition for an sqlite database as well).

Here are the errors I get:

exceptions.IOError: [Errno 13] Permission denied: 'ghostdriver.log' (log file race condition?)

selenium.common.exceptions.WebDriverException: Message: 'Can not connect to GhostDriver' (executable race condition?)

My questions are:

Does my analysis of what the problem(s) are seem correct?

Are there any known solutions to this problem other than limiting the number of spiders that can be ran at a time?

Is there some other way I should be handling javascript? (if you think I should create an entirely new question to discuss the best way to handle javascript with scrapy let me know and I will)

Here is my downloader middleware:

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        if _platform == "linux" or _platform == "linux2":
            driver = webdriver.PhantomJS(service_log_path='/var/log/scrapyd/ghost.log')
        else:
            driver = webdriver.PhantomJS(executable_path=settings.PHANTOM_JS_PATH)
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

note: the _platform code is a temporary work around until I get this source code deployed into a static environment.

I found solutions on SO for javascript problem but they were spider based. This bothered me because it meant every request had to be made once in the downloader handler and again in the spider. That is why I decided to implement mine as a downloader middleware.


回答1:


try using webdriver to interface with phantomjs https://github.com/brandicted/scrapy-webdriver



来源:https://stackoverflow.com/questions/24962520/using-phantomjs-for-dynamic-content-with-scrapy-and-selenium-possible-race-condi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!