Scrapy celery and multiple spiders

∥☆過路亽.° 提交于 2019-12-23 02:55:09

问题


I'm using scrapy and i'm trying to use celery to manage multiple spiders on one machine. The problem i have is (bit difficult to explain), that the spiders gets multiplied -> meaning if my first spider starts and i start a second spider the first spider executes twice.

See my code here:

ProcessJob.py

class ProcessJob():
    def processJob(self, job):
        #update job
        mysql = MysqlConnector.Mysql()
        db = mysql.getConnection();
        cur = db.cursor(); 
        job.status = 1
        update = "UPDATE job SET status=1 WHERE id=" + str(job.id)
        cur.execute(update)
        db.commit()
        db.close()

        #Start new crawler
        configure_logging()
        webspider = spider.MySpider;   
        if job.ajax == 1:
            webspider.custom_settings = CrawlerSettings.ajax_settings;
        else:
            webspider.custom_settings = CrawlerSettings.normal_settings;
        crawler = UrlCrawlerScript(webspider, job)
        crawler.start()
        crawler.join()
        reactor.stop(0)


class UrlCrawlerScript(Process):
    def __init__(self, spider, job):
        Process.__init__(self)
        self.crawler = CrawlerRunner()
        self.crawler.crawl(spider, job=job)  

    def run(self):
        d = self.crawler.join()
        d.addBoth(lambda _: reactor.stop())       
        reactor.run(0)

And the Spider:

def __init__(self, job):       
        self.job = job        
        #Get the hosts
        allowedDomainsPre = job.url.split(",")
        allowedDomains = []        
        for domains in allowedDomainsPre:
            parsed_uri = urlparse(domains)
            domain = '{uri.netloc}'.format(uri=parsed_uri)
            print domain
            allowedDomains.append(domain)

        self.allowed_domains = allowedDomains
        self.start_urls = allowedDomainsPre
        #Get job patterns
        jobPatterns = job.processing_patterns.split(",")
        allowedPatterns = []
        deniedPatterns = []
        for pattern in jobPatterns:
            if '-' in pattern:
                deniedPatterns.append(pattern.replace("-",""))
            else:
                allowedPatterns.append(pattern) 


        self._rules = [
            Rule(LinkExtractor(allow=(allowedPatterns), deny=(deniedPatterns)), callback=self.parse_items, follow=True)
                ]    
        self.name = job.id            

    def parse_items(self, response):           
        item = Item()
        item['html'] = response.body  
        item['url'] = response.url
        item['job_id'] = self.job.id
        return item

What this does: I retrieve new Jobs from my database (not in this code here - would be a bit too much). Then i want to process them (run a spider). As i said the problem is when i execute two jobs at a time the first spider is "doubled" (so executing twice in parallel).

Any suggestions how to fix this? Its possibly a problem with reactor again :(

来源:https://stackoverflow.com/questions/33737153/scrapy-celery-and-multiple-spiders

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!