Using one Scrapy spider for several websites

后端 未结 4 418
南方客
南方客 2020-12-08 11:47

I need to create a user configurable web spider/crawler, and I\'m thinking about using Scrapy. But, I can\'t hard-code the domains and allowed URL regex:es -- this will inst

相关标签:
4条回答
  • 2020-12-08 12:19

    Now it is extremely easy to configure scrapy for these purposes:

    1. About the first urls to visit, you can pass it as an attribute on the spider call with -a, and use the start_requests function to setup how to start the spider

    2. You don't need to setup the allowed_domains variable for the spiders. If you don't include that class variable, the spider will be able to allow every domain.

    It should end up to something like:

    class MySpider(Spider):
    
        name = "myspider"
    
        def start_requests(self):
            yield Request(self.start_url, callback=self.parse)
    
    
        def parse(self, response):
            ...
    

    and you should call it with:

    scrapy crawl myspider -a start_url="http://example.com"
    
    0 讨论(0)
  • 2020-12-08 12:22

    WARNING: This answer was for Scrapy v0.7, spider manager api changed a lot since then.

    Override default SpiderManager class, load your custom rules from a database or somewhere else and instanciate a custom spider with your own rules/regexes and domain_name

    in mybot/settings.py:

    SPIDER_MANAGER_CLASS = 'mybot.spidermanager.MySpiderManager'
    

    in mybot/spidermanager.py:

    from mybot.spider import MyParametrizedSpider
    
    class MySpiderManager(object):
        loaded = True
    
        def fromdomain(self, name):
            start_urls, extra_domain_names, regexes = self._get_spider_info(name)
            return MyParametrizedSpider(name, start_urls, extra_domain_names, regexes)
    
        def close_spider(self, spider):
            # Put here code you want to run before spiders is closed
            pass
    
        def _get_spider_info(self, name):
            # query your backend (maybe a sqldb) using `name` as primary key, 
            # and return start_urls, extra_domains and regexes
            ...
            return (start_urls, extra_domains, regexes)
    

    and now your custom spider class, in mybot/spider.py:

    from scrapy.spider import BaseSpider
    
    class MyParametrizedSpider(BaseSpider):
    
        def __init__(self, name, start_urls, extra_domain_names, regexes):
            self.domain_name = name
            self.start_urls = start_urls
            self.extra_domain_names = extra_domain_names
            self.regexes = regexes
    
         def parse(self, response):
             ...
    

    Notes:

    • You can extend CrawlSpider too if you want to take advantage of its Rules system
    • To run a spider use: ./scrapy-ctl.py crawl <name>, where name is passed to SpiderManager.fromdomain and is the key to retreive more spider info from the backend system
    • As solution overrides default SpiderManager, coding a classic spider (a python module per SPIDER) doesn't works, but, I think this is not an issue for you. More info on default spiders manager TwistedPluginSpiderManager
    0 讨论(0)
  • 2020-12-08 12:37

    What you need is to dynamically create spider classes, subclassing your favorite generic spider class as supplied by scrapy (CrawlSpider subclasses with your rules added, or XmlFeedSpider, or whatever) and adding domain_name, start_urls, and possibly extra_domain_names (and/or start_requests(), etc), as you get or deduce them from your GUI (or config file, or whatever).

    Python makes it easy to perform such dynamic creation of class objects; a very simple example might be:

    from scrapy import spider
    
    def makespider(domain_name, start_urls,
                   basecls=spider.BaseSpider):
      return type(domain_name + 'Spider',
                  (basecls,),
                  {'domain_name': domain_name,
                   'start_urls': start_urls})
    
    allspiders = []
    for domain, urls in listofdomainurlpairs:
      allspiders.append(makespider(domain, urls))
    

    This gives you a list of very bare-bone spider classes -- you'll probably want to add parse methods to them before you instantiate them. Season to taste...;-).

    0 讨论(0)
  • 2020-12-08 12:44

    Shameless self promotion on domo! you'll need to instantiate the crawler as given in the examples, for your project.

    Also you'll need to make the crawler configurable on runtime, which is simply passing the configuration to crawler, and overriding the settings on runtime, when configuration changed.

    0 讨论(0)
提交回复
热议问题