How to give URL to scrapy for crawling?

前端 未结 6 638
终归单人心
终归单人心 2020-11-29 01:42

I want to use scrapy for crawling web pages. Is there a way to pass the start URL from the terminal itself?

It is given in the documentation that either the name of

相关标签:
6条回答
  • 2020-11-29 01:56

    Use scrapy parse command. You can parse a url with your spider. url is passed from command.

    $ scrapy parse http://www.example.com/ --spider=spider-name
    

    http://doc.scrapy.org/en/latest/topics/commands.html#parse

    0 讨论(0)
  • 2020-11-29 02:13

    An even easier way to allow multiple url-arguments than what Peter suggested is by giving them as a string with the urls separated by a comma, like this:

    -a start_urls="http://example1.com,http://example2.com"
    

    In the spider you would then simply split the string on ',' and get an array of urls:

    self.start_urls = kwargs.get('start_urls').split(',')
    
    0 讨论(0)
  • 2020-11-29 02:15

    Sjaak Trekhaak has the right idea and here is how to allow multiples:

    class MySpider(scrapy.Spider):
        """
        This spider will try to crawl whatever is passed in `start_urls` which
        should be a comma-separated string of fully qualified URIs.
    
        Example: start_urls=http://localhost,http://example.com
        """
        def __init__(self, name=None, **kwargs):
            if 'start_urls' in kwargs:
                self.start_urls = kwargs.pop('start_urls').split(',')
            super(Spider, self).__init__(name, **kwargs)
    
    0 讨论(0)
  • 2020-11-29 02:17

    This is an extension to the approach given by Sjaak Trekhaak in this thread. The approach as it is so far only works if you provide exactly one url. For example, if you want to provide more than one url like this, for instance:

    -a start_url=http://url1.com,http://url2.com
    

    then Scrapy (I'm using the current stable version 0.14.4) will terminate with the following exception:

    error: running 'scrapy crawl' with more than one spider is no longer supported
    

    However, you can circumvent this problem by choosing a different variable for each start url, together with an argument that holds the number of passed urls. Something like this:

    -a start_url1=http://url1.com 
    -a start_url2=http://url2.com 
    -a urls_num=2
    

    You can then do the following in your spider:

    class MySpider(BaseSpider):
    
        name = 'my_spider'    
    
        def __init__(self, *args, **kwargs): 
            super(MySpider, self).__init__(*args, **kwargs) 
    
            urls_num = int(kwargs.get('urls_num'))
    
            start_urls = []
            for i in xrange(1, urls_num):
                start_urls.append(kwargs.get('start_url{0}'.format(i)))
    
            self.start_urls = start_urls
    

    This is a somewhat ugly hack but it works. Of course, it's tedious to explicitly write down all command line arguments for each url. Therefore, it makes sense to wrap the scrapy crawl command in a Python subprocess and generate the command line arguments in a loop or something.

    Hope it helps. :)

    0 讨论(0)
  • 2020-11-29 02:18

    I'm not really sure about the commandline option. However, you could write your spider like this.

    class MySpider(BaseSpider):
    
        name = 'my_spider'    
    
        def __init__(self, *args, **kwargs): 
          super(MySpider, self).__init__(*args, **kwargs) 
    
          self.start_urls = [kwargs.get('start_url')] 
    

    And start it like: scrapy crawl my_spider -a start_url="http://some_url"

    0 讨论(0)
  • 2020-11-29 02:18

    You can also try this:

    >>> scrapy view http://www.sitename.com
    

    It will open a window in browser of requested URL.

    0 讨论(0)
提交回复
热议问题