Pass Scrapy Spider a list of URLs to crawl via .txt file

前端 未结 4 643
无人及你
无人及你 2020-12-24 11:16

I\'m a little new to Python and very new to Scrapy.

I\'ve set up a spider to crawl and extract all the information I need. However, I need to pass a .txt file of U

相关标签:
4条回答
  • 2020-12-24 11:42

    If your urls are line seperated

    def get_urls(filename):
            f = open(filename).read().split()
            urls = []
            for i in f:
                    urls.append(i)
            return urls 
    

    then this lines of code will give you the urls.

    0 讨论(0)
  • 2020-12-24 11:57

    you could simply read-in the .txt file:

    with open('your_file.txt') as f:
        start_urls = f.readlines()
    

    if you end up with trailing newline characters, try:

    with open('your_file.txt') as f:
        start_urls = [url.strip() for url in f.readlines()]
    

    Hope this helps

    0 讨论(0)
  • 2020-12-24 12:03

    Run your spider with -a option like:

    scrapy crawl myspider -a filename=text.txt
    

    Then read the file in the __init__ method of the spider and define start_urls:

    class MySpider(BaseSpider):
        name = 'myspider'
    
        def __init__(self, filename=None):
            if filename:
                with open(filename, 'r') as f:
                    self.start_urls = f.readlines()
    

    Hope that helps.

    0 讨论(0)
  • 2020-12-24 12:06
    class MySpider(scrapy.Spider):
        name = 'nameofspider'
    
        def __init__(self, filename=None):
            if filename:
                with open('your_file.txt') as f:
                    self.start_urls = [url.strip() for url in f.readlines()]
    

    This will be your code. It will pick up the urls from the .txt file if they are separated by lines, like, url1 url2 etc..

    After this run the command -->

    scrapy crawl nameofspider -a filename=filename.txt
    

    Lets say, your filename is 'file.txt', then, run the command -->

    scrapy crawl myspider -a filename=file.txt
    
    0 讨论(0)
提交回复
热议问题