问题

Problem

My spider relies on a .txt file that contains the URLs the spider goes to. I have placed that file in the same directory the spider code is located, and in every directory before it (Hail Marry approach); the end result is this:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/app/__main__.egg/CCSpider1/spiders/cc_1_spider.py", line 41, in start_requests
    for line in fileinput.input({url_file}):
  File "/usr/local/lib/python2.7/fileinput.py", line 237, in next
    line = self._readline()
  File "/usr/local/lib/python2.7/fileinput.py", line 339, in _readline
    self._file = open(self._filename, self._mode)
IOError: [Errno 2] No such file or directory: 'url_list_20171028Z.txt'

Question

How do I ensure that url_list_20171028Z.txt is always found when I run my spider? This URL text file updates every day (a new one is stamped with the next day -- e.x: url_list_20171029Z.txt, etc.).

Background

Thank you for taking a crack at my issue. I am a new to Python (started learning in June, 2017) and I am taking this scraping project for fun and as a learning expirience. I only started using scrapy recently (October 2017), so apologies for any blatant simplicity passing over my head.

This project has been uploaded to Scraping Hub website. This issues is popping up when I try to run my spider from the Scraping Hub dashboard. The deployment of the spider was successful, and I made a requirements.txt file to download Pandas package used in my spider.

My Code

The code below is where the URL text file is called. I reworked the default spider initiated when a new project is started. When I run the spider on my own computer; it operates as desired. Here is the portion of code that calls on `url_list_20171028Z.txt' file to get the URLs to get data from:

def start_requests(self):
        s_time = strftime("%Y%m%d" ,gmtime())
        url_file = 'url_list_{0}Z.txt'.format(s_time)
        for line in fileinput.input({url_file}):
            url = str.strip(line)
            yield scrapy.Request(url=url, callback=self.parse)

Thank you very much for taking the time to help me out with this issue. If you need me to add anymore information, let me know! Thank you!

回答1:

You need to declare the files in the package_data section of your setup.py file.

For example, if your Scrapy project has the following structure:

myproject/
  __init__.py
  settings.py
  resources/
    cities.txt
scrapy.cfg
setup.py

You would use the following in your setup.py to include the cities.txt file:

setup(
    name='myproject',
    version='1.0',
    packages=find_packages(),
    package_data={
        'myproject': ['resources/*.txt']
    },
    entry_points={
        'scrapy': ['settings = myproject.settings']
    },
    zip_safe=False,
)

Note that the zip_safe flag is set to False , as this may be needed in some cases.

Now you can access the cities.txt file content from setting.py like this:

import pkgutil

data = pkgutil.get_data("myproject", "resources/cities.txt")

来源：https://stackoverflow.com/questions/46994553/url-text-file-not-found-when-deployed-to-scraping-hub-and-spider-run

标签

python-2.7

scrapy

scrapy-spider

URL text file not found when deployed to Scraping Hub and spider run

问题

Problem

Question

Background

My Code

回答1: