问题
Problem
My spider relies on a .txt
file that contains the URLs the spider goes to. I have placed that file in the same directory the spider code is located, and in every directory before it (Hail Marry approach); the end result is this:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/app/__main__.egg/CCSpider1/spiders/cc_1_spider.py", line 41, in start_requests
for line in fileinput.input({url_file}):
File "/usr/local/lib/python2.7/fileinput.py", line 237, in next
line = self._readline()
File "/usr/local/lib/python2.7/fileinput.py", line 339, in _readline
self._file = open(self._filename, self._mode)
IOError: [Errno 2] No such file or directory: 'url_list_20171028Z.txt'
Question
How do I ensure that url_list_20171028Z.txt
is always found when I run my spider? This URL text file updates every day (a new one is stamped with the next day -- e.x: url_list_20171029Z.txt
, etc.).
Background
Thank you for taking a crack at my issue. I am a new to Python (started learning in June, 2017) and I am taking this scraping project for fun and as a learning expirience. I only started using scrapy recently (October 2017), so apologies for any blatant simplicity passing over my head.
This project has been uploaded to Scraping Hub website. This issues is popping up when I try to run my spider from the Scraping Hub dashboard. The deployment of the spider was successful, and I made a requirements.txt
file to download Pandas
package used in my spider.
My Code
The code below is where the URL text file is called. I reworked the default spider initiated when a new project is started. When I run the spider on my own computer; it operates as desired. Here is the portion of code that calls on `url_list_20171028Z.txt' file to get the URLs to get data from:
def start_requests(self):
s_time = strftime("%Y%m%d" ,gmtime())
url_file = 'url_list_{0}Z.txt'.format(s_time)
for line in fileinput.input({url_file}):
url = str.strip(line)
yield scrapy.Request(url=url, callback=self.parse)
Thank you very much for taking the time to help me out with this issue. If you need me to add anymore information, let me know! Thank you!
回答1:
You need to declare the files in the package_data section of your setup.py
file.
For example, if your Scrapy project has the following structure:
myproject/
__init__.py
settings.py
resources/
cities.txt
scrapy.cfg
setup.py
You would use the following in your setup.py
to include the cities.txt
file:
setup(
name='myproject',
version='1.0',
packages=find_packages(),
package_data={
'myproject': ['resources/*.txt']
},
entry_points={
'scrapy': ['settings = myproject.settings']
},
zip_safe=False,
)
Note that the zip_safe
flag is set to False , as this may be needed in some cases.
Now you can access the cities.txt
file content from setting.py
like this:
import pkgutil
data = pkgutil.get_data("myproject", "resources/cities.txt")
来源:https://stackoverflow.com/questions/46994553/url-text-file-not-found-when-deployed-to-scraping-hub-and-spider-run