I\'m a little new to Python and very new to Scrapy.
I\'ve set up a spider to crawl and extract all the information I need. However, I need to pass a .txt file of U
If your urls are line seperated
def get_urls(filename):
f = open(filename).read().split()
urls = []
for i in f:
urls.append(i)
return urls
then this lines of code will give you the urls.
you could simply read-in the .txt file:
with open('your_file.txt') as f:
start_urls = f.readlines()
if you end up with trailing newline characters, try:
with open('your_file.txt') as f:
start_urls = [url.strip() for url in f.readlines()]
Hope this helps
Run your spider with -a
option like:
scrapy crawl myspider -a filename=text.txt
Then read the file in the __init__
method of the spider and define start_urls
:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, filename=None):
if filename:
with open(filename, 'r') as f:
self.start_urls = f.readlines()
Hope that helps.
class MySpider(scrapy.Spider):
name = 'nameofspider'
def __init__(self, filename=None):
if filename:
with open('your_file.txt') as f:
self.start_urls = [url.strip() for url in f.readlines()]
This will be your code. It will pick up the urls from the .txt file if they are separated by lines, like, url1 url2 etc..
After this run the command -->
scrapy crawl nameofspider -a filename=filename.txt
Lets say, your filename is 'file.txt', then, run the command -->
scrapy crawl myspider -a filename=file.txt