问题
I have a bunch of pages to scrape, about 200 000. I usually use Tor and Polipo proxy to hide my spiders behaviors even if they are polite, we never know. So if I login this is useless to use one account and change IP. So that is why I can create several accounts on the website and to set my spider with arguments like in the following:
class ASpider(scrapy.Spider):
name = "spider"
start_urls = ['https://www.a_website.com/compte/login']
def __init__ (self, username=None, password=None):
self.username = username
self.password = password
def parse(self, response):
token = response.css('[name="_csrf_token"]::attr(value)').get()
data_log = {
'_csrf_token': token,
'_username': self.username,
'_password': self.password
}
yield scrapy.FormRequest.from_response(response, formdata=data_log, callback=self.after_login) #No matter the rest
And to run several same spiders like:
scrapy crawl spider -a username=Bidule -a password=TMTC #cmd1
scrapy crawl spider -a username=Truc -a password=TMTC #cmd2
and to crawl it in several commands as I have several accounts.
I managed to check the ip with the code following at the end of the spider.py
:
yield scrapy.Request('http://checkip.dyndns.org/',meta={'item':item_cheval}, callback=self.checkip)
def checkip(self, response):
print('IP: {}'.format(response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]))
it returns the same IP in the both commands launched. So my proxy do not manage to give a different IP to each spider.
Someone told me about bindadress
but I have no idea how it works and if it really gives what I expect.
Notes: I use this in my middlewares.py
:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = settings.get('HTTP_PROXY')
and this in my settings.py
:
# proxy for polipo
HTTP_PROXY = 'http://127.0.0.1:8123'
....
DOWNLOADER_MIDDLEWARES = {
'folder.middlewares.RandomUserAgentMiddleware': 400,
'folder.middlewares.ProxyMiddleware': 410, #Here for proxy
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}
These are copied patterns I put in my code and it works, but I do not master this skill.
Scrapy version: 1.5.0, Python version: 2.7.9, Tor version: 0.3.4.8, Vidalia: 0.2.21
回答1:
If you get a proxy list then you can use 'scrapy_proxies.RandomProxy' in DOWNLOADER_MIDDLEWARES to chose a random proxy from the list for every new page.
In the settings of your spider:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = 'path/proxylist.txt'
PROXY_MODE = 0
With this method there is nothing to add to the spider script
来源:https://stackoverflow.com/questions/54635927/how-to-set-different-ip-according-to-different-commands-of-one-single-scrapy-spi