问题
I have created a custom Middleware in Scrapy by overriding the RetryMiddleware which changes both Proxy and User-Agent before retrying. It looks like this
class CustomRetryMiddleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
Proxy_UA_Middleware.switch_proxy()
Proxy_UA_Middleware.switch_ua()
logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
logger.debug("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
The Proxy_UA_Middlware class is quite long. Basically it contains methods that change proxy and user agent. I have both these middlewares configured properly in my settings.py file. The proxy part works okay but the User Agent doesn't change. The code I've used to changed User Agent looks like this
request.headers.setdefault('User-Agent', self.user_agent)
where self.user_agent is a random value taken from an array of user agents. This doesn't work. However, if I do this
request.headers['User-Agent'] = self.user_agent
then it works just fine and the user agent changes successfully for each retry. But I haven't seen anyone use this method to change the User Agent. My question is if changing the User Agent this way is okay and if not what am I doing wrong?
回答1:
If you always want to control which user-agent to use on that middleware, then it is ok, what setdefault
does is to check if there is no User-Agent
assigned before, which is possible because other middlewares could be doing it, or even assigning it from the spider.
Also I think you should also disable the default UserAgentMiddleware
or even set a higher priority to your middleware, check that UserAgentMiddleware
priority is 400, so set yours to be before (some number before 400).
回答2:
First, you are overriding a function with _
(an underscore) in the front which should be a "private" function in Python. The function might change in the different version of Scrapy and your overriding will hinder the upgrade/downgrade. It's risky for you to override it. It's better to change the user agent in another function wrapping _retry
.
I've made a function for that but using Scrapy fake user agent module. I found two functions calling _retry
. So, retry happens on exception and on retry statuses. We need to change the user agent on both functions in the request before it is retried. So this is the code:
from scrapy.downloadermiddlewares.retry import *
from scrapy.spidermiddlewares.httperror import *
from fake_useragent import UserAgent
class Retry500Middleware(RetryMiddleware):
def __init__(self, settings):
super(Retry500Middleware, self).__init__(settings)
fallback = settings.get('FAKEUSERAGENT_FALLBACK', None)
self.ua = UserAgent(fallback=fallback)
self.ua_type = settings.get('RANDOM_UA_TYPE', 'random')
def get_ua(self):
'''Gets random UA based on the type setting (random, firefox…)'''
return getattr(self.ua, self.ua_type)
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
request.headers['User-Agent'] = self.get_ua()
return self._retry(request, reason, spider) or response
return response
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and not request.meta.get('dont_retry', False):
request.headers['User-Agent'] = self.get_ua()
return self._retry(request, exception, spider)
Don't forget to enable the middleware via settings.py
and disable the standard retry and user agent middleware.
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'my_project.middlewares.Retry500Middleware': 401,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
FAKEUSERAGENT_FALLBACK = "<your favorite user agent>"
来源:https://stackoverflow.com/questions/34400307/scrapy-correct-way-to-change-user-agent-in-request