Scrapy - Correct way to change User Agent in Request

问题

I have created a custom Middleware in Scrapy by overriding the RetryMiddleware which changes both Proxy and User-Agent before retrying. It looks like this

class CustomRetryMiddleware(RetryMiddleware):
    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1

        if retries <= self.max_retry_times:
            Proxy_UA_Middleware.switch_proxy()
            Proxy_UA_Middleware.switch_ua()
            logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})
            retryreq = request.copy()
            retryreq.meta['retry_times'] = retries
            retryreq.dont_filter = True
            retryreq.priority = request.priority + self.priority_adjust
            return retryreq
        else:
            logger.debug("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})

The Proxy_UA_Middlware class is quite long. Basically it contains methods that change proxy and user agent. I have both these middlewares configured properly in my settings.py file. The proxy part works okay but the User Agent doesn't change. The code I've used to changed User Agent looks like this

request.headers.setdefault('User-Agent', self.user_agent)

where self.user_agent is a random value taken from an array of user agents. This doesn't work. However, if I do this

request.headers['User-Agent'] = self.user_agent

then it works just fine and the user agent changes successfully for each retry. But I haven't seen anyone use this method to change the User Agent. My question is if changing the User Agent this way is okay and if not what am I doing wrong?

回答1:

If you always want to control which user-agent to use on that middleware, then it is ok, what setdefault does is to check if there is no User-Agent assigned before, which is possible because other middlewares could be doing it, or even assigning it from the spider.

Also I think you should also disable the default UserAgentMiddleware or even set a higher priority to your middleware, check that UserAgentMiddleware priority is 400, so set yours to be before (some number before 400).

回答2:

First, you are overriding a function with _ (an underscore) in the front which should be a "private" function in Python. The function might change in the different version of Scrapy and your overriding will hinder the upgrade/downgrade. It's risky for you to override it. It's better to change the user agent in another function wrapping _retry.

I've made a function for that but using Scrapy fake user agent module. I found two functions calling _retry. So, retry happens on exception and on retry statuses. We need to change the user agent on both functions in the request before it is retried. So this is the code:

from scrapy.downloadermiddlewares.retry import *
from scrapy.spidermiddlewares.httperror import *

from fake_useragent import UserAgent

class Retry500Middleware(RetryMiddleware):

    def __init__(self, settings):
        super(Retry500Middleware, self).__init__(settings)

        fallback = settings.get('FAKEUSERAGENT_FALLBACK', None)
        self.ua = UserAgent(fallback=fallback)
        self.ua_type = settings.get('RANDOM_UA_TYPE', 'random')

    def get_ua(self):
        '''Gets random UA based on the type setting (random, firefox…)'''
        return getattr(self.ua, self.ua_type)

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            request.headers['User-Agent'] = self.get_ua()
            return self._retry(request, reason, spider) or response
        return response

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            request.headers['User-Agent'] = self.get_ua()
            return self._retry(request, exception, spider)

Don't forget to enable the middleware via settings.py and disable the standard retry and user agent middleware.

DOWNLOADER_MIDDLEWARES = {
  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
  'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
  'my_project.middlewares.Retry500Middleware': 401,
  'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}

FAKEUSERAGENT_FALLBACK = "<your favorite user agent>"

来源：https://stackoverflow.com/questions/34400307/scrapy-correct-way-to-change-user-agent-in-request

标签

python

scrapy

screen-scraping

user-agent