How to write a DownloadHandler for scrapy that makes requests through socksipy?

前端 未结 2 824
天命终不由人
天命终不由人 2021-02-02 04:14

I\'m trying to use scrapy over Tor. I\'ve been trying to get my head around how to write a DownloadHandler for scrapy that uses socksipy connections.

Scrapy\'s HTTP11Dow

2条回答
  •  死守一世寂寞
    2021-02-02 04:46

    Try txsocksx that comes from one of the Twisted developers.

    Thanks to greatness of Twsisted and Scrapy, you can easily use SOCKS as proxy:


    In downloader.py:

    import scrapy.core.downloader.handlers.http11 as handler
    from twisted.internet import reactor
    from txsocksx.http import SOCKS5Agent
    from twisted.internet.endpoints import TCP4ClientEndpoint
    from scrapy.core.downloader.webclient import _parse
    
    
    class TorScrapyAgent(handler.ScrapyAgent):
        _Agent = SOCKS5Agent
    
        def _get_agent(self, request, timeout):
            proxy = request.meta.get('proxy')
    
            if proxy:
                proxy_scheme, _, proxy_host, proxy_port, _ = _parse(proxy)
    
                if proxy_scheme == 'socks5':
                    endpoint = TCP4ClientEndpoint(reactor, proxy_host, proxy_port)
    
                    return self._Agent(reactor, proxyEndpoint=endpoint)
    
            return super(TorScrapyAgent, self)._get_agent(request, timeout)
    
    
    class TorHTTPDownloadHandler(handler.HTTP11DownloadHandler):
        def download_request(self, request, spider):
            agent = TorScrapyAgent(contextFactory=self._contextFactory, pool=self._pool,
                                   maxsize=getattr(spider, 'download_maxsize', self._default_maxsize),
                                   warnsize=getattr(spider, 'download_warnsize', self._default_warnsize))
    
            return agent.download_request(request)
    

    Register new handler in settings.py:

    DOWNLOAD_HANDLERS = {
        'http': 'crawler.downloader.TorHTTPDownloadHandler',
        'https': 'crawler.downloader.TorHTTPDownloadHandler'
    }
    

    Now, you only have to tell crawlers to use proxy. I recommend to do it via middleware:

    class ProxyDownloaderMiddleware(object):
        def process_request(self, request, spider):
            request.meta['proxy'] = 'socks5://127.0.0.1:950'
    

提交回复
热议问题