How to write a DownloadHandler for scrapy that makes requests through socksipy?

前端 未结 2 825
天命终不由人
天命终不由人 2021-02-02 04:14

I\'m trying to use scrapy over Tor. I\'ve been trying to get my head around how to write a DownloadHandler for scrapy that uses socksipy connections.

Scrapy\'s HTTP11Dow

相关标签:
2条回答
  • 2021-02-02 04:39

    I was able to make this work with https://github.com/habnabit/txsocksx.

    After doing a pip install txsocksx, I needed to replace scrapy's ScrapyAgent with txsocksx.http.SOCKS5Agent.

    I simply copied the code for HTTP11DownloadHandler and ScrapyAgent from scrapy/core/downloader/handlers/http.py, subclassed them and wrote this code:

    class TorProxyDownloadHandler(HTTP11DownloadHandler):
    
        def download_request(self, request, spider):
            """Return a deferred for the HTTP download"""
            agent = ScrapyTorAgent(contextFactory=self._contextFactory, pool=self._pool)
            return agent.download_request(request)
    
    
    class ScrapyTorAgent(ScrapyAgent):
        def _get_agent(self, request, timeout):
            bindaddress = request.meta.get('bindaddress') or self._bindAddress
            proxy = request.meta.get('proxy')
            if proxy:
                _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
                scheme = _parse(request.url)[0]
                omitConnectTunnel = proxyParams.find('noconnect') >= 0
                if  scheme == 'https' and not omitConnectTunnel:
                    proxyConf = (proxyHost, proxyPort,
                                 request.headers.get('Proxy-Authorization', None))
                    return self._TunnelingAgent(reactor, proxyConf,
                        contextFactory=self._contextFactory, connectTimeout=timeout,
                        bindAddress=bindaddress, pool=self._pool)
                else:
                    _, _, host, port, proxyParams = _parse(request.url)
                    proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                        timeout=timeout, bindAddress=bindaddress)
                    agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)
                    return agent
    
            return self._Agent(reactor, contextFactory=self._contextFactory,
                connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
    

    In settings.py, something like this is needed:

    DOWNLOAD_HANDLERS = {
        'http': 'crawler.http.TorProxyDownloadHandler'
    }
    

    Now proxying with Scrapy with work through a socks proxy like Tor.

    0 讨论(0)
  • 2021-02-02 04:46

    Try txsocksx that comes from one of the Twisted developers.

    Thanks to greatness of Twsisted and Scrapy, you can easily use SOCKS as proxy:


    In downloader.py:

    import scrapy.core.downloader.handlers.http11 as handler
    from twisted.internet import reactor
    from txsocksx.http import SOCKS5Agent
    from twisted.internet.endpoints import TCP4ClientEndpoint
    from scrapy.core.downloader.webclient import _parse
    
    
    class TorScrapyAgent(handler.ScrapyAgent):
        _Agent = SOCKS5Agent
    
        def _get_agent(self, request, timeout):
            proxy = request.meta.get('proxy')
    
            if proxy:
                proxy_scheme, _, proxy_host, proxy_port, _ = _parse(proxy)
    
                if proxy_scheme == 'socks5':
                    endpoint = TCP4ClientEndpoint(reactor, proxy_host, proxy_port)
    
                    return self._Agent(reactor, proxyEndpoint=endpoint)
    
            return super(TorScrapyAgent, self)._get_agent(request, timeout)
    
    
    class TorHTTPDownloadHandler(handler.HTTP11DownloadHandler):
        def download_request(self, request, spider):
            agent = TorScrapyAgent(contextFactory=self._contextFactory, pool=self._pool,
                                   maxsize=getattr(spider, 'download_maxsize', self._default_maxsize),
                                   warnsize=getattr(spider, 'download_warnsize', self._default_warnsize))
    
            return agent.download_request(request)
    

    Register new handler in settings.py:

    DOWNLOAD_HANDLERS = {
        'http': 'crawler.downloader.TorHTTPDownloadHandler',
        'https': 'crawler.downloader.TorHTTPDownloadHandler'
    }
    

    Now, you only have to tell crawlers to use proxy. I recommend to do it via middleware:

    class ProxyDownloaderMiddleware(object):
        def process_request(self, request, spider):
            request.meta['proxy'] = 'socks5://127.0.0.1:950'
    
    0 讨论(0)
提交回复
热议问题