How to write a DownloadHandler for scrapy that makes requests through socksipy?

前端未结

关注

 2  826

天命终不由人 2021-02-02 04:14

I\'m trying to use scrapy over Tor. I\'ve been trying to get my head around how to write a DownloadHandler for scrapy that uses socksipy connections.

Scrapy\'s HTTP11Dow

2条回答

感情败类 (楼主)

2021-02-02 04:39

I was able to make this work with https://github.com/habnabit/txsocksx.

After doing a pip install txsocksx, I needed to replace scrapy's ScrapyAgent with txsocksx.http.SOCKS5Agent.

I simply copied the code for HTTP11DownloadHandler and ScrapyAgent from scrapy/core/downloader/handlers/http.py, subclassed them and wrote this code:

class TorProxyDownloadHandler(HTTP11DownloadHandler):

    def download_request(self, request, spider):
        """Return a deferred for the HTTP download"""
        agent = ScrapyTorAgent(contextFactory=self._contextFactory, pool=self._pool)
        return agent.download_request(request)


class ScrapyTorAgent(ScrapyAgent):
    def _get_agent(self, request, timeout):
        bindaddress = request.meta.get('bindaddress') or self._bindAddress
        proxy = request.meta.get('proxy')
        if proxy:
            _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
            scheme = _parse(request.url)[0]
            omitConnectTunnel = proxyParams.find('noconnect') >= 0
            if  scheme == 'https' and not omitConnectTunnel:
                proxyConf = (proxyHost, proxyPort,
                             request.headers.get('Proxy-Authorization', None))
                return self._TunnelingAgent(reactor, proxyConf,
                    contextFactory=self._contextFactory, connectTimeout=timeout,
                    bindAddress=bindaddress, pool=self._pool)
            else:
                _, _, host, port, proxyParams = _parse(request.url)
                proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                    timeout=timeout, bindAddress=bindaddress)
                agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)
                return agent

        return self._Agent(reactor, contextFactory=self._contextFactory,
            connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

In settings.py, something like this is needed:

DOWNLOAD_HANDLERS = {
    'http': 'crawler.http.TorProxyDownloadHandler'
}

Now proxying with Scrapy with work through a socks proxy like Tor.

0 讨论(0)

查看其它2个回答