I\'m trying to use scrapy over Tor. I\'ve been trying to get my head around how to write a DownloadHandler for scrapy that uses socksipy connections.
Scrapy\'s HTTP11Dow
Try txsocksx that comes from one of the Twisted developers.
Thanks to greatness of Twsisted and Scrapy, you can easily use SOCKS as proxy:
In downloader.py
:
import scrapy.core.downloader.handlers.http11 as handler
from twisted.internet import reactor
from txsocksx.http import SOCKS5Agent
from twisted.internet.endpoints import TCP4ClientEndpoint
from scrapy.core.downloader.webclient import _parse
class TorScrapyAgent(handler.ScrapyAgent):
_Agent = SOCKS5Agent
def _get_agent(self, request, timeout):
proxy = request.meta.get('proxy')
if proxy:
proxy_scheme, _, proxy_host, proxy_port, _ = _parse(proxy)
if proxy_scheme == 'socks5':
endpoint = TCP4ClientEndpoint(reactor, proxy_host, proxy_port)
return self._Agent(reactor, proxyEndpoint=endpoint)
return super(TorScrapyAgent, self)._get_agent(request, timeout)
class TorHTTPDownloadHandler(handler.HTTP11DownloadHandler):
def download_request(self, request, spider):
agent = TorScrapyAgent(contextFactory=self._contextFactory, pool=self._pool,
maxsize=getattr(spider, 'download_maxsize', self._default_maxsize),
warnsize=getattr(spider, 'download_warnsize', self._default_warnsize))
return agent.download_request(request)
Register new handler in settings.py
:
DOWNLOAD_HANDLERS = {
'http': 'crawler.downloader.TorHTTPDownloadHandler',
'https': 'crawler.downloader.TorHTTPDownloadHandler'
}
Now, you only have to tell crawlers to use proxy. I recommend to do it via middleware:
class ProxyDownloaderMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'socks5://127.0.0.1:950'