I am attempting to scrape data from a website that requires authentication.
I have been able to successfully login using requests and HttpNtlmAuth with the following:
I was able to figure out what was going on.
1: This is considered a "DOWNLOADER_MIDDLEWARE" not a "SPIDER_MIDDLEWARE".
DOWNLOADER_MIDDLEWARES = { 'test.ntlmauth.NTLM_Middleware': 400, }
2: The middleware which I was trying to use needed to be modified significantly. Here is what works for me:
from scrapy.http import Response
import requests
from requests_ntlm import HttpNtlmAuth
class NTLM_Middleware(object):
def process_request(self, request, spider):
url = request.url
pwd = getattr(spider, 'http_pass', '')
usr = getattr(spider, 'http_user', '')
s = requests.session()
response = s.get(url,auth=HttpNtlmAuth(usr,pwd))
return Response(url,response.status_code,{}, response.content)
Within the spider, all you need to do is set these variables:
http_user = 'DOMAIN\\USER'
http_pass = 'PASS'