问题
This question is an extension of the resolved question here, ie. Crawling linkedin while authenticating with scrapy. Crawling LinkedIn while authenticated with Scrapy @Gates
While I keep the base of the script the same, only adding my own session_key and session_password - and after changing the start url particular to my use-case, as below.
class LinkedPySpider(InitSpider):
name = 'Linkedin'
allowed_domains = ['linkedin.com']
login_page = 'https://www.linkedin.com/uas/login'
start_urls=["http://www.linkedin.com/nhome/"]
[Also tried with this start url]
start_urls =
["http://www.linkedin.com/profile/view?id=38210724&trk=nav_responsive_tab_profile"]
I also tried changing the start_url to the second one(commented), to see if I could start scraping from my own profile page, I was unable to do so.
**Error that I get** -
scrapy crawl Linkedin
**2013-07-29 11:37:10+0530 [Linkedin] DEBUG: Retrying <GET http://www.linkedin.com/nhome/> (failed 1 times): DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname.**
**To see if the Name space was resolved, I tried -:**
nslookup www.linkedin.com #works
nslookup www.linkedin.com/uas/login # I think the depth of pages within a main website, does not resolve, and is normal right ?
Then I also tried to see if the error could have been due to Name Server not resolving and appended the Nameservers as below .
echo $http_proxy #gives http://username:password@your.proxy.com:80
sudo vi /etc/resolv.conf
and appended the free fast dns nameservers IP address as follows to this file :
nameserver 208.67.222.222
nameserver 208.67.220.220
nameserver 202.51.5.52
I am not too good with NS conflicts and DNS lookup failures, but could this be due to the fact that I am in a VM - though other scraping projects seemed to work just fine ?
My base use-case is to be able to extract connections and the list of companies they worked at, and a bunch of other attributes. So, I want to crawl/paginate from the "Connections" (All) in the main profile page, which does NOT show up if I use public profile in the start_url, ie. scrapy shell http://www.linkedin.com/in/ektagrover On passing legitimate XPath via hxs.select - this seems to work, but NOT if I used it along with a spider, since it did not meet my base-usecase(As below)
Question : Is there something wrong with my start_url, or is it just the way that I am "assuming that post the authentication the start_page could come to potentially ANY webpage in that site, when I redirect it post authentication at "https://www.linkedin.com/uas/login"
Work-environment - I am on Oracle VM Virtual Box with ubuntu 12.04 LTS with Python 2.7.3, with Scrapy 0.14.4
What worked/ Answer -- Looks like my proxy server was incorrectly pointing to echo $http_proxy - which gives http://username:password@your.proxy.com:80 [Unset the environment variable $http_proxy ] Just did " http_proxy= " , which unsets the proxy then did echo $http_proxy , which gives null to confirm . Post that just did scrapy crawl Linkedin, which worked through the authentication module. Though I am getting stuck here and there on selenium, but that's for another question. Thank you, @warwaruk
回答1:
**Error that I get** -
scrapy crawl Linkedin
**2013-07-29 11:37:10+0530 [Linkedin] DEBUG: Retrying <GET http://www.linkedin.com/nhome/> (failed 1 times): DNS lookup failed: address 'your.proxy.com' not found: [Errno -5] No address associated with hostname.**
**To see if the Name space was resolved, I tried -:**
nslookup www.linkedin.com #works
nslookup www.linkedin.com/uas/login # I think the depth of pages within a main website, does not resolve, and is normal right ?
Then I also tried to see if the error could have been due to Name Server not resolving and appended the Nameservers as below .
echo $http_proxy #gives http://username:password@your.proxy.com:80
You have a proxy set: http://username:password@your.proxy.com:80
.
Obviosly, it doesn't exist in Internet:
$ nslookup your.proxy.com
Server: 127.0.1.1
Address: 127.0.1.1#53
** server can't find your.proxy.com: NXDOMAIN
Either unset the environment variable $http_proxy
or set up a proxy and change the env. variable accordingly.
来源:https://stackoverflow.com/questions/17917292/dns-lookup-failed-address-your-proxy-com-not-found-errno-5-no-address-ass