问题
Here is a related question but I could not figure out how to apply the answer to mechanize/urllib2: how to force python httplib library to use only A requests
Basically, given this simple code:
#!/usr/bin/python
import urllib2
print urllib2.urlopen('http://python.org/').read(100)
This results in wireshark saying the following:
0.000000 10.102.0.79 -> 8.8.8.8 DNS Standard query A python.org
0.000023 10.102.0.79 -> 8.8.8.8 DNS Standard query AAAA python.org
0.005369 8.8.8.8 -> 10.102.0.79 DNS Standard query response A 82.94.164.162
5.004494 10.102.0.79 -> 8.8.8.8 DNS Standard query A python.org
5.010540 8.8.8.8 -> 10.102.0.79 DNS Standard query response A 82.94.164.162
5.010599 10.102.0.79 -> 8.8.8.8 DNS Standard query AAAA python.org
5.015832 8.8.8.8 -> 10.102.0.79 DNS Standard query response AAAA 2001:888:2000:d::a2
That's a 5 second delay!
I don't have IPv6 enabled anywhere in my system (gentoo compiled with USE=-ipv6
) so I don't think that python has any reason to even try an IPv6 lookup.
The above referenced question suggested explicitly setting the socket type to AF_INET
which sounds great. I have no idea how to force urllib or mechanize to use any sockets that I create though.
EDIT: I know that the AAAA queries are the issue because other apps had the delay as well and as soon as I recompiled with ipv6 disabled, the problem went away... except for in python which still performs the AAAA requests.
回答1:
Suffering from the same problem, here is an ugly hack (use at your own risk..) based on the information given by J.J. .
This basically forces the family
parameter of socket.getaddrinfo(..)
to socket.AF_INET
instead of using socket.AF_UNSPEC
(zero, which is what seems to be used in socket.create_connection
), not only for calls from urllib2
but should do it for all calls to socket.getaddrinfo(..)
:
#--------------------
# do this once at program startup
#--------------------
import socket
origGetAddrInfo = socket.getaddrinfo
def getAddrInfoWrapper(host, port, family=0, socktype=0, proto=0, flags=0):
return origGetAddrInfo(host, port, socket.AF_INET, socktype, proto, flags)
# replace the original socket.getaddrinfo by our version
socket.getaddrinfo = getAddrInfoWrapper
#--------------------
import urllib2
print urllib2.urlopen("http://python.org/").read(100)
This works for me at least in this simple case.
回答2:
No answer, but a few datapoints. The DNS resolution appears to be originating from httplib.py
in HTTPConnection.connect()
(line 670 on my python 2.5.4 stdlib)
The code flow is roughly:
for res in socket.getaddrinfo(self.host, self.port, 0, socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
self.sock = socket.socket(af, socktype, proto)
try:
self.sock.connect(sa)
except socket.error, msg:
continue
break
A few comments on what's going on:
the third argument to
socket.getaddrinfo()
limits the socket families -- i.e., IPv4 vs. IPv6. Passing zero returns all families. Zero is hardcoded into the stdlib.passing a hostname into
getaddrinfo()
will cause name resolution -- on my OS X box with IPv6 enabled, both A and AAAA records go out, both answers come right back and both are returned.the rest of the connect loop tries each returned address until one succeeds
For example:
>>> socket.getaddrinfo("python.org", 80, 0, socket.SOCK_STREAM)
[
(30, 1, 6, '', ('2001:888:2000:d::a2', 80, 0, 0)),
( 2, 1, 6, '', ('82.94.164.162', 80))
]
>>> help(socket.getaddrinfo)
getaddrinfo(...)
getaddrinfo(host, port [, family, socktype, proto, flags])
-> list of (family, socktype, proto, canonname, sockaddr)
Some guesses:
Since the socket family in
getaddrinfo()
is hardcoded to zero, you won't be able to override the A vs. AAAA records through some supported API interface in urllib. Unless mechanize does their own name resolution for some other reason, mechanize can't either. From the construct of the connect loop, this is By Design.python's socket module is a thin wrapper around the POSIX socket APIs; I expect they're resolving every family available & configured on the system. Double-check Gentoo's IPv6 configuration.
回答3:
The DNS server 8.8.8.8 (Google DNS) replies immediately when asked about the AAAA of python.org. Therefore, the fact we do not see this reply in the trace you post probably indicate that this packet did not come back (which happens with UDP). If this loss is random, it is normal. If it is systematic, it means there is a problem in your network setup, may be a broken firewall which prevents the first AAAA reply to come back.
The 5-second delay comes from your stub resolver. In that case, if it is random, it is probably bad luck, but not related to IPv6, the reply for the A record could have failed as well.
Disabling IPv6 seems a very strange move, only two years before the last IPv4 address is distributed!
% dig @8.8.8.8 AAAA python.org
; <<>> DiG 9.5.1-P3 <<>> @8.8.8.8 AAAA python.org
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50323
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;python.org. IN AAAA
;; ANSWER SECTION:
python.org. 69917 IN AAAA 2001:888:2000:d::a2
;; Query time: 36 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sat Jan 9 21:51:14 2010
;; MSG SIZE rcvd: 67
回答4:
Most likely cause of this is a broken egress firewall. Juniper firewalls can cause this, for instance, though they have a workaround available.
If you can't get your network admins to fix the firewall, you can try the host-based workaround. Add this line to your /etc/resolv.conf
:
options single-request-reopen
The man page explains it well:
The resolver uses the same socket for the A and AAAA requests. Some hardware mistakenly only sends back one reply. When that happens the client sytem will sit and wait for the second reply. Turning this option on changes this behavior so that if two requests from the same port are not handled correctly it will close the socket and open a new one before sending the second request.
来源:https://stackoverflow.com/questions/2014534/force-python-mechanize-urllib2-to-only-use-a-requests