So I\'m looking into urllib3 because it has connection pooling and is thread safe (so performance is better, especially for crawling), but the documentation is... minimal to
You're correct, there's no immediately better way to do this right now. I would be more than happy to accept a patch if you have a congruent improvement.
One thing to keep in mind, urllib3's HTTPConnectionPool is intended to be a "pool of connections" to a specific host, as opposed to a stateful client. In that context, it makes sense to keep the tracking of cookies outside of the actual pool.
You need to set 'Cookie'
not 'Set-Cookie'
, 'Set-Cookie'
set by web server.
And Cookies are one of headers, so its nothing wrong with doing that way.
You should use the requests library. It uses urllib3 but makes things like adding cookies trivial.
https://github.com/kennethreitz/requests
import requests
r1 = requests.get(url, cookies={'somename':'somevalue'})
print(r1.content)
You can use a code like this:
def getHtml(url):
http = urllib3.PoolManager()
r = http.request('GET', url, headers={'User-agent':'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.16 Safari/537.36','Cookie':'cookie_name=cookie_value'})
return r.data #HTML
You should replace cookie_name and cookie_value
Is there not a problem with multiple cookies?
Some servers return multiple Set-Cookie headers, but urllib3 stores the headers in a dict and a dict does not allow multiple entries with the same key.
httplib2 has a similar problem.
Or maybe not: it turns out that the readheaders method of the HTTPMessage class in the httplib package -- which both urllib3 and httplib2 use -- has the following comment:
If multiple header fields with the same name occur, they are combined according to the rules in RFC 2616 sec 4.2:
Appending each subsequent field-value to the first, each separated
by a comma. The order in which header fields with the same field-name
are received is significant to the interpretation of the combined
field value.
So no headers are lost.
There is, however, a problem if there are commas within a header value. I have not yet figured out what is going on here, but from skimming RFC 2616 ("Hypertext Transfer Protocol -- HTTP/1.1") and RFC 2965 ("HTTP State Management Mechanism") I get the impression that any commas within a header value are supposed to be quoted.