Python urllib3 and how to handle cookie support?

后端 未结 5 876
野趣味
野趣味 2020-12-06 16:55

So I\'m looking into urllib3 because it has connection pooling and is thread safe (so performance is better, especially for crawling), but the documentation is... minimal to

相关标签:
5条回答
  • 2020-12-06 17:32

    You're correct, there's no immediately better way to do this right now. I would be more than happy to accept a patch if you have a congruent improvement.

    One thing to keep in mind, urllib3's HTTPConnectionPool is intended to be a "pool of connections" to a specific host, as opposed to a stateful client. In that context, it makes sense to keep the tracking of cookies outside of the actual pool.

    • shazow (the author of urllib3)
    0 讨论(0)
  • 2020-12-06 17:35

    You need to set 'Cookie' not 'Set-Cookie', 'Set-Cookie' set by web server.

    And Cookies are one of headers, so its nothing wrong with doing that way.

    0 讨论(0)
  • 2020-12-06 17:41

    You should use the requests library. It uses urllib3 but makes things like adding cookies trivial.

    https://github.com/kennethreitz/requests

    import requests
    r1 = requests.get(url, cookies={'somename':'somevalue'})
    print(r1.content)
    
    0 讨论(0)
  • 2020-12-06 17:45

    You can use a code like this:

    def getHtml(url):
        http = urllib3.PoolManager()
        r = http.request('GET', url, headers={'User-agent':'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.16 Safari/537.36','Cookie':'cookie_name=cookie_value'})
        return r.data #HTML
    

    You should replace cookie_name and cookie_value

    0 讨论(0)
  • 2020-12-06 17:49

    Is there not a problem with multiple cookies?

    Some servers return multiple Set-Cookie headers, but urllib3 stores the headers in a dict and a dict does not allow multiple entries with the same key.

    httplib2 has a similar problem.

    Or maybe not: it turns out that the readheaders method of the HTTPMessage class in the httplib package -- which both urllib3 and httplib2 use -- has the following comment:

    If multiple header fields with the same name occur, they are combined according to the rules in RFC 2616 sec 4.2:

        Appending each subsequent field-value to the first, each separated
        by a comma. The order in which header fields with the same field-name
        are received is significant to the interpretation of the combined
        field value.
    

    So no headers are lost.

    There is, however, a problem if there are commas within a header value. I have not yet figured out what is going on here, but from skimming RFC 2616 ("Hypertext Transfer Protocol -- HTTP/1.1") and RFC 2965 ("HTTP State Management Mechanism") I get the impression that any commas within a header value are supposed to be quoted.

    0 讨论(0)
提交回复
热议问题