Which is best in Python: urllib2, PycURL or mechanize?

前端未结

关注

 8  1325

一个人的身影

Ok so I need to download some web pages using Python and did a quick investigation of my options.

Included with Python:

urllib - seems to me that I should use ur

相关标签:

8条回答

一向

2021-01-29 17:51

Take a look on Grab (http://grablib.org). It is a network library which provides two main interfaces: 1) Grab for creating network requests and parsing retrieved data 2) Spider for creating bulk site scrapers

Under the hood Grab uses pycurl and lxml but it is possible to use other network transports (for example, requests library). Requests transport is not well tested yet.

0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2021-01-29 17:58

Every python library that speaks HTTP has its own advantages.

Use the one that has the minimum amount of features necessary for a particular task.

Your list is missing at least urllib3 - a cool third party HTTP library which can reuse a HTTP connection, thus speeding up greatly the process of retrieving multiple URLs from the same site.

0 讨论(0)
发布评论:

提交评论
- 加载中...
太阳男子

2021-01-29 18:00
Python requests is also a good candidate for HTTP stuff. It has a nicer api IMHO, an example http request from their offcial documentation:
```
>>> r = requests.get('https://api.github.com', auth=('user', 'pass'))
>>> r.status_code
204
>>> r.headers['content-type']
'application/json'
>>> r.content
...
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2021-01-29 18:03
- urllib2 is found in every Python install everywhere, so is a good base upon which to start.
- PycURL is useful for people already used to using libcurl, exposes more of the low-level details of HTTP, plus it gains any fixes or improvements applied to libcurl.
- mechanize is used to persistently drive a connection much like a browser would.
It's not a matter of one being better than the other, it's a matter of choosing the appropriate tool for the job.
0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2021-01-29 18:06

Don't worry about "last updated". HTTP hasn't changed much in the last few years ;)

urllib2 is best (as it's inbuilt), then switch to mechanize if you need cookies from Firefox. mechanize can be used as a drop-in replacement for urllib2 - they have similar methods etc. Using Firefox cookies means you can get things from sites (like say StackOverflow) using your personal login credentials. Just be responsible with your number of requests (or you'll get blocked).

PycURL is for people who need all the low level stuff in libcurl. I would try the other libraries first.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2021-01-29 18:07
To "get some webpages", use requests!

From http://docs.python-requests.org/en/latest/ :

Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.

Things shouldn’t be this way. Not in Python.
```
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页