urllib

Python standard library to POST multipart/form-data encoded data

旧街凉风 提交于 2019-12-17 15:43:27
问题 I would like to POST multipart/form-data encoded data. I have found an external module that does it: http://atlee.ca/software/poster/index.html however I would rather avoid this dependency. Is there a way to do this using the standard libraries? thanks 回答1: The standard library does not currently support that. There is cookbook recipe that includes a fairly short piece of code that you just may want to copy, though, along with long discussions of alternatives. 回答2: It's an old thread but

Python 3 - urllib, HTTP Error 407: Proxy Authentication Required

一笑奈何 提交于 2019-12-17 15:39:32
问题 I'm trying to open a website (I am behind a corporate proxy) using urllib.request.urlopen() but I am getting the error: urllib.error.HTTPError: HTTP Error 407: Proxy Authentication Required I can find the proxy in urllib.request.getproxies(), but how do I specify a username and password to use for it? I couldn't find the solution in the official docs. 回答1: import urllib.request as req proxy = req.ProxyHandler({'http': r'http://username:password@url:port'}) auth = req.HTTPBasicAuthHandler()

Only add to a dict if a condition is met

旧巷老猫 提交于 2019-12-17 15:38:19
问题 I am using urllib.urlencode to build web POST parameters, however there are a few values I only want to be added if a value other than None exists for them. apple = 'green' orange = 'orange' params = urllib.urlencode({ 'apple': apple, 'orange': orange }) That works fine, however if I make the orange variable optional, how can I prevent it from being added to the parameters? Something like this (pseudocode): apple = 'green' orange = None params = urllib.urlencode({ 'apple': apple, if orange:

In Python, how do I use urllib to see if a website is 404 or 200?

狂风中的少年 提交于 2019-12-17 15:20:18
问题 How to get the code of the headers through urllib? 回答1: The getcode() method (Added in python2.6) returns the HTTP status code that was sent with the response, or None if the URL is no HTTP URL. >>> a=urllib.urlopen('http://www.google.com/asdfsf') >>> a.getcode() 404 >>> a=urllib.urlopen('http://www.google.com/') >>> a.getcode() 200 回答2: You can use urllib2 as well: import urllib2 req = urllib2.Request('http://www.python.org/fish.html') try: resp = urllib2.urlopen(req) except urllib2

Why can't I get Python's urlopen() method to work on Windows?

旧巷老猫 提交于 2019-12-17 14:48:08
问题 Why isn't this simple Python code working? import urllib file = urllib.urlopen('http://www.google.com') print file.read() This is the error that I get: Traceback (most recent call last): File "C:\workspace\GarchUpdate\src\Practice.py", line 26, in <module> file = urllib.urlopen('http://www.google.com') File "C:\Python26\lib\urllib.py", line 87, in urlopen return opener.open(url) File "C:\Python26\lib\urllib.py", line 206, in open return getattr(self, name)(url) File "C:\Python26\lib\urllib.py

Python爬虫小实践:爬取任意CSDN博客所有文章的文字内容(或可改写为保存其他的元素),间接增加博客访问量

萝らか妹 提交于 2019-12-17 13:22:00
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> Python并不是我的主业,当初学Python主要是为了学爬虫,以为自己觉得能够从网上爬东西是一件非常神奇又是一件非常有用的事情,因为我们可以获取一些方面的数据或者其他的东西,反正各有用处。 这两天闲着没事,主要是让脑子放松一下就写着爬虫来玩,上一篇初略的使用BeautifulSoup去爬某个CSDN博客的基本统计信息(http://blog.csdn.net/hw140701/article/details/55048364),今天就想要不就直接根据某个CSDN博客的主页的地址爬取该博客的所有文章链接,进而提取每一篇文章中的元素,我这里是提取每一篇博客中的文字信息。 一、主要思路 通过分析CSDN博客的网站源码,我们发现当我们输入某博客主页网址时,如:http://blog.csdn.net/hw140701 在主页会有多篇文章,以及文章的链接,默认的是15篇。在主页博客的底部会有分页的链接,如下图 如图所示,一共65篇分5页,每一页中又包含了15篇文章的链接。 所以我们总体的思路是: 1.输入博客主页地址,先获取当前页所有文章的链接; 2.获取每个分页的链接地址 3.通过每个分页的链接地址获取每一个分页上所有文章的链接地址 4.根据每一篇文章的链接地址,获取每一篇文章的内容,直到该博客所有文章都爬取完毕 二

How to save “complete webpage” not just basic html using Python

半腔热情 提交于 2019-12-17 09:33:42
问题 I am using following code to save webpage using Python: import urllib import sys from bs4 import BeautifulSoup url = 'http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html' f = urllib.urlretrieve(url,'test.html') Problem : This code saves html as basic html without javascripts, images etc. I want to save webpage as complete (Like we have option in browser) Update : I am using following code now to save all the js/images/css files of webapge so that it can be saved as complete

multiprocessing.pool.MaybeEncodingError: 'TypeError(“cannot serialize '_io.BufferedReader' object”,)'

筅森魡賤 提交于 2019-12-17 07:54:28
问题 Why does the code below work only with multiprocessing.dummy , but not with simple multiprocessing . import urllib.request #from multiprocessing.dummy import Pool #this works from multiprocessing import Pool urls = ['http://www.python.org', 'http://www.yahoo.com','http://www.scala.org', 'http://www.google.com'] if __name__ == '__main__': with Pool(5) as p: results = p.map(urllib.request.urlopen, urls) Error : Traceback (most recent call last): File "urlthreads.py", line 31, in <module>

multiprocessing.pool.MaybeEncodingError: 'TypeError(“cannot serialize '_io.BufferedReader' object”,)'

吃可爱长大的小学妹 提交于 2019-12-17 07:54:07
问题 Why does the code below work only with multiprocessing.dummy , but not with simple multiprocessing . import urllib.request #from multiprocessing.dummy import Pool #this works from multiprocessing import Pool urls = ['http://www.python.org', 'http://www.yahoo.com','http://www.scala.org', 'http://www.google.com'] if __name__ == '__main__': with Pool(5) as p: results = p.map(urllib.request.urlopen, urls) Error : Traceback (most recent call last): File "urlthreads.py", line 31, in <module>

urllib.error.URLError: <urlopen error unknown url type: 'https>

南笙酒味 提交于 2019-12-17 07:48:27
问题 (Python 3.4.2) I've got a weird error when I run 'urllib.request.urlopen(url)' inside of a script. If I run it directly in the Python interpreter, it works fine, but not when I run it inside of a script through a bash shell (Linux). I'm guessing it has something to do with the 'url' string, maybe because I'm creating the string through the 'string.join' method. import urllib.request url = "".join((baseurl, other_string, midurl, query)) response = urllib.request.urlopen(url) The 'url' string