urllib | 易学教程

get file size before downloading using HTTP header not matching with one retrieved from urlopen

阅读更多关于 get file size before downloading using HTTP header not matching with one retrieved from urlopen

why is the content-lenght different in case of using requests and urlopen(url).info() >>> url = 'http://pymotw.com/2/urllib/index.html' >>> requests.head(url).headers.get('content-length', None) '8176' >>> urllib.urlopen(url).info()['content-length'] '38227' >>> len(requests.get(url).content) 38274 I was going to make a check for size of file in bytes to split the buffer to multiple threads based on Range in urllib2 but if I do not have the actual size of file in bytes it won't work.. only len(requests.get(url).content) gives 38274 which is closest but still not correct and moreover it is

爬虫之Urllib

阅读更多关于爬虫之Urllib

urllib是python内置的HTTP请求库包括以下模块　　urllib.request 请求模块　　urllib.error 异常处理模块　　urllib.parse url解析模块　　urllib.robotparser robots.txt解析模块 urlopen 关于urllib.request.urlopen参数的介绍： urllib.request.urlopen(url, data=None, [timeout ]) import requests import urllib.request response = urllib.request.urlopen('http://www.baidu.com') html = response.read(); html.decode('utf-8'); print(html)#打印输出百度在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况，或者请求异常，所以这个时候我们需要给请求设置一个超时时间，而不是让程序一直在等待结果。例子如下： import urllib.request response = urllib.request.urlopen('http://httpbin.org/get', timeout=1) print(response.read())#如果时间超出，则停止下面来进行小练习

python3——urllib模块的网络爬虫

阅读更多关于 python3——urllib模块的网络爬虫

urllib urllib模块是python3的URL处理包其中： 1、urllib.request主要是打开和阅读urls 个人平时主要用的1：打开对应的URL：urllib.request.open(url) 用urllib.request.build_opener([handler, ...])，来伪装成对应的浏览器 import urllib #要伪装成的浏览器(我这个是用的chrome) headers = ('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36') url='http://hotels.ctrip.com/' opener = urllib.request.build_opener() #将要伪装成的浏览器添加到对应的http头部 opener.addheaders=[headers] #读取相应的url data = opener.open(url).read() #将获得的html解码为utf-8 data=data.decode('utf-8') print(data) 2、urllib.parse主要是用来解析url 主要方法： urllib.parse

python 3 Login form on webpage with urllib and cookiejar

阅读更多关于 python 3 Login form on webpage with urllib and cookiejar

I've been trying to make a python script login to my reddit account but it doesnt seem to work, could anybody tell me whats wrong with my code? It runs fine it just doesnt login.¨ cj = http.cookiejar.CookieJar() opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj)) opener.addheaders = [('User-agent', 'Mozilla/5.0')] urllib.request.install_opener(opener) authentication_url = 'https://ssl.reddit.com/post/login' payload = { 'op': 'login', 'user_name': 'username', 'user_pass': 'password' } data = urllib.parse.urlencode(payload) binary_data = data.encode('UTF-8') req = urllib

Urlretrieve and User-Agent? - Python

阅读更多关于 Urlretrieve and User-Agent? - Python

问题 I'm using urlretrieve from the urllib module. I cannot seem to find how to add a User-Agent description to my requests. Is it possible with urlretrieve? or do I need to use another method? 回答1: First, set version: urllib.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 SE 2.X MetaSr 1.0' Then: filename, headers = urllib.urlretrieve(url) 回答2: You can use URLopener or FancyURLopener classes. The 'version' argument

Print code from web page with python and urllib

阅读更多关于 Print code from web page with python and urllib

I'm trying to use python and urllib to look at the code of a certain web page. I've tried and succeeded this at other webpages using the code: from urllib import * url = code = urlopen(url).read() print code But it returns nothing at all. My guess is it's because the page has a lot of javascripts? What to do? Niclas Nilsson Dynamic client side generated pages (JavaScript) You can not use urllib alone to see code that been rendered dynamically client side (JavaScript). The reason is that urllib only fetches the response from the server which is headers and the body (the actual code). Because of

Multithreading for faster downloading

阅读更多关于 Multithreading for faster downloading

问题 How can I download multiple links simultaneously? My script below works but only downloads one at a time and it is extremely slow. I can't figure out how to incorporate multithreading in my script. The Python script: from BeautifulSoup import BeautifulSoup import lxml.html as html import urlparse import os, sys import urllib2 import re print ("downloading and parsing Bibles...") root = html.parse(open('links.html')) for link in root.findall('//a'): url = link.get('href') name = urlparse

Connect to FTP server through http proxy

阅读更多关于 Connect to FTP server through http proxy

My code belove gives me the error: socket.gaierror: [Errno 11001] getaddrinfo failed when calling the method ftp.connect(). My question is: why can I connect to google.com but when connecting to an ftp server it gives me error? And how I can connect to the ftp server from behind http proxy? import ftplib import urllib.request # ftp settings ftpusername = 'abc' ftppassword = 'xyz' ftp_host = 'host' ftp_port = 1234 proxy_url = 'http://username:password@host:port' proxy_support = urllib.request.ProxyHandler({'http': proxy_url}) opener = urllib.request.build_opener(proxy_support) urllib.request

Logging into quora using python

阅读更多关于 Logging into quora using python

I tried logging into quora using python. But it gives me the following error. urllib2.HTTPError: HTTP Error 500: Internal Server Error This is my code till now. I also work behind a proxy. import urllib2 import urllib import re import cookielib class Quora: def __init__(self): '''Initialising and authentication''' auth = 'http://name:password@proxy:port' cj = cookielib.CookieJar() logindata = urllib.urlencode({'email' : 'email' , 'password' : 'password'}) handler = urllib2.ProxyHandler({'http' : auth}) opener = urllib2.build_opener(handler , urllib2.HTTPCookieProcessor(cj)) urllib2.install

Requests, Mechanize, urllib fails but cURL works

阅读更多关于 Requests, Mechanize, urllib fails but cURL works

问题 Whilst attempting to access this site through requests, I receive: ('Connection aborted.', error(54, 'Connection reset by peer')) I have also tried to access the site through mechanize and urllib, both failed. However cURL works fine (see end for code). I have tried requests.get() with combinations of parameters verify=True , stream=True and I have also tried a request with the cURL header. I tried to move to urllib / Mechanize as alternatives but both gave the same error. My code for