urllib

get file size before downloading using HTTP header not matching with one retrieved from urlopen

坚强是说给别人听的谎言 提交于 2019-12-04 19:13:18
why is the content-lenght different in case of using requests and urlopen(url).info() >>> url = 'http://pymotw.com/2/urllib/index.html' >>> requests.head(url).headers.get('content-length', None) '8176' >>> urllib.urlopen(url).info()['content-length'] '38227' >>> len(requests.get(url).content) 38274 I was going to make a check for size of file in bytes to split the buffer to multiple threads based on Range in urllib2 but if I do not have the actual size of file in bytes it won't work.. only len(requests.get(url).content) gives 38274 which is closest but still not correct and moreover it is

爬虫之Urllib

痞子三分冷 提交于 2019-12-04 18:18:37
urllib是python内置的HTTP请求库 包括以下模块   urllib.request 请求模块   urllib.error 异常处理模块   urllib.parse url解析模块   urllib.robotparser robots.txt解析模块 urlopen 关于urllib.request.urlopen参数的介绍: urllib.request.urlopen(url, data=None, [timeout ]) import requests import urllib.request response = urllib.request.urlopen('http://www.baidu.com') html = response.read(); html.decode('utf-8'); print(html)#打印输出百度 在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况,或者请求异常,所以这个时候我们需要给 请求设置一个超时时间,而不是让程序一直在等待结果。例子如下: import urllib.request response = urllib.request.urlopen('http://httpbin.org/get', timeout=1) print(response.read())#如果时间超出,则停止 下面来进行小练习

python3——urllib模块的网络爬虫

孤街醉人 提交于 2019-12-04 17:44:30
urllib urllib模块是python3的URL处理包 其中: 1、urllib.request主要是打开和阅读urls 个人平时主要用的1: 打开对应的URL:urllib.request.open(url) 用urllib.request.build_opener([handler, ...]),来伪装成对应的浏览器 import urllib #要伪装成的浏览器(我这个是用的chrome) headers = ('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36') url='http://hotels.ctrip.com/' opener = urllib.request.build_opener() #将要伪装成的浏览器添加到对应的http头部 opener.addheaders=[headers] #读取相应的url data = opener.open(url).read() #将获得的html解码为utf-8 data=data.decode('utf-8') print(data) 2、urllib.parse主要是用来解析url 主要方法: urllib.parse

python 3 Login form on webpage with urllib and cookiejar

只谈情不闲聊 提交于 2019-12-04 17:39:59
I've been trying to make a python script login to my reddit account but it doesnt seem to work, could anybody tell me whats wrong with my code? It runs fine it just doesnt login.¨ cj = http.cookiejar.CookieJar() opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj)) opener.addheaders = [('User-agent', 'Mozilla/5.0')] urllib.request.install_opener(opener) authentication_url = 'https://ssl.reddit.com/post/login' payload = { 'op': 'login', 'user_name': 'username', 'user_pass': 'password' } data = urllib.parse.urlencode(payload) binary_data = data.encode('UTF-8') req = urllib

Urlretrieve and User-Agent? - Python

拈花ヽ惹草 提交于 2019-12-04 16:45:28
问题 I'm using urlretrieve from the urllib module. I cannot seem to find how to add a User-Agent description to my requests. Is it possible with urlretrieve? or do I need to use another method? 回答1: First, set version: urllib.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 SE 2.X MetaSr 1.0' Then: filename, headers = urllib.urlretrieve(url) 回答2: You can use URLopener or FancyURLopener classes. The 'version' argument

Print code from web page with python and urllib

两盒软妹~` 提交于 2019-12-04 16:05:44
I'm trying to use python and urllib to look at the code of a certain web page. I've tried and succeeded this at other webpages using the code: from urllib import * url = code = urlopen(url).read() print code But it returns nothing at all. My guess is it's because the page has a lot of javascripts? What to do? Niclas Nilsson Dynamic client side generated pages (JavaScript) You can not use urllib alone to see code that been rendered dynamically client side (JavaScript). The reason is that urllib only fetches the response from the server which is headers and the body (the actual code). Because of

Multithreading for faster downloading

青春壹個敷衍的年華 提交于 2019-12-04 15:57:45
问题 How can I download multiple links simultaneously? My script below works but only downloads one at a time and it is extremely slow. I can't figure out how to incorporate multithreading in my script. The Python script: from BeautifulSoup import BeautifulSoup import lxml.html as html import urlparse import os, sys import urllib2 import re print ("downloading and parsing Bibles...") root = html.parse(open('links.html')) for link in root.findall('//a'): url = link.get('href') name = urlparse

Connect to FTP server through http proxy

爱⌒轻易说出口 提交于 2019-12-04 15:03:52
My code belove gives me the error: socket.gaierror: [Errno 11001] getaddrinfo failed when calling the method ftp.connect(). My question is: why can I connect to google.com but when connecting to an ftp server it gives me error? And how I can connect to the ftp server from behind http proxy? import ftplib import urllib.request # ftp settings ftpusername = 'abc' ftppassword = 'xyz' ftp_host = 'host' ftp_port = 1234 proxy_url = 'http://username:password@host:port' proxy_support = urllib.request.ProxyHandler({'http': proxy_url}) opener = urllib.request.build_opener(proxy_support) urllib.request

Logging into quora using python

别来无恙 提交于 2019-12-04 14:59:52
I tried logging into quora using python. But it gives me the following error. urllib2.HTTPError: HTTP Error 500: Internal Server Error This is my code till now. I also work behind a proxy. import urllib2 import urllib import re import cookielib class Quora: def __init__(self): '''Initialising and authentication''' auth = 'http://name:password@proxy:port' cj = cookielib.CookieJar() logindata = urllib.urlencode({'email' : 'email' , 'password' : 'password'}) handler = urllib2.ProxyHandler({'http' : auth}) opener = urllib2.build_opener(handler , urllib2.HTTPCookieProcessor(cj)) urllib2.install

Requests, Mechanize, urllib fails but cURL works

限于喜欢 提交于 2019-12-04 14:44:22
问题 Whilst attempting to access this site through requests, I receive: ('Connection aborted.', error(54, 'Connection reset by peer')) I have also tried to access the site through mechanize and urllib, both failed. However cURL works fine (see end for code). I have tried requests.get() with combinations of parameters verify=True , stream=True and I have also tried a request with the cURL header. I tried to move to urllib / Mechanize as alternatives but both gave the same error. My code for