urllib

python requests return a different web page from browser or urllib

依然范特西╮ 提交于 2020-01-14 04:01:05
问题 I use requests to scrape webpage for some content. When I use import requests requests.get('example.org') I get a different page from the one I get when I use my broswer or using import urllib.request urllib.request.urlopen('example.org') I tried using urllib but it was really slow. In a comparison test I did it was 50% slower than requests !! How Do you solve this?? 回答1: After a lot of investigations I found that the site passes a cookie in the header attached to the first visitor to the

requests与urllib.request

柔情痞子 提交于 2020-01-14 02:37:01
requests很明显,在写法上与urllib.request不同,前者多一个 S. 导入包时: import requests import urllib.request urllib.request请求模块,用于打开和读取url urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None) response.read()可以获取到网页的内容 timeout参数的使用在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况,或者请求异常,有时也用来解决反爬,控制爬行速度。 response.status,response.getheaders()【response.headers】【response.info()】获取状态码以及头部信息。response.read()获得的是响应体的内容. urlopen()只能用于简单的请求,它无法添加header信息。 urllib.request.Request(url) 注意大写。 使用data参数; data = urllib.parse.urlencode(dict).encode(‘utf-8’) 使用data参数如果要传 必须传bytes(字节流) 类型的,如果是一个字典

urllib.urlopen works but urllib2.urlopen doesn't

家住魔仙堡 提交于 2020-01-12 07:11:38
问题 I have a simple website I'm testing. It's running on localhost and I can access it in my web browser. The index page is simply the word "running". urllib.urlopen will successfully read the page but urllib2.urlopen will not. Here's a script which demonstrates the problem (this is the actual script and not a simplification of a different test script): import urllib, urllib2 print urllib.urlopen("http://127.0.0.1").read() # prints "running" print urllib2.urlopen("http://127.0.0.1").read() #

urllib.urlopen works but urllib2.urlopen doesn't

喜你入骨 提交于 2020-01-12 07:09:52
问题 I have a simple website I'm testing. It's running on localhost and I can access it in my web browser. The index page is simply the word "running". urllib.urlopen will successfully read the page but urllib2.urlopen will not. Here's a script which demonstrates the problem (this is the actual script and not a simplification of a different test script): import urllib, urllib2 print urllib.urlopen("http://127.0.0.1").read() # prints "running" print urllib2.urlopen("http://127.0.0.1").read() #

Python爬虫连载1-urllib.request和chardet包使用方式

狂风中的少年 提交于 2020-01-09 00:56:32
一、参考资料 1.《Python网络数据采集》图灵工业出版社 2.《精通Python爬虫框架Scrapy》人民邮电出版社 3.[Scrapy官方教程](http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html) 4.[Python3网络爬虫](http://blog.csdn.net/c406495762/article/details/72858983 二、前提知识 url、http协议、web前端:html\CSS\JS、ajax、re、Xpath、xml 三、基础知识 1.爬虫简介 爬虫定义:网络爬虫(又被称为网页蜘蛛、网络机器人、在FOAF社区中,更经常的称为网页追逐者)是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本​。两外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者如​蠕虫。 2.两大特征 (1)能按作者要求下载数据或者内容 (2)能自动在网络上流窜 3.三大步骤 (1)​下载网页; (2)提取正确的信息 (3)根据一定规则自动跳到另外的网页上执行上两步内容 4.爬虫分类 (1)通用爬虫 (2)专用爬虫 5.Python网络包简介 Python2:urllib\urllib2\urllib3\httplib\httplib2\requests Python3.x:urllib

python 爬虫 基本库使用urllib之urlopen(一)

我的梦境 提交于 2020-01-08 01:58:10
urllib是python内置的请求库。它主要包含四个模块: request :是最基本的HTTP请求模块,可以用来模拟发送请求。 error:异常处理模块,如果请求出现错误,可以捕获异常,然后进行其他操作,保证程序不会意外终止。 parse:工具模块,提供了很多URL处理方法,比如拆分、解析、合并等。 robotparser:主要用来识别网站的robots.txt文件,然后判断哪些网站可以爬。 使用urllib的request模块中的方法urlopen抓取python官网,这样我们想要的东西就可以提取出来了 import urllib.request response = urllib.request.urlopen('https://www.python.org') print(type(response)) #类型 print(response.read().decode('utf-8')) 来源: https://www.cnblogs.com/u-damowang1/p/12164500.html

Content-Length should be specified for iterable data of type <class 'dict'>

£可爱£侵袭症+ 提交于 2020-01-06 12:48:13
问题 import urllib.request url = 'site' headers = {'Authorization' : 'Basic emVkMHg6WWJyYm5mMDA='} req = urllib.request.Request(url, headers) response = urllib.request.urlopen(req).getcode() I wanted to write multi-threaded authorization on a remote server, but I see this: Traceback (most recent call last): File "C:\Program Files\Python33\lib\urllib\request.py", line 1186, in do_request_ mv = memoryview(data) TypeError: memoryview: dict object does not have the buffer interface During handling of

Python download images with alernating variables

邮差的信 提交于 2020-01-05 05:36:22
问题 I was trying to download images with url's that change but got an error. url_image="http://www.joblo.com/timthumb.php?src=/posters/images/full/"+str(title_2)+"-poster1.jpg&h=333&w=225" user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' headers = {'User-Agent': user_agent} req = urllib.request.Request(url_image, None, headers) print(url_image) #image, h = urllib.request.urlretrieve(url_image) with urllib.request.urlopen(req) as response: the_page = response.read() #print (the_page) with

Logging in to a web site with Python (urllib,urllib2,cookielib): How does one find necessary information for submission?

╄→尐↘猪︶ㄣ 提交于 2020-01-04 13:47:19
问题 Preface: I understand that there are many responses for similar questions such as this on stack overflow. However, I haven't found anything relating to aspx log ins, nor an exact case such as this. Problem: I need to determine what information is necessary in order to log in to https://cableone.net/login.aspx in order to scrape information from there. Progress: Thus far I have found input fields in the source of login.aspx and have scrapped together a script in python with urllib,urllib2,and

Redirect with no auth

笑着哭i 提交于 2020-01-04 07:42:45
问题 According to the docs, it should be as simple as: data = self.http_pool.urlopen('GET', file_url, preload_content=False, retries=max_download_retries) request.add_unredirected_header(key, header) Add a header that will not be added to a redirected request. But I cannot seem to find any examples on how this can be achieved. I am using the pyupdater to download updates from bitbucket and launch the newest version of exe. I am using this library to create a script that connects to bitbucket fine,