一、urllib
1. 访问
urllib.request.urlopen()
参数:
- url:需要爬取的URL地址
- timeout:设置等待时间,指定时间内未得到相应时抛出异常
# 导入模块 import urllib.request url = "http://www.baidu.com/" # 向百度发起请求,得到相应对象 html = urllib.request.urlopen(url) print(html.read().decode("utf-8")) # 得到网页源代码,为str类型 print(html.status) # 得到响应状态码
2.响应方法
1、bytes = response.read() # 获取原生的网页源码 2、string = response.read().decode("utf-8") # 获取网页源码并转码 3、url = response.geturl() # 获取资源的URL 4、code = response.getcode() # 获取响应状态码 5、string.encode() # string --> bytes 6、bytes.decode() # bytes --> string
3. 包装
3.1 User-Agent
urllib.request.Request()
作用:创建请求对象(包装请求,重构User-Agent,使程序更像正常人类请求)
参数:
- url:请求的url地址
- headers:添加请求头
使用
- 构造请求对象(重构User-Agent)
- 发送请求获取响应对象(urlopen)
- 获取响应内容
from urllib import request url = "http://www.httpbin.org/get" header = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"} # 向百度发起请求,得到相应对象 res = request.Request(url, headers=header) res = request.urlopen(res) html = res.read().decode("utf-8") print(html)
另外,python还提供了第三库
fake_useragent
来随机生成User-Agentfrom fake_useragent import UserAgent useragent = UserAgent() useragent = {"User-Agent" : useragent.random} print(useragent) print(useragent) print(useragent) print(useragent)
3.2 编码
urllib.parse.urlencode()
编码之前:https://www.baidu.com/s?&wd=美女
编码之后:https://www.baidu.com/s?&wd=%E7%BE%8E%E5%A5%B3
In [1]: from urllib import parse In [2]: query_string = {"wd" : "美女"} In [3]: reslut = parse.urlencode(query_string) In [4]: print(reslut) wd=%E7%BE%8E%E5%A5%B3
**访问百度** from urllib import parse, request from fake_useragent import UserAgent header = {"User-Agent":UserAgent().random} query_string = {"wd" : "美女"} url = "https://www.baidu.com/s?" + parse.urlencode(query_string) response = request.Request(url, headers=header) html = request.urlopen(response).read().decode("utf-8") print(html)
字符编码
In [10]: import urllib.parse In [11]: string = urllib.parse.quote("美女") In [13]: "https://www.baidu.com/s?wd=" + string Out[13]: 'https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3'
字符解码
In [14]: import urllib.parse In [15]: str = "%E7%BE%8E%E5%A5%B3" In [16]: urllib.parse.unquote(str) Out[16]: '美女'
3.3 多参数
In [1]: from urllib import parse In [2]: query_string_dic = { ...: "wd" : "美女", ...: "pn" : "50" ...: } In [3]: r = parse.urlencode(query_string_dic) In [4]: r Out[5]: 'wd=%E7%BE%8E%E5%A5%B3&pn=50'
不难发现,parse会在两个参数之间自动加上&
符号,因此可以直接构造URL访问
3.4 拼接URL
字符串相加
'https://www.baidu.com/s?' + urlencode({"wd" : "美女", "pn" : "50"})
字符串格式化(占位符)
'https://www.baidu.com/s?%s' % urlencode({"wd" : "美女", "pn" : "50"})
format()
'https://www.baidu.com/s?{}'.format(urlencode({"wd" : "美女", "pn" : "50"}))
请求百度搜索的内容,把响应内容保存到本地文件
from urllib import request, parse from fake_useragent import UserAgent wd = input("请输入关键字:") url = "https://www.baidu.com/?{}".format(parse.urlencode({"wd":wd})) header = {"User-Agent" : UserAgent().random} req = request.Request(url=url, headers=header) res = request.urlopen(req) html = res.read().decode("utf-8") filename = "{}.html".format(wd) with open(filename, "w") as f: f.write(html)
在读写文件的时候经常遇见编码问题
在Windows中使用utf-8会报错
UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 30904: illegal multibyte sequence在读写文件加上参数
with open("demo.txt", encoding="gbk18030") as f
这种编码识别的国标会更多
4. 示例
百度贴吧数据抓取
要求
- 输入贴吧名称
- 输入起始页
- 输入终止页
- 保存到本地文件:第1页.html、第2页.html ...
代码实现
from urllib import parse, request from fake_useragent import UserAgent import time import random class BaiduSpider(object): def __init__(self): self.url = "http://tieba.baidu.com/f?kw{}&pn={}" self.headers = {"User-Agent" : UserAgent().random} # 获取响应页面 def get_page(self, url): req = request.Request(url=url, headers=self.headers) res = request.urlopen(req) html = res.read().decode("utf-8") return html # 提取数据 def parse_page(self): pass # 保存数据 def write_page(self, filename, html): with open(filename, "w", encoding="utf-8") as f: f.write(html) # 主函数 def main(self): name = input("请输入贴吧名:") start = int(input("请输入起始页:")) end = int(input("请输入终止页:")) # 拼接URL,发送请求 for page in range(start, end+1): pn = (page - 1) * 50 kw = parse.quote(name) url = self.url.format(kw, pn) # 获取响应,保存 html = self.get_page(url) filename = "{}-第{}页.html".format(name, page) self.write_page(filename, html) # 控制速度,随机休眠 time.sleep(random.randint(1, 3)) # 随机休眠1-3秒钟 if __name__ == '__main__': spider = BaiduSpider() spider.main()
来源:https://www.cnblogs.com/chancey/p/11494260.html