一、urllib

1. 访问

urllib.request.urlopen()

参数：

url：需要爬取的URL地址
timeout：设置等待时间，指定时间内未得到相应时抛出异常

# 导入模块
import urllib.request

url = "http://www.baidu.com/"
# 向百度发起请求，得到相应对象
html = urllib.request.urlopen(url)

print(html.read().decode("utf-8")) # 得到网页源代码，为str类型
print(html.status) # 得到响应状态码

2.响应方法

1、bytes = response.read() # 获取原生的网页源码
2、string = response.read().decode("utf-8") # 获取网页源码并转码
3、url = response.geturl() # 获取资源的URL
4、code = response.getcode() # 获取响应状态码
5、string.encode() # string --> bytes
6、bytes.decode() # bytes --> string

3. 包装

3.1 User-Agent

urllib.request.Request()

作用：创建请求对象（包装请求，重构User-Agent，使程序更像正常人类请求）

参数：

url：请求的url地址
headers：添加请求头

使用

构造请求对象（重构User-Agent）
发送请求获取响应对象（urlopen）
获取响应内容

from urllib import request

url = "http://www.httpbin.org/get"
header = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"}

# 向百度发起请求，得到相应对象
res = request.Request(url, headers=header)
res = request.urlopen(res)
html = res.read().decode("utf-8")
print(html)

另外，python还提供了第三库fake_useragent来随机生成User-Agent
from fake_useragent import UserAgent
useragent = UserAgent()
useragent = {"User-Agent" : useragent.random}
print(useragent)
print(useragent)
print(useragent)
print(useragent)

3.2 编码

urllib.parse.urlencode()

编码之前：https://www.baidu.com/s?&wd=美女

编码之后：https://www.baidu.com/s?&wd=%E7%BE%8E%E5%A5%B3

In [1]: from urllib import parse
    
In [2]: query_string = {"wd" : "美女"}
    
In [3]: reslut = parse.urlencode(query_string)
    
In [4]: print(reslut)
wd=%E7%BE%8E%E5%A5%B3

**访问百度**
from urllib import parse, request
from fake_useragent import UserAgent

header = {"User-Agent":UserAgent().random}
query_string = {"wd" : "美女"}
url = "https://www.baidu.com/s?" + parse.urlencode(query_string)

response = request.Request(url, headers=header)
html = request.urlopen(response).read().decode("utf-8")

print(html)

字符编码

In [10]: import urllib.parse

In [11]: string = urllib.parse.quote("美女")

In [13]: "https://www.baidu.com/s?wd=" + string
Out[13]: 'https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3'

字符解码

In [14]: import urllib.parse

In [15]: str = "%E7%BE%8E%E5%A5%B3"

In [16]: urllib.parse.unquote(str)
Out[16]: '美女'

3.3 多参数

In [1]: from urllib import parse
In [2]: query_string_dic = {
   ...:     "wd" : "美女",
   ...:     "pn" : "50"
   ...: }
In [3]: r = parse.urlencode(query_string_dic)
In [4]: r
Out[5]: 'wd=%E7%BE%8E%E5%A5%B3&pn=50'

不难发现，parse会在两个参数之间自动加上&符号，因此可以直接构造URL访问

3.4 拼接URL

字符串相加

'https://www.baidu.com/s?' + urlencode({"wd" : "美女", "pn" : "50"})
字符串格式化(占位符)

'https://www.baidu.com/s?%s' % urlencode({"wd" : "美女", "pn" : "50"})
format()

'https://www.baidu.com/s?{}'.format(urlencode({"wd" : "美女", "pn" : "50"}))

请求百度搜索的内容，把响应内容保存到本地文件

from urllib import request, parse
from fake_useragent import UserAgent

wd = input("请输入关键字:")

url = "https://www.baidu.com/?{}".format(parse.urlencode({"wd":wd}))
header = {"User-Agent" : UserAgent().random}

req = request.Request(url=url, headers=header)
res = request.urlopen(req)
html = res.read().decode("utf-8")

filename = "{}.html".format(wd)
with open(filename, "w") as f:
    f.write(html)

在读写文件的时候经常遇见编码问题

在Windows中使用utf-8会报错
UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 30904: illegal multibyte sequence
在读写文件加上参数with open("demo.txt", encoding="gbk18030") as f这种编码识别的国标会更多

4. 示例

百度贴吧数据抓取

要求

输入贴吧名称
输入起始页
输入终止页
保存到本地文件：第1页.html、第2页.html ...

代码实现

from urllib import parse, request
from fake_useragent import UserAgent
import time
import random

class BaiduSpider(object):
    def __init__(self):
        self.url = "http://tieba.baidu.com/f?kw{}&pn={}"
        self.headers = {"User-Agent" : UserAgent().random}

    # 获取响应页面
    def get_page(self, url):
        req = request.Request(url=url, headers=self.headers)
        res = request.urlopen(req)
        html = res.read().decode("utf-8")

        return html

    # 提取数据
    def parse_page(self):
        pass

    # 保存数据
    def write_page(self, filename, html):
        with open(filename, "w", encoding="utf-8") as f:
            f.write(html)

    # 主函数
    def main(self):
        name = input("请输入贴吧名：")
        start = int(input("请输入起始页："))
        end = int(input("请输入终止页："))

        # 拼接URL，发送请求
        for page in range(start, end+1):
            pn = (page - 1) * 50
            kw = parse.quote(name)
            url = self.url.format(kw, pn)
            # 获取响应，保存
            html = self.get_page(url)
            filename = "{}-第{}页.html".format(name, page)
            self.write_page(filename, html)

            # 控制速度，随机休眠
            time.sleep(random.randint(1, 3)) # 随机休眠1-3秒钟


if __name__ == '__main__':
    spider = BaiduSpider()
    spider.main()

来源：https://www.cnblogs.com/chancey/p/11494260.html

标签

url

python爬虫

string

urllib

爬虫基础 之 urllib