练习_Python3 爬取笔趣阁最新小说章节

警告：本文代码仅供学习，禁止违法使用或商用。

这里拿人气小说《黎明之剑》来举个栗子，喜欢小说《黎明之剑》的朋友们请支持正版阅读。

笔趣阁网站上的其他书籍基本上的都可以套用，其他盗版网站也基本上是差不多的思路就可以解决。

稍微改改就能很轻松的通过小说目录页下载全本，我这里就懒得弄了，有兴趣的朋友可以试一试。

# -*- coding:UTF-8 -*-
# 作者博客：https://www.cnblogs.com/Raine/
# 2019-06-20

import requests
from bs4 import BeautifulSoup


class TheLatest(object):
    # 测试爬取笔趣阁《黎明之剑》最新章节
    def __init__(self):
        self.url_dir = 'https://www.biqiuge.com/book/36438/'
        self.bookname = ""  # 存放书籍名
        self.chaptername = ""  # 存放章节名
        self.url_latest = ""  # 存放最新章节链接
        self.get_download_url()

    def get_download_url(self):
        # 直接从网页head标签内获取想要的内容
        r1 = requests.get(self.url_dir)
        # 网页是GBK编码，需要转换
        r1.encoding = 'GBK'
        html_1 = r1.text
        bs_div = BeautifulSoup(html_1, 'lxml')
        # 找到需要用到的标签然后提取属性
        _bookname = bs_div.find('meta', property="og:novel:book_name")
        self.bookname = _bookname.get('content')
        _chaptername = bs_div.find('meta', property='og:novel:latest_chapter_name')
        self.chaptername = _chaptername.get('content')
        _url_latest = bs_div.find('meta', property='og:novel:latest_chapter_url')
        self.url_latest = _url_latest.get('content')

    def get_content(self):
        r2 = requests.get(self.url_latest)
        r2.encoding = 'GBK'
        html_content = r2.text
        bs_div = BeautifulSoup(html_content, 'lxml')
        txt = bs_div.find('div', 'showtxt')
        # 优化文字排版
        txt = txt.text.replace('　　', '\n　　')
        txt = txt.replace('�6�1', '·')
        out_content = txt.split(self.url_latest)[0]
        return out_content


if __name__ == '__main__':
    txt_content = TheLatest()
    filename = txt_content.bookname + txt_content.chaptername + '.txt'
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(txt_content.get_content())

参考资料：

Python3网络爬虫快速入门实战解析：https://cuijiahua.com/blog/2017/10/spider_tutorial_1.html

Python——爬虫【Requests设置请求头Headers】：https://blog.csdn.net/ysblogs/article/details/88530124

Python3.x爬虫教程：爬网页、爬图片、自动登录：https://blog.csdn.net/Evankaka/article/details/46849095

来源：oschina

链接：https://my.oschina.net/u/4316981/blog/3493554

标签

def