python-爬虫-庆余年小说-词云胡乱分析

进入正题，首先要搞到资源，我先去了搜索了一番，找到个网站“落霞”。一言不合就按下了F12，翻了下网页源码，超级简单。


from bs4 import BeautifulSoup
from requests import Session
from re import sub,DOTALL
sess = Session()
txt=[]
url = 'https://www.luoxia.com/qing/48416.htm'
def find(url):
    res = sess.get(url)
    soup = BeautifulSoup(res.content,'html.parser')
    title = soup.find('title')
    div = soup.find('div',id='nr1')
    ps = div.find_all('p')
    page = title.text+'\n'
    print(page)
    for p in ps:
        page += p.text+'\n'
    txt.append(page)
    try:
        a = soup.find('a',rel='next')
        href = a['href']
    except:
        return 0
    find(href)
find(url)

网页结构真的超级简洁有规律，标题就在里，正文在一个title标签里，而且每段话都用p标签包起来了。不过他的网址不是连续的数字，so，迭代大法。下一章的链接就包在一个a标签里，还带了属性。给落霞网站程序员打call,不过我马上就后悔了，这个网站速度有点慢，差不多一秒一章的样子？
是我换了个网站，书趣阁,这个倒是快，就是程序员不喜欢打标记。


url = '17754382.html'
shu = []
def shuquge(url):
    res = sess.get('http://www.shuquge.com/txt/83203/'+url)
    soup = BeautifulSoup(res.content,'html.parser')
    h1 = soup.find('h1')
    div = soup.find('div', id="content")
    page = str(div)
    page = page.replace('<div class="showtxt" id="content">','')
    page = page.replace('<br/>','')
    page = sub('http.*','',page,0,DOTALL)
    shu.append(h1.text+'\n'+page)
    print(h1.text)
    href = [i['href'] for i in soup.find_all('a') if i.text == '下一章'][0]
    if 'index' not in href:
        shuquge(href)
shuquge(url)

标签都没个正经属性，还一堆广告。正文里面也有，还得我删

import jieba,cv2
from wordcloud import WordCloud
img=cv2.imread('c2cec1e832a833ded3f6f9bbc226ae2f.jpeg')
content=' '.join(jieba.cut(''.join(shu)))
wordshow = WordCloud(background_color='white',
                     width=800,
                     height=800,
                     max_words=800,
                     max_font_size=100,
                     font_path="msyh.ttc",    #用微软雅黑作为字体显示效果
                     mask=img,
                     mode='RGBA'
                     ).generate(content)
wordshow.to_file('word.png')  #转换成图片

其实本来只想下载小说的，闲着也是闲着，吃饱了也是撑着，不如来写程序，我和身边的朋友都在做，每天收入0元。
对了，保存一下


from codecs import open
with open('庆余年.txt','w','utf8')as f:
    f.write('\n'.join(shu))#网页是utf8的，windows下直接用gbk存不了

庆余年小说-python词云胡乱分析

来源：CSDN

作者：1213清心

链接：https://blog.csdn.net/qq_39666130/article/details/104118698

标签

python爬虫

url

python

content

soup