jieba库的使用 | 易学教程

1. jieba库概述

jieba是优秀的中文分词第三方库

中文文本需要通过分词获得单个的词语
jieba是优秀的中文分词第三方库，需要额外安装
jieba库提供三种分词模式，最简单只需掌握一个函数

2. jieba库的安装

（cmd命令行）pip install jieba

3. jieba的分词原理

利用一个中文词库，确定中文字符之间的关联概率
中文字符间概率大的组成词组，形成分词结果
除了分词，用户还可以添加自定义的词组

4. jieba库的使用

4.1 jieba分词的的三种模式

精确模式：把文本精确地切分开，不存在冗余单词
全模式：把文本中所有可能的词语都扫描出来，有冗余
搜索引擎模式：在精确模式的基础上，对长词再次切分

4.2 jieba库常用函数

jieba.lcut(s)★ —— 精确模式，返回一个列表类型的分词结果

>>> import jieba
>>> jieba.lcut("中国是一个伟大的国家")

['中国', '是', '一个', '伟大', '的', '国家']

jieba.lcut(s, cut_all=True) —— 全模式，返回一个列表类型的结果，存在冗余

>>> jieba.lcut("中国是一个伟大的国家", cut_all=True)
['中国', '国是', '一个', '伟大', '的', '国家']

jieba.lcut_for_search(s) —— 搜索引擎模式，返回一个列表类型的分词结果，存在冗余

>>> jieba.lcut_for_search("中华人民共和国是最伟大的")
['中华', '华人', '人民', '共和', '共和国', '中华人民共和国', '是', '最', '伟大', '的']

jieba.add_word(w) —— 向分词词典增加新词“w”

词频统计实例：
英文文本——《哈姆雷特（英文版）》
要点：文本去噪归一化、使用字典表示词频

def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@{}[\\]^_|~·':
        txt = txt.replace(ch, " ")
        return txt

hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
    counts[word] = counts.get(word,0) + 1
items = list(counts.items()) # 列表中的键值对是元组形式
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
    word, count = items[i] # 对列表中相应的元组表示的键值对进行序列解包
    print("{0:<10}{1:>5}".format(word, count))

输出：

the        1137
and         936
to          728
of          665
a           527
i           515
my          513
in          423
hamlet      407
you         406

中文文本——《三国演义》

import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
    word, count = items[i]
    print("{0:>2}.{1:<10}{2:>5}".format(i+1, word, count))

输出：

 1.曹操          953
 2.孔明          836
 3.将军          772
 4.却说          656
 5.玄德          585
 6.关公          510
 7.丞相          491
 8.二人          469
 9.不可          440
10.荆州          425
11.玄德曰         390
12.孔明曰         390
13.不能          384
14.如此          378
15.张飞          358

过程中出现的问题：

ValueError: cannot switch from automatic field numbering to manual field specification
意思是，电脑太笨了，输出print需要指定编号
结果不够理想：存在“将军”、“却说”、“玄德”、“孔明曰”等等需要处理的情况，在调试过程中根据结果逐步优化程序

优化版本

import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
excludes = {"将军","却说","荆州","二人","不可","不能","如此",\
            "商议","如何","主公","军士","左右","军马","引兵","次日",\
            "大喜","天下","东吴","于是","今日","不敢","魏兵","陛下","一人","都督"}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "诸葛亮"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:>2}.{1:>3}{2:>10}".format(i+1, word, count))

输出：

 1. 曹操      1451
 2. 刘备      1252
 3. 孔明       836
 4. 关羽       784
 5.诸葛亮       547
 6. 张飞       358
 7. 吕布       300
 8. 赵云       278
 9. 孙权       264
10.司马懿       221

来源：CSDN

作者：柳神的迷弟的迷弟的迷弟

链接：https://blog.csdn.net/weixin_42764266/article/details/104595379

标签

jieba

三国

分词