简介

jiaba分词是目前最好的Python中文分词组件。支持3种分词模式：精确模式、全模式、搜索引擎模式。

jieba的安装

在Pycharm中，File -> Setting -> Project interpreter -> Add。搜索jieba关键字，点击安装即可。前提是已经配置好了好用的软件源。

三种模式的简单示例

# -*- coding: utf-8 -*-
import jieba

seg_str = "好好学习，天天向上。"

print("/".join(jieba.lcut(seg_str)))    # 精简模式，返回一个列表类型的结果
print("/".join(jieba.lcut(seg_str, cut_all=True)))      # 全模式，使用 'cut_all=True' 指定 
print("/".join(jieba.lcut_for_search(seg_str)))     # 搜索引擎模式

分词效果如下：

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\j00347382\AppData\Local\Temp\jieba.cache
好好学习/，/天天向上/。
好好/好好学/好好学习/好学/学习/，/天天/天天向上/向上/。
好好/好学/学习/好好学/好好学习/，/天天/向上/天天向上/。
Loading model cost 0.632 seconds.
Prefix dict has been built successfully.

jieba分词简单应用

需求：使用jieba分词对一个文本进行分词，统计次数出现最多的词语，这里以红楼梦为例子。

# -*- coding: utf-8 -*-
import jieba

txt = open(r"C:\Users\Downloads\《红楼梦》作者_曹雪芹_TXT.TXT", "r", encoding='utf-8').read()
words = jieba.lcut(txt)     # 使用精确模式对文本进行分词
counts = {}     # 通过键值对的形式存储词语及其出现的次数

for word in words:
    if len(word) == 1:    # 单个词语不计算在内
        continue
    else:
        counts[word] = counts.get(word, 0) + 1    # 遍历所有词语，每出现一次其对应的值加 1

items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)    # 根据词语出现的次数进行从大到小排序

for i in range(5):
    word, count = items[i]
    print("{0:<5}{1:>5}".format(word, count))

打印出的结果为：

宝玉    3647
什么    1596
----------------------- 1342
一个    1311
贾母    1189

来源：oschina

链接：https://my.oschina.net/u/923087/blog/4743134

标签

python

jieba

dictionary

pycharm

jieba分词的简单上手教程

简介

jieba的安装

三种模式的简单示例

jieba分词简单应用