基于文本向量空间模型的文本聚类算法

@[vsm|向量空间模型|文本相似度]

本文源地址http://www.houzhuo.net/archives/51.html

vsm概念简单，把对文本内容的处理转化为向量空间中的向量计算，以空间上的相似度来直观表达语义上的相似度。

基于文本向量空间模型的文本聚类算法
- 文本聚类
- 向量空间模型vsm

文本聚类

文本聚类主要依据聚类假设：同类的文档相似度较大，非同类的文档相似度较小。作为一种无监督的机器学习方法，聚类由于不需要训练过程、以及不需要预先对文档手工标注类别，因此具有较高的灵活性和自动化处理能力，成为对文本信息进行有效组织、摘要和导航的重要手段。

向量空间模型vsm

所有的文本都可表现成向量的形式：

向量中的每一维都代表在文档中出现的一个独立词组或单个词，并且我们给每个词组赋予一个权值（最简单就是词频，或者广为人知的tf_idf权重）。所以一个文档就会转换成一个n维的向量。
向量夹角公式

接下来就是利用中学所学的的公式来计算向量之间的夹角，夹角越小即代表较高的相似度。当然，我们比较之前需要将两个向量转化为同一维度（下面的代码中将加以演示）

文本预处理：

__author__ = 'iothz'

import string
from string import *
list_of_all_file =[]
str_of_file1 = ""
str_of_file2 = ""

file1 = open('science.txt', 'r')
for line in file1.readlines():
    nopunc =line.replace(",", "").replace(".", "").replace("?", "").replace("\"", "").replace("\'", "").replace(")", "").replace("(", "").replace("[", " ").replace("]", " ").replace("\n", " ")
    str_of_file1 +=nopunc
list_of_all_file.append(str_of_file1)
file2 = open('science2.txt', 'r')
for line in file2.readlines():
    nopunc =line.replace(",", "").replace(".", "").replace("?", "").replace("\"", "").replace("\'", "").replace(")", "").replace("(", "").replace("[", " ").replace("]", " ").replace("\n", " ")
    str_of_file2 +=nopunc

list_of_all_file.append(str_of_file2)

文本预处理方法各不相同，上面代码去除两个文本的标点，并添加到一个list中方便下面处理

获取每篇文档词频

from collections import Counter
def build_lexicon(corpus):
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return lexicon


vocabulary = build_lexicon(list_of_all_file)
print 'the vector of two file is [' + ', '.join(list(vocabulary)) + ']'

the vector of two file is [and, nlp, basketball, love, her, i, baseball, you, cins]

这里引入了一个新的Python对象Counter用来在一个循环中进行计数。结果统计出每个单词出现的次数，但是我们现在还不能比较，因为他们的不在同一词汇空间中。

获得相同长度的向量

def freq(term, document):
    return document.split().count(term)

def tf(term, document):
    return freq(term, document)

doc_term_matrix = []

for doc in list_of_all_file:
    #print '****'
    print 'the doc is "' + doc + ' " '
    tf_vector = [tf(word, doc) for word in vocabulary]
    print tf_vector
    tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
    print 'the tf vector for Document %d is [%s]' % ((list_of_all_file.index(doc)+1), tf_vector_string)

    doc_term_matrix.append(tf_vector)
print 'All combined, here is out master document term matrix: '
print doc_term_matrix

the doc is "i love basketball i love you and nlp " 
[1, 1, 1, 2, 0, 2, 0, 1, 0]
the tf vector for Document 1 is [1, 1, 1, 2, 0, 2, 0, 1, 0]
the doc is "i love baseball i love her and cins " 
[1, 0, 0, 2, 1, 2, 1, 0, 1]
the tf vector for Document 2 is [1, 0, 0, 2, 1, 2, 1, 0, 1]
All combined, here is out master document term matrix: 
[[1, 1, 1, 2, 0, 2, 0, 1, 0], [1, 0, 0, 2, 1, 2, 1, 0, 1]]

根据这段代码我们得到了相同长度的量化结果，量化结果的长度是由语料库决定的。有过接触机器学习经验的人都知道，为了避免个别单词在文档中过于频繁的出现，影响分析结果，我们要对每个词频向量进行比例缩放（归一化）。

归一化

import math
import numpy as np

def normalizer(vec):
    denom = np.sum([el**2 for el in vec])
    return [(el / math.sqrt(denom)) for el in vec]

doc_term_matrix_normalizer = []
for vec in doc_term_matrix:
    doc_term_matrix_normalizer.append(normalizer(vec))
print 'A regular old document term matrix '
print np.matrix(doc_term_matrix)
print '\nA document term matrix with row-wise  norms of 1:'
print np.matrix(doc_term_matrix_normalizer)

A regular old document term matrix 
[[1 1 1 2 0 2 0 1 0]
 [1 0 0 2 1 2 1 0 1]]

A document term matrix with row-wise norms of 1:
[[ 0.28867513  0.28867513  0.28867513  0.57735027  0.          0.57735027
   0.          0.28867513  0.        ]
 [ 0.28867513  0.          0.          0.57735027  0.28867513  0.57735027
   0.28867513  0.          0.28867513]]

我们就这样得到了一个归一化过后的向量，并且没有丢失过多信息。但是比如在一篇文章中，“我”，“的”这类高频词汇对我们的做相似性比较似乎并没有什么作用，因为每篇文章中都会出现，反而会影响结果。所以我们将引入最通用的一种文本权值计算方法tf-idf

idf频率加权

def numDocsContaining(word, doclist):
    docCount = 0
    for doc in doclist:
        if(freq(word, doc) > 0):
            docCount +=1
        return docCount
def idf(word, doclist):
    n_samples = len(doclist)
    df = numDocsContaining(word, doclist)
    return np.log(n_samples / 1+df)

my_idf_vector = [idf(word, list_of_all_file) for word in vocabulary]

print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
print 'The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']'

Our vocabulary vector is [and, nlp, basketball, love, her, i, baseball, you, cins]
The inverse document frequency vector is
 [1.098612, 1.098612, 1.098612, 1.098612, 0.693147, 1.098612, 0.693147, 1.098612, 0.693147]

tf-idf是一种统计方法，它的主要思想是：如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。

TF词频(Term Frequency)易于理解
而IDF逆向文件频率(Inverse Document Frequency) 是一个词语普遍重要性的度量。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到：

其中
- |D|：语料库中的文件总数
- ：包含词语的文件数目（即的文件数目）如果该词语不在语料库中，就会导致分母为零，因此一般情况下使用
  作为分母。

我们快得到想要的结果了。为了得到TF-IDF加权词向量，你必须做一个简单的计算：tf * idf。
如果你用一个AxB的向量乘以另一个AxB的向量，你将得到一个大小为AxA的向量，或者一个标量。我们不会那么做，因为我们想要的是一个具有相同维度(1 x词数量)的词向量，向量中的每个元素都已经被自己的idf权重加权了，所以：

def build_idf_matrix(idf_vector):
    idf_mat = np.zeros((len(idf_vector),len(idf_vector)))
    np.fill_diagonal(idf_mat,idf_vector)
    return idf_mat

my_idf_matrix = build_idf_matrix(my_idf_vector)
print my_idf_matrix

[[ 1.09861229  0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [ 0.          1.09861229  0.          0.          0.          0.          0.
   0.          0.        ]
 [ 0.          0.          1.09861229  0.          0.          0.          0.
   0.          0.        ]
 [ 0.          0.          0.          1.09861229  0.          0.          0.
   0.          0.        ]
 [ 0.          0.          0.          0.          0.69314718  0.          0.
   0.          0.        ]
 [ 0.          0.          0.          0.          0.          1.09861229
 ...]]

这样我们就将IDF向量转化为BxB的矩阵了，矩阵的对角线就是IDF向量。这意味着我们现在可以用逆文档词频矩阵乘以每一个词频向量了。当然，我们在其中还是要做一次归一化操作

tf-idf加权并归一化

doc_term_tfidf__matrix = []
for tf_vector in doc_term_matrix:
    doc_term_tfidf__matrix.append(np.dot(tf_vector, my_idf_matrix))
doc_term_tfidf__matrix_normalizer = []
for tf_vector in doc_term_tfidf__matrix:
    doc_term_tfidf__matrix_normalizer.append(normalizer(tf_vector))
print vocabulary
print np.matrix(doc_term_tfidf__matrix_normalizer)

[[ 0.28867513  0.28867513  0.28867513  0.57735027  0.          0.57735027
   0.          0.28867513  0.        ]
 [ 0.31320094  0.          0.          0.62640189  0.19760779  0.62640189
   0.19760779  0.          0.19760779]]

由此已经计算出了tf-idf权值，最后一步便是计算向量间的夹角了

计算向量间的夹角

x = np.array(doc_term_tfidf__matrix_normalizer[0][:])
y = np.array(doc_term_tfidf__matrix_normalizer[1][:])

Lx = np.sqrt(x.dot(x))
Ly = np.sqrt(y.dot(y))
print Lx, Ly

cos_angle = x.dot(y) / (Lx*Ly)
print  'cos_value: ', cos_angle

angle = np.arccos(cos_angle)
angle2 = angle*360/2/np.pi
print 'angle: ', angle2

similarity = (90-angle2) / 90
print 'similarity is: ' ,similarity

1.0 1.0
cos_value:  0.8137199207
angle:  35.5390135166
similarity is: 0.605122072038

得到最终结果！当然还有最简的方式，就是利用scikit-learn来计算，但是为了夯实基础还是要从最基本的了解起。

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df=1)
tfidf_matrix = tfidf_vectorizer.fit_transform(list_of_all_file)

print tfidf_matrix.todense()

来源：CSDN

作者：Cins侯卓

链接：https://blog.csdn.net/IOThouzhuo/article/details/46592095

标签

向量空间模型

文本分类

文本分析

自然语言处理

vector

tf-idf

replace