[MATLAB] Simple TF-IDF implementation

时光总嘲笑我的痴心妄想 提交于 2019-11-26 16:29:22

Term-Frequency word weighting scheme is one of most used in normalization of document-term matrices in text mining and information retrieval.

See wikipedia for details.

 

tfidf

 

function Y = tfidf( X )
% FUNCTION computes TF-IDF weighted word histograms.
%
%   Y = tfidf( X );
%
% INPUT :
%   X        - document-term matrix (documents in columns)
%
% OUTPUT :
%   Y        - TF-IDF weighted document-term matrix
%
 
% get term frequencies
X = tf(X);
 
% get inverse document frequencies
I = idf(X);
 
% apply weights for each document
for j=1:size(X, 2)
    X(:, j) = X(:, j)*I(j);
end
 
Y = X;
 
 
function X = tf(X)
% SUBFUNCTION computes word frequencies
 
% for every word
for i=1:size(X, 1)
    
    % get word i counts for all documents
    x = X(i, :);
    
    % sum all word i occurences in the whole collection
    sumX = sum( x );
    
    % compute frequency of the word i in the whole collection
    if sumX ~= 0
        X(i, :) = x / sum(x);
    else
        % avoiding NaNs : set zero to never appearing words
        X(i, :) = 0;
    end
    
end
 
 
function I = idf(X)
% SUBFUNCTION computes inverse document frequencies
 
% m - number of terms or words
% n - number of documents
[m, n]=size(X);
 
% allocate space for document idf's
I = zeros(n, 1);
 
% for every document
for j=1:n
    
    % count non-zero frequency words
    nz = nnz( X(:, j) );
    
    % if not zero, assign a weight:
    if nz
        I(j) = log( m / nz );
    end
    
end
posted on 2012-08-11 15:30 Shicai Yang 阅读(...) 评论(...) 编辑 收藏

转载于:https://www.cnblogs.com/youth0826/archive/2012/08/11/2633688.html

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!