问题
Sklearn does few tweaks in the implementation of its version of TFIDF vectorizer, so to replicate the exact results you would need to add following things to your custom implementation of tfidf vectorizer:
- Sklearn has its vocabulary generated from idf sroted in alphabetical order
- Sklearn formula of idf is different from the standard textbook formula. Here the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions.
IDF(t)=1+(loge((1 + Total number of documents in collection)/(1+Number of documents with term t in it))
. - Sklearn applies L2-normalization to its output matrix.
- The final output of sklearn tfidf vectorizer is a sparse matrix.
Now given the following corpus:
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
Sklearn implementation:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
skl_output = vectorizer.transform(corpus)
print(vectorizer.get_feature_names())
output : [‘and’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’]
print(skl_output[0])
Output:
(0, 8) 0.38408524091481483
(0, 6) 0.38408524091481483
(0, 3) 0.38408524091481483
(0, 2) 0.5802858236844359
(0, 1) 0.46979138557992045
I need to replicate the above result using a custom implementation i.e write code in simple python.
I wrote the following code:
from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy
# The fit function helps in creating a vocabulary of all the unique words in the corpus
def fit(dataset):
storage_set = set()
if isinstance(dataset,list):
for document in dataset:
for word in document.split(" "):
storage_set.add(word)
storage_set = sorted(list(storage_set))
vocab = {j:i for i,j in enumerate(storage_set)}
return vocab
vocab = fit(corpus)
print(vocab)
output : {‘and’: 0, ‘document’: 1, ‘first’: 2, ‘is’: 3, ‘one’: 4, ‘second’: 5, ‘the’: 6, ‘third’: 7, ‘this’: 8}
This output is matching with the output of the sklearn above
#Returs a sparse matrix of the all non-zero values along with their row and col
def transform(dataset,vocab):
row = []
col = []
values = []
for ibx,document in enumerate(dataset):
word_freq = dict(Counter(document.split()))
for word, freq in word_freq.items():
col_index = vocab.get(word,-1)
if col_index != -1:
if len(word)<2:
continue
col.append(col_index)
row.append(ibx)
td = freq/float(len(document)) # the number of times a word occured in a document
idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
values.append((td) * (idf_))
return normalize(csr_matrix( ((values),(row,col)), shape=(len(dataset),len(vocab))),norm='l2' )
print(transform(corpus,vocab))
Output:
(0, 1) 0.3989610517704845
(0, 2) 0.602760579899478
(0, 3) 0.3989610517704845
(0, 6) 0.3989610517704845
(0, 8) 0.3989610517704845
As you can see this output is not matching with the values from the sklearn’s output. I went through the logic several times, tried debugging everywhere. However, couldn’t locate why my custom implementation is not matching the output by sklearn. Would appreciate any insights.
回答1:
Can you please check idf() in idf_ = 1+math.log((1+len(dataset))/float(1+idf(word)))
.
While trying to replicate your results, my output matched with that of sklearn without doing any significant change in your transform function. So I think, there must be any problem in your idf() which must return the no. of rows in which the word w is present in the corpus
回答2:
Here you have to apply l2 normalisation since Sklearm does so..
来源:https://stackoverflow.com/questions/60042735/how-to-build-a-tfidf-vectorizer-given-a-corpus-and-compare-its-results-using-skl