tf-idf

AttributeError: 'int' object has no attribute 'lower' in TFIDF and CountVectorizer

妖精的绣舞 提交于 2020-08-27 06:44:19
问题 I tried to predict different classes of the entry messages and I worked on the Persian language. I used Tfidf and Naive-Bayes to classify my input data. Here is my code: import pandas as pd df=pd.read_excel('dataset.xlsx') col=['label','body'] df=df[col] df.columns=['label','body'] df['class_type'] = df['label'].factorize()[0] class_type_df=df[['label','class_type']].drop_duplicates().sort_values('class_type') class_type_id = dict(class_type_df.values) id_to_class_type = dict(class_type_df[[

How to make TF-IDF matrix dense?

被刻印的时光 ゝ 提交于 2020-08-17 04:58:22
问题 I am using TfidfVectorizer to convert a collection of raw documents to a matrix of TF-IDF features, which I then plan to input into a k-means algorithm (which I will implement). In that algorithm I will have to compute distances between centroids (categories of articles) and data points (articles). I am going to use Euclidean distance, so I need these two entities to be of same dimension, in my case max_features . Here is what I have: tfidf = TfidfVectorizer(max_features=10, strip_accents=

TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams), which accuracy should be considered?

淺唱寂寞╮ 提交于 2020-08-10 19:20:25
问题 Here you can see i am calculating the frequency at Count Vectors,WordLevel, N-Gram Vectors accuracy = train_model( classifier, xtrain_count, train_y, xvalid_count) print("NB, Count Vectors: ", accuracy) # Naive Bayes on Word Level TF IDF Vectors accuracy = train_model(classifier, xtrain_tfidf, train_y, xvalid_tfidf) print("NB, WordLevel TF-IDF: ", accuracy) # Naive Bayes on Ngram Level TF IDF Vectors accuracy = train_model(classifier, xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram) print("NB

TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams), which accuracy should be considered?

陌路散爱 提交于 2020-08-10 19:18:57
问题 Here you can see i am calculating the frequency at Count Vectors,WordLevel, N-Gram Vectors accuracy = train_model( classifier, xtrain_count, train_y, xvalid_count) print("NB, Count Vectors: ", accuracy) # Naive Bayes on Word Level TF IDF Vectors accuracy = train_model(classifier, xtrain_tfidf, train_y, xvalid_tfidf) print("NB, WordLevel TF-IDF: ", accuracy) # Naive Bayes on Ngram Level TF IDF Vectors accuracy = train_model(classifier, xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram) print("NB

How to build a TFIDF Vectorizer given a corpus and compare its results using Sklearn?

删除回忆录丶 提交于 2020-08-09 19:06:50
问题 Sklearn does few tweaks in the implementation of its version of TFIDF vectorizer, so to replicate the exact results you would need to add following things to your custom implementation of tfidf vectorizer: Sklearn has its vocabulary generated from idf sroted in alphabetical order Sklearn formula of idf is different from the standard textbook formula. Here the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the

Error predicting: X has n features per sample, expecting m

社会主义新天地 提交于 2020-07-20 04:12:10
问题 I got the following code, where I transform a text to tf: ... x_train, x_test, y_train, y_test = model_selection.train_test_split(dataset['documents'],dataset['classes'],test_size=test_percentil) #Term document matrix count_vect = CountVectorizer(ngram_range=(1, Ngram), min_df=1, max_features=MaxVocabulary) x_train_counts = count_vect.fit_transform(x_train) x_test_counts=count_vect.transform(x_test) #Term Inverse-Frequency tf_transformer = TfidfTransformer(use_idf=True).fit(x_train_counts)

How to manually calculate TF-IDF score from SKLearn's TfidfVectorizer

坚强是说给别人听的谎言 提交于 2020-06-27 16:00:17
问题 I have been running the TF-IDF Vectorizer from SKLearn but am having trouble recreating the values manually (as an aid to understanding what is happening). To add some context, i have a list of documents that I have extracted named entities from (in my actual data these go up to 5-grams but here I have restricted this to bigrams). I only want to know the TF-IDF scores for these values and thought passing these terms via the vocabulary parameter would do this. Here is some dummy data similar

Computing TF-IDF on the whole dataset or only on training data?

人走茶凉 提交于 2020-06-13 18:45:45
问题 In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test? 回答1: I have not read the book and I am not sure whether this is actually a mistake in the book

How does TfidfVectorizer compute scores on test data

自闭症网瘾萝莉.ら 提交于 2020-05-13 05:36:06
问题 In scikit-learn TfidfVectorizer allows us to fit over training data, and later use the same vectorizer to transform over our test data. The output of the transformation over the train data is a matrix that represents a tf-idf score for each word for a given document. However, how does the fitted vectorizer compute the score for new inputs? I have guessed that either: The score of a word in a new document computed by some aggregation of the scores of the same word over documents in the

Document similarity: Vector embedding versus Tf-Idf performance?

允我心安 提交于 2020-04-09 18:37:25
问题 I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches: A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity. Bag-of-Words: tf-idf or its variations such as BM25. Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document