tf-idf | 易学教程

AttributeError: 'int' object has no attribute 'lower' in TFIDF and CountVectorizer

阅读更多关于 AttributeError: 'int' object has no attribute 'lower' in TFIDF and CountVectorizer

问题 I tried to predict different classes of the entry messages and I worked on the Persian language. I used Tfidf and Naive-Bayes to classify my input data. Here is my code: import pandas as pd df=pd.read_excel('dataset.xlsx') col=['label','body'] df=df[col] df.columns=['label','body'] df['class_type'] = df['label'].factorize()[0] class_type_df=df[['label','class_type']].drop_duplicates().sort_values('class_type') class_type_id = dict(class_type_df.values) id_to_class_type = dict(class_type_df[[

How to make TF-IDF matrix dense?

阅读更多关于 How to make TF-IDF matrix dense?

问题 I am using TfidfVectorizer to convert a collection of raw documents to a matrix of TF-IDF features, which I then plan to input into a k-means algorithm (which I will implement). In that algorithm I will have to compute distances between centroids (categories of articles) and data points (articles). I am going to use Euclidean distance, so I need these two entities to be of same dimension, in my case max_features . Here is what I have: tfidf = TfidfVectorizer(max_features=10, strip_accents=

TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams), which accuracy should be considered?

阅读更多关于 TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams), which accuracy should be considered?

问题 Here you can see i am calculating the frequency at Count Vectors,WordLevel, N-Gram Vectors accuracy = train_model( classifier, xtrain_count, train_y, xvalid_count) print("NB, Count Vectors: ", accuracy) # Naive Bayes on Word Level TF IDF Vectors accuracy = train_model(classifier, xtrain_tfidf, train_y, xvalid_tfidf) print("NB, WordLevel TF-IDF: ", accuracy) # Naive Bayes on Ngram Level TF IDF Vectors accuracy = train_model(classifier, xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram) print("NB

TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams), which accuracy should be considered?

阅读更多关于 TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams), which accuracy should be considered?

How to build a TFIDF Vectorizer given a corpus and compare its results using Sklearn?

阅读更多关于 How to build a TFIDF Vectorizer given a corpus and compare its results using Sklearn?

问题 Sklearn does few tweaks in the implementation of its version of TFIDF vectorizer, so to replicate the exact results you would need to add following things to your custom implementation of tfidf vectorizer: Sklearn has its vocabulary generated from idf sroted in alphabetical order Sklearn formula of idf is different from the standard textbook formula. Here the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the

Error predicting: X has n features per sample, expecting m

阅读更多关于 Error predicting: X has n features per sample, expecting m

问题 I got the following code, where I transform a text to tf: ... x_train, x_test, y_train, y_test = model_selection.train_test_split(dataset['documents'],dataset['classes'],test_size=test_percentil) #Term document matrix count_vect = CountVectorizer(ngram_range=(1, Ngram), min_df=1, max_features=MaxVocabulary) x_train_counts = count_vect.fit_transform(x_train) x_test_counts=count_vect.transform(x_test) #Term Inverse-Frequency tf_transformer = TfidfTransformer(use_idf=True).fit(x_train_counts)

How to manually calculate TF-IDF score from SKLearn's TfidfVectorizer

阅读更多关于 How to manually calculate TF-IDF score from SKLearn's TfidfVectorizer

问题 I have been running the TF-IDF Vectorizer from SKLearn but am having trouble recreating the values manually (as an aid to understanding what is happening). To add some context, i have a list of documents that I have extracted named entities from (in my actual data these go up to 5-grams but here I have restricted this to bigrams). I only want to know the TF-IDF scores for these values and thought passing these terms via the vocabulary parameter would do this. Here is some dummy data similar

Computing TF-IDF on the whole dataset or only on training data?

阅读更多关于 Computing TF-IDF on the whole dataset or only on training data?

问题 In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test? 回答1: I have not read the book and I am not sure whether this is actually a mistake in the book

How does TfidfVectorizer compute scores on test data

阅读更多关于 How does TfidfVectorizer compute scores on test data

问题 In scikit-learn TfidfVectorizer allows us to fit over training data, and later use the same vectorizer to transform over our test data. The output of the transformation over the train data is a matrix that represents a tf-idf score for each word for a given document. However, how does the fitted vectorizer compute the score for new inputs? I have guessed that either: The score of a word in a new document computed by some aggregation of the scores of the same word over documents in the

Document similarity: Vector embedding versus Tf-Idf performance?

阅读更多关于 Document similarity: Vector embedding versus Tf-Idf performance?

问题 I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches: A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity. Bag-of-Words: tf-idf or its variations such as BM25. Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document