tf-idf | 易学教程

PostgreSQL: Find sentences closest to a given sentence

阅读更多关于 PostgreSQL: Find sentences closest to a given sentence

问题 I have a table of images with sentence captions. Given a new sentence I want to find the images that best match it based on how close the new sentence is to the stored old sentences. I know that I can use the @@ operator with a to_tsquery but tsquery accepts specific words as queries. One problem is I don't know how to convert the given sentence into a meaningful query. The sentence may have punctuation and numbers. However, I also feel that some kind of cosine similarity thing is what I need

How to find out which words are most representative based on their tfidf index and score

阅读更多关于 How to find out which words are most representative based on their tfidf index and score

问题 I have generated tfidf scores for the words in my corpus and would like to identify which words are they. This is my code and results: from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer(stop_words = 'english') X_counts = count_vect.fit_transform(X) X_counts.shape Out[4]: (26, 3777) from sklearn.feature_extraction.text import TfidfTransformer tfidf_transformer = TfidfTransformer() X_tfidf = tfidf_transformer.fit_transform(X_counts) X_tfidf.shape Out[73]:

In general, when does TF-IDF reduce accuracy?

阅读更多关于 In general, when does TF-IDF reduce accuracy?

问题 I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it? 回答1: The IDF component of TF*IDF can harm your classification accuracy in some cases. Let suppose

TF-IDF for my documents yield 0

阅读更多关于 TF-IDF for my documents yield 0

问题 I got this tfidf from yebrahim and somehow my output document yield all 0 for the result . Any problem with this ? example of the output is hippo 0.0 hipper 0.0 hip 0.0 hint 0.0 hindsight 0.0 hill 0.0 hilarious 0.0 thanks for the help # a list of (words-freq) pairs for each document global_terms_in_doc = {} # list to hold occurrences of terms across documents global_term_freq = {} num_docs = 0 lang = 'english' lang_dictionary = {} top_k = -1 supported_langs = ('english', 'french') from django

Dealing with a large amount of unique words for text processing/tf-idf etc

阅读更多关于 Dealing with a large amount of unique words for text processing/tf-idf etc

问题 I am using scikit to do some text processing, such as tfidf. The amount of filenames is being handled fine (~40k). But as far as the number of unique words, I am not able to deal with the array/matrix, whether it is to get the size of the amount of unique words printed, or to dump the numpy array to a file (using savetxt). Below is the traceback. If I could get the top values of the tfidf, as I dont need them for every single word for every single document. Or, I could exclude other words

Tf-idf of strings from csv file

阅读更多关于 Tf-idf of strings from csv file

问题 My test.csv file is (without header): very good, very bad, you are great very bad, good restaurent, nice place to visit I want to make my corpus separated with , so that my final DocumentTermMatrix becomes: terms docs very good very bad you are great good restaurent nice place to visit doc1 tf-idf tf-idf tf-idf 0 0 doc2 0 tf-idf 0 tf-idf tf-idf I am able to produce the above DTM correctly, if I don't load the documents from csv file , like below: library(tm) docs <- c(D1 = "very good, very

sorting each row of a large sparse & saving top K values & column index

阅读更多关于 sorting each row of a large sparse & saving top K values & column index

问题 I have a large sparse scipy matrix (~40k by 100k). I would like to sort each row, descending order, and grab/slice the top K values (~20-50) for every row. I would also like to know the original column index, as each column in the matrix represents a word/feature (in my case, I am running scikit to get tfidf values). 40k rows by K values, wont be as large, and then I can do operations such as .toarray() , but I am not sure what would be the most efficient way of doing the argsort(axis=1) for

tfidf vectorizer process shows error

阅读更多关于 tfidf vectorizer process shows error

问题 I am working on non-Engish corpus analysis but facing several problems. One of those problems is tfidf_vectorizer. After importing concerned liberaries, I processed following code to get results contents = [open("D:\test.txt", encoding='utf8').read()] #define vectorizer parameters tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words=stopwords, use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3)) %time tfidf_matrix = tfidf_vectorizer.fit_transform

How can I return accuracy rates for Top N predictions using sklearn's SGDClassifier?

阅读更多关于 How can I return accuracy rates for Top N predictions using sklearn's SGDClassifier?

问题 I am trying to modify the results in this post (How to get Top 3 or Top N predictions using sklearn's SGDClassifier) to get the accuracy rate returned, however I am get an accuracy rate of zero and I can't figure out why. Any thoughts? Any thoughts/edits would be much appreciated! Thank you. from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np from sklearn import linear_model arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium

converting scipy.sparse.csr.csr_matrix to a list of lists

阅读更多关于 converting scipy.sparse.csr.csr_matrix to a list of lists

问题 I am learning multi label classification and trying to implement the tfidf tutorial from scikit learning. I am dealing with a text corpus to calculate its tf-idf score. I am using the module sklearn.feature_extraction.text for the purpose.Using CountVectorizer and TfidfTransformer I have now my corpus vectorised and tfidf for each vocabulary. The problem is that I am having a sparse matrix now, like: (0, 47) 0.104275891915 (0, 383) 0.084129133023 . . . . (4, 308) 0.0285015996586 (4, 199) 0