tf-idf

PostgreSQL: Find sentences closest to a given sentence

 ̄綄美尐妖づ 提交于 2019-12-12 09:52:27
问题 I have a table of images with sentence captions. Given a new sentence I want to find the images that best match it based on how close the new sentence is to the stored old sentences. I know that I can use the @@ operator with a to_tsquery but tsquery accepts specific words as queries. One problem is I don't know how to convert the given sentence into a meaningful query. The sentence may have punctuation and numbers. However, I also feel that some kind of cosine similarity thing is what I need

How to find out which words are most representative based on their tfidf index and score

本秂侑毒 提交于 2019-12-12 06:28:03
问题 I have generated tfidf scores for the words in my corpus and would like to identify which words are they. This is my code and results: from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer(stop_words = 'english') X_counts = count_vect.fit_transform(X) X_counts.shape Out[4]: (26, 3777) from sklearn.feature_extraction.text import TfidfTransformer tfidf_transformer = TfidfTransformer() X_tfidf = tfidf_transformer.fit_transform(X_counts) X_tfidf.shape Out[73]:

In general, when does TF-IDF reduce accuracy?

狂风中的少年 提交于 2019-12-12 03:48:45
问题 I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it? 回答1: The IDF component of TF*IDF can harm your classification accuracy in some cases. Let suppose

TF-IDF for my documents yield 0

一个人想着一个人 提交于 2019-12-11 18:11:41
问题 I got this tfidf from yebrahim and somehow my output document yield all 0 for the result . Any problem with this ? example of the output is hippo 0.0 hipper 0.0 hip 0.0 hint 0.0 hindsight 0.0 hill 0.0 hilarious 0.0 thanks for the help # a list of (words-freq) pairs for each document global_terms_in_doc = {} # list to hold occurrences of terms across documents global_term_freq = {} num_docs = 0 lang = 'english' lang_dictionary = {} top_k = -1 supported_langs = ('english', 'french') from django

Dealing with a large amount of unique words for text processing/tf-idf etc

天大地大妈咪最大 提交于 2019-12-11 07:59:33
问题 I am using scikit to do some text processing, such as tfidf. The amount of filenames is being handled fine (~40k). But as far as the number of unique words, I am not able to deal with the array/matrix, whether it is to get the size of the amount of unique words printed, or to dump the numpy array to a file (using savetxt). Below is the traceback. If I could get the top values of the tfidf, as I dont need them for every single word for every single document. Or, I could exclude other words

Tf-idf of strings from csv file

我们两清 提交于 2019-12-11 07:25:14
问题 My test.csv file is (without header): very good, very bad, you are great very bad, good restaurent, nice place to visit I want to make my corpus separated with , so that my final DocumentTermMatrix becomes: terms docs very good very bad you are great good restaurent nice place to visit doc1 tf-idf tf-idf tf-idf 0 0 doc2 0 tf-idf 0 tf-idf tf-idf I am able to produce the above DTM correctly, if I don't load the documents from csv file , like below: library(tm) docs <- c(D1 = "very good, very

sorting each row of a large sparse & saving top K values & column index

Deadly 提交于 2019-12-11 06:53:28
问题 I have a large sparse scipy matrix (~40k by 100k). I would like to sort each row, descending order, and grab/slice the top K values (~20-50) for every row. I would also like to know the original column index, as each column in the matrix represents a word/feature (in my case, I am running scikit to get tfidf values). 40k rows by K values, wont be as large, and then I can do operations such as .toarray() , but I am not sure what would be the most efficient way of doing the argsort(axis=1) for

tfidf vectorizer process shows error

白昼怎懂夜的黑 提交于 2019-12-11 06:15:00
问题 I am working on non-Engish corpus analysis but facing several problems. One of those problems is tfidf_vectorizer. After importing concerned liberaries, I processed following code to get results contents = [open("D:\test.txt", encoding='utf8').read()] #define vectorizer parameters tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words=stopwords, use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3)) %time tfidf_matrix = tfidf_vectorizer.fit_transform

How can I return accuracy rates for Top N predictions using sklearn's SGDClassifier?

我们两清 提交于 2019-12-11 02:48:37
问题 I am trying to modify the results in this post (How to get Top 3 or Top N predictions using sklearn's SGDClassifier) to get the accuracy rate returned, however I am get an accuracy rate of zero and I can't figure out why. Any thoughts? Any thoughts/edits would be much appreciated! Thank you. from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np from sklearn import linear_model arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium

converting scipy.sparse.csr.csr_matrix to a list of lists

笑着哭i 提交于 2019-12-10 17:03:23
问题 I am learning multi label classification and trying to implement the tfidf tutorial from scikit learning. I am dealing with a text corpus to calculate its tf-idf score. I am using the module sklearn.feature_extraction.text for the purpose.Using CountVectorizer and TfidfTransformer I have now my corpus vectorised and tfidf for each vocabulary. The problem is that I am having a sparse matrix now, like: (0, 47) 0.104275891915 (0, 383) 0.084129133023 . . . . (4, 308) 0.0285015996586 (4, 199) 0