tf-idf | 易学教程

String Matching Using TF-IDF, NGrams and Cosine Similarity in Python

阅读更多关于 String Matching Using TF-IDF, NGrams and Cosine Similarity in Python

问题 I am working on my first major data science project. I am attempting to match names between a large list of data from one source, to a cleansed dictionary in another. I am using this string matching blog as a guide. I am attempting to use two different data sets. Unfortunately, I can't seem to get good results and I think I am not applying this appropriately. Code: import pandas as pd, numpy as np, re, sparse_dot_topn.sparse_dot_topn as ct from sklearn.feature_extraction.text import

tf-idf on a somewhat large (65k) amount of text files

阅读更多关于 tf-idf on a somewhat large (65k) amount of text files

问题 I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree). I figure

Sklearn how to get the 10 words from each topic

阅读更多关于 Sklearn how to get the 10 words from each topic

问题 I want to get the top 10 frequency of words from each topic, and after I use TfidfTransformer, I get: and the type is scipy.sparse.csr.csr_matrix But I don't know how to get the highest ten from each list, in the data, (0, ****) means the 0 list, until (5170, *****) means the 5170 list. I've tried to convert it into numpy, but it fails. (0, 19016) 0.024214182003181053 (0, 28002) 0.03661443306612277 (0, 6710) 0.02292100371816788 (0, 27683) 0.013973969726506812 (0, 27104) 0.02236713272585597 (0

Can I use TfidfVectorizer in scikit-learn for non-English language? Also how do I read a non-English text in Python?

阅读更多关于 Can I use TfidfVectorizer in scikit-learn for non-English language? Also how do I read a non-English text in Python?

问题 I have to read a text document which contains both English and non-English (Malayalam specifically) languages in Python. The following I see: >>>text_english = 'Today is a good day' >>>text_non_english = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത' Now, if I write a code to extract the first letter using >>>print(text_english[0]) 'T' and when I run >>>print(text_non_english[0]) � To get the first letter, I have to write the following >>>print(text_non_english[0:3]) ആ Why this happens? My aim to extract the

Tfidfvectorizer - How can I check out processed tokens?

阅读更多关于 Tfidfvectorizer - How can I check out processed tokens?

问题 How can I check the strings tokenized inside TfidfVertorizer() ? If I don't pass anything in the arguments, TfidfVertorizer() will tokenize the string with some pre-defined methods. I want to observe how it tokenizes strings so that I can more easily tune my model. from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?'] vectorizer = TfidfVectorizer

how to view tf-idf score against each word

阅读更多关于 how to view tf-idf score against each word

问题 I was trying to know the tf-idf scores of each word in my document. However, it only returns values in the matrix but I see a specific type of representation of tf-idf scores against each word. I have used processed and the code works however I want to change the way it is presented: code: from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer bow_transformer = CountVectorizer(analyzer=text_process).fit(df["comments"].head())

How to use tft.compute_and_apply_vocabulary and tft.tfidf correctly?

阅读更多关于 How to use tft.compute_and_apply_vocabulary and tft.tfidf correctly?

来源： https://stackoverflow.com/questions/57447937/how-to-use-tft-compute-and-apply-vocabulary-and-tft-tfidf-correctly

Extract most important keywords from a set of documents

阅读更多关于 Extract most important keywords from a set of documents

来源： https://stackoverflow.com/questions/45861220/extract-most-important-keywords-from-a-set-of-documents

Python tf-idf: fast way to update the tf-idf matrix

阅读更多关于 Python tf-idf: fast way to update the tf-idf matrix

来源： https://stackoverflow.com/questions/42212423/python-tf-idf-fast-way-to-update-the-tf-idf-matrix

Python tf-idf: fast way to update the tf-idf matrix

阅读更多关于 Python tf-idf: fast way to update the tf-idf matrix

来源： https://stackoverflow.com/questions/42212423/python-tf-idf-fast-way-to-update-the-tf-idf-matrix