tf-idf

String Matching Using TF-IDF, NGrams and Cosine Similarity in Python

倾然丶 夕夏残阳落幕 提交于 2021-02-17 20:59:49
问题 I am working on my first major data science project. I am attempting to match names between a large list of data from one source, to a cleansed dictionary in another. I am using this string matching blog as a guide. I am attempting to use two different data sets. Unfortunately, I can't seem to get good results and I think I am not applying this appropriately. Code: import pandas as pd, numpy as np, re, sparse_dot_topn.sparse_dot_topn as ct from sklearn.feature_extraction.text import

tf-idf on a somewhat large (65k) amount of text files

十年热恋 提交于 2021-02-08 04:45:37
问题 I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree). I figure

Sklearn how to get the 10 words from each topic

佐手、 提交于 2021-01-29 14:46:03
问题 I want to get the top 10 frequency of words from each topic, and after I use TfidfTransformer, I get: and the type is scipy.sparse.csr.csr_matrix But I don't know how to get the highest ten from each list, in the data, (0, ****) means the 0 list, until (5170, *****) means the 5170 list. I've tried to convert it into numpy, but it fails. (0, 19016) 0.024214182003181053 (0, 28002) 0.03661443306612277 (0, 6710) 0.02292100371816788 (0, 27683) 0.013973969726506812 (0, 27104) 0.02236713272585597 (0

Can I use TfidfVectorizer in scikit-learn for non-English language? Also how do I read a non-English text in Python?

Deadly 提交于 2021-01-29 05:22:57
问题 I have to read a text document which contains both English and non-English (Malayalam specifically) languages in Python. The following I see: >>>text_english = 'Today is a good day' >>>text_non_english = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത' Now, if I write a code to extract the first letter using >>>print(text_english[0]) 'T' and when I run >>>print(text_non_english[0]) � To get the first letter, I have to write the following >>>print(text_non_english[0:3]) ആ Why this happens? My aim to extract the

Tfidfvectorizer - How can I check out processed tokens?

♀尐吖头ヾ 提交于 2021-01-04 05:40:43
问题 How can I check the strings tokenized inside TfidfVertorizer() ? If I don't pass anything in the arguments, TfidfVertorizer() will tokenize the string with some pre-defined methods. I want to observe how it tokenizes strings so that I can more easily tune my model. from sklearn.feature_extraction.text import TfidfVectorizer corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?'] vectorizer = TfidfVectorizer

how to view tf-idf score against each word

我是研究僧i 提交于 2020-12-13 05:56:40
问题 I was trying to know the tf-idf scores of each word in my document. However, it only returns values in the matrix but I see a specific type of representation of tf-idf scores against each word. I have used processed and the code works however I want to change the way it is presented: code: from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer bow_transformer = CountVectorizer(analyzer=text_process).fit(df["comments"].head())