tf-idf

about cosine similarity

╄→尐↘猪︶ㄣ 提交于 2019-12-01 11:08:41
问题 I am finding cosine similarity between documents.. I did it like this D1=(8,0,0,1) where 8,0,0,1 are the tf-idf scores of the terms t1, t2, t3 , t4 D2=(7,0,0,1) cos(theta) = (56 + 0 + 0 + 1) / sqrt(64 + 49) sqrt(1 +1 ) which comes out to be cos(theta)= 5 Now what do I evaluate from this value... I don't get it what does cos(theta)=5 signify about the similarity between them... Am I doing things right? 回答1: The denominator is wrong. The cosine similarity is defined as D1 · D2 sim = ———————————

jieba库

倖福魔咒の 提交于 2019-12-01 05:05:11
jieba “结巴”中文分词:做最好的 Python 中文分词组件 "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module. Scroll down for English documentation. 特点 支持三种分词模式: 精确模式,试图将句子最精确地切开,适合文本分析; 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义; 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。 支持繁体分词 支持自定义词典 MIT 授权协议 算法 基于前缀词典实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图 (DAG) 采用了动态规划查找最大概率路径, 找出基于词频的最大切分组合 对于未登录词,采用了基于汉字成词能力的 HMM 模型,使用了 Viterbi 算法 主要功能 分词 jieba.cut 方法接受三个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用 HMM 模型 jieba.cut_for_search 方法接受两个参数:需要分词的字符串;是否使用 HMM 模型

Append tfidf to pandas dataframe

戏子无情 提交于 2019-12-01 03:47:11
问题 I have the following pandas structure: col1 col2 col3 text 1 1 0 meaningful text 5 9 7 trees 7 8 2 text I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn into a dense matrix via mysparsematrix).toarray() . However, how can I add this info with labels to my original df? So the target would look like: col1 col2 col3 meaningful text trees 1 1 0 1 1 0 5 9 7 0 0 1 7 8 2 0 1 0 UPDATE: Solution makes the concatenation wrong even when

pyspark: sparse vectors to scipy sparse matrix

南笙酒味 提交于 2019-11-30 07:19:27
I have a spark dataframe with a column of short sentences, and a column with a categorical variable. I'd like to perform tf-idf on the sentences, one-hot-encoding on the categorical variable and then output it to a sparse matrix on my driver once it's much smaller in size (for a scikit-learn model). What is the best way to get the data out of spark in sparse form? It seems like there is only a toArray() method on sparse vectors, which outputs numpy arrays. However, the docs do say that scipy sparse arrays can be used in the place of spark sparse arrays. Keep in mind also that the tf_idf values

Elasticsearch score disable IDF

北慕城南 提交于 2019-11-30 06:56:56
I'm using ES for searching a huge list of human names employing fuzzy search techniques. TF is applicable for scoring, but IDF is really not required for me in this case. This is really diluting the score. I still want TF and Field Norm to be applied to the score. How do I disable/suppress IDF for my queries, but keep TF and Field Norm? I came across the Disable IDF calculation thread, but it did not help me. It also seems like the constant score query would not help me in this case. even When create index, we can put our own similarity calculate method into the setting parts, if you need only

TF*IDF for Search Queries

元气小坏坏 提交于 2019-11-30 06:44:27
Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature Basically, I want to create a search query that contains searches through multiple documents. I would like to use the scikit-learn toolkit as well as the NLTK library for Python The problem is that I don't see where the two TF*IDF vectors come from. I need one search query and multiple documents to search. I figured that I calculate the TF*IDF scores of each document against each query and find the cosine similarity between them, and then rank them by

Lucene custom scoring for numeric fields

社会主义新天地 提交于 2019-11-30 05:23:21
I would like to have, in addition to standard term search with tf-idf similarity over text content field, scoring based on "similarity" of numeric fields. This similarity will be depending on distance between the value in query and in document (e.g. gaussian with m= [user input], s= 0.5) I.e. let's say documents represent people, and person document have two fields: description (full text) age (numeric). I want to find documents like description:(x y z) age:30 but age to be not the filter , but rather part of score (for person of age 30 multiplier will be 1.0, for 25-year-old person 0.8 etc.)

How can I create a TF-IDF for Text Classification using Spark?

我只是一个虾纸丫 提交于 2019-11-30 04:24:08
I have a CSV file with the following format : product_id1,product_title1 product_id2,product_title2 product_id3,product_title3 product_id4,product_title4 product_id5,product_title5 [...] The product_idX is a integer and the product_titleX is a String, example : 453478692, Apple iPhone 4 8Go I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib. I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4 . So I'm reading the file : val file = sc.textFile("offers.csv") Then I'm mapping it

Calculate TF-IDF using sklearn for n-grams in python

给你一囗甜甜゛ 提交于 2019-11-30 04:08:56
问题 I have a vocabulary list that include n-grams as follows. myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding'] I want to use these words to calculate TF-IDF values. I also have a dictionary of corpus as follows (key = recipe number, value = recipe). corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3:

TF-IDF implementations in python

烈酒焚心 提交于 2019-11-30 03:28:59
What are the standard tf-idf implementations/api available in python? I've come across the one in nltk. I want to know the other libraries that provide this feature. Gunjan there is a package called scikit which calculates tf-idf scores. you can refer to my answer to this question Python: tf-idf-cosine: to find document similarity and also see the question code from this. Thankz. Try the libraries which implements TF-IDF algorithm in python. http://code.google.com/p/tfidf/ https://github.com/hrs/python-tf-idf Unfortunately, questions asking for a tool or library are offtopic on SO. There are