tf-idf

Calculating cosine similarity by featurizing the text into vector using tf-idf

烂漫一生 提交于 2019-12-06 09:19:50
问题 I'm new to Apache Spark, want to find the similar text from a bunch of text, have tried myself as follows - I have 2 RDD- 1st RDD contain incomplete text as follows - [0,541 Suite 204, Redwood City, CA 94063] [1,6649 N Blue Gum St, New Orleans,LA, 70116] [2,#69, Los Angeles, Los Angeles, CA, 90034] [3,98 Connecticut Ave Nw, Chagrin Falls] [4,56 E Morehead Webb, TX, 78045] 2nd RDD contain correct address as follows - [0,541 Jefferson Avenue, Suite 204, Redwood City, CA 94063] [1,6649 N Blue

TF-IDF算法

不想你离开。 提交于 2019-12-06 09:19:16
wiki: https://zh.wikipedia.org/wiki/Tf-idf 参考: https://zhuanlan.zhihu.com/p/31197209 tf-idf (英语: t erm f requency– i nverse d ocument f requency)是一种用于 信息检索 与 文本挖掘 的常用加权技术。tf-idf是一种统计方法,用以评估一字词对于一个文件集或一个 语料库 中的其中一份 文件 的重要程度。字词的重要性随着它在文件中出现的次数成 正比 增加,但同时会随着它在语料库中出现的频率成反比下降。tf-idf加权的各种形式常被 搜索引擎 应用,作为文件与用户查询之间相关程度的度量或评级。除了tf-idf以外,互联网上的搜索引擎还会使用基于链接分析的评级方法,以确定文件在搜索结果中出现的顺序。 在一份给定的文件里, 词频 (term frequency,tf)指的是某一个给定的词语在该文件中出现的频率。这个数字是对 词数 (term count)的归一化,以防止它偏向长的文件。(同一个词语在长文件里可能会比短文件有更高的词数,而不管该词语重要与否。)对于在某一特定文件里的词语 来说,它的重要性可表示为: 以上式子中 是该词在文件 中的出现次数,而分母则是在文件 中所有字词的出现次数之和。 逆向文件频率 (inverse document

TFIDF Vectorizer giving error

淺唱寂寞╮ 提交于 2019-12-06 06:58:06
I am trying to carry out text classification for certain files using TFIDF and SVM. The features are to be selected 3 words at a time . My data files is already in the format : angel eyes has, each one for, on its own. There are no stop words and neither can do lemming or stemming. I want the feature to be selected as: angel eyes has ... The code that I have written is given below: import os import sys import numpy from sklearn.svm import LinearSVC from sklearn.metrics import confusion_matrix from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text

How to Calculate cosine similarity with tf-idf using Lucene and Java

南笙酒味 提交于 2019-12-06 06:00:37
I have a query and a set of documents. I need to rank these documents based on the cosine similarity with tf-idf. Can someone please tell me what support I can get from Lucene to compute this ? What parameters I can directly calculate from Lucene (can I get tf, idf directly through some method in lucene?) and how to compute cosine similarity with Lucene (is there any function which directly returns cosine similarity if I pass two vectors of the query and the document ?) Thanx in advance Lucene already uses a pimped version of cosine similarity, so if you need the raw CS itself, it's probably

TfIdfVectorizer: How does the vectorizer with fixed vocab deal with new words?

大憨熊 提交于 2019-12-06 05:42:08
问题 I'm working on a corpus of ~100k research papers. I'm considering three fields: plaintext title abstract I used the TfIdfVectorizer to get a TfIdf representation of the plaintext field and feed the thereby originated vocab back into the Vectorizers of title and abstract to assure that all three representations are working on the same vocab. My idea was that since the the plaintext field is much bigger than the other two, it's vocab will most probably cover all the words in the other fields.

AttributeError: getfeature_names not found ; using scikit-learn

天涯浪子 提交于 2019-12-06 04:51:23
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() vectorizer = vectorizer.fit(word_data) freq_term_mat = vectorizer.transform(word_data) from sklearn.feature_extraction.text import TfidfTransformer tfidf = TfidfTransformer(norm="l2") tfidf = tfidf.fit(freq_term_mat) Ttf_idf_matrix = tfidf.transform(freq_term_mat) voc_words = Ttf_idf_matrix.getfeature_names() print "The num of words = ",len(voc_words) when I run the program containing this piece of code I get following error: Traceback (most recent call last): File "vectorize_text.py", line 87, in voc

Python and tfidf algorithm, make it faster?

落爺英雄遲暮 提交于 2019-12-06 03:42:37
问题 I am implementing the tf-idf algorithm in a web application using Python, however it runs extremely slow. What I basically do is: 1) Create 2 dictionaries: First dictionary: key (document id), value (list of all found words (incl. repeated) in doc) Second dictionary; key (document id), value (set containing unique words of the doc) Now, there is a petition of a user to get tfidf results of document d. What I do is: 2) Loop over the unique words of the second dictionary for the document d, and

基于文本向量空间模型的文本聚类算法

一曲冷凌霜 提交于 2019-12-06 01:51:24
基于文本向量空间模型的文本聚类算法 @[vsm|向量空间模型|文本相似度] 本文源地址 http://www.houzhuo.net/archives/51.html vsm概念简单,把对文本内容的处理转化为向量空间中的 向量 计算,以空间上的相似度来直观表达语义上的相似度。 目录 基于文本向量空间模型的文本聚类算法 文本聚类 向量空间模型vsm 文本预处理 获取每篇文档词频 获得相同长度的向量 归一化 idf频率加权 tf-idf加权并归一化 计算向量间的夹角 文本聚类 文本聚类 主要依据聚类假设:同类的文档相似度较大,非同类的文档相似度较小。作为一种 无监督的机器学习 方法,聚类由于不需要训练过程、以及不需要预先对文档手工标注类别,因此具有较高的灵活性和自动化处理能力,成为对文本信息进行有效组织、摘要和导航的重要手段。 向量空间模型vsm 所有的文本都可表现成向量的形式: 向量中的每一维都代表在文档中出现的一个独立词组或单个词,并且我们给每个词组赋予一个 权值 (最简单就是词频,或者广为人知的tf_idf权重)。所以一个文档就会转换成一个n维的向量。 向量夹角公式 接下来就是利用中学所学的的公式来计算向量之间的夹角,夹角越小即代表较高的相似度。当然,我们比较之前需要将两个向量转化为同一维度(下面的代码中将加以演示) 文本预处理: __author__ = 'iothz'

how can I implement the tf-idf and cosine similarity in Lucene?

喜欢而已 提交于 2019-12-05 17:44:04
How can I implement the tf-idf and cosine similarity in Lucene? I'm using Lucene 4.2. The program that I've created does not use tf-idf and Cosine similaryty, it only uses TopScoreDocCollector. import com.mysql.jdbc.Statement; import java.io.BufferedReader; import java.io.File; import java.io.InputStreamReader; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.util.Version; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.IndexWriter; import java.sql.DriverManager; import java.sql.Connection; import java.sql.ResultSet; import

From TF-IDF to LDA clustering in spark, pyspark

那年仲夏 提交于 2019-12-05 16:57:48
I am trying to cluster tweets stored in the format key,listofwords My first step has been to extract TF-IDF values for the list of words using dataframe with dbURL = "hdfs://pathtodir" file = sc.textFile(dbURL) #Define data frame schema fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)] schema = StructType(fields) #Data in format <key>,<listofwords> file_temp = file.map(lambda l : l.split(",")) file_df = sqlContext.createDataFrame(file_temp, schema) #Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html tokenizer = Tokenizer(inputCol=