tf-idf

NLP2

送分小仙女□ 提交于 2019-12-15 01:13:22
本文内容为 贪心学院 NLP课程的个人总结。 大纲: 拼写错误纠正 词汇过滤 文本表示 文本相似度计算 拼写错误纠正(小案例) 拼写错误纠正 :spell correction。根据用户的错误输入产生理应正确的输出。譬如天起(输入) --> 天气(输出),theris --> theirs,机器学系 --> 机器学习等。 方法 :根据用户输入计算编辑距离。 编辑距离 编辑距离 :edit distance。编辑距离的计算是指通过插入(insert)、删除(delete)和替换(操作让词2变为词1。每个操作的成本为1。 例子 : 输入 目标 成本 therr there 1 [替换r为e] their 1 [替换r为i] thesis 3 [替换r,r为s,i,插入一个s] 编辑距离的编程实现 :计算编辑距离的编程是动态规划问题。 LeetCode:计算最小编辑距离 如何确定编辑距离最小的词? 方法1: 把词典中的所有词都循环一遍,计算与输入的编辑距离,输出编辑距离最小的词。但是时间复杂度较高,O(V),其中V是词库所有词汇的数量。 方法2 :根据用户的输入,生成编辑距离为1和2的字符串,通过 过滤 的方式来选择输出。这里过滤的意思是指计算出现 每个 编辑距离为1和2字符串的概率,选择概率最大的一个来作为输出。 如何生成编辑距离为1或者2的字符串? 编辑距离为1:通过增、删

Spark数据挖掘-TF-IDF文档矩阵

那年仲夏 提交于 2019-12-14 20:28:29
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> Spark数据挖掘-TF-IDF文档矩阵 前言 得到词文档矩阵往往都是文本挖掘算法的第一步,词文档矩阵中行表示语料库中出现过的词(实际代码都是对词进行整数编码),列表示所有的文档,矩阵中的每个值就代表词在文档中的重要程度。目前已经有很多计算词在文档中权重的模型,不过最通用的模型应该就是 词频-逆文档频率(简称:TF-IDF) 矩阵。 TF-IDF 先看一下TF-IDF如何计算每个词在文档中的重要程度,先假设得到了下面几个变量的值: termFrequencyInDoc:Int 词在文档中出现的次数 totalTermsInDoc: Int 文档中所有词的个数 termFreqInCorpus: Int 语料库中出现这个词的不同文档数 totalDocs: Int 整个语料库包含的文档数量 利用上面的几个值就可以计算一个词在文档中的重要程度,代码如下: def termDocWeight(termFrequencyInDoc: Int, totalTermsInDoc: Int, termFreqInCorpus: Int, totalDocs: Int): Double = { val tf = termFrequencyInDoc.toDouble / totalTermsInDoc val docFreq

User Warning: Your stop_words may be inconsistent with your preprocessing

不问归期 提交于 2019-12-13 15:23:26
问题 I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning: ,,UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid',

What does “document” mean in a NLP context?

僤鯓⒐⒋嵵緔 提交于 2019-12-13 14:23:03
问题 As I was reading about tf–idf on Wiki, I was confused by what it means by the word "document". Does it mean paragraph? "The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of

Vector Space Model - query vector [0, 0.707, 0.707] calculated

浪子不回头ぞ 提交于 2019-12-13 07:53:43
问题 I'm reading the book "Introduction to Information Retrieval "(Christopher Manning) and I'm stuck on the chapter 6 when it introduces the query "jealous gossip" for which it indicated that the vector unit associated is [0, 0.707, 0.707] ( https://nlp.stanford.edu/IR-book/html/htmledition/queries-as-vectors-1.html ) considering the terms affect, jealous and gossip. I tried to calculate it by computing the tf idf assuming that: - Tf is equal to 1 for jealous and gossip - Idf is always equal to 0

Lucene 4.9: Get TF-IDF for a few selected documents from an Index

情到浓时终转凉″ 提交于 2019-12-13 04:46:56
问题 I've seen this or similar question a lot on stackoverflow as well as other online sources. However, it looks like the corresponding part of Lucene's API changed quite a lot so to sum it up: I did not find any example which would work on the latest Lucene version. What I have: Lucene Index + IndexReader + IndexSearcher a bunch of documents (and their IDs) What I want: For all terms that occur only in at least one of the selected documents I want to get TF-IDF for each document. Or to say it

Memory Error when attempting to apply 'fit_transform()' on TFidfVectorizer containing Pandas Dataframe column (containing strings)

喜夏-厌秋 提交于 2019-12-13 02:34:08
问题 I'm attempting a similar operation as shown here . I begin with reading in two columns from a CSV file that contains 2405 rows in the format of: Year e.g. "1995" AND cleaned e.g. ["this", "is, "exemplar", "document", "contents"], both columns utilise strings as data types. df = pandas.read_csv("ukgovClean.csv", encoding='utf-8', usecols=[0,2]) I have already pre-cleaned the data, and below shows the format of the top 4 rows: [IN] df.head() [OUT] Year cleaned 0 1909 acquaint hous receiv follow

Problems using a custom vocabulary for TfidfVectorizer scikit-learn

孤街浪徒 提交于 2019-12-12 14:19:14
问题 I'm trying to use a custom vocabulary in scikit-learn for some clustering tasks and I'm getting very weird results. The program runs ok when not using a custom vocabulary and I'm satisfied with the cluster creation. However, I have already identified a group of words (around 24,000) that I would like to use as a custom vocabulary. The words are stored in a SQL Server table. I have tried so far 2 approaches, but I get the same results at the end. The first one is to create a list, the second

Sparse vector RDD in pyspark

余生长醉 提交于 2019-12-12 12:26:31
问题 I have been implementing the TF-IDF method described here with Python/Pyspark using feature from mllib: https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html I have a training set of 150 text documents, a testing set of 80 text documents. I have produced a hash table TF-IDF RDD (of sparse vectors) for both training and testing i.e. bag of words representation called tfidf_train and tfidf_test. The IDF is shared between both and is based solely on the training data. My question

tf idf similarity

淺唱寂寞╮ 提交于 2019-12-12 11:48:00
问题 I am using TF/IDF to calculate similarity. For example if I have the following two doc. Doc A => cat dog Doc B => dog sparrow It is normal it's similarity would be 50% but when I calculate its TF/IDF. It is as follow Tf values for Doc A dog tf = 0.5 cat tf = 0.5 Tf values for Doc B dog tf = 0.5 sparrow tf = 0.5 IDF values for Doc A dog idf = -0.4055 cat idf = 0 IDF values for Doc B dog idf = -0.4055 ( without +1 formula 0.6931) sparrow idf = 0 TF/IDF value for Doc A 0.5x-0.4055 + 0.5x0 = -0