tf-idf | 易学教程

NLP2

阅读更多关于 NLP2

本文内容为贪心学院 NLP课程的个人总结。大纲：拼写错误纠正词汇过滤文本表示文本相似度计算拼写错误纠正（小案例）拼写错误纠正：spell correction。根据用户的错误输入产生理应正确的输出。譬如天起（输入） --> 天气（输出），theris --> theirs，机器学系 --> 机器学习等。方法：根据用户输入计算编辑距离。编辑距离编辑距离：edit distance。编辑距离的计算是指通过插入（insert）、删除（delete）和替换（操作让词2变为词1。每个操作的成本为1。例子：输入目标成本 therr there 1 [替换r为e] their 1 [替换r为i] thesis 3 [替换r,r为s,i，插入一个s] 编辑距离的编程实现：计算编辑距离的编程是动态规划问题。 LeetCode:计算最小编辑距离如何确定编辑距离最小的词？方法1: 把词典中的所有词都循环一遍，计算与输入的编辑距离，输出编辑距离最小的词。但是时间复杂度较高，O(V)，其中V是词库所有词汇的数量。方法2 ：根据用户的输入，生成编辑距离为1和2的字符串，通过过滤的方式来选择输出。这里过滤的意思是指计算出现每个编辑距离为1和2字符串的概率，选择概率最大的一个来作为输出。如何生成编辑距离为1或者2的字符串？编辑距离为1:通过增、删

Spark数据挖掘-TF-IDF文档矩阵

阅读更多关于 Spark数据挖掘-TF-IDF文档矩阵

【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> Spark数据挖掘-TF-IDF文档矩阵前言得到词文档矩阵往往都是文本挖掘算法的第一步，词文档矩阵中行表示语料库中出现过的词（实际代码都是对词进行整数编码），列表示所有的文档，矩阵中的每个值就代表词在文档中的重要程度。目前已经有很多计算词在文档中权重的模型，不过最通用的模型应该就是词频-逆文档频率（简称：TF-IDF）矩阵。 TF-IDF 先看一下TF-IDF如何计算每个词在文档中的重要程度,先假设得到了下面几个变量的值： termFrequencyInDoc:Int 词在文档中出现的次数 totalTermsInDoc: Int 文档中所有词的个数 termFreqInCorpus: Int 语料库中出现这个词的不同文档数 totalDocs: Int 整个语料库包含的文档数量利用上面的几个值就可以计算一个词在文档中的重要程度，代码如下： def termDocWeight(termFrequencyInDoc: Int, totalTermsInDoc: Int, termFreqInCorpus: Int, totalDocs: Int): Double = { val tf = termFrequencyInDoc.toDouble / totalTermsInDoc val docFreq

User Warning: Your stop_words may be inconsistent with your preprocessing

阅读更多关于 User Warning: Your stop_words may be inconsistent with your preprocessing

问题 I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning: ,,UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid',

What does “document” mean in a NLP context?

阅读更多关于 What does “document” mean in a NLP context?

问题 As I was reading about tf–idf on Wiki, I was confused by what it means by the word "document". Does it mean paragraph? "The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of

Vector Space Model - query vector [0, 0.707, 0.707] calculated

阅读更多关于 Vector Space Model - query vector [0, 0.707, 0.707] calculated

问题 I'm reading the book "Introduction to Information Retrieval "(Christopher Manning) and I'm stuck on the chapter 6 when it introduces the query "jealous gossip" for which it indicated that the vector unit associated is [0, 0.707, 0.707] ( https://nlp.stanford.edu/IR-book/html/htmledition/queries-as-vectors-1.html ) considering the terms affect, jealous and gossip. I tried to calculate it by computing the tf idf assuming that: - Tf is equal to 1 for jealous and gossip - Idf is always equal to 0

Lucene 4.9: Get TF-IDF for a few selected documents from an Index

阅读更多关于 Lucene 4.9: Get TF-IDF for a few selected documents from an Index

问题 I've seen this or similar question a lot on stackoverflow as well as other online sources. However, it looks like the corresponding part of Lucene's API changed quite a lot so to sum it up: I did not find any example which would work on the latest Lucene version. What I have: Lucene Index + IndexReader + IndexSearcher a bunch of documents (and their IDs) What I want: For all terms that occur only in at least one of the selected documents I want to get TF-IDF for each document. Or to say it

Memory Error when attempting to apply 'fit_transform()' on TFidfVectorizer containing Pandas Dataframe column (containing strings)

阅读更多关于 Memory Error when attempting to apply 'fit_transform()' on TFidfVectorizer containing Pandas Dataframe column (containing strings)

问题 I'm attempting a similar operation as shown here . I begin with reading in two columns from a CSV file that contains 2405 rows in the format of: Year e.g. "1995" AND cleaned e.g. ["this", "is, "exemplar", "document", "contents"], both columns utilise strings as data types. df = pandas.read_csv("ukgovClean.csv", encoding='utf-8', usecols=[0,2]) I have already pre-cleaned the data, and below shows the format of the top 4 rows: [IN] df.head() [OUT] Year cleaned 0 1909 acquaint hous receiv follow

Problems using a custom vocabulary for TfidfVectorizer scikit-learn

阅读更多关于 Problems using a custom vocabulary for TfidfVectorizer scikit-learn

问题 I'm trying to use a custom vocabulary in scikit-learn for some clustering tasks and I'm getting very weird results. The program runs ok when not using a custom vocabulary and I'm satisfied with the cluster creation. However, I have already identified a group of words (around 24,000) that I would like to use as a custom vocabulary. The words are stored in a SQL Server table. I have tried so far 2 approaches, but I get the same results at the end. The first one is to create a list, the second

Sparse vector RDD in pyspark

阅读更多关于 Sparse vector RDD in pyspark

问题 I have been implementing the TF-IDF method described here with Python/Pyspark using feature from mllib: https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html I have a training set of 150 text documents, a testing set of 80 text documents. I have produced a hash table TF-IDF RDD (of sparse vectors) for both training and testing i.e. bag of words representation called tfidf_train and tfidf_test. The IDF is shared between both and is based solely on the training data. My question

tf idf similarity

阅读更多关于 tf idf similarity

问题 I am using TF/IDF to calculate similarity. For example if I have the following two doc. Doc A => cat dog Doc B => dog sparrow It is normal it's similarity would be 50% but when I calculate its TF/IDF. It is as follow Tf values for Doc A dog tf = 0.5 cat tf = 0.5 Tf values for Doc B dog tf = 0.5 sparrow tf = 0.5 IDF values for Doc A dog idf = -0.4055 cat idf = 0 IDF values for Doc B dog idf = -0.4055 ( without +1 formula 0.6931) sparrow idf = 0 TF/IDF value for Doc A 0.5x-0.4055 + 0.5x0 = -0