tf-idf

TF-IDF

匿名 (未验证) 提交于 2019-12-02 23:36:01
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/lz_peter/article/details/90676146 TF-IDF算是nlp工程师必须掌握的入门级别的算法。作为兴趣爱好,之前曾阅读过几篇介绍该算法的博客,对其只是知道个大概。最近在看吴军老师的《数学之美》,里面对TF-IDF的介绍使得我对该算法有了更深刻的认识。现将个人对该算法的理解整理如下: TF-IDF是一种统计方法,用来评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。 字词的重要性随着它 ①在文件中出现的次数成正比增加 ,但同时会随着它 ②在语料库中出现的频率成反比下降。 其中,①通过算法的TF部分来进行评估,②通过IDF来进行评估。 文章来源: https://blog.csdn.net/lz_peter/article/details/90676146

NLP-关键词抽取的几种算法

匿名 (未验证) 提交于 2019-12-02 22:56:40
TextRank PageRank的思想是这样的: 求解网页的重要性就是求解有向图中节点的重要性,或者说节点的权重。图中节点的重要性和节点的入度有关,越多的其他节点连接到该节点说明该节点的重要性越大。因此,节点的重要性与节点的入度有关。考虑到存在没有入度的节点,增加了阻尼系数来保证所有的节点都有大于0的重要性。试验结果表明,0.85的阻尼系数,迭代100多次就可以收敛到一个稳定的值。 所以PageRank的公式: TextRank从PageRank改进而来,比PageRank多了一个参数:节点之间边的权重,不同的是TextRank算法构造了一个无向图。公式如下: 文本分词后的词汇相当于有向图中的节点,节点之间的边则通过词共现关系构建。给节点指定任意初值,迭代传播节点之间的权重,直到收敛。 优点:仅用单个文档就可以抽取其本身的关键词,不需要使用多篇文档训练 缺点: TF-IDF TF-IDF=TF*IDF TF(Term Frequency) Term Frequency:词频,顾名思义,TF的基本思想是词在文档中出现的次数越多,越能代表该文档。由于同一个词语在长文件里可能会比短文件有更高的词频,所以需要对词频进行归一化处理。 而有些通用词在每个文档中都出现很多次,但不能表示任一文档的含义,所以引入IDF。 Inverse Document Frequency:逆文档频率。

用Python进行简单的文本相似度分析(重要)

匿名 (未验证) 提交于 2019-12-02 22:54:36
转载:https://blog.csdn.net/xiexf189/article/details/79092629 学习目标: 利用gensim包分析文档相似度 使用jieba进行中文分词 了解TF-IDF模型 环境: Python 3.6.0 |Anaconda 4.3.1 (64-bit) 工具: jupyter notebook 注:为了简化问题,本文没有剔除停用词“stop-word”。实际应用中应该要剔除停用词。 首先引入分词API库jieba、文本相似度库gensim import jieba from gensim import corpora , models , similarities 1 2 以下doc0-doc7是几个最简单的文档,我们可以称之为目标文档,本文就是分析doc_test(测试文档)与以上8个文档的相似度。 doc0 = "我不喜欢上海" doc1 = "上海是一个好地方" doc2 = "北京是一个好地方" doc3 = "上海好吃的在哪里" doc4 = "上海好玩的在哪里" doc5 = "上海是好地方" doc6 = "上海路和上海人" doc7 = "喜欢小吃" doc_test = "我喜欢上海的小吃" 1 2 3 4 5 6 7 8 9 分词 首先,为了简化操作,把目标文档放到一个列表all_doc中。 all_doc = []

How to select stop words using tf-idf? (non english corpus)

别等时光非礼了梦想. 提交于 2019-12-02 22:53:54
I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document. Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune. The best (as in

Trying to get tf-idf weighting working in R

隐身守侯 提交于 2019-12-02 17:33:25
I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses , the second containing the second three episodes, if you must know). R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3 Relevant bit of code: library('tm') corpus <- Corpus(DirSource('.')) dtm <- DocumentTermMatrix(corpus,control=list(weight=weightTfIdf)) str

Python Tf idf algorithm

做~自己de王妃 提交于 2019-12-02 11:32:47
问题 I would like to find the most relevant words over a set of documents. I would like to call a Tf Idf algorithm over 3 documents and return a csv file containing each word and its frequency. After that, I will take only the ones with a high number and I will use them. I found this implementation that does what I need https://github.com/mccurdyc/tf-idf/. I call that jar using the subprocess library. But there is a huge problem in that code: it commits a lot of mistake in analyzing words. It mixs

tfidf.transform() function not returning correct values

浪尽此生 提交于 2019-12-02 03:20:49
问题 I am trying to fit tfidf vectorizer on a certain text corpus and then use the same vectorizer to find the sum of tfidf values of the new text.However, the sum values are not as expected. Below is the example: text = ["I am new to python and R , how can anyone help me","why is no one able to crack the python code without help"] tf= TfidfVectorizer(stop_words='english',ngram_range =(1,1)) tf.fit_transform(text) zip(tf.get_feature_names(),tf.idf_) [(u'able', 1.4054651081081644), (u'code', 1

Why is the value of TF-IDF different from IDF_?

半世苍凉 提交于 2019-12-02 02:45:07
问题 Why is the value of the vectorized corpus different from the value obtained through the idf_ attribute? Should not the idf_ attribute just return the inverse document frequency (IDF) in the same way it appears in the corpus vectorized? from sklearn.feature_extraction.text import TfidfVectorizer corpus = ["This is very strange", "This is very nice"] vectorizer = TfidfVectorizer() corpus = vectorizer.fit_transform(corpus) print(corpus) Corpus vectorized: (0, 2) 0.6300993445179441 (0, 4) 0

Why is the value of TF-IDF different from IDF_?

徘徊边缘 提交于 2019-12-02 01:25:01
Why is the value of the vectorized corpus different from the value obtained through the idf_ attribute? Should not the idf_ attribute just return the inverse document frequency (IDF) in the same way it appears in the corpus vectorized? from sklearn.feature_extraction.text import TfidfVectorizer corpus = ["This is very strange", "This is very nice"] vectorizer = TfidfVectorizer() corpus = vectorizer.fit_transform(corpus) print(corpus) Corpus vectorized: (0, 2) 0.6300993445179441 (0, 4) 0.44832087319911734 (0, 0) 0.44832087319911734 (0, 3) 0.44832087319911734 (1, 1) 0.6300993445179441 (1, 4) 0

tfidf.transform() function not returning correct values

心不动则不痛 提交于 2019-12-02 00:58:45
I am trying to fit tfidf vectorizer on a certain text corpus and then use the same vectorizer to find the sum of tfidf values of the new text.However, the sum values are not as expected. Below is the example: text = ["I am new to python and R , how can anyone help me","why is no one able to crack the python code without help"] tf= TfidfVectorizer(stop_words='english',ngram_range =(1,1)) tf.fit_transform(text) zip(tf.get_feature_names(),tf.idf_) [(u'able', 1.4054651081081644), (u'code', 1.4054651081081644), (u'crack', 1.4054651081081644), (u'help', 1.0), (u'new', 1.4054651081081644), (u'python'