text-mining

Error in extracting phrases using Gensim

我只是一个虾纸丫 提交于 2019-12-20 03:50:51
问题 I am trying to get the bigrams in the sentences using Phrases in Gensim as follows. from gensim.models import Phrases from gensim.models.phrases import Phraser documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"] sentence_stream = [doc.split(" ") for doc in documents] #print(sentence_stream) bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ') bigram_phraser = Phraser(bigram) for sent in sentence_stream

Clustering: how to extract most distinguishing features?

Deadly 提交于 2019-12-19 11:57:38
问题 I have a set of documents that I am trying to cluster based on their vocabulary (that is, first making a corpus and then a sparse matrix with the DocumentTermMatrix command and so on). To improve the clusters and to understand better what features/words make a particular document fall into a particular cluster, I would like to know what the most distinguishing features for each cluster are. There is an example of this in the Machine Learning with R book by Lantz, if you happen to know it - he

Text clustering using Scipy Hierarchy Clustering in Python

て烟熏妆下的殇ゞ 提交于 2019-12-18 18:27:11
问题 I have a text corpus that contains 1000+ articles each in a separate line. I am trying to use Hierarchy Clustering using Scipy in python to produce clusters of related articles. This is the code I used to do the clustering # Agglomerative Clustering import matplotlib.pyplot as plt import scipy.cluster.hierarchy as hac tree = hac.linkage(X.toarray(), method="complete",metric="euclidean") plt.clf() hac.dendrogram(tree) plt.show() and I got this plot Then I cut off the tree at the third level

R Text mining - how to change texts in R data frame column into several columns with bigram frequencies?

回眸只為那壹抹淺笑 提交于 2019-12-18 18:24:12
问题 In addition to question R Text mining - how to change texts in R data frame column into several columns with word frequencies? I am wondering how I can manage to make columns with bigrams frequencies instead of just word frequencies. Again, many thanks in advance! This is the example data frame (thanks to Tyler Rinker). person sex adult state code 1 sam m 0 Computer is fun. Not too fun. K1 2 greg m 0 No it's not, it's dumb. K2 3 teacher m 1 What should we do? K3 4 sam m 0 You liar, it stinks!

Use R to convert PDF files to text files for text mining

廉价感情. 提交于 2019-12-18 10:24:31
问题 I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <- "C:/Program Files (x86)/xpdfbin-win-3.03/bin32/pdftotext.exe" system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F) # get txt-file name and open it filetxt <- sub(".pdf", ".txt", dest) shell.exec(filetxt) By this, I am converting one pdf file to

build word co-occurence edge list in R

余生颓废 提交于 2019-12-18 04:18:06
问题 I have a chunk of sentences and I want to build the undirected edge list of word co-occurrence and see the frequency of every edge. I took a look at the tm package but didn't find similar functions. Is there some package/script I can use? Thanks a lot! Note: A word doesn't co-occur with itself. A word which appears twice or more co-occurs with other words for only once in the same sentence. DF: sentence_id text 1 a b c d e 2 a b b e 3 b c d 4 a e 5 a 6 a a a OUTPUT word1 word2 freq a b 2 a c

How do I clean twitter data in R?

醉酒当歌 提交于 2019-12-17 22:43:11
问题 I extracted tweets from twitter using the twitteR package and saved them into a text file. I have carried out the following on the corpus xx<-tm_map(xx,removeNumbers, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,stripWhitespace, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,removePunctuation, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,strip_retweets, lazy=TRUE, 'mc.cores=1') xx<-tm_map(xx,removeWords,stopwords(english), lazy=TRUE, 'mc.cores=1') (using mc.cores=1 and lazy=True as otherwise R on mac is running

Finding 2 & 3 word Phrases Using R TM Package

时光毁灭记忆、已成空白 提交于 2019-12-17 06:27:53
问题 I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have no luck. If you worked on a similar situation in the past, could you post a code that is tested and actually works? Thank you so much! 回答1: You can pass in a custom tokenizing function to tm 's DocumentTermMatrix function, so if you have package

What does “document” mean in a NLP context?

僤鯓⒐⒋嵵緔 提交于 2019-12-13 14:23:03
问题 As I was reading about tf–idf on Wiki, I was confused by what it means by the word "document". Does it mean paragraph? "The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of

no applicable method for 'tm_map' applied to an object of class “character”

半世苍凉 提交于 2019-12-13 13:11:55
问题 My data looks like this: 1. Good quality, love the taste, the only ramen noodles we buy but they're available at the local Korean grocery store for a bit less so no need to buy on Amazon really. 2. Great flavor and taste. Prompt delivery.We will reorder this and other products from this manufacturer. 3. Doesn't taste good to me. 4. Most delicious ramen I have ever had. Spicy and tasty. Great price too. 5. I have this on my subscription, my family loves this version. The taste is great by