term-document-matrix

R: TermDocumentMatrix - Error while creating

假如想象 提交于 2019-12-14 02:56:22
问题 I am trying to get twitter data and create a wordcloud but my code is giving error while creating TermDocumentMatrix. My code is as below twitter_search_data <- searchTwitter(searchString = text_to_search ,n = 500) twitter_search_text <- sapply(twitter_search_data ,function(x) x$getText()) twitter_search_corpus <- Corpus(VectorSource(twitter_search_text)) twitter_search_corpus <- tm_map(twitter_search_corpus, stripWhitespace, lazy = TRUE) twitter_search_corpus <- tm_map(twitter_search_corpus,

How to efficiently compute similarity between documents in a stream of documents

◇◆丶佛笑我妖孽 提交于 2019-12-09 12:02:38
问题 I gather Text documents (in Node.js) where one document i is represented as a list of words. What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of documents? I currently use cos-similarity on the Normalized Frequency of the words within each document. I don't use the TF-IDF (Term frequency, Inverse document frequency) because of the scalability issue since I get more and more documents. Initially My

R tm package create matrix of Nmost frequent terms

岁酱吖の 提交于 2019-12-09 11:10:42
问题 I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot allocate vector of size 2.0 Gb So I tried converting to sparse matrices using Matrix package: > A <- as(mydata.dtm, "sparseMatrix") Error in as(from, "CsparseMatrix") : no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix" > B

R and tm package: create a term-document matrix with a dictionary of one or two words?

谁说我不能喝 提交于 2019-12-09 07:01:59
问题 Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams , as some of the keywords . Web Search: Being new to text-mining and the tm package in R , I went to the web to figure out how to do this. Below are some relevant links that I found: FAQS on the tm-package website finding 2 & 3 word phrases using r tm package counter ngram with tm package in r findassocs for multiple terms in r Background: Of these, I preferred the solution that uses

efficient Term Document Matrix with NLTK

若如初见. 提交于 2019-12-08 22:49:18
问题 I am trying to create a term document matrix with NLTK and pandas. I wrote the following function: def fnDTM_Corpus(xCorpus): import pandas as pd '''to create a Term Document Matrix from a NLTK Corpus''' fd_list = [] for x in range(0, len(xCorpus.fileids())): fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x]))) DTM = pd.DataFrame(fd_list, index = xCorpus.fileids()) DTM.fillna(0,inplace = True) return DTM.T to run it import nltk from nltk.corpus import PlaintextCorpusReader

R DocumentTermMatrix loses results less than 100

此生再无相见时 提交于 2019-12-08 05:58:09
问题 I'm trying to feed a corpus into DocumentTermMatrix (I shorthand as DTM) to get term frequencies, but I noticed that DTM doesn't keep all terms and I don't know why! Check it out: A<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107") B<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107") C<-Corpus(VectorSource(c(A,B))) inspect(C) >A corpus with 2 text documents > >The metadata consists of 2 tag-value pairs and a data frame >Available tags are: > create_date

R How do i keep punctuation with TermDocumentMatrix()

落花浮王杯 提交于 2019-12-08 02:58:36
问题 I have a large dataframe where I am identifying patterns in strings and then extracting them. I have provided a small subset to illustrate my task. I am generating my patterns by creating a TermDocumentMatrix with multiple words. I use these patterns with stri_extract and str_replace from stringi and stringr packages to search within the 'punct_prob' dataframe. My problem is that I need to keep punctuation in tact within the 'punct_prob$description' to maintain the literal meanings within

how to calculate term-document matrix?

醉酒当歌 提交于 2019-12-07 09:05:50
问题 I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation from sklearn.feature_extraction.text import CountVectorizer

Big Text Corpus breaks tm_map

老子叫甜甜 提交于 2019-12-06 00:30:26
问题 I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and want to run some basic text mining operations such as creating document term matrix and term document matrix and doing some operations based co-locations of words. My script works on a smaller corpus, however, when I try it with the bigger corpus,

how to calculate term-document matrix?

那年仲夏 提交于 2019-12-05 16:16:16
I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation from sklearn.feature_extraction.text import CountVectorizer import numpy as np vectorizer = CountVectorizer(min_df=1) print(vectorizer) content = ["how to format my