term-document-matrix | 易学教程

R: TermDocumentMatrix - Error while creating

阅读更多关于 R: TermDocumentMatrix - Error while creating

问题 I am trying to get twitter data and create a wordcloud but my code is giving error while creating TermDocumentMatrix. My code is as below twitter_search_data <- searchTwitter(searchString = text_to_search ,n = 500) twitter_search_text <- sapply(twitter_search_data ,function(x) x$getText()) twitter_search_corpus <- Corpus(VectorSource(twitter_search_text)) twitter_search_corpus <- tm_map(twitter_search_corpus, stripWhitespace, lazy = TRUE) twitter_search_corpus <- tm_map(twitter_search_corpus,

How to efficiently compute similarity between documents in a stream of documents

阅读更多关于 How to efficiently compute similarity between documents in a stream of documents

问题 I gather Text documents (in Node.js) where one document i is represented as a list of words. What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of documents? I currently use cos-similarity on the Normalized Frequency of the words within each document. I don't use the TF-IDF (Term frequency, Inverse document frequency) because of the scalability issue since I get more and more documents. Initially My

R tm package create matrix of Nmost frequent terms

阅读更多关于 R tm package create matrix of Nmost frequent terms

问题 I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot allocate vector of size 2.0 Gb So I tried converting to sparse matrices using Matrix package: > A <- as(mydata.dtm, "sparseMatrix") Error in as(from, "CsparseMatrix") : no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix" > B

R and tm package: create a term-document matrix with a dictionary of one or two words?

阅读更多关于 R and tm package: create a term-document matrix with a dictionary of one or two words?

问题 Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams , as some of the keywords . Web Search: Being new to text-mining and the tm package in R , I went to the web to figure out how to do this. Below are some relevant links that I found: FAQS on the tm-package website finding 2 & 3 word phrases using r tm package counter ngram with tm package in r findassocs for multiple terms in r Background: Of these, I preferred the solution that uses

efficient Term Document Matrix with NLTK

阅读更多关于 efficient Term Document Matrix with NLTK

问题 I am trying to create a term document matrix with NLTK and pandas. I wrote the following function: def fnDTM_Corpus(xCorpus): import pandas as pd '''to create a Term Document Matrix from a NLTK Corpus''' fd_list = [] for x in range(0, len(xCorpus.fileids())): fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x]))) DTM = pd.DataFrame(fd_list, index = xCorpus.fileids()) DTM.fillna(0,inplace = True) return DTM.T to run it import nltk from nltk.corpus import PlaintextCorpusReader

R DocumentTermMatrix loses results less than 100

阅读更多关于 R DocumentTermMatrix loses results less than 100

问题 I'm trying to feed a corpus into DocumentTermMatrix (I shorthand as DTM) to get term frequencies, but I noticed that DTM doesn't keep all terms and I don't know why! Check it out: A<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107") B<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107") C<-Corpus(VectorSource(c(A,B))) inspect(C) >A corpus with 2 text documents > >The metadata consists of 2 tag-value pairs and a data frame >Available tags are: > create_date

R How do i keep punctuation with TermDocumentMatrix()

阅读更多关于 R How do i keep punctuation with TermDocumentMatrix()

问题 I have a large dataframe where I am identifying patterns in strings and then extracting them. I have provided a small subset to illustrate my task. I am generating my patterns by creating a TermDocumentMatrix with multiple words. I use these patterns with stri_extract and str_replace from stringi and stringr packages to search within the 'punct_prob' dataframe. My problem is that I need to keep punctuation in tact within the 'punct_prob$description' to maintain the literal meanings within

how to calculate term-document matrix?

阅读更多关于 how to calculate term-document matrix?

问题 I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation from sklearn.feature_extraction.text import CountVectorizer

Big Text Corpus breaks tm_map

阅读更多关于 Big Text Corpus breaks tm_map

问题 I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and want to run some basic text mining operations such as creating document term matrix and term document matrix and doing some operations based co-locations of words. My script works on a smaller corpus, however, when I try it with the bigger corpus,

how to calculate term-document matrix?

阅读更多关于 how to calculate term-document matrix?

I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation from sklearn.feature_extraction.text import CountVectorizer import numpy as np vectorizer = CountVectorizer(min_df=1) print(vectorizer) content = ["how to format my