corpus | 易学教程

Split a huge dataframe in many smaller dataframes to create a corpus in r

阅读更多关于 Split a huge dataframe in many smaller dataframes to create a corpus in r

问题 I need to create a corpus from a huge dataframe (about 170.000 rows, but only two columns) to mine some text and group by usernames according to to the search terms. For example I start from a dataframe like this: username search_term name_1 "some_text_1" name_1 "some_text_2" name_2 "some_text_3" name_2 "some_text_4" name_3 "some_text_5" name_3 "some_text_6" name_3 "some_text_1" [...] name_n "some_text_n-1" And I want to obtain: data frame 1 username search_term name_1 "some_text_1" name_1

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

阅读更多关于 How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

问题 I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included? I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint. 回答1: You can modify a corpus to keep only the terms you want by

How transform a list into a corpus in r?

阅读更多关于 How transform a list into a corpus in r?

问题 In this question I asked how to split a huge dataframe to create a corpus. Thanks to the answer I was able to create a list from a dataframe. My problem was still obtaining a corpus from the list I created in order to do some text mining and cluster the data according to the search term. 回答1: To solve this problem I just applied the as.VCorpus function of the tm package to the list I created before: new_corpus <- as.VCorpus(new_list) Check if the new object is a corpus: class(new_corpus) [1]

Split a huge dataframe in many smaller dataframes to create a corpus in r

阅读更多关于 Split a huge dataframe in many smaller dataframes to create a corpus in r

I need to create a corpus from a huge dataframe (about 170.000 rows, but only two columns) to mine some text and group by usernames according to to the search terms. For example I start from a dataframe like this: username search_term name_1 "some_text_1" name_1 "some_text_2" name_2 "some_text_3" name_2 "some_text_4" name_3 "some_text_5" name_3 "some_text_6" name_3 "some_text_1" [...] name_n "some_text_n-1" And I want to obtain: data frame 1 username search_term name_1 "some_text_1" name_1 "some_text_2" data frame 2 username search_term name_2 "some_text_3" name_2 "some_text_4" And so on.. Any

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

阅读更多关于 How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included? I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint. eipi10 You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm package and the help for the

How transform a list into a corpus in r?

阅读更多关于 How transform a list into a corpus in r?

In this question I asked how to split a huge dataframe to create a corpus. Thanks to the answer I was able to create a list from a dataframe. My problem was still obtaining a corpus from the list I created in order to do some text mining and cluster the data according to the search term. To solve this problem I just applied the as.VCorpus function of the tm package to the list I created before: new_corpus <- as.VCorpus(new_list) Check if the new object is a corpus: class(new_corpus) [1] "VCorpus" "Corpus" I thus created a "volatile corpus". As written in the R documentation: A volatile corpus

How do I tag textfiles with hunpos in nltk?

阅读更多关于 How do I tag textfiles with hunpos in nltk?

Can someone help me with the syntax for hunpos tagging a corpus in nltk? What do I import for the hunpos.HunPosTagger module ? How do I HunPosTag the corpus? See the code below. import nltk from nltk.corpus import PlaintextCorpusReader from nltk.corpus.util import LazyCorpusLoader corpus_root = './' reader = PlaintextCorpusReader (corpus_root, '.*') ntuen = LazyCorpusLoader ('ntumultien', PlaintextCorpusReader, reader) ntuen.fileids() isinstance (ntuen, PlaintextCorpusReader) # So how do I hunpos tag `ntuen`? I can't get the following code to work. # please help me to correct my python syntax

In R tm package, build corpus FROM Document-Term-Matrix

阅读更多关于 In R tm package, build corpus FROM Document-Term-Matrix

It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab

NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

阅读更多关于 NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews , which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been

In R tm package, build corpus FROM Document-Term-Matrix

阅读更多关于 In R tm package, build corpus FROM Document-Term-Matrix

问题 It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to