corpus

Split a huge dataframe in many smaller dataframes to create a corpus in r

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-02 09:29:50
问题 I need to create a corpus from a huge dataframe (about 170.000 rows, but only two columns) to mine some text and group by usernames according to to the search terms. For example I start from a dataframe like this: username search_term name_1 "some_text_1" name_1 "some_text_2" name_2 "some_text_3" name_2 "some_text_4" name_3 "some_text_5" name_3 "some_text_6" name_3 "some_text_1" [...] name_n "some_text_n-1" And I want to obtain: data frame 1 username search_term name_1 "some_text_1" name_1

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

家住魔仙堡 提交于 2019-12-02 05:44:55
问题 I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included? I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint. 回答1: You can modify a corpus to keep only the terms you want by

How transform a list into a corpus in r?

这一生的挚爱 提交于 2019-12-02 03:52:29
问题 In this question I asked how to split a huge dataframe to create a corpus. Thanks to the answer I was able to create a list from a dataframe. My problem was still obtaining a corpus from the list I created in order to do some text mining and cluster the data according to the search term. 回答1: To solve this problem I just applied the as.VCorpus function of the tm package to the list I created before: new_corpus <- as.VCorpus(new_list) Check if the new object is a corpus: class(new_corpus) [1]

Split a huge dataframe in many smaller dataframes to create a corpus in r

ε祈祈猫儿з 提交于 2019-12-02 02:48:13
I need to create a corpus from a huge dataframe (about 170.000 rows, but only two columns) to mine some text and group by usernames according to to the search terms. For example I start from a dataframe like this: username search_term name_1 "some_text_1" name_1 "some_text_2" name_2 "some_text_3" name_2 "some_text_4" name_3 "some_text_5" name_3 "some_text_6" name_3 "some_text_1" [...] name_n "some_text_n-1" And I want to obtain: data frame 1 username search_term name_1 "some_text_1" name_1 "some_text_2" data frame 2 username search_term name_2 "some_text_3" name_2 "some_text_4" And so on.. Any

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

假如想象 提交于 2019-12-02 02:07:19
I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included? I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint. eipi10 You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm package and the help for the

How transform a list into a corpus in r?

筅森魡賤 提交于 2019-12-02 00:44:48
In this question I asked how to split a huge dataframe to create a corpus. Thanks to the answer I was able to create a list from a dataframe. My problem was still obtaining a corpus from the list I created in order to do some text mining and cluster the data according to the search term. To solve this problem I just applied the as.VCorpus function of the tm package to the list I created before: new_corpus <- as.VCorpus(new_list) Check if the new object is a corpus: class(new_corpus) [1] "VCorpus" "Corpus" I thus created a "volatile corpus". As written in the R documentation: A volatile corpus

How do I tag textfiles with hunpos in nltk?

血红的双手。 提交于 2019-12-01 14:30:48
Can someone help me with the syntax for hunpos tagging a corpus in nltk? What do I import for the hunpos.HunPosTagger module ? How do I HunPosTag the corpus? See the code below. import nltk from nltk.corpus import PlaintextCorpusReader from nltk.corpus.util import LazyCorpusLoader corpus_root = './' reader = PlaintextCorpusReader (corpus_root, '.*') ntuen = LazyCorpusLoader ('ntumultien', PlaintextCorpusReader, reader) ntuen.fileids() isinstance (ntuen, PlaintextCorpusReader) # So how do I hunpos tag `ntuen`? I can't get the following code to work. # please help me to correct my python syntax

In R tm package, build corpus FROM Document-Term-Matrix

断了今生、忘了曾经 提交于 2019-12-01 06:35:30
It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab

NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

北城余情 提交于 2019-12-01 05:29:37
Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews , which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been

In R tm package, build corpus FROM Document-Term-Matrix

て烟熏妆下的殇ゞ 提交于 2019-12-01 05:27:16
问题 It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to