corpus

R text mining documents from CSV file (one row per doc)

心已入冬 提交于 2019-11-28 18:58:14
I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set. Originally I did the following: fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t") This creates a corpus with 1 document and >10,000 rows, and I want >10,000 docs with 1 row each. I imagine

Make dataframe of top N frequent terms for multiple corpora using tm package in R

烈酒焚心 提交于 2019-11-28 17:06:18
I have several TermDocumentMatrix s created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... ... [10th most frequent word] By definition, findFreqTerms(corpus1,N) returns all of the terms which appear N times or more. To do this by hand I could change N until I got 10 or so terms returned, but the output for findFreqTerms is listed alphabetically so unless I picked exactly the right N, I wouldn't actually know which were the top 10. I suspect that

More efficient means of creating a corpus and DTM with 4M rows

大憨熊 提交于 2019-11-28 16:35:24
My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")) doc.corpus <- tm_map(doc.corpus, stemDocument, "english") doc.corpus <- tm_map(doc.corpus,

Programmatically install NLTK corpora / models, i.e. without the GUI downloader?

跟風遠走 提交于 2019-11-28 16:04:25
My project uses the NLTK. How can I list the project's corpus & model requirements so they can be automatically installed? I don't want to click through the nltk.download() GUI, installing packages one by one. Also, any way to freeze that same list of requirements (like pip freeze )? The NLTK site does list a command line interface for downloading packages and collections at the bottom of this page : http://www.nltk.org/data The command line usage varies by which version of Python you are using, but on my Python2.6 install I noticed I was missing the 'spanish_grammar' model and this worked

nltk words corpus does not contain “okay”?

社会主义新天地 提交于 2019-11-28 13:26:14
The NLTK word corpus does not have the phrase "okay", "ok", "Okay"? > from nltk.corpus import words > words.words().__contains__("check") > True > words.words().__contains__("okay") > False > len(words.words()) > 236736 Any ideas why? alvas TL;DR from nltk.corpus import words from nltk.corpus import wordnet manywords = words.words() + wordnet.words() In Long From the docs , the nltk.corpus.words are words a list of words from " http://en.wikipedia.org/wiki/Words_(Unix) Which in Unix, you can do: ls /usr/share/dict/ And reading the README: $ cd /usr/share/dict/ /usr/share/dict$ cat README # @(#

Need free English dictionary or Corpus, ultimately for a MySQL database [closed]

扶醉桌前 提交于 2019-11-28 01:39:15
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I'm trying to find a free downloadable dictionary (or Corpus might be the better word) which I can import into MySQL. I need to words to have the type (noun, verb, adjective) associated with them. Any tips on where I can find one? I found one several years ago that worked nicely, but I no longer have it around.

How to “update” an existing Named Entity Recognition model - rather than creating from scratch?

匆匆过客 提交于 2019-11-27 23:17:10
Please see the tutorial steps for OpenNLP - Named Entity Recognition : Link to tutorial I am using the "en-ner-person.bin" model found here In the tutorial, there are instructions on Training and creating a new model. Is there any way to "Update" the existing "en-ner-person.bin" with additional training data? Say I have a list of 500 additional person names that are otherwise not recognized as persons - how do I generate a new model? Sorry it took me a while to put together a decent code example... What the code below does is read in your sentences, uses the default en-ner-person model to do

R Corpus Is Messing Up My UTF-8 Encoded Text

流过昼夜 提交于 2019-11-27 22:51:54
问题 I am simply trying to create a corpus from Russian, UTF-8 encoded text. The problem is, the Corpus method from the tm package is not encoding the strings correctly. Here is a reproducible example of my problem: Load in the Russian text: > data <- c("Renault Logan, 2005","Складское помещение, 345 м²", "Су-шеф","3-к квартира, 64 м², 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)") Create a VectorSource: > vs <- VectorSource(data) > vs # outputs correctly Then, create the corpus: > corp <-

More efficient means of creating a corpus and DTM with 4M rows

落花浮王杯 提交于 2019-11-27 19:56:41
问题 My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")

How to create a word cloud from a corpus in Python?

本秂侑毒 提交于 2019-11-27 17:27:12
From Creating a subset of words from a corpus in R , the answerer can easily convert a term-document matrix into a word cloud easily. Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim Mmcorpus into a word cloud? The result will look somewhat like this: Here's a blog post which does just that: http://peekaboo-vision.blogspot.com/2012/11/a-wordcloud-in-python.html The whole code is here: https://github.com/amueller/word_cloud HeadAndTail from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt stopwords = set