corpus | 易学教程

R text mining documents from CSV file (one row per doc)

阅读更多关于 R text mining documents from CSV file (one row per doc)

I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set. Originally I did the following: fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t") This creates a corpus with 1 document and >10,000 rows, and I want >10,000 docs with 1 row each. I imagine

Make dataframe of top N frequent terms for multiple corpora using tm package in R

阅读更多关于 Make dataframe of top N frequent terms for multiple corpora using tm package in R

I have several TermDocumentMatrix s created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... ... [10th most frequent word] By definition, findFreqTerms(corpus1,N) returns all of the terms which appear N times or more. To do this by hand I could change N until I got 10 or so terms returned, but the output for findFreqTerms is listed alphabetically so unless I picked exactly the right N, I wouldn't actually know which were the top 10. I suspect that

More efficient means of creating a corpus and DTM with 4M rows

阅读更多关于 More efficient means of creating a corpus and DTM with 4M rows

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")) doc.corpus <- tm_map(doc.corpus, stemDocument, "english") doc.corpus <- tm_map(doc.corpus,

Programmatically install NLTK corpora / models, i.e. without the GUI downloader?

阅读更多关于 Programmatically install NLTK corpora / models, i.e. without the GUI downloader?

My project uses the NLTK. How can I list the project's corpus & model requirements so they can be automatically installed? I don't want to click through the nltk.download() GUI, installing packages one by one. Also, any way to freeze that same list of requirements (like pip freeze )? The NLTK site does list a command line interface for downloading packages and collections at the bottom of this page : http://www.nltk.org/data The command line usage varies by which version of Python you are using, but on my Python2.6 install I noticed I was missing the 'spanish_grammar' model and this worked

nltk words corpus does not contain “okay”?

阅读更多关于 nltk words corpus does not contain “okay”?

The NLTK word corpus does not have the phrase "okay", "ok", "Okay"? > from nltk.corpus import words > words.words().__contains__("check") > True > words.words().__contains__("okay") > False > len(words.words()) > 236736 Any ideas why? alvas TL;DR from nltk.corpus import words from nltk.corpus import wordnet manywords = words.words() + wordnet.words() In Long From the docs , the nltk.corpus.words are words a list of words from " http://en.wikipedia.org/wiki/Words_(Unix) Which in Unix, you can do: ls /usr/share/dict/ And reading the README: $ cd /usr/share/dict/ /usr/share/dict$ cat README # @(#

Need free English dictionary or Corpus, ultimately for a MySQL database [closed]

阅读更多关于 Need free English dictionary or Corpus, ultimately for a MySQL database [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I'm trying to find a free downloadable dictionary (or Corpus might be the better word) which I can import into MySQL. I need to words to have the type (noun, verb, adjective) associated with them. Any tips on where I can find one? I found one several years ago that worked nicely, but I no longer have it around.

How to “update” an existing Named Entity Recognition model - rather than creating from scratch?

阅读更多关于 How to “update” an existing Named Entity Recognition model - rather than creating from scratch?

Please see the tutorial steps for OpenNLP - Named Entity Recognition : Link to tutorial I am using the "en-ner-person.bin" model found here In the tutorial, there are instructions on Training and creating a new model. Is there any way to "Update" the existing "en-ner-person.bin" with additional training data? Say I have a list of 500 additional person names that are otherwise not recognized as persons - how do I generate a new model? Sorry it took me a while to put together a decent code example... What the code below does is read in your sentences, uses the default en-ner-person model to do

R Corpus Is Messing Up My UTF-8 Encoded Text

阅读更多关于 R Corpus Is Messing Up My UTF-8 Encoded Text

问题 I am simply trying to create a corpus from Russian, UTF-8 encoded text. The problem is, the Corpus method from the tm package is not encoding the strings correctly. Here is a reproducible example of my problem: Load in the Russian text: > data <- c("Renault Logan, 2005","Складское помещение, 345 м²", "Су-шеф","3-к квартира, 64 м², 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)") Create a VectorSource: > vs <- VectorSource(data) > vs # outputs correctly Then, create the corpus: > corp <-

More efficient means of creating a corpus and DTM with 4M rows

阅读更多关于 More efficient means of creating a corpus and DTM with 4M rows

问题 My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")

How to create a word cloud from a corpus in Python?

阅读更多关于 How to create a word cloud from a corpus in Python?

From Creating a subset of words from a corpus in R , the answerer can easily convert a term-document matrix into a word cloud easily. Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim Mmcorpus into a word cloud? The result will look somewhat like this: Here's a blog post which does just that: http://peekaboo-vision.blogspot.com/2012/11/a-wordcloud-in-python.html The whole code is here: https://github.com/amueller/word_cloud HeadAndTail from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt stopwords = set