corpus

DocumentTermMatrix error on Corpus argument

浪子不回头ぞ 提交于 2019-11-27 10:33:34
I have the following code: # returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus, tolower) corpus_clean <- tm_map(corpus_clean, removeNumbers) corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english')) corpus_clean <- tm_map(corpus_clean, removePunctuation) corpus_clean <- tm_map(corpus_clean, stripWhitespace) corpus_clean <- tm_map(corpus_clean, trim) news_dtm <- DocumentTermMatrix(corpus_clean) # errors here When I run the

Make dataframe of top N frequent terms for multiple corpora using tm package in R

旧巷老猫 提交于 2019-11-27 10:10:52
问题 I have several TermDocumentMatrix s created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... ... [10th most frequent word] By definition, findFreqTerms(corpus1,N) returns all of the terms which appear N times or more. To do this by hand I could change N until I got 10 or so terms returned, but the output for findFreqTerms is listed alphabetically so

Programmatically install NLTK corpora / models, i.e. without the GUI downloader?

妖精的绣舞 提交于 2019-11-27 09:31:29
问题 My project uses the NLTK. How can I list the project's corpus & model requirements so they can be automatically installed? I don't want to click through the nltk.download() GUI, installing packages one by one. Also, any way to freeze that same list of requirements (like pip freeze )? 回答1: The NLTK site does list a command line interface for downloading packages and collections at the bottom of this page : http://www.nltk.org/data The command line usage varies by which version of Python you

R tm package vcorpus: Error in converting corpus to data frame

扶醉桌前 提交于 2019-11-27 08:05:28
I am using the tm package to clean up some data using the following code: mycorpus <- Corpus(VectorSource(x)) mycorpus <- tm_map(mycorpus, removePunctuation) I then want to convert the corpus back into a data frame in order to export a text file that contains the data in the original format of a data frame. I have tried the following: dataframe <- as.data.frame(mycorpus) But this returns an error: "Error in as.data.frame.default.(mycorpus) : cannot coerce class "c(vcorpus, > corpus")" to a data.frame How can I convert a corpus into a data frame? MrFlick Your corpus is really just a character

Using my own corpus instead of movie_reviews corpus for Classification in NLTK

孤街浪徒 提交于 2019-11-27 07:55:25
I use following code and I get it form Classification using movie review corpus in NLTK/Python import string from itertools import chain from nltk.corpus import movie_reviews as mr from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.classify import NaiveBayesClassifier as nbc import nltk stop = stopwords.words('english') documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()] word_features = FreqDist(chain(*[i for i,j in documents])) word_features = word_features.keys()[:100]

Keep document ID with R corpus

做~自己de王妃 提交于 2019-11-27 02:02:19
问题 I have searched stackoverflow and the web and can only find partial solutions OR some that don't work due to changes in TM or qdap. Problem below: I have a dataframe: ID and Text (Simple document id/name and then some text ) I have two issues: Part 1 : How can I create a tdm or dtm and maintain the document name/id? It only shows "character(0)" on inspect(tdm). Part 2 : I want to keep only a specific list of terms, i.e. opposite of remove custom stopwords. I want this to happen in the corpus,

How to “update” an existing Named Entity Recognition model - rather than creating from scratch?

亡梦爱人 提交于 2019-11-26 23:19:27
问题 Please see the tutorial steps for OpenNLP - Named Entity Recognition : Link to tutorial I am using the "en-ner-person.bin" model found here In the tutorial, there are instructions on Training and creating a new model. Is there any way to "Update" the existing "en-ner-person.bin" with additional training data? Say I have a list of 500 additional person names that are otherwise not recognized as persons - how do I generate a new model? 回答1: Sorry it took me a while to put together a decent code

How to create a word cloud from a corpus in Python?

做~自己de王妃 提交于 2019-11-26 22:33:21
问题 From Creating a subset of words from a corpus in R, the answerer can easily convert a term-document matrix into a word cloud easily. Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim Mmcorpus into a word cloud? The result will look somewhat like this: 回答1: Here's a blog post which does just that: http://peekaboo-vision.blogspot.com/2012/11/a-wordcloud-in-python.html The whole code is here: https://github.com/amueller/word_cloud

nltk words corpus does not contain “okay”?

别来无恙 提交于 2019-11-26 22:26:19
问题 The NLTK word corpus does not have the phrase "okay", "ok", "Okay"? > from nltk.corpus import words > words.words().__contains__("check") > True > words.words().__contains__("okay") > False > len(words.words()) > 236736 Any ideas why? 回答1: TL;DR from nltk.corpus import words from nltk.corpus import wordnet manywords = words.words() + wordnet.words() In Long From the docs, the nltk.corpus.words are words a list of words from "http://en.wikipedia.org/wiki/Words_(Unix) Which in Unix, you can do:

DocumentTermMatrix error on Corpus argument

不问归期 提交于 2019-11-26 17:57:30
问题 I have the following code: # returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus, tolower) corpus_clean <- tm_map(corpus_clean, removeNumbers) corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english')) corpus_clean <- tm_map(corpus_clean, removePunctuation) corpus_clean <- tm_map(corpus_clean, stripWhitespace) corpus_clean <-