corpus | 易学教程

DocumentTermMatrix error on Corpus argument

阅读更多关于 DocumentTermMatrix error on Corpus argument

I have the following code: # returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus, tolower) corpus_clean <- tm_map(corpus_clean, removeNumbers) corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english')) corpus_clean <- tm_map(corpus_clean, removePunctuation) corpus_clean <- tm_map(corpus_clean, stripWhitespace) corpus_clean <- tm_map(corpus_clean, trim) news_dtm <- DocumentTermMatrix(corpus_clean) # errors here When I run the

Make dataframe of top N frequent terms for multiple corpora using tm package in R

阅读更多关于 Make dataframe of top N frequent terms for multiple corpora using tm package in R

问题 I have several TermDocumentMatrix s created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... ... [10th most frequent word] By definition, findFreqTerms(corpus1,N) returns all of the terms which appear N times or more. To do this by hand I could change N until I got 10 or so terms returned, but the output for findFreqTerms is listed alphabetically so

Programmatically install NLTK corpora / models, i.e. without the GUI downloader?

阅读更多关于 Programmatically install NLTK corpora / models, i.e. without the GUI downloader?

问题 My project uses the NLTK. How can I list the project's corpus & model requirements so they can be automatically installed? I don't want to click through the nltk.download() GUI, installing packages one by one. Also, any way to freeze that same list of requirements (like pip freeze )? 回答1: The NLTK site does list a command line interface for downloading packages and collections at the bottom of this page : http://www.nltk.org/data The command line usage varies by which version of Python you

R tm package vcorpus: Error in converting corpus to data frame

阅读更多关于 R tm package vcorpus: Error in converting corpus to data frame

I am using the tm package to clean up some data using the following code: mycorpus <- Corpus(VectorSource(x)) mycorpus <- tm_map(mycorpus, removePunctuation) I then want to convert the corpus back into a data frame in order to export a text file that contains the data in the original format of a data frame. I have tried the following: dataframe <- as.data.frame(mycorpus) But this returns an error: "Error in as.data.frame.default.(mycorpus) : cannot coerce class "c(vcorpus, > corpus")" to a data.frame How can I convert a corpus into a data frame? MrFlick Your corpus is really just a character

Using my own corpus instead of movie_reviews corpus for Classification in NLTK

阅读更多关于 Using my own corpus instead of movie_reviews corpus for Classification in NLTK

I use following code and I get it form Classification using movie review corpus in NLTK/Python import string from itertools import chain from nltk.corpus import movie_reviews as mr from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.classify import NaiveBayesClassifier as nbc import nltk stop = stopwords.words('english') documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()] word_features = FreqDist(chain(*[i for i,j in documents])) word_features = word_features.keys()[:100]

Keep document ID with R corpus

阅读更多关于 Keep document ID with R corpus

问题 I have searched stackoverflow and the web and can only find partial solutions OR some that don't work due to changes in TM or qdap. Problem below: I have a dataframe: ID and Text (Simple document id/name and then some text ) I have two issues: Part 1 : How can I create a tdm or dtm and maintain the document name/id? It only shows "character(0)" on inspect(tdm). Part 2 : I want to keep only a specific list of terms, i.e. opposite of remove custom stopwords. I want this to happen in the corpus,

How to “update” an existing Named Entity Recognition model - rather than creating from scratch?

阅读更多关于 How to “update” an existing Named Entity Recognition model - rather than creating from scratch?

问题 Please see the tutorial steps for OpenNLP - Named Entity Recognition : Link to tutorial I am using the "en-ner-person.bin" model found here In the tutorial, there are instructions on Training and creating a new model. Is there any way to "Update" the existing "en-ner-person.bin" with additional training data? Say I have a list of 500 additional person names that are otherwise not recognized as persons - how do I generate a new model? 回答1: Sorry it took me a while to put together a decent code

How to create a word cloud from a corpus in Python?

阅读更多关于 How to create a word cloud from a corpus in Python?

问题 From Creating a subset of words from a corpus in R, the answerer can easily convert a term-document matrix into a word cloud easily. Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim Mmcorpus into a word cloud? The result will look somewhat like this: 回答1: Here's a blog post which does just that: http://peekaboo-vision.blogspot.com/2012/11/a-wordcloud-in-python.html The whole code is here: https://github.com/amueller/word_cloud

nltk words corpus does not contain “okay”?

阅读更多关于 nltk words corpus does not contain “okay”?

问题 The NLTK word corpus does not have the phrase "okay", "ok", "Okay"? > from nltk.corpus import words > words.words().__contains__("check") > True > words.words().__contains__("okay") > False > len(words.words()) > 236736 Any ideas why? 回答1: TL;DR from nltk.corpus import words from nltk.corpus import wordnet manywords = words.words() + wordnet.words() In Long From the docs, the nltk.corpus.words are words a list of words from "http://en.wikipedia.org/wiki/Words_(Unix) Which in Unix, you can do:

DocumentTermMatrix error on Corpus argument

阅读更多关于 DocumentTermMatrix error on Corpus argument

问题 I have the following code: # returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus, tolower) corpus_clean <- tm_map(corpus_clean, removeNumbers) corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english')) corpus_clean <- tm_map(corpus_clean, removePunctuation) corpus_clean <- tm_map(corpus_clean, stripWhitespace) corpus_clean <-