I'm creating a wordcloud using the wordcloud package in R, and the help of "Word Cloud in R".
I can do this easily enough, but I want to remove words from this wordcloud. I have words in a file (actually an excel file, but I could change that), and I want to exclude all these words, of which there are a couple hundred. Any suggestions?
require(XML)
require(tm)
require(wordcloud)
require(RColorBrewer)
ap.corpus=Corpus(DataframeSource(data.frame(as.character(data.merged2[,6]))))
ap.corpus=tm_map(ap.corpus, removePunctuation)
ap.corpus=tm_map(ap.corpus, tolower)
ap.corpus=tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
ap.tdm=TermDocumentMatrix(ap.corpus)
ap.m=as.matrix(ap.tdm)
ap.v=sort(rowSums(ap.m),decreasing=TRUE)
ap.d=data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
@Tyler Rinker has given the answer, just add another line of removeWords()
, but here's a bit more detail.
Let's say your excel file is called nuts.xls
and has a single column of words like this
stopwords
peanut
cashew
walnut
almond
macadamia
In R
you might proceed like this
library(gdata) # package with xls import function
library(tm)
# now load the excel file with the custom stoplist, note a few of the arguments here
# to clean the data by removing spaces that excel seems to insert and prevent it from
# importing the characters as factors. You can use any args from read.table(), which is
# handy
nuts<-read.xls("nuts.xls", header=TRUE, stringsAsFactor=FALSE, strip.white=TRUE)
# now make some words to build a corpus to test for a two-step stopword removal process...
words1<- c("peanut, cashew, walnut, macadamia, apple, pear, orange, lime, mandarin, and, or, but")
words2<- c("peanut, cashew, walnut, almond, apple, pear, orange, lime, mandarin, if, then, on")
words3<- c("peanut, walnut, almond, macadamia, apple, pear, orange, lime, mandarin, it, as, an")
words.all<-data.frame(rbind(words1,words2,words3))
words.corpus<-Corpus(DataframeSource((words.all)))
# now remove the standard list of stopwords, like you've already worked out
words.corpus.nostopwords <- tm_map(words.corpus, removeWords, stopwords("english"))
# now remove the second set of stopwords, this time your custom set from the excel file,
# note that it has to be a reference to a character vector containing the custom stopwords
words.corpus.nostopwords <- tm_map(words.corpus.nostopwords, removeWords, nuts$stopwords)
# have a look to see if it worked
inspect(words.corpus.nostopwords)
A corpus with 3 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$words1
, , , , apple, pear, orange, lime, mandarin, , ,
$words2
, , , , apple, pear, orange, lime, mandarin, , ,
$words3
, , , , apple, pear, orange, lime, mandarin, , ,
Success! the standard stopwords are gone, as are the words in the custom list from the excel file. Undoubtedly there are other ways to do it.
Convert the data you want to make a datacloud into a data frame.
Create a CSV file with the words you want to be eliminated and read it as a data frame. You can then make an anti_join
:
allWords = as.data.frame(table(bigWords$Words))
wordsToAvoid = read.csv("wordsToDrop.csv")
finalWords = anti_join(allWords, wordsToAvoid, by = "Words")
来源:https://stackoverflow.com/questions/8619941/how-do-i-remove-words-from-a-wordcloud