问题
I am trying to make a word cloud of publications keywords. for example: Educational data mining; collaborative learning; computer science...etc
My current code is as the following:
KeywordsCorpus <- Corpus(VectorSource(subset(Words$Author.Keywords, Words$Year==2012)))
KeywordsCorpus <- tm_map(KeywordsCorpus, removePunctuation)
KeywordsCorpus <- tm_map(KeywordsCorpus, removeNumbers)
# added tolower
KeywordsCorpus <- tm_map(KeywordsCorpus, tolower)
KeywordsCorpus <- tm_map(KeywordsCorpus, removeWords, stopwords("english"))
# moved stripWhitespace
KeywordsCorpus <- tm_map(KeywordsCorpus, stripWhitespace)
KeywordsCorpus <- tm_map(KeywordsCorpus, PlainTextDocument)
dtm4 <- TermDocumentMatrix(KeywordsCorpus)
m4 <- as.matrix(dtm4)
v4 <- sort(rowSums(m4),decreasing=TRUE)
d4 <- data.frame(word = names(v4),freq=v4)
However, With this code, it is splitting each word by itself, But what I need is to have a combined words/phrases. For instance: Educational Data Mining is 1 phrase that I need to show instead of what is happening: "Educational" "Data" "Mining". Is there a way to keep each compound of words together? the semi-colon might help as a separator.
Thanks.
回答1:
Here's a solution using a different text package, that allows you to form multi-word expressions from either statistically detected collocations, or just by forming all bi-grams. The package is called quanteda.
library(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.14’
First, the method for detecting the top 1,500 bigram collocations, and replacing these collocations in the texts with their single-token versions (concatenated by the "_"
character). Here I am using the package's built-in corpus of the US presidential inaugural address texts.
### for just the top 1500 collocations
# detect the collocations
colls <- collocations(inaugCorpus, n = 1500, size = 2)
# remove collocations containing stopwords
colls <- removeFeatures(colls, stopwords("SMART"))
## Removed 1,224 (81.6%) of 1,500 collocations containing one of 570 stopwords.
# replace the phrases with single-token versions
inaugCorpusColl2 <- phrasetotoken(inaugCorpus, colls)
# create the document-feature matrix
inaugColl2dfm <- dfm(inaugCorpusColl2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... indexing features: 9,741 feature types
## ... removed 430 features, from 570 supplied (glob) feature types
## ... complete.
## ... created a 57 x 9311 sparse dfm
## Elapsed time: 0.163 seconds.
# plot the wordcloud
set.seed(1000)
png("~/Desktop/wcloud1.png", width = 800, height = 800)
plot(inaugColl2dfm["2013-Obama", ], min.freq = 2, random.order = FALSE,
colors = sample(colors()[2:128]))
dev.off()
This results in the following plot. Note the collocations, such as "generation's_task" and "fellow_americans".
The version formed with all bigrams is easier, but results in a huge number of low frequency bigram features. For the word cloud, I selected a larger set of texts, not just the 2013 Obama address.
### version with all bi-grams
inaugbigramsDfm <- dfm(inaugCorpusColl2, ngrams = 2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... removed 54,200 features, from 570 supplied (glob) feature types
## ... indexing features: 64,108 feature types
## ... created a 57 x 9908 sparse dfm
## ... complete.
## Elapsed time: 3.254 seconds.
# plot the bigram wordcloud - more texts because for a single speech,
# almost none occur more than once
png("~/Desktop/wcloud2.png", width = 800, height = 800)
plot(inaugbigramsDfm[40:57, ], min.freq = 2, random.order = FALSE,
colors = sample(colors()[2:128]))
dev.off()
This produces:
回答2:
Ok..After a lot of research i found the perfect answer. first of all, if you want to wordcloud multiple words, this is called bigrams. There are R available packages to do so such as "tau" and "Rweka".
This link will help you: This
回答3:
The best suggestion for you is to follow the short five-minute video (link below):
https://youtu.be/HellsQ2JF2k
If you want directly the R code, here it is:
mycorpus <- Corpus(VectorSource(subset(Words$Author.Keywords,Words$Year==2012)))
Text Cleaning Convert the text to lower case
mycorpus <- tm_map(mycorpus, content_transformer(tolower))
Remove numbers
mycorpus <- tm_map(mycorpus, removeNumbers)
Remove english common stopwords
mycorpus <- tm_map(mycorpus, removeWords, stopwords("english"))
Remove punctuations
mycorpus <- tm_map(mycorpus, removePunctuation)
Eliminate extra white spaces
mycorpus <- tm_map(mycorpus, stripWhitespace)
as.character(mycorpus[[1]])
Bigrams
minfreq_bigram<-2
token_delim <- " \\t\\r\\n.!?,;\"()"
bitoken <- NGramTokenizer(mycorpus, Weka_control(min=2,max=2, delimiters = token_delim))
two_word <- data.frame(table(bitoken))
sort_two <- two_word[order(two_word$Freq,decreasing=TRUE),]
wordcloud(sort_two$bitoken,sort_two$Freq,random.order=FALSE,scale = c(2,0.35),min.freq = minfreq_bigram,colors = brewer.pal(8,"Dark2"),max.words=150)
来源:https://stackoverflow.com/questions/36479780/making-a-wordcloud-but-with-combined-words