问题
I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have no luck.
If you worked on a similar situation in the past, could you post a code that is tested and actually works? Thank you so much!
回答1:
You can pass in a custom tokenizing function to tm
's DocumentTermMatrix
function, so if you have package tau
installed it's fairly straightforward.
library(tm); library(tau);
tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
corpus <- Corpus(VectorSource(texts))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))
Where n
in the tokenize_ngrams
function is the number of words per phrase. This feature is also implemented in package RTextTools
, which further simplifies things.
library(RTextTools)
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
matrix <- create_matrix(texts,ngramLength=3)
This returns a class of DocumentTermMatrix
for use with package tm
.
回答2:
This is part 5 of the FAQ of the tm package:
5. Can I use bigrams instead of single tokens in a term-document matrix?
Yes. RWeka provides a tokenizer for arbitrary n-grams which can be directly passed on to the term-document matrix constructor. E.g.:
library("RWeka")
library("tm")
data("crude")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(tdm[340:345,1:10])
回答3:
This is my own made up creation for different purposes but I think may applicable to your needs too:
#User Defined Functions
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE))
strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
strp <- function(x, digit.remove, apostrophe.remove){
x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x))))
x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
}
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove,
apostrophe.remove = apostrophe.remove)) ))
}
unblanker <- function(x)subset(x, nchar(x)>0)
#Fake Text Data
x <- "I like green eggs and ham. They are delicious. They taste so yummy. I'm talking about ham and eggs of course"
#The code using Base R to Do what you want
breaker(x)
strip(x)
words <- unblanker(breaker(strip(x)))
textDF <- as.data.frame(table(words))
textDF$characters <- sapply(as.character(textDF$words), nchar)
textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ]
rownames(textDF2) <- 1:nrow(textDF2)
textDF2
subset(textDF2, characters%in%2:3)
回答4:
The corpus library has a function called term_stats
that does what you want:
library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
## term count support
## 1 of the 336 1
## 2 the scarecrow 208 1
## 3 to the 185 1
## 4 and the 166 1
## 5 said the 152 1
## 6 in the 147 1
## 7 the lion 141 1
## 8 the tin 123 1
## 9 the tin woodman 114 1
## 10 tin woodman 114 1
## 11 i am 84 1
## 12 it was 69 1
## 13 in a 64 1
## 14 the great 63 1
## 15 the wicked 61 1
## 16 wicked witch 60 1
## 17 at the 59 1
## 18 the little 59 1
## 19 the wicked witch 58 1
## 20 back to 57 1
## ⋮ (52511 rows total)
Here, count
is the number of appearances, and support
is the number of documents containing the term.
回答5:
I add a similar problem by using tm
and ngram
packages.
After debugging mclapply
, I saw there where problems on documents with less than 2 words with the following error
input 'x' has nwords=1 and n=2; must have nwords >= n
So I've added a filter to remove document with low word count number:
myCorpus.3 <- tm_filter(myCorpus.2, function (x) {
length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1
})
Then my tokenize function looks like:
bigramTokenizer <- function(x) {
x <- as.character(x)
# Find words
one.list <- c()
tryCatch({
one.gram <- ngram::ngram(x, n = 1)
one.list <- ngram::get.ngrams(one.gram)
},
error = function(cond) { warning(cond) })
# Find 2-grams
two.list <- c()
tryCatch({
two.gram <- ngram::ngram(x, n = 2)
two.list <- ngram::get.ngrams(two.gram)
},
error = function(cond) { warning(cond) })
res <- unlist(c(one.list, two.list))
res[res != '']
}
Then you can test the function with:
dtmTest <- lapply(myCorpus.3, bigramTokenizer)
And finally:
dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))
回答6:
Try tidytext package
library(dplyr)
library(tidytext)
library(janeaustenr)
library(tidyr
)
Suppose I have a dataframe CommentData that contains comment column and I want to find occurrence of two words together. Then try
bigram_filtered <- CommentData %>%
unnest_tokens(bigram, Comment, token= "ngrams", n=2) %>%
separate(bigram, c("word1","word2"), sep=" ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word) %>%
count(word1, word2, sort=TRUE)
The above code creates tokens, and then remove stop words that doesn't help in analysis(eg. the,an,to etc.) Then you count occurrence of these words. You will be then using unite function to combine individual words and record their occurrence.
bigrams_united <- bigram_filtered %>%
unite(bigram, word1, word2, sep=" ")
bigrams_united
回答7:
Try this code.
library(tm)
library(SnowballC)
library(class)
library(wordcloud)
keywords <- read.csv(file.choose(), header = TRUE, na.strings=c("NA","-","?"))
keywords_doc <- Corpus(VectorSource(keywords$"use your column that you need"))
keywords_doc <- tm_map(keywords_doc, removeNumbers)
keywords_doc <- tm_map(keywords_doc, tolower)
keywords_doc <- tm_map(keywords_doc, stripWhitespace)
keywords_doc <- tm_map(keywords_doc, removePunctuation)
keywords_doc <- tm_map(keywords_doc, PlainTextDocument)
keywords_doc <- tm_map(keywords_doc, stemDocument)
This is the bigrams or tri grams section that you could use
BigramTokenizer <- function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
# creating of document matrix
keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer))
# remove sparse terms
keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.95)
# Frequency of the words appearing
keyword.freq <- rowSums(as.matrix(keywords_naremoval))
subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20)
frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq)
# Sorting of the words
frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq)
frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ]
frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ]
# Printing of the words
wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))
Hope this helps. This is an entire code that you could use.
来源:https://stackoverflow.com/questions/8898521/finding-2-3-word-phrases-using-r-tm-package