问题
I have a column of 50k rows of tweets named text from a csv file (the tweets consists of sentences, phrases etc). I'm trying to count frequency of several words in that column. Is there an easier way to do it vs what I'm doing below?
# Reading my file
tweets <- read.csv('coffee.csv', header=TRUE)
# Doing a grepl per word (This is hard because I need to look for many words one by one)
coffee <- grepl("coffee", text$tweets, ignore.case=TRUE)
mugs <- grepl("mugs", text$tweets, ignore.case=TRUE)
# Calculate the % of times among all tweets (This is hard because I need to calculate one by one)
sum(coffee) / nrow(text)
sum(starbucks) / nrow(text)
Expected Output (assuming I have more than 2 words up there)
Word Freq
coffee 50
mugs 40
cup 64
pen 12
回答1:
You can create a vector of the words that you want to count frequency/percentage for and use sapply
to calculate them.
words <- c('coffee', 'mugs')
data.frame(words, t(sapply(paste0('\\b', words, '\\b'), function(x) {
tmp <- grepl(x, tweets$text)
c(perc = mean(tmp) * 100,
Freq = sum(tmp))
})), row.names = NULL) -> result
result
# words perc Freq
#1 coffee 33.33333 1
#2 mugs 66.66667 2
sapply
is similar to for
loop as it iterates over each word defined in words
. grepl
returns TRUE
/FALSE
values indicating if the word is present in tweets$text
which is stored in tmp
. To count the frequency we use sum
and for percentage we use mean
. Also added word boundary (\\b
) to the words so that they match completely in the text
hence 'coffee'
does not match with 'coffees'
etc.
data
tweets <- data.frame(text = c('This is text with coffee in it with lot of mugs',
'This has only mugs',
'This has nothing'))
来源:https://stackoverflow.com/questions/66073470/grepl-group-of-strings-and-count-frequency-of-all-using-r