问题
I'm trying to remove rows from my data frame that contain less than 5 words. e.g.
mydf <- as.data.frame(read.xlsx("C:\\data.xlsx", 1, header=TRUE)
head(mydf)
NO ARTICLE
1 34 The New York Times reports a lot of words here.
2 12 Greenwire reports a lot of words.
3 31 Only three words.
4 2 The Financial Times reports a lot of words.
5 9 Greenwire short.
6 13 The New York Times reports a lot of words again.
I want to remove rows with 5 or less words. how can i do that?
回答1:
Here are two ways:
mydf[sapply(gregexpr("\\W+", mydf$ARTICLE), length) >4,]
# NO ARTICLE
# 1 34 The New York Times reports a lot of words here.
# 2 12 Greenwire reports a lot of words.
# 4 2 The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.
mydf[sapply(strsplit(as.character(mydf$ARTICLE)," "),length)>5,]
# NO ARTICLE
# 1 34 The New York Times reports a lot of words here.
# 2 12 Greenwire reports a lot of words.
# 4 2 The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.
The first generates a vector containing the starting positions of each word after the first, and then calculates the length of that vector.
The second splits the ARTICLE column into a vector containing the component words and calculates the length of that vector. This is probably a better approach.
回答2:
The word count (wc
) function in the qdap package can facilitate this as well:
dat <- read.transcript(text="34 The New York Times reports a lot of words here.
12 Greenwire reports a lot of words.
31 Only three words.
2 The Financial Times reports a lot of words.
9 Greenwire short.
13 The New York Times reports a lot of words again.",
col.names = qcv(NO, ARTICLE), sep=" ")
library(qdap)
dat[wc(dat$ARTICLE) > 4, ]
## NO ARTICLE
## 1 34 The New York Times reports a lot of words here.
## 2 12 Greenwire reports a lot of words.
## 4 2 The Financial Times reports a lot of words.
## 6 13 The New York Times reports a lot of words again.
来源:https://stackoverflow.com/questions/22140149/how-to-remove-rows-from-a-data-frame-that-contain-only-few-words-in-r