How to remove rows from a data frame that contain only few words in R?

梦想的初衷 提交于 2021-02-05 07:55:10

问题


I'm trying to remove rows from my data frame that contain less than 5 words. e.g.

mydf <- as.data.frame(read.xlsx("C:\\data.xlsx", 1, header=TRUE)

head(mydf)

     NO    ARTICLE
1    34    The New York Times reports a lot of words here.
2    12    Greenwire reports a lot of words.
3    31    Only three words.
4     2    The Financial Times reports a lot of words.
5     9    Greenwire short.
6    13    The New York Times reports a lot of words again.

I want to remove rows with 5 or less words. how can i do that?


回答1:


Here are two ways:

mydf[sapply(gregexpr("\\W+", mydf$ARTICLE), length) >4,]
#   NO                                          ARTICLE
# 1 34  The New York Times reports a lot of words here.
# 2 12                Greenwire reports a lot of words.
# 4  2      The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.


mydf[sapply(strsplit(as.character(mydf$ARTICLE)," "),length)>5,]
#   NO                                          ARTICLE
# 1 34  The New York Times reports a lot of words here.
# 2 12                Greenwire reports a lot of words.
# 4  2      The Financial Times reports a lot of words.
# 6 13 The New York Times reports a lot of words again.

The first generates a vector containing the starting positions of each word after the first, and then calculates the length of that vector.

The second splits the ARTICLE column into a vector containing the component words and calculates the length of that vector. This is probably a better approach.




回答2:


The word count (wc) function in the qdap package can facilitate this as well:

dat <- read.transcript(text="34    The New York Times reports a lot of words here.
12    Greenwire reports a lot of words.
31    Only three words.
2    The Financial Times reports a lot of words.
9    Greenwire short.
13    The New York Times reports a lot of words again.", 
    col.names = qcv(NO, ARTICLE), sep="   ")

library(qdap)
dat[wc(dat$ARTICLE) > 4, ]

##   NO                                          ARTICLE
## 1 34  The New York Times reports a lot of words here.
## 2 12                Greenwire reports a lot of words.
## 4  2      The Financial Times reports a lot of words.
## 6 13 The New York Times reports a lot of words again.


来源:https://stackoverflow.com/questions/22140149/how-to-remove-rows-from-a-data-frame-that-contain-only-few-words-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!