subsetting based on number of observations in a factor variable

半城伤御伤魂 提交于 2019-12-02 10:44:53

table, subset that, and match based on the names of that subset. Probably will want to droplevels thereafter.


EIDT

Some sample data:

set.seed(1234)
data <- data.frame(factor = factor(sample(10000:12999, 1000000, 
  TRUE, prob=rexp(3000))))

Has some categories with few cases

> min(table(data$factor))
[1] 1

Remove records from case with less than 100 of those with the same value of factor.

tbl <- table(data$factor)
data <- droplevels(data[data$factor %in% names(tbl)[tbl >= 100],,drop=FALSE])

Check:

> min(table(data$factor))
[1] 100

Note that data and factor are not very good names since they are also builtin functions.

cconnell

I figured it out using the following, as there is no reason to do things twice:

function (df, column, threshold) { 
    size <- nrow(df) 
    if (threshold < 1) threshold <- threshold * size 
    tab <- table(df[[column]]) 
    keep <- names(tab)[tab >  threshold] 
    drop <- names(tab)[tab <= threshold] 
    cat("Keep(",column,")",length(keep),"\n"); print(tab[keep]) 
    cat("Drop(",column,")",length(drop),"\n"); print(tab[drop]) 
    str(df) 
    df <- df[df[[column]] %in% keep, ] 
    str(df) 
    size1 <- nrow(df) 
    cat("Rows:",size,"-->",size1,"(dropped",100*(size-size1)/size,"%)\n") 
    df[[column]] <- factor(df[[column]], levels=keep) 
    df 
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!