subsetting based on number of observations in a factor variable

前端 未结 2 655
情话喂你
情话喂你 2021-01-25 15:34

how do you subset based on the number of observations of the levels of a factor variable? I have a dataset with 1,000,000 rows and nearly 3000 levels, and I want to subset out

相关标签:
2条回答
  • 2021-01-25 16:10

    I figured it out using the following, as there is no reason to do things twice:

    function (df, column, threshold) { 
        size <- nrow(df) 
        if (threshold < 1) threshold <- threshold * size 
        tab <- table(df[[column]]) 
        keep <- names(tab)[tab >  threshold] 
        drop <- names(tab)[tab <= threshold] 
        cat("Keep(",column,")",length(keep),"\n"); print(tab[keep]) 
        cat("Drop(",column,")",length(drop),"\n"); print(tab[drop]) 
        str(df) 
        df <- df[df[[column]] %in% keep, ] 
        str(df) 
        size1 <- nrow(df) 
        cat("Rows:",size,"-->",size1,"(dropped",100*(size-size1)/size,"%)\n") 
        df[[column]] <- factor(df[[column]], levels=keep) 
        df 
    }
    
    0 讨论(0)
  • 2021-01-25 16:26

    table, subset that, and match based on the names of that subset. Probably will want to droplevels thereafter.


    EIDT

    Some sample data:

    set.seed(1234)
    data <- data.frame(factor = factor(sample(10000:12999, 1000000, 
      TRUE, prob=rexp(3000))))
    

    Has some categories with few cases

    > min(table(data$factor))
    [1] 1
    

    Remove records from case with less than 100 of those with the same value of factor.

    tbl <- table(data$factor)
    data <- droplevels(data[data$factor %in% names(tbl)[tbl >= 100],,drop=FALSE])
    

    Check:

    > min(table(data$factor))
    [1] 100
    

    Note that data and factor are not very good names since they are also builtin functions.

    0 讨论(0)
提交回复
热议问题