how do you subset based on the number of observations of the levels of a factor variable? I have a dataset with 1,000,000 rows and nearly 3000 levels, and I want to subset out
I figured it out using the following, as there is no reason to do things twice:
function (df, column, threshold) {
size <- nrow(df)
if (threshold < 1) threshold <- threshold * size
tab <- table(df[[column]])
keep <- names(tab)[tab > threshold]
drop <- names(tab)[tab <= threshold]
cat("Keep(",column,")",length(keep),"\n"); print(tab[keep])
cat("Drop(",column,")",length(drop),"\n"); print(tab[drop])
str(df)
df <- df[df[[column]] %in% keep, ]
str(df)
size1 <- nrow(df)
cat("Rows:",size,"-->",size1,"(dropped",100*(size-size1)/size,"%)\n")
df[[column]] <- factor(df[[column]], levels=keep)
df
}
table
, subset that, and match based on the names of that subset. Probably will want to droplevels
thereafter.
EIDT
Some sample data:
set.seed(1234)
data <- data.frame(factor = factor(sample(10000:12999, 1000000,
TRUE, prob=rexp(3000))))
Has some categories with few cases
> min(table(data$factor))
[1] 1
Remove records from case with less than 100 of those with the same value of factor
.
tbl <- table(data$factor)
data <- droplevels(data[data$factor %in% names(tbl)[tbl >= 100],,drop=FALSE])
Check:
> min(table(data$factor))
[1] 100
Note that data
and factor
are not very good names since they are also builtin functions.