I have a df made by almost 50,000 rows spread in 15 different IDs (every ID has thousands of observations). df looks like:
ID Year Temp ph
1
Although this is not very elegant solution, but it may work.
library(data.table)
df <- data.table(df)
f <- list()
for(i in unique(df1$ID)){
f[[i]] <- df1[id == i][sample(.N,(500))]
}
dfnew <- rbindlist(f)
library(data.table) #1
df <- data.table(df) #2
df[,group_num := sample(2,.N,replace = TRUE,prob = c(500,.N-500)/.N),by = "ID"] #3
df_sample = df[group_num == 1,] #4
or you can change line #3 and #4 to:
df[,random_num := sample(.N,.N),by="ID"]
df_sample = df[random_num <=500,]
In case you have big datasets, a data.table
solution could go like this:
library(data.table)
# Generate 26 mil rows random data
set.seed(2019)
dt <- data.table(c1 = sample(length(LETTERS)*10^6),
c2 = sample(LETTERS, replace = TRUE))
# For each letter, sample 500 rows
dt_sample <- dt[, .SD[sample(x = .N, size = 500)], by = c2]
# We indeed sampled 500 rows for each letter
dt_sample[, .N, by = c2][order(c2)]
#> c2 N
#> 1: A 500
#> 2: D 500
#> 3: G 500
#> 4: I 500
#> 5: M 500
#> 6: N 500
#> 7: O 500
#> 8: P 500
#> 9: Q 500
#> 10: R 500
#> 11: S 500
#> 12: T 500
#> 13: U 500
#> 14: V 500
#> 15: W 500
#> 16: Y 500
#> 17: Z 500
Created on 2019-04-23 by the reprex package (v0.2.1)
In case your data is unbalanced in the sense that some groups happen to be smaller (as number of rows) than your desired sample size, then you need to set a defensive trick like sample size should be min(500, .N)
- see sample random rows within each group in a data.table. So like:
dt[, .SD[sample(x = .N, size = min(500, .N))], by = c2]
Try this:
library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])
An approach if on of the IDs is < 500. Here I used the mtcars set:
n <- 8
df <- mtcars
df$ID <- df$cyl
FUN <- function(x, n) {
if (length(x) <= n) return(x)
x[x %in% sample(x, n)]
}
df[unlist(lapply(split(1:nrow(df), df$ID), FUN, n = 8)), ]
This is available as the sample_n
function in dplyr
:
library(dplyr)
new_df <- df %>% group_by(ID) %>% sample_n(500)