from data table, randomly select one row per group

偶尔善良 提交于 2019-11-27 14:50:56

OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.

dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
bramtayl

You can use dplyr

library(dplyr)

dt %>%
  group_by(z) %%
  sample_n(1)

I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:

# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")

Below is an example on a bigger data.table with grouping by 3 columns. Comparing with @akrun ' solution seems to give the same grouping:

set.seed(2017)
dt <- data.table(c1 = sample(52*10^6), 
                 c2 = sample(LETTERS, replace = TRUE), 
                 c3 = sample(10^5, replace = TRUE), 
                 c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user  system elapsed 
# 13.87    0.49   14.33 

# @akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user  system elapsed 
# 11.89    0.10   12.01 

# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)], 
          y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE

For sampling more than one row per group check here

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!